Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Skipping Indices #4143

Merged
merged 155 commits into from Feb 5, 2019
Merged

Data Skipping Indices #4143

merged 155 commits into from Feb 5, 2019

Conversation

@nikvas0
Copy link
Contributor

nikvas0 commented Jan 24, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • New Feature

Short description (up to few sentences):
Data Skipping Indices for ClickHouse.
These indices aggregate some info about big blocks of data during writing/merging. After that, they are used to avoid reading the data which do not satisfy the query.

Detailed description (optional):

  • Indices description in the CREATE TABLE query.
  • ALTER TABLE ... ADD/DROP INDEX
  • min-max index (stores extremes of the specified expression)
  • unique index (stores a set of unique values of the specified expression)
nikvas0 added 13 commits Jan 29, 2019
Nikvas0/unique index
fix
@nikvas0 nikvas0 changed the title Data Skipping Indices [WIP] Data Skipping Indices Jan 30, 2019
const auto one = std::make_shared<ASTLiteral>(Field(1));
const auto two = std::make_shared<ASTLiteral>(Field(2));

node = makeASTFunction(

This comment has been minimized.

Copy link
@nikvas0

nikvas0 Jan 30, 2019

Author Contributor

Would special function for swapping last two bits be better in this case?

This comment has been minimized.

Copy link
@alexey-milovidov

alexey-milovidov Jan 30, 2019

Member

Yes, but this function would not be usable for anything else.

This comment has been minimized.

Copy link
@alexey-milovidov

alexey-milovidov Jan 30, 2019

Member

Let's make this function nevertheless.

else if (select.prewhere_expression)
new_expression = select.prewhere_expression->clone();
else
/// 11_2 -- can be true and false at the same time

This comment has been minimized.

Copy link
@alexey-milovidov

alexey-milovidov Jan 30, 2019

Member

11_2 - I guess you mean "11 in binary".
If you write 0b11 - it will be more obvious.

nikvas0 added 2 commits Jan 30, 2019
fix
Copy link
Member

alesapin left a comment

My current remarks mostly about style. Code is really cool. I'll try to look one more time later.


String MergeTreeUniqueGranule::toString() const
{
String res = "";

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

some stream (like ostringstream) would be better



/// Structure for storing basic index info like columns, expression, arguments, ...
class MergeTreeIndex

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

Better IMergeTreeIndex

using MutableMergeTreeIndexPtr = std::shared_ptr<MergeTreeIndex>;


struct MergeTreeIndexGranule

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

IMergeTreeIndexGranule

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

Need comment

IndexConditionPtr MergeTreeMinMaxIndex::createIndexCondition(
const SelectQueryInfo & query, const Context & context) const
{
return std::make_shared<MinMaxCondition>(query, context, *this);

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

indent


};

std::unique_ptr<MergeTreeIndex> MergeTreeMinMaxIndexCreator(

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

Move definition of this function to .cpp and add declaration to factory .cpp file. Also function name starts from lowercase.

{
stream.assertMark();
}
catch (Exception &e)

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

style

/// If there are no marks after the end of range, just use max_read_buffer_size
if (right >= marks_count
|| (right + 1 == marks_count
&& getMark(right).offset_in_compressed_file

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

Hard to read, maybe save into several tmp variables?

virtual String toString() const = 0;
virtual bool empty() const = 0;

virtual void update(const Block & block, size_t * pos, size_t limit) = 0;

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

need comment


void MergeTreeMinMaxGranule::update(const Block & block, size_t * pos, size_t limit)
{
size_t rows_read = std::min(limit, block.rows() - *pos);

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

Maybe add check that *pos is less than number of rows? I know from outer code, that it's always true.

auto granule = std::dynamic_pointer_cast<MergeTreeUniqueGranule>(idx_granule);
if (!granule)
throw Exception(
"Unique index condition got wrong granule", ErrorCodes::LOGICAL_ERROR);

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 1, 2019

Member

granule with wrong type?

INSERT INTO test.minmax_idx VALUES (1, 5, 6.9, 1.57, 'bac', 'c', '2014-07-11');

/* simple select */
SELECT * FROM test.minmax_idx WHERE i32 = 5 AND i32 + f64 < 12 AND 3 < d AND d < 7 AND (s = 'bac' OR s = 'cba') ORDER BY dt;

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 5, 2019

Member

maybe it would be better to append FORMAT JSON to end of query. After that we can see how many rows were read:

{
        "meta":
        [
                {
                        "name": "u64",
                        "type": "UInt64"
                },
                {
                        "name": "i32",
                        "type": "Int32"
                },
                {
                        "name": "f64",
                        "type": "Float64"
                },
                {
                        "name": "d",
                        "type": "Decimal(10, 2)"
                },
                {
                        "name": "s",
                        "type": "String"
                },
                {
                        "name": "e",
                        "type": "Enum8('a' = 1, 'b' = 2, 'c' = 3)"
                },
                {
                        "name": "dt",
                        "type": "Date"
                }
        ],

        "data":
        [
                {
                        "u64": "0",
                        "i32": 5,
                        "f64": 4.7,
                        "d": 6.50,
                        "s": "cba",
                        "e": "b",
                        "dt": "2014-01-04"
                }
        ],

        "rows": 1,

        "statistics":
        {
                "elapsed": 0.000601825,
                "rows_read": 1,
                "bytes_read": 43
        }
}


* `ALTER ADD INDEX name expression TYPE type GRANULARITY value AFTER name [AFTER name2]` - Adds index description to tables metadata.

* `ALTER DROP INDEX name` - Removes index description from tables metadata and deletes index files from disk.

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 5, 2019

Member

ALTER TABLE ... DROP INDEX?

namespace DB
{

using IndicesAsts = std::vector<std::shared_ptr<ASTIndexDeclaration>>;

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 5, 2019

Member

IndiciesASTs

String name;
IAST * expr;
ASTFunction * type;
Field granularity;

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 5, 2019

Member

Why we store granularity as Field if it's unsigned integer?

@@ -113,6 +113,7 @@ namespace ErrorCodes
extern const int KEEPER_EXCEPTION;
extern const int ALL_REPLICAS_LOST;
extern const int REPLICA_STATUS_CHANGED;
extern const int INCORRECT_QUERY;

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 5, 2019

Member

unused

MergeTreeReaderStream(
const String &path_prefix_, const String &extension_, size_t marks_count_,
const MarkRanges &all_mark_ranges,
MarkCache *mark_cache, bool save_marks_in_cache,

This comment has been minimized.

Copy link
@alesapin

alesapin Feb 5, 2019

Member

style

@alexey-milovidov alexey-milovidov merged commit a1b0ded into ClickHouse:master Feb 5, 2019
16 checks passed
16 checks passed
ClickHouse build check 9/9 builds are OK
Details
Clickhouse build check 9/9 builds are OK
Details
Functional stateful tests (address) fail:0, passed:92
Details
Functional stateful tests (debug) fail:0, passed:92
Details
Functional stateful tests (release) fail:0, passed:92
Details
Functional stateful tests (ubsan) fail:0, passed:92
Details
Functional stateless tests (address) fail:0, passed:999, skipped:1
Details
Functional stateless tests (debug) fail:0, passed:999, skipped:1
Details
Functional stateless tests (release) fail:0, passed:1000
Details
Functional stateless tests (ubsan) fail:0, passed:998, skipped:2
Details
Functional stateless tests (unbundled) fail:0, passed:995, skipped:5
Details
Integration tests (asan) fail:0, passed:112, error:0
Details
Integration tests (release) fail:0, passed:112, error:0
Details
PVS check Found 67 errors and 78 warnings
Details
Stress test No errors found
Details
Style check Style check passed
Details
* `minmax`
Хранит минимум и максимум выражения (если выражение - `tuple`, то для каждого элемента `tuple`), используя их для пропуска блоков аналогично первичному ключу.

* `unique(max_rows)`

This comment has been minimized.

Copy link
@alexey-milovidov

alexey-milovidov Feb 5, 2019

Member

It can be easily confused with unique constraint. Let's think about better name for it.

alexey-milovidov added a commit that referenced this pull request Feb 5, 2019
@nikvas0

This comment has been minimized.

Copy link
Contributor Author

nikvas0 commented Feb 8, 2019

Fixed in #4286

@4ertus2 4ertus2 mentioned this pull request Jun 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.