Assorted ideas to slightly improve JOINs #21047

alexey-milovidov · 2021-02-21T18:50:50Z

Only ideas that are rather easy to implement.

Add IColumn::shrinkToFit method. It will remove overallocation of columns and save memory (for hash join) up to 2x. Use this method in HashJoin unconditionally.
For CROSS JOIN (nested loops): compress blocks in memory if there is large amount of data. Uncompress while joining (repeatedly for every iteration of the outer loop). The implementation is very easy after Compression for Memory tables #20168.
For CROSS JOIN (nested loops): write blocks to tmp directory in Native format (similar to external sorting and external aggregation) if there is large amount of data. Read many times while joining.
Compress blocks for hash join in memory. While joining, maintain LRU cache of uncompressed blocks. Can work good if JOIN is skewed, otherwise questionable.
If the amount of data is large, serialize all records on disk in RowBinary format and keep offsets in hash table (we will have 8 bytes per record + key size + hash table overhead instead of keeping all data). While joining, do batch reads with AIO and also maintain small LRU hash table in memory. The performance can be decent (1 million IOPS on modern SSD).
If the amount of data is large, replace HashJoin to SSDCacheDictionary (the performance of SSDCacheDictionary assumed to be decent).
Represent the data structure for right hand side of JOIN or IN as a table for key-value requests. When doing distributed JOIN, instead of usual (broadcast or shuffle) algorithms, do lookup requests over network. Right hand side of distributed JOIN can be represented as a special kind of distributed table with local cache of lookup results. Applicability is limited but can be good for some scenarios: large rhs table but small subset of keys are JOINed.
For INNER and RIGHT JOIN try to use the set of keys of rhs table as an index for lhs table, similar to IN.

The text was updated successfully, but these errors were encountered:

UnamedRus · 2022-01-30T15:22:42Z

Does it make sense to use HashedArray-like layout for JOIN hashtables? (#30236)

It uses less memory, but gives pretty similar performance as in hashed dictionary especially in case of multi attribute lookup (because we need to get index offset only once for all value).

It will also speedup initial hash table build, which is important for joins.

alexey-milovidov · 2022-01-30T17:05:02Z

@UnamedRus It is already using one hash table for all columns. The hash table contains a reference to row.

awakeljw · 2022-04-27T12:27:34Z

Is RowRef worth Optimizing? now RowRef has two members (Block* block , SizeT row_num), If we can use (SizeT block_number, SizeT row_num),we can save hash memory and make cache compact.

alexey-milovidov · 2022-04-30T11:37:29Z

@awakeljw
size_t has the same size as pointer.

awakeljw · 2022-04-30T12:51:39Z

@awakeljw size_t has the same size as pointer.

struct RowRef
{
    using SizeT = uint32_t; /// Do not use size_t cause of memory economy

    const Block * block = nullptr;
    SizeT row_num = 0;

    RowRef() = default;
    RowRef(const Block * block_, size_t row_num_) : block(block_), row_num(row_num_) {}
};

Currently, we use uint32_t as SizeT in code.

Yur3k · 2022-10-24T08:37:52Z

@watemus and I have started working on this

alexey-milovidov added the feature label Feb 21, 2021

alexey-milovidov mentioned this issue Feb 22, 2021

CROSS JOIN using unexpectedly large amount of memory #12571

Open

alexey-milovidov mentioned this issue Jun 22, 2021

HashJoin uses much more memory than uncompressed data size #10409

Closed

alexey-milovidov added comp-joins JOINs performance labels Mar 22, 2022

UnamedRus mentioned this issue Jun 27, 2022

Index scan optimization for ASOF JOIN #38444

Open

vdimir mentioned this issue Aug 4, 2022

Filter joined streams for full_sorting_join by each other before sorting #39418

Merged

alexey-milovidov mentioned this issue Oct 8, 2022

Intern Tasks 2022/2023 #42194

Closed

alexey-milovidov mentioned this issue Dec 31, 2023

Intern Tasks 2023/2024 #58394

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assorted ideas to slightly improve JOINs #21047

Assorted ideas to slightly improve JOINs #21047

alexey-milovidov commented Feb 21, 2021 •

edited

UnamedRus commented Jan 30, 2022

alexey-milovidov commented Jan 30, 2022

awakeljw commented Apr 27, 2022 •

edited

alexey-milovidov commented Apr 30, 2022

awakeljw commented Apr 30, 2022 •

edited

Yur3k commented Oct 24, 2022

Assorted ideas to slightly improve JOINs #21047

Assorted ideas to slightly improve JOINs #21047

Comments

alexey-milovidov commented Feb 21, 2021 • edited

UnamedRus commented Jan 30, 2022

alexey-milovidov commented Jan 30, 2022

awakeljw commented Apr 27, 2022 • edited

alexey-milovidov commented Apr 30, 2022

awakeljw commented Apr 30, 2022 • edited

Yur3k commented Oct 24, 2022

alexey-milovidov commented Feb 21, 2021 •

edited

awakeljw commented Apr 27, 2022 •

edited

awakeljw commented Apr 30, 2022 •

edited