Parallel processing on replicas, reworked. #26748

alexey-milovidov · 2021-07-23T17:25:56Z

We want to parallelize data processing using multiple replicas of single shard.
Every replica should process some split of data.

There are the following considerations that makes the task non-trivial:

Replicas may contain different set of data parts, because some replicas may lag behind and miss some new parts.
Replicas may contain different set of data parts, because some parts may be merged on one replica and unmerged on another replica.
We want to distribute the work uniformly across replicas, so if one replica is slower than another, we should not be bounded by performance of the most slow replica.

We cannot use some hashing based split of data parts, because replicas have different set of data parts.
We cannot split the set of data parts statically and assign to replicas before query processing.

This task will also help to implement distributed processing over shared storage.
In most simple case, multiple computation nodes over shared storage can look as replicas, and we split data processing over them.

Proposal

Initiator node sends the query to all participating replicas (all available replicas or limited subset of available replicas depending on settings) along with the setting that these replicas should coordinate in parallel processing and with another setting - a hint on how many replicas are participating.

Every replica collects a snapshot of data parts to process the query as usual. From these data parts "read tasks" are being formed like: - read this data part; - read this range of marks from this data part (for large data parts). Read tasks are identical to what the replica wants to read during normal query processing.

Replica collects some amount of read tasks and make a request with the list of read tasks back to initiator node. Initiator node acts like a "semaphore" for replicas to coordinate their data processing. The request like "tell me if I can take these tasks and assign them for me". Initiator node builds the set of data parts (in memory state) with the ranges inside them that are being processed.

Note: this is similar to already implemented s3Cluster table function that coordinates processing of files on remote storage across multiple computation nodes.

For every read task it answers:

if no replicas already took this data part or covering or intersecting data part - then allow it to be processed;
if some other replica already took this data part as a whole, or anything from covering or intersecting data part - then answer that replica should skip it;
if some other replica already took a range of this data part - then answer that replica should took another range of this data part starting from already taken range with the size comparable to already taken range.

In addition it can send info about already taken tasks - so replica will not ask about them later.

The amount of read tasks that are sent to initiator in one network request is selected to balance between uniform workload distribution and low number of RTTs for short queries.
E.g. it can be one request per 1 GB of data (controlled by a setting).
It will automatically lower the number of replicas and the amount of coordination to process short queries.

If some replicas lagging behind and miss some parts, the result will include the data available at least on one replica (the most complete result).

If some replicas have processed and unprocessed mutations, the result will be calculated over the data parts either before or after mutation (non deterministically).
E.g. if some records were deleted on one replica and not deleted on another, the result may include half of these records.

If some replicas have lost half of data parts that should be merged and other replicas have completed the merge (rare case), the result may not include the data in lost parts.

Every replica may select read tasks in order determined by consistent hashing function of replica number and data part. It will allow to maintain better cache affinity in case of shared storage.

Note: as simple extension, we can implement failover during query processing. If some replica died in the middle of query processing and did not return any block of data yet, we can drop processing on that replica and reassign tasks.

The text was updated successfully, but these errors were encountered:

yiguolei · 2021-09-14T01:37:58Z

Any process on this task?

nikitamikhaylov · 2021-09-15T21:49:41Z

Any process on this task?

Feature is being developed, no PR. I would write a bit about an algorithm chosen, but I'd better write some code.

alexey-milovidov · 2022-11-15T03:03:06Z

FYI we found that the current implementation is inefficient, and @nikitamikhaylov is reworking the details of it.

maskshell · 2022-11-18T03:52:24Z

How about extending Zero Copy from S3/HDFS to general-purpose disk storage? Similar to RAC's shared disk.

LuPan92 · 2023-08-16T22:47:37Z

@nikitamikhaylov This feature is awesome, I'm going to try it out. Do you have more detailed design documents?

alexey-milovidov · 2023-08-17T00:21:07Z

@LuPan92 the description in this issue is the complete design document.

alexey-milovidov added the feature label Jul 23, 2021

alexey-milovidov assigned nikitamikhaylov Sep 12, 2021

alexey-milovidov mentioned this issue Sep 15, 2021

Add Maglev consistent hash as SQL function. #29064

Closed

alexey-milovidov mentioned this issue Sep 16, 2021

RFC: S3 as 'tape device' #28527

Closed

nikitamikhaylov mentioned this issue Sep 22, 2021

Parallel reading from replicas #29279

Merged

nikitamikhaylov closed this as completed in #29279 Dec 9, 2021

UnamedRus mentioned this issue Jun 3, 2022

More parallel execution for queries with FINAL #36396

Merged

alexey-milovidov mentioned this issue Nov 6, 2022

Roadmap 2022 (discussion) #32513

Closed

falpic mentioned this issue Jan 9, 2023

max_parallel_replicas does not apply sampling when query is run by non-default user #44877

Closed

Clark0 mentioned this issue Apr 11, 2023

RFC: Hive distributed processing ByConity/ByConity#220

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing on replicas, reworked. #26748

Parallel processing on replicas, reworked. #26748

alexey-milovidov commented Jul 23, 2021 •

edited

Loading

yiguolei commented Sep 14, 2021

nikitamikhaylov commented Sep 15, 2021

alexey-milovidov commented Nov 15, 2022

maskshell commented Nov 18, 2022

LuPan92 commented Aug 16, 2023

alexey-milovidov commented Aug 17, 2023

Parallel processing on replicas, reworked. #26748

Parallel processing on replicas, reworked. #26748

Comments

alexey-milovidov commented Jul 23, 2021 • edited Loading

yiguolei commented Sep 14, 2021

nikitamikhaylov commented Sep 15, 2021

alexey-milovidov commented Nov 15, 2022

maskshell commented Nov 18, 2022

LuPan92 commented Aug 16, 2023

alexey-milovidov commented Aug 17, 2023

alexey-milovidov commented Jul 23, 2021 •

edited

Loading