Understanding background merges #45998

jgupta · 2023-02-03T06:22:19Z

jgupta
Feb 3, 2023

Hello,

I wanted to understand background merges better. I understand that background merges keep happening until it reaches 150GB as defined by max_bytes_to_merge_at_max_space_in_pool.

I would like to understand following.

Are all parts ordered by ORDER BY? So for e.g. if ORDER BY is by date and if new inserts have older date it will eventually get merged with corresponding older part that has matching date (even if this older part is of 150GB and needs no merging) or is ordering not maintained across parts so this new insert can be merged in a new 150 GB part?
In table engines like ReplacingMergeTree if duplicate entry to be removed are part of large parts having 150GB of size, do all of them gets reprocessed (as this deduplication only occurs during merge and there is no need to merge parts that have reached 150GB)?

I am trying to understand all of this as I would like ClickHouse to make no (or minimal) changes to parts once they reach size defined by max_bytes_to_merge_at_max_space_in_pool. Reason for this is because as our data would grow we would want to use slower-larger disks and maybe S3 for older data. I understand I can also create multiple volumes, have ClickHouse move parts to this volume once it reaches 150GB using max_data_part_size_bytes and set prefer_not_to_merge on this volume. What would be disadvantages of doing this?

mo-avatar · 2023-02-07T01:46:15Z

mo-avatar
Feb 7, 2023

order by key do not maintain order across parts, it only preserve order in part.
replacing merge tree do not guarantee the absence of duplicates (http://devdoc.net/database/ClickhouseDocs_19.4.1.3-docs/operations/table_engines/replacingmergetree/)

parts merging is very complicated , it's an heuristic algorithm has relation with parts size, part create time, free disks , free mergeing thread and so on, if you really want to understand this, please look at the code base. the main logical is around MergeTreeDataMergerMutator::selectPartsToMerge if you are on a code base before and include 22.3-lts

0 replies

mdonkers · 2023-02-28T14:36:46Z

mdonkers
Feb 28, 2023

Some more (detailed) information on your questions;

In general as first point, it's good to understand the relation between parts and the partitioning key (https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key)
When using e.g. the partitioning key toYYYYMM(timestamp) which creates parts per month, then parts for different months will never be merged. Thus, this data partitioned by month but with ordering on some other value, will never be fully ordered over multiple months.

As @mo-avatar said, data is only ordered within a part. If new inserts are added with an older date and ORDER BY is by date, then eventually when the new part gets merged with other parts, the ordering in the new part will be correct. But there might be various reasons why the part isn't merged; different partition (see above), reaching the max size of 150GB (configurable), or simply because ClickHouse is prioritizing merging smaller / newer parts (see more info below).
For e.g. ReplacingMergeTree, parts are not specifically touched to remove duplicates. Only within a part (after merge), its guaranteed there won't be any duplicates. For the rest the same applies as for the ORDER BY under 1). If you absolutely need duplicates to be removed, use the FINAL modifier in your SELECT query: https://clickhouse.com/docs/en/sql-reference/statements/select/from/#final-modifier

For more details on merges, e.g. how they are picked and the parameters, see https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/MergeTree/SimpleMergeSelector.h#L6-L81

To configure multiple disks with S3, you could use config for example as:

storage_configuration:
  disks:
    s3disk:
      endpoint: ...
      path: ...
      region: ...
      type: s3
    s3diskWithCache:
      cache_on_write_operations: 1
      disk: s3disk
      do_not_evict_index_and_mark_files: 0
      max_size: ...
      path: ...
      type: cache
    default:
      keep_free_space_bytes: 536870912
  policies:
    default:
      move_factor: 0.5
      volumes:
        01_hot:
          disk: default
          max_data_part_size_bytes: 4294967296
        02_cold:
          disk: s3diskWithCache
    s3:
      volumes:
        main:
          disk: s3diskWithCache

There are various resources in our docs, e.g. https://clickhouse.com/docs/en/operations/storing-data/#using-local-cache

We would recommend not using prefer_not_to_merge setting, but people use it...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding background merges #45998

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Understanding background merges #45998

jgupta Feb 3, 2023

Replies: 2 comments

mo-avatar Feb 7, 2023

mdonkers Feb 28, 2023

jgupta
Feb 3, 2023

mo-avatar
Feb 7, 2023

mdonkers
Feb 28, 2023