backups: add prefixes to backup keys to improve S3 sharding#88418
backups: add prefixes to backup keys to improve S3 sharding#88418jkartseva merged 48 commits intoClickHouse:masterfrom
Conversation
|
Workflow [PR], commit [1517d12] Summary: ❌
|
b48e41e to
2ac4803
Compare
|
@jkartseva Let's not expose the |
|
Also, please check how restores would work with this scheme. |
I think it should be enabled through a setting – either a query setting or a configuration setting, currently under We'll also need to account for incremental backups. If we enable |
| [ | ||
| pytest.param("", id="first_file_name"), | ||
| pytest.param("checksum"), | ||
| ], |
There was a problem hiding this comment.
I would prefer to create a separate test*.py file, enable there the checksum data file name generation and copy there some of these tests.
| @pytest.fixture( | ||
| params=[[], ["configs/data_file_name_generator_from_checksum.xml"]], | ||
| ids=["data_file_name_from_first_file_name", "data_file_name_from_checksum"], | ||
| scope="module", |
There was a problem hiding this comment.
Can you explain how this parametrization works?
There was a problem hiding this comment.
This is a parametrized fixture – pytest will run every test that depends on setup_cluster twice, once for each value in params.
[] adds no extra config file
["configs/data_file_name_generator_from_checksum.xml"] adds one more config file that directs data_file_name to be default from checksum.
Extra configs passed as a list, which concatenated with the main_congis list in setup_cluster function.
tests/integration/test_backup_restore_azure_blob_storage/test.py
Outdated
Show resolved
Hide resolved
| node1.query("SYSTEM SYNC REPLICA ON CLUSTER 'cluster' tbl") | ||
|
|
||
| backup_name = new_backup_name() | ||
| backup_settings = {"data_file_name_generator": name_gen} if name_gen else None |
There was a problem hiding this comment.
We could randomly enable this feature in some tests. So if something fails it'll make the tests flaky.
There was a problem hiding this comment.
I don't know how to to this in the integration tests. I can think of fault injection in C++ code, such as randomly adding a data_file_name_generator=checksum setting when a backup is created, but it may not be what we want.
|
If I understood it correctly we need this feature mostly for S3. We are not sure if we need it for Azure or for backups on local disks. Thus it seems we need to broadly test this feature for S3, and maybe just a bit of testing for Azure and for backups on local disks. So it seems right to keep your parametrized test approach for S3, and maybe have just one test for Azure and one for backups on local disks. |
5327542 to
af8a8c0
Compare
7196ad8
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
S3 partitions objects internally based on key-name prefixes and automatically scales to high request rates per partition. This change introduces two new BACKUP settings: data_file_name_generator and data_file_name_prefix_length. When data_file_name_generator=checksum, backup data files are named using a hash of their contents. Example: for a checksum =
abcd1234ef567890abcd1234ef567890anddata_file_name_prefix_length = 3, the resulting path will be:abc/d1234ef567890abcd1234ef567890. The resulting key distribution enhances load balancing across S3 partitions and reduces the risk of throttling.Documentation entry for user-facing changes
Details
Addresses https://github.com/ClickHouse/clickhouse-private/issues/35866#issuecomment-3346135504