[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during `clickhouse-backup create` and during `clickhouse-backup restore` #447

datavisorzhizhu · 2022-05-30T08:22:33Z

I have two disks: one is default, another is s3, here is the storage configuration both in source clickhouse and target clickhouse:

<yandex>
    <storage_configuration>
        <disks>
            <s3>
                <type>s3</type>
                <endpoint>https://s3.us-west-2.amazonaws.com/datavisor-access-logs/fedex/</endpoint>
                <use_environment_credentials>false</use_environment_credentials>
                <access_key_id>xxxxxx</access_key_id>
                <secret_access_key>xxxxxx</secret_access_key>
                <region>us-west-2</region>
                <connect_timeout_ms>10000</connect_timeout_ms>
                <request_timeout_ms>5000</request_timeout_ms>
                <retry_attempts>10</retry_attempts>
                <single_read_retries>4</single_read_retries>
                <min_bytes_for_seek>1000</min_bytes_for_seek>
                <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
                <cache_enabled>true</cache_enabled>
                <cache_path>/var/lib/clickhouse/disks/s3/cache/</cache_path>
                <skip_access_check>false</skip_access_check>
            </s3>
        </disks>
        <policies>
          <default>
            <volumes>
              <volume_groups>
                <disk>default</disk>
              </volume_groups>
            </volumes>
          </default>
          <tiered>
            <volumes>
              <volume_groups>
                 <disk>default</disk>
              </volume_groups>
              <s3>
                 <disk>s3</disk>
              </s3>
            </volumes>
          </tiered>
          <s3only>
            <volumes>
              <s3>
                <disk>s3</disk>
              </s3>
            </volumes>
          </s3only>
        </policies>
    </storage_configuration>
</yandex>

I created a table using storage policy 'tiered' so that I can store part of the data into s3 bucket.

When I restore the clickhouse db using clickhouse-backup, db partitions on default disk are properly restored, but db partitions on s3 disk cannot be restored. Throwing exceptions while running commands:
ALTER TABLE fedex.ontime ATTACH PART '2022_2_2_0';

Code: 499. DB::Exception: Received from localhost:9000. DB::Exception: The specified key does not exist.. (S3_ERROR)

In target clickhouse db, I have the following detached parts:

please help, thanks in advance

The text was updated successfully, but these errors were encountered:

Slach · 2022-05-30T11:45:14Z

Unfortunately, s3 disk not full supported by clickhouse-backup right now

During adding s3 support we didn't properly clean data in test case, and currently OK in test is false positive

Now clickhouse-backup successful backup only metadata which contains S3 keys, not real data
if you drop real data on s3, you can't restore it with clickhouse-backup

So, currently, I don't know the good solution to full support S3 disks.
Ugly workaround is:

make clickhouse-backup create + upload
use something like https://rclone.org to copy exists s3 data to new destination
change clickhouse s3 settings on destination server storage_policy
run clickhouse-backup restore

datavisorzhizhu · 2022-05-30T12:12:47Z

@Slach thanks for your explanation. should I close the ticket or let it remain open for future improvement?

datavisorzhizhu · 2022-05-30T12:15:21Z

want to confirm one more thing: if everything go smoothly, partition name should remain intact after restoring from backup, right?

I ask above question because I saw while restoring, partition name from default disk changed

Slach · 2022-05-30T13:30:07Z

what exatcly you mean when say "partition name" ?
do you mean name which generated from PARTITION BY clause during CREATE TABLE execution?

datavisorzhizhu · 2022-05-30T22:25:56Z

SELECT name, table,part_type,disk_name, path FROM system.parts
   WHERE (table = 'ontime') AND active;

By "partition name", I mean 'name' column from above SQL. In source clickhouse, the partition name in default disk changes from '2022_6_6_0' to '2022_1_1_0' in restored clickhouse

Slach · 2022-05-31T04:38:23Z

By "partition name", I mean 'name' column from above SQL. In source clickhouse, the partition name in default disk changes from '2022_6_6_0' to '2022_1_1_0' in restored clickhouse

this is not "data partition", this is "data part"

I don't know, shall part name changed after ALTER TABLE ... ATTACH PART ... or not

parts.name have the following format:
<partition_name_prefix>_<min_block>_<max_block>_<merge_level>

partition_name_prefix - 2022 - means you have something like PARTITION BY toYear(xxx) if your table doesn't have PARTITION BY it will all value

min_block and max_block means BLOCKS number which produced during INSERT statements _6_6 means you have block number 6 in your part

merge_level = 0 means this part never merged with another parts, so block number 6 was never merged with 5 and 4

datavisorzhizhu · 2022-08-23T12:32:17Z

@Slach do we have an estimated time for this feature to be ready? thanks

Slach · 2022-08-23T12:50:09Z

no estimate with this feature now,
don't figure out how to make it on s3 side

on v2 branch we already implements use_embedded_backup_restore: true
which allow us to backup s3 disks properly, but it already have trade off (data will streaming from source s3 to clickhouse-server memory and after it streaming to destination s3 bucket)

you can try to use rclone sync with v2 branch for remote_storage: custom

feel free to make pull request

datavisorzhizhu · 2022-08-23T22:00:52Z

thanks for explanation @Slach

gfunc · 2022-10-10T04:17:01Z

Perhaps it would be nice to add a switch for this feature (snapshot data on S3 disk)? For my use case, I need to back up metadata and metadata only.

Slach · 2023-01-01T05:20:28Z

@gfunc if you want to back up metadata (it means database \table \ views \ funcitons SQL definitions only) just use clickhouse-backup create --schema

silentsokolov · 2023-09-14T09:16:29Z

@Slach I need to back up metadata only, but when I used option --schema, metadata (aka "shadow") didn't upload. metadata.json / sql uploaded only. I cannot restore a table on s3 disk without "shadow".

Slach · 2023-09-14T09:27:14Z

@silentsokolov could you create separate issue
and provide exact command
and logs which it output?

Slach changed the title ~~issue when restoring data stored in s3 bucket using clickhouse-backup~~ [FEATURE] Support s3 bucket copy via rclone during clickhouse-backup create and properly restore storage policy May 31, 2022

Slach self-assigned this May 31, 2022

Slach added this to the 2.0.0 milestone May 31, 2022

Slach mentioned this issue Jun 7, 2022

How to work with incremental backups on S3 #451

Closed

Slach modified the milestones: 2.0.0, 2.1.0 Sep 4, 2022

Slach modified the milestones: 2.1.0, 2.2.0 Oct 13, 2022

Slach modified the milestones: 2.2.0, 2.3.0 Feb 13, 2023

Slach changed the title ~~[FEATURE] Support s3 bucket copy via rclone during clickhouse-backup create and properly restore storage policy~~ [FEATURE] Support s3 bucket server copy during clickhouse-backup create and properly restore with check storage policy Jun 7, 2023

Slach modified the milestones: 2.3.0, 2.4.0 Jun 21, 2023

Slach added a commit that referenced this issue Jul 21, 2023

test CopyObject implementation, TestIntegrationGCS passed, fix #447

da1bd99

Slach added a commit that referenced this issue Jul 21, 2023

test CopyObject implementation, TestIntegrationAzure passed, fix #447

d693ade

Slach mentioned this issue Jul 24, 2023

Object disk backup #702

Merged

Slach closed this as completed in #702 Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during `clickhouse-backup create` and during `clickhouse-backup restore` #447

[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during `clickhouse-backup create` and during `clickhouse-backup restore` #447

datavisorzhizhu commented May 30, 2022 •

edited by Slach

Loading

Slach commented May 30, 2022

datavisorzhizhu commented May 30, 2022

datavisorzhizhu commented May 30, 2022

Slach commented May 30, 2022

datavisorzhizhu commented May 30, 2022

Slach commented May 31, 2022

datavisorzhizhu commented Aug 23, 2022

Slach commented Aug 23, 2022

datavisorzhizhu commented Aug 23, 2022

gfunc commented Oct 10, 2022

Slach commented Jan 1, 2023 •

edited

Loading

silentsokolov commented Sep 14, 2023

Slach commented Sep 14, 2023

[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during clickhouse-backup create and during clickhouse-backup restore #447

[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during clickhouse-backup create and during clickhouse-backup restore #447

Comments

datavisorzhizhu commented May 30, 2022 • edited by Slach Loading

Slach commented May 30, 2022

datavisorzhizhu commented May 30, 2022

datavisorzhizhu commented May 30, 2022

Slach commented May 30, 2022

datavisorzhizhu commented May 30, 2022

Slach commented May 31, 2022

datavisorzhizhu commented Aug 23, 2022

Slach commented Aug 23, 2022

datavisorzhizhu commented Aug 23, 2022

gfunc commented Oct 10, 2022

Slach commented Jan 1, 2023 • edited Loading

silentsokolov commented Sep 14, 2023

Slach commented Sep 14, 2023

[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during `clickhouse-backup create` and during `clickhouse-backup restore` #447

[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during `clickhouse-backup create` and during `clickhouse-backup restore` #447

datavisorzhizhu commented May 30, 2022 •

edited by Slach

Loading

Slach commented Jan 1, 2023 •

edited

Loading