Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during clickhouse-backup create and during clickhouse-backup restore #447

Closed
datavisorzhizhu opened this issue May 30, 2022 · 13 comments · Fixed by #702
Assignees
Milestone

Comments

@datavisorzhizhu
Copy link

datavisorzhizhu commented May 30, 2022

I have two disks: one is default, another is s3, here is the storage configuration both in source clickhouse and target clickhouse:

<yandex>
    <storage_configuration>
        <disks>
            <s3>
                <type>s3</type>
                <endpoint>https://s3.us-west-2.amazonaws.com/datavisor-access-logs/fedex/</endpoint>
                <use_environment_credentials>false</use_environment_credentials>
                <access_key_id>xxxxxx</access_key_id>
                <secret_access_key>xxxxxx</secret_access_key>
                <region>us-west-2</region>
                <connect_timeout_ms>10000</connect_timeout_ms>
                <request_timeout_ms>5000</request_timeout_ms>
                <retry_attempts>10</retry_attempts>
                <single_read_retries>4</single_read_retries>
                <min_bytes_for_seek>1000</min_bytes_for_seek>
                <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
                <cache_enabled>true</cache_enabled>
                <cache_path>/var/lib/clickhouse/disks/s3/cache/</cache_path>
                <skip_access_check>false</skip_access_check>
            </s3>
        </disks>
        <policies>
          <default>
            <volumes>
              <volume_groups>
                <disk>default</disk>
              </volume_groups>
            </volumes>
          </default>
          <tiered>
            <volumes>
              <volume_groups>
                 <disk>default</disk>
              </volume_groups>
              <s3>
                 <disk>s3</disk>
              </s3>
            </volumes>
          </tiered>
          <s3only>
            <volumes>
              <s3>
                <disk>s3</disk>
              </s3>
            </volumes>
          </s3only>
        </policies>
    </storage_configuration>
</yandex>

I created a table using storage policy 'tiered' so that I can store part of the data into s3 bucket.

When I restore the clickhouse db using clickhouse-backup, db partitions on default disk are properly restored, but db partitions on s3 disk cannot be restored. Throwing exceptions while running commands:
ALTER TABLE fedex.ontime ATTACH PART '2022_2_2_0';

Code: 499. DB::Exception: Received from localhost:9000. DB::Exception: The specified key does not exist.. (S3_ERROR)

In target clickhouse db, I have the following detached parts:
WeChatWorkScreenshot_6a236fe1-24eb-42bf-a2da-4610ee4b8909

please help, thanks in advance

@Slach
Copy link
Collaborator

Slach commented May 30, 2022

Unfortunately, s3 disk not full supported by clickhouse-backup right now

During adding s3 support we didn't properly clean data in test case, and currently OK in test is false positive

Now clickhouse-backup successful backup only metadata which contains S3 keys, not real data
if you drop real data on s3, you can't restore it with clickhouse-backup

So, currently, I don't know the good solution to full support S3 disks.
Ugly workaround is:

  • make clickhouse-backup create + upload
  • use something like https://rclone.org to copy exists s3 data to new destination
  • change clickhouse s3 settings on destination server storage_policy
  • run clickhouse-backup restore

@datavisorzhizhu
Copy link
Author

@Slach thanks for your explanation. should I close the ticket or let it remain open for future improvement?

@datavisorzhizhu
Copy link
Author

want to confirm one more thing: if everything go smoothly, partition name should remain intact after restoring from backup, right?

I ask above question because I saw while restoring, partition name from default disk changed

@Slach
Copy link
Collaborator

Slach commented May 30, 2022

what exatcly you mean when say "partition name" ?
do you mean name which generated from PARTITION BY clause during CREATE TABLE execution?

@datavisorzhizhu
Copy link
Author

SELECT name, table,part_type,disk_name, path FROM system.parts
   WHERE (table = 'ontime') AND active;

By "partition name", I mean 'name' column from above SQL. In source clickhouse, the partition name in default disk changes from '2022_6_6_0' to '2022_1_1_0' in restored clickhouse

@Slach
Copy link
Collaborator

Slach commented May 31, 2022

By "partition name", I mean 'name' column from above SQL. In source clickhouse, the partition name in default disk changes from '2022_6_6_0' to '2022_1_1_0' in restored clickhouse

this is not "data partition", this is "data part"

I don't know, shall part name changed after ALTER TABLE ... ATTACH PART ... or not

parts.name have the following format:
<partition_name_prefix>_<min_block>_<max_block>_<merge_level>

partition_name_prefix - 2022 - means you have something like PARTITION BY toYear(xxx) if your table doesn't have PARTITION BY it will all value

min_block and max_block means BLOCKS number which produced during INSERT statements _6_6 means you have block number 6 in your part

merge_level = 0 means this part never merged with another parts, so block number 6 was never merged with 5 and 4

@Slach Slach changed the title issue when restoring data stored in s3 bucket using clickhouse-backup [FEATURE] Support s3 bucket copy via rclone during clickhouse-backup create and properly restore storage policy May 31, 2022
@Slach Slach self-assigned this May 31, 2022
@Slach Slach added this to the 2.0.0 milestone May 31, 2022
@datavisorzhizhu
Copy link
Author

@Slach do we have an estimated time for this feature to be ready? thanks

@Slach
Copy link
Collaborator

Slach commented Aug 23, 2022

no estimate with this feature now,
don't figure out how to make it on s3 side

on v2 branch we already implements use_embedded_backup_restore: true
which allow us to backup s3 disks properly, but it already have trade off (data will streaming from source s3 to clickhouse-server memory and after it streaming to destination s3 bucket)

you can try to use rclone sync with v2 branch for remote_storage: custom

feel free to make pull request

@datavisorzhizhu
Copy link
Author

thanks for explanation @Slach

@Slach Slach modified the milestones: 2.0.0, 2.1.0 Sep 4, 2022
@gfunc
Copy link

gfunc commented Oct 10, 2022

Perhaps it would be nice to add a switch for this feature (snapshot data on S3 disk)? For my use case, I need to back up metadata and metadata only.

@Slach Slach modified the milestones: 2.1.0, 2.2.0 Oct 13, 2022
@Slach
Copy link
Collaborator

Slach commented Jan 1, 2023

@gfunc if you want to back up metadata (it means database \table \ views \ funcitons SQL definitions only) just use clickhouse-backup create --schema

@Slach Slach modified the milestones: 2.2.0, 2.3.0 Feb 13, 2023
@Slach Slach changed the title [FEATURE] Support s3 bucket copy via rclone during clickhouse-backup create and properly restore storage policy [FEATURE] Support s3 bucket server copy during clickhouse-backup create and properly restore with check storage policy Jun 7, 2023
@Slach Slach changed the title [FEATURE] Support s3 bucket server copy during clickhouse-backup create and properly restore with check storage policy [FEATURE] properly backup S3 disks, support s3 keys server copy during clickhouse-backup create and properly restore with check storage policy Jun 7, 2023
@Slach Slach modified the milestones: 2.3.0, 2.4.0 Jun 21, 2023
@Slach Slach changed the title [FEATURE] properly backup S3 disks, support s3 keys server copy during clickhouse-backup create and properly restore with check storage policy [FEATURE] properly backup S3/GCS/Azure disks, support server-side copy to back up bucket during clickhouse-backup create and during clickhouse-backup restore Jul 21, 2023
@silentsokolov
Copy link

@Slach I need to back up metadata only, but when I used option --schema, metadata (aka "shadow") didn't upload. metadata.json / sql uploaded only. I cannot restore a table on s3 disk without "shadow".

@Slach
Copy link
Collaborator

Slach commented Sep 14, 2023

@silentsokolov could you create separate issue
and provide exact command
and logs which it output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants