Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update storing-data.md #60024

Merged
merged 9 commits into from
Mar 14, 2024
Merged

Conversation

kssenii
Copy link
Member

@kssenii kssenii commented Feb 15, 2024

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Updated documentation to include changes from #58357.

@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Feb 15, 2024
@robot-clickhouse-ci-1
Copy link
Contributor

robot-clickhouse-ci-1 commented Feb 15, 2024

This is an automated comment for commit 69a6631 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@kssenii kssenii force-pushed the add-documentation-for-disks-configuration branch from ba77897 to 9bcd4da Compare February 15, 2024 15:36
@kssenii kssenii marked this pull request as draft February 15, 2024 15:37
@alexey-milovidov alexey-milovidov self-assigned this Feb 15, 2024
@kssenii kssenii marked this pull request as ready for review February 15, 2024 17:58
Copy link

@danthegoodman1 danthegoodman1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get to read through everything, but i tried to target the areas where I know docs were previously lacking. Hopefully this is helpful and can lead to some great docs!


### Using Plain Storage {#s3-storage}

There is a disk type `s3_plain`, which provides a write-once storage. Unlike `s3` disk type, it stores data as is, e.g. instead of randomly-generated blob names, it uses normal file names as clickhouse stores files on local disk. So this disk type allows to keeper a static version of the table and can also be used to create backups on it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this talk about the plain metadata type now, since this setting no longer exists?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, because it's write-once, does this mean it never merges? Or when it merges it never deletes old parts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this disk type allows to keeper a static version of the table and can also be used to create backups on it. Configuration parameters are the same as for s3 disk type.

Doesn't mean much to me tbh. Does this mean it basically just puts the exact clickhouse path on S3? If so, why are we doing the random blobs locally then, so metadata is then stored locally? Maybe that confusion is a result of this not being updated to reflect the plain metadata_type.

It would be useful to show this in context as well, like when I do this:

  • what does the S3 file structure look like?
  • how does this affect merges (as previously mentioned)?
  • should this ever be a working-set table, or do I need to make this a materialized view target? If this is a working table, what are the performance implications?
  • what do you mean by keep a static version and create a backup? Are you saying that because merges never delete it's a backup? Confusing without deep clickhouse context

Copy link
Member Author

@kssenii kssenii Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this talk about the plain metadata type now, since this setting no longer exists?

Yes, this made me to remember that some code was missing to allow usage of plain metadata for other object storage types, which was addressed in #60396, so this is why I did not continue with this documentation PR until now - needed to merge the PR with the fix first.

Also, because it's write-once, does this mean it never merges? Or when it merges it never deletes old parts?

Yes, merges are disabled. And Inserts are not allowed as well (an exception will be thrown on an attempt to insert some data). Added this to doc.

Doesn't mean much to me tbh.

Added some more explanation in doc.

Does this mean it basically just puts the exact clickhouse path on S3?

Yes.

If so, why are we doing the random blobs locally then, so metadata is then stored locally?

For s3 disk type we store data in random blobs because unlike s3_plain it is not "write once", e.g. we have inserts and merges, so requirements are higher for ordinary s3. The limitations of object storage (no rename, move, hardlink operations, etc) do not allow the same usability as local filesystem allows, therefore we cannot handle it the same way.

what does the S3 file structure look like?

Just randomly generated strings with 3-digit prefix, e.g. /prefix_from_disk_config/blob_random_3_digit_prefix/blob_random_name.
Also there was some feature which allows to change this blob path representation in a more performant way #57663, as I see it was documented already.

should this ever be a working-set table, or do I need to make this a materialized view target?

This is a normal read-only table, you can do whatever you want. The initial use case for s3_plain disk was to create backups to it (I added some info to doc about it). Backups to any other disk type apart from plain is not allowed.

If this is a working table, what are the performance implications?

None apart from data parts not being merged.

what do you mean by keep a static version and create a backup?

Added explanation in doc.

Copy link
Member Author

@kssenii kssenii Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danthegoodman1 I will merge this PR for now. If you have more comments - please write - I will address them in the next PR.

docs/en/operations/storing-data.md Outdated Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no explanation on what each metadata type does. I think it would be useful to briefly explain each one, and have an example config for each one that is something that might actually be used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that they are explained in other context in other places (e.g. https://github.com/ClickHouse/ClickHouse/blob/09e630e02be9ccd19681b34f33e24cea849ca9fd/docs/en/operations/storing-data.md#using-static-web-storage-read-only-web-storage) but having them in one spot so it's easy to find the answer will make this far more accessible for users

docs/en/operations/storing-data.md Outdated Show resolved Hide resolved
@kssenii kssenii merged commit 83f1c89 into master Mar 14, 2024
39 checks passed
@kssenii kssenii deleted the add-documentation-for-disks-configuration branch March 14, 2024 18:20
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants