Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: disk without rename/hardlink/mv/copy operations #42337

Closed
UnamedRus opened this issue Oct 15, 2022 · 4 comments
Closed

RFC: disk without rename/hardlink/mv/copy operations #42337

UnamedRus opened this issue Oct 15, 2022 · 4 comments
Assignees
Labels
development Developement process & source code & implementation details feature st-fixed

Comments

@UnamedRus
Copy link
Contributor

UnamedRus commented Oct 15, 2022

Use case
ClickHouse currently support multiple types of disks for data: (disk, s3, hdfs, blob) and some special (cache/encryption)
<disk>
Basically any POSIX compatible FileSystem.
ClickHouse initially was designed around how EXT4 deal with file / directory renames.
But not all FS are equal, despite all of them claim compatibility performance of certain operations can differ much.
For example, it's known about some performance issues on XFS.

Another problem that under POSIX compatible disk hides entire family of FS and backend storage, which based on completely different ideas and physical principles:
Starting from locally attached disks (NMVe, SSD, HDD, software/hardware RAID arrays) and near network storage based on SAN/NAS (NFS/Lustre/Gluster/Ceph) and cloud provided block devices (EBS/PV/Azure Block Storage)
All of them provide different guarantees on latency and performance of certain operations, sometimes they are far away from locally attached ext4 FS.

It's especially noticeable during mutations, when a lot of parts being renamed or hard links created. (One of main reason why Lightweight mutations not such lightweight as they wanted to be (Second one is writing new parts in ZK))

Such problems also exist in non-POSIX storage options like S3, HDFS?, BLOB. But here, it's more extreme as you can't rename file without re upload (S3) or there is no hard links (all of them)

Current approach

  1. All files related to single part reside in single directory.
  2. All parts of table reside in table directory (For Atomic DB engine, it's unique UUID instead of DB_NAME/TABLE_NAME like it was in Ordinary, no FS renames in case of RENAME TABLE, and atomic EXCHANGE TABLE, YAY)
  3. ATTACH PART[ITIION]/ALTER TABLE UPDATE use hardlinks in order to create copy of data.

It also complicate things when people want to make backup with snapshoting tools, because ClickHouse quite often rename/move files.

Proposal is simple:
Reduce amount of possible file manipulation to minimum, basically: CREATE, READ, REMOVE (so basically the same set as in object storage)
It will require us to have layout similar to what we have in object storage:
Some shared bin, where all files will have randomly assigned names. (Probably ClickHouse can reuse implementation from s3 disk)

@alexey-milovidov
Copy link
Member

This is already implemented, see IObjectStorage.

@alexey-milovidov alexey-milovidov added st-fixed development Developement process & source code & implementation details labels Oct 16, 2022
@UnamedRus
Copy link
Contributor Author

But there is not such disk for POSIX disks

@alexey-milovidov
Copy link
Member

See LocalObjectStorage.h

@UnamedRus
Copy link
Contributor Author

PR #48791

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Developement process & source code & implementation details feature st-fixed
Projects
None yet
Development

No branches or pull requests

2 participants