RFC: disk without rename/hardlink/mv/copy operations #42337

UnamedRus · 2022-10-15T20:53:09Z

Use case
ClickHouse currently support multiple types of disks for data: (disk, s3, hdfs, blob) and some special (cache/encryption)
<disk>
Basically any POSIX compatible FileSystem.
ClickHouse initially was designed around how EXT4 deal with file / directory renames.
But not all FS are equal, despite all of them claim compatibility performance of certain operations can differ much.
For example, it's known about some performance issues on XFS.

Another problem that under POSIX compatible disk hides entire family of FS and backend storage, which based on completely different ideas and physical principles:
Starting from locally attached disks (NMVe, SSD, HDD, software/hardware RAID arrays) and near network storage based on SAN/NAS (NFS/Lustre/Gluster/Ceph) and cloud provided block devices (EBS/PV/Azure Block Storage)
All of them provide different guarantees on latency and performance of certain operations, sometimes they are far away from locally attached ext4 FS.

It's especially noticeable during mutations, when a lot of parts being renamed or hard links created. (One of main reason why Lightweight mutations not such lightweight as they wanted to be (Second one is writing new parts in ZK))

Such problems also exist in non-POSIX storage options like S3, HDFS?, BLOB. But here, it's more extreme as you can't rename file without re upload (S3) or there is no hard links (all of them)

Current approach

All files related to single part reside in single directory.
All parts of table reside in table directory (For Atomic DB engine, it's unique UUID instead of DB_NAME/TABLE_NAME like it was in Ordinary, no FS renames in case of RENAME TABLE, and atomic EXCHANGE TABLE, YAY)
ATTACH PART[ITIION]/ALTER TABLE UPDATE use hardlinks in order to create copy of data.

It also complicate things when people want to make backup with snapshoting tools, because ClickHouse quite often rename/move files.

Proposal is simple:
Reduce amount of possible file manipulation to minimum, basically: CREATE, READ, REMOVE (so basically the same set as in object storage)
It will require us to have layout similar to what we have in object storage:
Some shared bin, where all files will have randomly assigned names. (Probably ClickHouse can reuse implementation from s3 disk)

The text was updated successfully, but these errors were encountered:

alexey-milovidov · 2022-10-16T20:54:16Z

This is already implemented, see IObjectStorage.

UnamedRus · 2022-10-16T22:20:04Z

But there is not such disk for POSIX disks

alexey-milovidov · 2022-10-17T03:15:24Z

See LocalObjectStorage.h

UnamedRus · 2023-04-14T18:53:11Z

PR #48791

UnamedRus added the feature label Oct 15, 2022

alexey-milovidov closed this as completed Oct 16, 2022

alexey-milovidov added st-fixed development Developement process & source code & implementation details labels Oct 16, 2022

alexey-milovidov self-assigned this Oct 17, 2022

UnamedRus mentioned this issue Mar 7, 2023

Decouple table elements(column, projection, index) names from file name in part on disk/(hdfs?) #47320

Open

kssenii mentioned this issue May 3, 2023

Make local object storage work consistently with s3 object storage, fix problem with append, make it configurable as independent storage #48791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: disk without rename/hardlink/mv/copy operations #42337

RFC: disk without rename/hardlink/mv/copy operations #42337

UnamedRus commented Oct 15, 2022 •

edited

alexey-milovidov commented Oct 16, 2022

UnamedRus commented Oct 16, 2022

alexey-milovidov commented Oct 17, 2022

UnamedRus commented Apr 14, 2023

RFC: disk without rename/hardlink/mv/copy operations #42337

RFC: disk without rename/hardlink/mv/copy operations #42337

Comments

UnamedRus commented Oct 15, 2022 • edited

alexey-milovidov commented Oct 16, 2022

UnamedRus commented Oct 16, 2022

alexey-milovidov commented Oct 17, 2022

UnamedRus commented Apr 14, 2023

UnamedRus commented Oct 15, 2022 •

edited