Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 disk garbage collection? #52393

Closed
ray-at-github opened this issue Jul 21, 2023 · 73 comments · Fixed by #55309
Closed

S3 disk garbage collection? #52393

ray-at-github opened this issue Jul 21, 2023 · 73 comments · Fixed by #55309

Comments

@ray-at-github
Copy link

ray-at-github commented Jul 21, 2023

Hi,

If we use S3 as a disk for a table, how does Clickhouse make sure there are no orphan files left in the S3 bucket?

In my recent test, I have encountered a lot of S3 rate limiting errors (HTTP 503, "Slow Down") when bulk inserted lots of data.

Then I truncated the table. This seems have left many files in the S3 bucket undeleted, although the table is empty.

This bucket is used only in this test and only for this table, I expect the S3 bucket to be empty when truncated. I also waited some time for Clickhouse to perform some sort of "garbage collection" but it seems not happening.

This will be an issue if we use S3 disks for production and the 'garbage' size keeps growing (storage costs $$).

Does Clickhouse have any mechanism to detect and collect garbage (unused S3 objects)?

Thanks for any insights if I missed something.

@ray-at-github ray-at-github added the question Question? label Jul 21, 2023
@ray-at-github
Copy link
Author

Sorry I do not have any logs - the test server has been killed.

This is more of a design / architecture question. :)

@ray-at-github
Copy link
Author

ray-at-github commented Jul 24, 2023

[version 23.6.2.18]
I have re-produced this issue (junk files left in S3 bucket after table truncate):

  1. truncate a table that is stored on S3

  2. waited until old parts deleted, checked system.parts, 0 parts left:

:) select count(*) from system.parts where table = 'device_h3s';

SELECT count(*)
FROM system.parts
WHERE table = 'device_h3s'

Query id: 132cf50c-e1d2-40a2-86e6-d3a4eb73281c

┌─count()─┐
│       0 │
└─────────┘

1 row in set. Elapsed: 0.003 sec.
  1. checked system.remote_data_paths, it's empty
:) select * from system.remote_data_paths;

SELECT *
FROM system.remote_data_paths

Query id: e8bdf86b-ea79-4f4f-a04a-d4d1002a816c

Ok.

0 rows in set. Elapsed: 0.001 sec.
  1. there are still 2,244 objects left in S3 bucket:

image
image

  1. some errors logs may be related to not being able to delete the S3 objects (lots of them like this):
2023.07.24 02:37:12.258849 [ 20504 ] {} <Error> core_au.device_h3s (d6625e32-e74f-4b7b-8f46-8d3a59f309fc): Cannot quickly remove directory /var/lib/clickhouse/disks/s3/store/d66/d6625e32-e74f-4b7b-8f46-8d3a59f309fc/delete_tmp_20190305_34140_34140_0 by removing files; fallback to recursive removal. Reason: Code: 499. DB::Exception: Please reduce your request rate. (Code: 19). (S3_ERROR) (version 23.6.2.18 (official build))
  1. an empty directory left in /var/lib/clickhouse/disks/s3:
# ls -R /var/lib/clickhouse/disks/s3/store/
/var/lib/clickhouse/disks/s3/store/:
d66

/var/lib/clickhouse/disks/s3/store/d66:

@ray-at-github
Copy link
Author

It also happens after background merges when it tries to delete old parts, leaving junk files in S3 bucket:

image

@alifirat
Copy link

I'm facing the same issue here.
Is anyone has a feedback to share ?

@CheSema
Copy link
Member

CheSema commented Sep 22, 2023

You need to take care of

What is happen to your instance is the following:

  • you reach the request rate limit, you get Slow Down responcese
  • the delete request is also suffer with that limiting
  • if s3 delete request is failed the blobs are left

@CheSema CheSema self-assigned this Sep 22, 2023
@alifirat
Copy link

What do you mean by partitioning for your bucket in S3 in order to prevent.
Because today I'm using the MergeTree so the storage side on ClickHouse is internal.

@ray-at-github
Copy link
Author

ray-at-github commented Sep 22, 2023

What is happen to your instance is the following:

  • you reach the request rate limit, you get Slow Down responcese
  • the delete request is also suffer with that limiting
  • if s3 delete request is failed the blobs are left

Thanks, I understand this part. Will increasing retries completely overcome rate limiting and leave no orphan files? If not, do we need some process to keep the metadata in-sync with the bucket content? For example a background process. I don’t know what the best way is but we definitely don’t want an ever-growing S3 bucket in production.

@alifirat
Copy link

Is it this PR #53221 can help to track the orphans file on S3 ?

@ray-at-github
Copy link
Author

ray-at-github commented Sep 22, 2023

#52431 suggested a manual comparison of s3 contents with system.remote_data_paths for orphan objects. IMHO this should ideally be implemented and automated in Clickhouse because of the unique internal file/folder structure of the s3 disk.

@alifirat
Copy link

To me I still a discrepancy when looking to the remote_data_paths and that confirms my concern about orphan files on s3

@lazywei
Copy link

lazywei commented Sep 22, 2023

Just to add a bit observation: we got this issue as well, and we noticed if we take a full DB backup to S3, then the back-ed up structure / parts will have no orphans or uncollected garbages. So if we restore from that backup, the issue would be resolved "temporarily". That is not a solution but just a side effect when we test our backup/restore.

@alifirat
Copy link

Hey @CheSema

Can you specify what do you mean by partitioning for your bucket in S3 in order to prevent SlowDown
Because on the storage.xml file we specify, we just put a bucket name with a first prefix and then ClickHouse is using a first 3 letters prefix like aaa, aab, etc to do the data partitions on S3 so how do we can increase the partitions on S3?
Do you have an example from the doc?

@CheSema
Copy link
Member

CheSema commented Sep 25, 2023

Hi @alifirat
It is literally on S3. You need read the S3-doc, reach out s3 support to manage it.

@alifirat
Copy link

So nothing special to tune on the ClickHouse side?

@CheSema
Copy link
Member

CheSema commented Sep 25, 2023

So nothing special to tune on the ClickHouse side?

Nope.

@alifirat
Copy link

Hey @CheSema which API of AWS is used for the Delete operation ? The plain DeleteObject or a POST request?

@CheSema
Copy link
Member

CheSema commented Sep 27, 2023

Model::DeleteObjectOutcome DeleteObject(const DeleteObjectRequest & request) const;
Model::DeleteObjectsOutcome DeleteObjects(const DeleteObjectsRequest & request) const;

/// Uses `DeleteObjectsRequest` if it is allowed by `s3_capabilities`, otherwise `DeleteObjectRequest`.
/// `DeleteObjectsRequest` is not supported on GCS, see https://issuetracker.google.com/issues/162653700 .
void removeObjects(const StoredObjects & objects) override;

Both could be used. It depends what operation is supported by s3 provider. At the start the client do the checks to determine if DeleteObjects (POST) is available.

@tavplubix tavplubix added unexpected behaviour and removed question Question? labels Sep 27, 2023
@alifirat
Copy link

alifirat commented Sep 28, 2023

Hey @tavplubix

I'm coming back with a small test that I did with the 23.8.2.7:

  • I've created a table with a full storage on S3
  • ⚠️ I've actived zero-replication
  • From another cluster with 3 shards, I did a ALTER TABLE FETCH PARTITION ... => no problem, data consistency is perfect
  • At this moment, the data on ClickHouse and S3 is matching perfectly (a diff of 1 file between CH and S3 that is due to a file that has not been deleted after the create step)
  • Because I have now 3 parts that I want to optimize, I've run "OPTIMIZE TABLE my table" and it's starting to be interesting here
  • In CH, the system.remote_data_paths is returning 704 local metadata files with an active storage around 6.25 GiB while the aws-cli returns me 987 files with a storage of 7.3 GiB so the OPTIMIZE have generated orphans files.

As you have requested last time some logs, I've activated the trace logs and we will zoom on a specific part that has been not removed:

f6dbdef8-bcaf-4ae3-b94a-cc968ce0d6fc} <Trace> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415): Renaming temporary part detached/attaching_20230914_0_5361_55 to 20230914_1_1_0 with tid (1, 1, 00000000-0000-0000-0000-000000000000).
2023.09.28 09:58:14.952021 [ 16646 ] {f6dbdef8-bcaf-4ae3-b94a-cc968ce0d6fc} <Debug> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415): Attached part 20230914_0_5361_55 as 20230914_1_1_0
2023.09.28 09:58:14.954670 [ 13340 ] {} <Trace> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415) (Replicated PartsService): Sending part 20230914_1_1_0
2023.09.28 11:50:33.216836 [ 6154 ] {} <Debug> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415): Removing 3 parts from filesystem (serially): Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 11:50:33.996400 [ 6154 ] {} <Trace> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415): Part 20230914_1_1_0 local references is zero, will check blobs can be removed in zookeeper
2023.09.28 11:50:33.996860 [ 6154 ] {} <Trace> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415): Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/1859b231-ac56-43bf-a1ad-061d2716b415/20230914_1_1_0/clickhouse-lakehouse_buz_wdscpmuxgnusskvuiyqtdjrbmjwus/clickhouse-lakehouse-shard1-r1-staging for part 20230914_1_1_0
2023.09.28 11:50:33.999340 [ 6154 ] {} <Trace> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415): Removed last parent zookeeper lock /clickhouse/zero_copy/zero_copy_s3/1859b231-ac56-43bf-a1ad-061d2716b415/20230914_1_1_0/clickhouse-lakehouse_buz_wdscpmuxgnusskvuiyqtdjrbmjwus for part 20230914_1_1_0 with id clickhouse-lakehouse_buz_wdscpmuxgnusskvuiyqtdjrbmjwus
2023.09.28 11:50:34.000474 [ 6154 ] {} <Trace> default.mytable (1859b231-ac56-43bf-a1ad-061d2716b415): Removed last parent zookeeper lock /clickhouse/zero_copy/zero_copy_s3/1859b231-ac56-43bf-a1ad-061d2716b415/20230914_1_1_0 for part 20230914_1_1_0 (part is finally unlocked)
2023.09.28 11:57:33.881677 [ 19166 ] {b22910af-65d5-43b5-87a0-103a1fcabb38} <Trace> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Renaming temporary part detached/attaching_20230914_0_5361_55 to 20230914_1_1_0 with tid (1, 1, 00000000-0000-0000-0000-000000000000).
2023.09.28 11:57:33.885003 [ 19166 ] {b22910af-65d5-43b5-87a0-103a1fcabb38} <Debug> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Attached part 20230914_0_5361_55 as 20230914_1_1_0
2023.09.28 11:57:33.887778 [ 19174 ] {} <Trace> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c) (Replicated PartsService): Sending part 20230914_1_1_0
2023.09.28 12:00:31.692197 [ 6250 ] {} <Trace> eb5c4673-2669-4536-ad52-8bf44a24535c::20230914_0_2_1 (MergeFromLogEntryTask): Executing log entry to merge parts 20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0 to 20230914_0_2_1
{} <Trace> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Found 3 old parts to remove. Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 12:14:46.499137 [ 6110 ] {} <Debug> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Removing 3 parts from filesystem (serially): Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 12:14:46.516500 [ 6110 ] {} <Trace> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Part 20230914_1_1_0 local references is zero, will check blobs can be removed in zookeeper
2023.09.28 12:14:46.516804 [ 6110 ] {} <Trace> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/eb5c4673-2669-4536-ad52-8bf44a24535c/20230914_1_1_0/clickhouse-lakehouse_kgg_njuthgudrnkbpjiaxwojthyhjbnwx/clickhouse-lakehouse-shard1-r1-staging for part 20230914_1_1_0
2023.09.28 12:14:46.517962 [ 6110 ] {} <Trace> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Blobs of part 20230914_1_1_0 cannot be removed
2023.09.28 12:14:46.561677 [ 6110 ] {} <Debug> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Removing 3 parts from memory: Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]

I can guess that the orphans file can be explain by the following log

2023.09.28 12:14:46.517962 [ 6110 ] {} <Trace> default.mytable (eb5c4673-2669-4536-ad52-8bf44a24535c): Blobs of part 20230914_1_1_0 cannot be removed

Also I have the feeling that ones the parts are removed from the memory there is no check anymore about deleting the blobs from s3.

It looks like the lock on ZK has been removed but for a reason that I ignore, the remote blobs cannot be removed.
Happy to help with more details if it's required.

@den-crane
Copy link
Contributor

den-crane commented Sep 28, 2023

@alifirat you can check zookeeper, znodes related to the zero replication. The last replica should remove s3 objects when there is a zero references to the object in the zk.

@alifirat
Copy link

alifirat commented Sep 28, 2023

@den-crane

What should I check on the ZK part?

Looks line this trace_log is located here, in the code available on the master branch.

@tavplubix
Copy link
Member

I can guess that the orphans file can be explain by the following log
Blobs of part 20230914_1_1_0 cannot be removed

Yes, but it means that blobs are still locked by some replica. Also, some log messages between "Remove zookeeper lock ..." and "Blobs of part 20230914_1_1_0 cannot be removed" are missing

@alifirat
Copy link

@tavplubix

I've checkout the code locally (checkout the specific tag version I'm using) and I should expect the following logs

LOG_TRACE(logger, "Remove zookeeper lock {} for part {}", zookeeper_part_replica_node, part_name);

        if (auto ec = zookeeper_ptr->tryRemove(zookeeper_part_replica_node); ec != Coordination::Error::ZOK)
        {
            /// Very complex case. It means that lock already doesn't exist when we tried to remove it.
            /// So we don't know are we owner of this part or not. Maybe we just mutated it, renamed on disk and failed to lock in ZK.
            /// But during mutation we can have hardlinks to another part. So it's not Ok to remove blobs of this part if it was mutated.
            if (ec == Coordination::Error::ZNONODE)
            {
                if (has_parent)
                {
                    LOG_INFO(logger, "Lock on path {} for part {} doesn't exist, refuse to remove blobs", zookeeper_part_replica_node, part_name);
                    return {false, {}};
                }
                else
                {
                    LOG_INFO(logger, "Lock on path {} for part {} doesn't exist, but we don't have mutation parent, can remove blobs", zookeeper_part_replica_node, part_name);
                }
            }
            else
            {
                throw zkutil::KeeperException::fromPath(ec, zookeeper_part_replica_node);
            }
        }

I was expected to see the logs in case of errors:

  • Lock on path {} for part {} doesn't exist, refuse to remove blobs
  • Lock on path {} for part {} doesn't exist, but we don't have mutation parent, can remove blobs

But when looking in the log files
$ cat /var/log/clickhouse-server/clickhouse-server.err.log | grep -i "lock on path"

But I have nothing that is returned :/

@den-crane
Copy link
Contributor

@alifirat

Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/eb5c4673-2669-4536-ad52-8bf44a24535c/20230914_1_1_0/clickhouse-lakehouse_kgg_njuthgudrnkbpjiaxwojthyhjbnwx/clickhouse-lakehouse-shard1-r1-staging for part 20230914_1_1_0
Blobs of part 20230914_1_1_0 cannot be removed

check what else is in /clickhouse/zero_copy/zero_copy_s3/eb5c4673-2669-4536-ad52-8bf44a24535c/20230914_1_1_0/

select * from system.zookeeper 
where path = '/clickhouse/zero_copy/zero_copy_s3/eb5c4673-2669-4536-ad52-8bf44a24535c/20230914_1_1_0/';

@alifirat
Copy link

alifirat commented Sep 28, 2023

@den-crane

I had a first suspicious about 2 merges that was ongoing on my both replicas so I've decided to desactivated the merges on the second replica using always_fetch_merged_part set to 1.

Then I retry again my process.

What I have on the first replica

2023.09.28 15:59:29.980125 [ 6191 ] {} <Trace> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): Found 3 old parts to remove. Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 15:59:29.982307 [ 6191 ] {} <Debug> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): Removing 3 parts from filesystem (serially): Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 15:59:29.982397 [ 6191 ] {} <Trace> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): Part 20230914_0_0_0 local references is zero, will check blobs can be removed in zookeeper
2023.09.28 15:59:29.982617 [ 6191 ] {} <Trace> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/2ddde820-1e04-4f98-ab74-5015d45ef5e1/20230914_0_0_0/clickhouse-lakehouse_omm_tkilhtszrrzjkkxwantfvywqjstlp/clickhouse-lakehouse-shard1-r1-staging for part 20230914_0_0_0
2023.09.28 15:59:29.983749 [ 6191 ] {} <Trace> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): No more children left for for /clickhouse/zero_copy/zero_copy_s3/2ddde820-1e04-4f98-ab74-5015d45ef5e1/20230914_0_0_0/clickhouse-lakehouse_omm_tkilhtszrrzjkkxwantfvywqjstlp, will try to remove the whole node
2023.09.28 15:59:29.984750 [ 6191 ] {} <Trace> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): Removed last parent zookeeper lock /clickhouse/zero_copy/zero_copy_s3/2ddde820-1e04-4f98-ab74-5015d45ef5e1/20230914_0_0_0/clickhouse-lakehouse_omm_tkilhtszrrzjkkxwantfvywqjstlp for part 20230914_0_0_0 with id clickhouse-lakehouse_omm_tkilhtszrrzjkkxwantfvywqjstlp
2023.09.28 15:59:29.985891 [ 6191 ] {} <Trace> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): Removed last parent zookeeper lock /clickhouse/zero_copy/zero_copy_s3/2ddde820-1e04-4f98-ab74-5015d45ef5e1/20230914_0_0_0 for part 20230914_0_0_0 (part is finally unlocked)
2023.09.28 15:59:32.359409 [ 6191 ] {} <Debug> default.mytable (2ddde820-1e04-4f98-ab74-5015d45ef5e1): Removing 3 parts from memory: Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]

On the second one

2023.09.28 14:08:30.213281 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Found 3 old parts to remove. Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 14:08:30.213330 [ 6005 ] {} <Debug> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Removing 3 old parts from ZooKeeper
2023.09.28 14:08:30.215774 [ 6005 ] {} <Debug> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Removed 3 old parts from ZooKeeper. Removing them from filesystem.
2023.09.28 14:08:30.215807 [ 6005 ] {} <Debug> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Removing 3 parts from filesystem (serially): Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 14:08:30.215900 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Part 20230914_0_0_0 local references is zero, will check blobs can be removed in zookeeper
2023.09.28 14:08:30.216085 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/426ff7b4-30af-4f6f-b66e-8ddcd65ab013/20230914_0_0_0/clickhouse-lakehouse_nxh_pcajdxzktxqdyvavmwkupktwfviza/clickhouse-lakehouse-shard1-r2-staging for part 20230914_0_0_0
2023.09.28 14:08:30.217216 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Found 1 (clickhouse-lakehouse-shard1-r1-staging) zookeeper locks for /clickhouse/zero_copy/zero_copy_s3/426ff7b4-30af-4f6f-b66e-8ddcd65ab013/20230914_0_0_0/clickhouse-lakehouse_nxh_pcajdxzktxqdyvavmwkupktwfviza
2023.09.28 14:08:30.217232 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Blobs of part 20230914_0_0_0 cannot be removed
2023.09.28 14:08:30.232966 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Part 20230914_1_1_0 local references is zero, will check blobs can be removed in zookeeper
2023.09.28 14:08:30.233229 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/426ff7b4-30af-4f6f-b66e-8ddcd65ab013/20230914_1_1_0/clickhouse-lakehouse_vyk_ypowbksogpoldhwocuohhwcuadece/clickhouse-lakehouse-shard1-r2-staging for part 20230914_1_1_0
2023.09.28 14:08:30.234406 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Found 1 (clickhouse-lakehouse-shard1-r1-staging) zookeeper locks for /clickhouse/zero_copy/zero_copy_s3/426ff7b4-30af-4f6f-b66e-8ddcd65ab013/20230914_1_1_0/clickhouse-lakehouse_vyk_ypowbksogpoldhwocuohhwcuadece
2023.09.28 14:08:30.234422 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Blobs of part 20230914_1_1_0 cannot be removed
2023.09.28 14:08:30.249259 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Part 20230914_2_2_0 local references is zero, will check blobs can be removed in zookeeper
2023.09.28 14:08:30.249497 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Remove zookeeper lock /clickhouse/zero_copy/zero_copy_s3/426ff7b4-30af-4f6f-b66e-8ddcd65ab013/20230914_2_2_0/clickhouse-lakehouse_ejh_etvktfmxockhoiteuiuaqgslenmab/clickhouse-lakehouse-shard1-r2-staging for part 20230914_2_2_0
2023.09.28 14:08:30.250787 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Found 1 (clickhouse-lakehouse-shard1-r1-staging) zookeeper locks for /clickhouse/zero_copy/zero_copy_s3/426ff7b4-30af-4f6f-b66e-8ddcd65ab013/20230914_2_2_0/clickhouse-lakehouse_ejh_etvktfmxockhoiteuiuaqgslenmab
2023.09.28 14:08:30.250803 [ 6005 ] {} <Trace> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Blobs of part 20230914_2_2_0 cannot be removed
2023.09.28 14:08:30.265514 [ 6005 ] {} <Debug> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Removing 3 parts from memory: Parts: [20230914_0_0_0, 20230914_1_1_0, 20230914_2_2_0]
2023.09.28 14:08:30.265566 [ 6005 ] {} <Debug> default.mytable (426ff7b4-30af-4f6f-b66e-8ddcd65ab013): Removed 3 old parts

On the r2, it seems one of the replica emits a lock and block the blobs deletion so let's see in ZK table

SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/zero_copy/zero_copy_s3/2ddde820-1e04-4f98-ab74-5015d45ef5e1/20230914_0_0_0/'

Query id: c51a9243-57e8-4ebb-aa9d-b96b7c100700

Ok.

0 rows in set. Elapsed: 0.001 sec.

But if we look to the parts that has been merged on ZK

SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/zero_copy/zero_copy_s3/2ddde820-1e04-4f98-ab74-5015d45ef5e1/20230914_0_2_1'
FORMAT Vertical

Query id: d771db18-97b6-4541-a333-9243c5004b1e

Row 1:
──────
name:  clickhouse-lakehouse_vcu_gdcmoedblaypoxjvgxqjlkbaemabm
value:
path:  /clickhouse/zero_copy/zero_copy_s3/2ddde820-1e04-4f98-ab74-5015d45ef5e1/20230914_0_2_1

It's pointing to the part that has been merged .

@den-crane
Copy link
Contributor

den-crane commented Sep 29, 2023

tried

insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;

┌─disk_name─┬─part_type─┬─partition─┬─sum(rows)─┬─size─────┬─part_count─┐
│ s3disk    │ Wide      │ tuple()   │    1800001.01 GiB │          6 │
└───────────┴───────────┴───────────┴───────────┴──────────┴────────────┘

┌─disk_name─┬─part_type─┬─partition─┬─sum(rows)─┬─size─────┬─part_count─┐
│ s3disk    │ Wide      │ tuple()   │    1800001.01 GiB │          1 │
└───────────┴───────────┴───────────┴───────────┴──────────┴────────────┘

mergetree & replicatedmergetree / 2 replica / zero replication

no orphans


/usr/local/bin/aws s3 ls --recursive  s3://t..../denis/s3cached/|wc -l
0

@alifirat
Copy link

Which version of ClickHouse?

@alifirat
Copy link

@den-crane

I was finally able to identify (with very ugly methods) one orphans file on the logs.

I see this:

ket mybucketkey clickhouse/zyr/dxegmxjahvabjzlxrdxcxxlbfujwh (LogSeriesLimiter: on interval from 2023-09-29 20:51:28 to 2023-09-29 20:51:33 accepted series 1 / 32 for the logger WriteBufferFromS3 : 1.7491308770058435e+38)
/var/log/clickhouse-server/clickhouse-server.log:1721990:2023.09.29 20:51:33.939590 [ 427 ] {30206194-d191-49b2-b53b-03ec60a7d4a1::20230914_0_2_1} <Trace> WriteBufferFromS3: Single part upload has completed. Details: bucket mybucketkey clickhouse/zyr/dxegmxjahvabjzlxrdxcxxlbfujwh, size 527519
/var/log/clickhouse-server/clickhouse-server.log:1723884:2023.09.29 20:52:19.357159 [ 29937 ] {30206194-d191-49b2-b53b-03ec60a7d4a1::20230914_0_2_1} <Trace> WriteBufferFromS3: finalizeImpl WriteBufferFromS3. Details: bucket mybucketkey clickhouse/zyr/dxegmxjahvabjzlxrdxcxxlbfujwh.
/var/log/clickhouse-server/clickhouse-server.log:1723885:2023.09.29 20:52:19.357885 [ 29937 ] {30206194-d191-49b2-b53b-03ec60a7d4a1::20230914_0_2_1} <Trace> WriteBufferFromS3: Close WriteBufferFromS3. Details: bucket mybucketkey clickhouse/zyr/dxegmxjahvabjzlxrdxcxxlbfujwh.

Looking for the file on the remote_data_paths

SELECT *
FROM system.remote_data_paths
WHERE remote_path = 'clickhouse/zyr/dxegmxjahvabjzlxrdxcxxlbfujwh'

Query id: e78ffddf-ec5d-4c6a-be1b-875ef64b67bf

Ok.

0 rows in set. Elapsed: 0.013 sec.

On S3 now

aws s3 ls --recursive --human-readable --summarize s3://mybucket/clickhouse/zyr/dxegmxjahvabjzlxrdxcxxlbfujwh
2023-09-29 20:51:34  515.2 KiB clickhouse/zyr/dxegmxjahvabjzlxrdxcxxlbfujwh

Total Objects: 1
   Total Size: 515.2 KiB

@alifirat
Copy link

alifirat commented Sep 30, 2023

@ray-at-github for your use case, can you try to drop the data using the sync keyword at the end?

If it works for you I will create another issue

@ray-at-github
Copy link
Author

ray-at-github commented Sep 30, 2023

@alifirat unfortunately I won’t be able to run the test again any time soon. I can probably do it in a month or so.

But you can test it on your end. The only requirement is that you need some large amount of data in the table, say about 100 million rows and about 100k s3 objects. The point is to trigger lots of s3 delete operations within a short period, and therefore trigger AWS rate limiting. ClickHouse will then give up the failed s3 deletes and remove the local metadata links, hence leaving s3 orphans.

I’m not quite sure your case is the same.

@alifirat
Copy link

@ray-at-github On huge tables, I prefer to use AWS lifecycle because there is nothing magic with ClickHouse since either it will timeout (timeout in ClickHouse, AWS S3 limit rates reached).

@ray-at-github
Copy link
Author

ray-at-github commented Sep 30, 2023

@alifirat the Clickhouse S3 disk has its own method to organise table folders. You cannot tell which S3 folder belongs to which Clickhouse table. What if I have 2 tables in the same bucket and I only want to truncate one? AWS lifecycle won’t work in this case.

In this ticket I am prompting (or hoping for) better handling of rate limiting errors in Clickhouse, instead of looking for a workaround. My solution for myself now is not to use s3 disk for production.

@den-crane
Copy link
Contributor

ClickHouse will then give up the failed s3 deletes and remove the local metadata links

Clickhouse can store such failures (objects) in a table or in a file and retry automatically later or leave it to the user's discretion.

@den-crane
Copy link
Contributor

Which version of ClickHouse?

23.8

@ray-at-github
Copy link
Author

ClickHouse will then give up the failed s3 deletes and remove the local metadata links

Clickhouse can store such failures (objects) in a table or in a file and retry automatically later or leave it to the user's discretion.

I understand, but that sounds a workaround to me. As a user ideally I can use S3 disk like a local disk, transparently. That’s what I am hoping for in this ticket, if it’s possible at all.

@alifirat
Copy link

@ray-at-github I think it's a trade off to do for a production usage because it helps to reduce a lot on the storage but it can also increase the number of operations you have to develop around his usage.

I'm also curious to know if ClickHouse Cloud has already faces similar issue.

@ray-at-github
Copy link
Author

ray-at-github commented Sep 30, 2023

@alifirat I agree, it’s a balance and each of us may have our own considerations. For me, the cost savings haven’t outweighed the hassle of manual intervention just yet.

I'm also curious to know if ClickHouse Cloud has already faces similar issue.

I second that. Also wondering how SharedMergeTree handles rate limiting.

@alexey-milovidov
Copy link
Member

I prefer to use AWS lifecycle

No, it's not possible to use the AWS lifecycle correctly; it is not supported and will damage the tables.

@alifirat
Copy link

@alexey-milovidov

To be more precise, when I've mentionned to Lifecycle rule, I was thinking to this ⬇️
Screenshot 2023-09-30 at 23 25 19

@alexey-milovidov
Copy link
Member

Yes, it is alright.

To summarize:

  • set up automatic deletion of incomplete multipart uploads;
  • do not set up automatic data deletion.

@alexey-milovidov
Copy link
Member

I would expect that an exception during deletion will lead to parts not being removed (and retried later), but need to check the code...

@alifirat
Copy link

Hey @alexey-milovidov

If you look to the analysis I've did, it's the not files of the inactive parts that have not been deleted but it's during the optimize process that seems to generate more files (I imagine tmp files) and those one are not removed but don't know why.

@alifirat
Copy link

alifirat commented Oct 2, 2023

Hey @alexey-milovidov @den-crane

I continue analysis the issue I have and finally found something that make sense.

Thank to your remarks, I've change my opinion about thinking that the optimize process was generating temporary file and think like this:

If the orphans are duplicated then based on their size I should find them in system.remote_data_paths by filtering by the size column

And guess what? It worked!

I have a list of 281 orphans files and for each of them, I did a request to AWS to get their size (simple script like below):

#!/bin/bash

for s3_file in $(cat orphans_s3.tsv);
do
	aws s3 ls s3://mybucket/$s3_file | awk '{print $3}' 
done

It generates a file with only the size of orphans file like this :

1388050
26768
752573
35065
34084
...

Now by using this information, I've looked them in the system.remote_data_paths by doing:

select local_path, remote_path from system.remote_data_paths where size in (1388050, 26768, 752573, etc)

Bingo! I was able to find them and what was important to me, is if those files are really equals, the size yes but what about the content? Here it's very interesting and I have a couple of example like this one.

I have some files that has been wrote during the optimize for which I have at least more than 2 retry. For instance:

SELECT remote_path
FROM system.remote_data_paths
WHERE size = 25834657

Query id: c9ed504f-b71a-4157-9685-d82d0a650a1a

┌─remote_path────────────────────────────────────────────┐
│ clickhouse/hqz/blrkslobggtquxettcjvvwxgtfrbl │
└────────────────────────────────────────────────────────┘
$ aws s3 ls s3://mybucket/../blrkslobggtquxettcjvvwxgtfrbl
2023-10-02 12:13:52   25834657 blrkslobggtquxettcjvvwxgtfrbl

Duplicate 1
$ aws s3 s3://../nbkiqfojndhhqxzpvxzkxxcmxxjrh
2023-10-02 12:13:50   25834657 nbkiqfojndhhqxzpvxzkxxcmxxjrh

Duplicate 2 
2023-10-02 12:13:47   25834657 valmuzrhzvyhfoxrjndoixofejlgi

Duplicate 3 
023-10-02 12:13:49   25834657 hyuqbkdnfjvjoyhgehsksfufwlqbw

Duplicate 4 
2023-10-02 12:13:52   25834657 uzsnquqtizzadujcvztwduxtobgms

ℹ️ I copy locally all of them and compared them, they were all equals.

ℹ️ But now, I think it's the most important information in my comment, can we find a common pattern for those orphans file?
I have look to their local_path and all of them are coming from Nested types, especially all the orphans file are sharing this:

  • size0.bin
  • size0.cmrk2

Last but not least, I have computed the size of those orphans files:

FROM file('s3_orphans_file_size.tsv') AS s3
INNER JOIN file('ch_file_size.tsv') AS ch ON ch.c2 = s3.c1

Query id: e15276d1-8cf1-4618-ac8a-6ce562c99557

┌─formatReadableSize(sum(c1))─┐
│ 1.08 GiB                    │
└─────────────────────────────┘
SELECT formatReadableSize(sum(data_compressed_bytes))
FROM system.parts
WHERE active AND (database = 'default')

Query id: b26961a2-5d22-4448-8455-a04d32fab63e

┌─formatReadableSize(sum(data_compressed_bytes))─┐
│ 6.25 GiB                                       │
└────────────────────────────────────────────────┘

$ aws s3 ls s3://mybucket/mykey --human-readable --summarize --recursive
Total Size: 7.3 GiB

So 6,25 + 1,08 = 7,33 GiB (the size returned by AWS).

Also, I've just noticed that during the merge operation

SELECT count(1)
FROM system.remote_data_paths
WHERE local_path ILIKE '%size0%'

Query id: 32f6d770-b036-4224-8ae2-03299af6bf40

┌─count()─┐
│     282 │
└─────────┘

This is exactly the number of orphans file I have.

@den-crane
Copy link
Contributor

den-crane commented Oct 2, 2023

size0.bin

Nice catch. Now it's crystal clear.

/usr/local/bin/aws s3 ls --recursive  s3://t..../denis/s3cached/|wc -l
0


CREATE TABLE test_s3(
c1	Int8,
c2	Date,
`c3.k`	Array(String),
`c3.v1`	Array(Int64),
`c3.v3`	Array(Int64),
`c3.v4`	Array(Int64),
) ENGINE = MergeTree
order by (c1,c2) SETTINGS disk = 's3disk',old_parts_lifetime=1,
min_bytes_for_wide_part=1,
vertical_merge_algorithm_min_rows_to_activate=1,
vertical_merge_algorithm_min_columns_to_activate=1;

insert into test_s3 values(1,1,['a','b'],[1,2],[1,2],[1,2]);
insert into test_s3 values(1,1,['a','b'],[1,2],[1,2],[1,2]);

optimize table test_s3 final;

drop table test_s3 sync;

select * FROM system.remote_data_paths limit 10;
0 rows in set. Elapsed: 0.003 sec.

/usr/local/bin/aws s3 ls --recursive  s3://t..../denis/s3cached/|wc -l
6

@alifirat
Copy link

alifirat commented Oct 2, 2023

@den-crane super fast to reproduce the issue 😅😅😅

At least you have a simple example to reproduce it 🙏

@alifirat
Copy link

alifirat commented Oct 3, 2023

@den-crane When I look to the settings you have used for reproducing the issue, is it a side effect of the Merge Algorithm (vertical one on this case) ?

@alifirat
Copy link

alifirat commented Oct 3, 2023

@tavplubix @CheSema Do you think you will have the time to look for a potential fix, now that @den-crane has provided a way to reproduce the issue?

@alifirat
Copy link

alifirat commented Oct 3, 2023

Hey @ray-at-github

Did you have Nested fields or Array fields in your table? Maybe you are also impacted by the issue I've raise in my previous comment.

@ray-at-github
Copy link
Author

@alifirat no, only primitive types like int and string.

@den-crane
Copy link
Contributor

den-crane commented Oct 3, 2023

@den-crane When I look to the settings you have used for reproducing the issue, is it a side effect of the Merge Algorithm (vertical one on this case) ?

Yes.
Nested subcolumns are stored as Arrays and they have the same length.
Nested subcolumns share size column, and store a single pair of size0.bin/mrk files.
Vertical merge processes all columns and Nested subcolumns separately, one after another.
It seems Vertical merge just creates the same size0.bin/mrk for each Nested subcolumn again and again. Probably the same issue is for a local disk, and merge just replaces already existing size0.bin/mrk file.
So the fix is pretty obvious create size0 only for the first Nested subcolumn, or check if the file already exist then skip to create it.

@alifirat
Copy link

alifirat commented Oct 3, 2023

@den-crane do you know which file(s) is handling that?

@den-crane
Copy link
Contributor

@den-crane do you know which file(s) is handling that?

Do you mean source code .cpp files related to Vertical merge? No idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants