-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 disk garbage collection? #52393
Comments
Sorry I do not have any logs - the test server has been killed. This is more of a design / architecture question. :) |
[version 23.6.2.18]
|
I'm facing the same issue here. |
You need to take care of
What is happen to your instance is the following:
|
What do you mean by |
Thanks, I understand this part. Will increasing retries completely overcome rate limiting and leave no orphan files? If not, do we need some process to keep the metadata in-sync with the bucket content? For example a background process. I don’t know what the best way is but we definitely don’t want an ever-growing S3 bucket in production. |
Is it this PR #53221 can help to track the orphans file on S3 ? |
#52431 suggested a manual comparison of s3 contents with system.remote_data_paths for orphan objects. IMHO this should ideally be implemented and automated in Clickhouse because of the unique internal file/folder structure of the s3 disk. |
To me I still a discrepancy when looking to the remote_data_paths and that confirms my concern about orphan files on s3 |
Just to add a bit observation: we got this issue as well, and we noticed if we take a full DB backup to S3, then the back-ed up structure / parts will have no orphans or uncollected garbages. So if we restore from that backup, the issue would be resolved "temporarily". That is not a solution but just a side effect when we test our backup/restore. |
Hey @CheSema Can you specify what do you mean by |
Hi @alifirat |
So nothing special to tune on the ClickHouse side? |
Nope. |
Hey @CheSema which API of AWS is used for the Delete operation ? The plain DeleteObject or a POST request? |
Lines 201 to 202 in 3a1663a
ClickHouse/src/Disks/ObjectStorages/S3/S3ObjectStorage.h Lines 120 to 122 in 3a1663a
Both could be used. It depends what operation is supported by s3 provider. At the start the client do the checks to determine if DeleteObjects (POST) is available. |
Hey @tavplubix I'm coming back with a small test that I did with the
As you have requested last time some logs, I've activated the trace logs and we will zoom on a specific part that has been not removed:
I can guess that the orphans file can be explain by the following log
Also I have the feeling that ones the parts are removed from the memory there is no check anymore about deleting the blobs from s3. It looks like the lock on ZK has been removed but for a reason that I ignore, the remote blobs cannot be removed. |
@alifirat you can check zookeeper, znodes related to the zero replication. The last replica should remove s3 objects when there is a zero references to the object in the zk. |
What should I check on the ZK part? Looks line this trace_log is located here, in the code available on the master branch. |
Yes, but it means that blobs are still locked by some replica. Also, some log messages between "Remove zookeeper lock ..." and "Blobs of part 20230914_1_1_0 cannot be removed" are missing |
I've checkout the code locally (checkout the specific tag version I'm using) and I should expect the following logs LOG_TRACE(logger, "Remove zookeeper lock {} for part {}", zookeeper_part_replica_node, part_name);
if (auto ec = zookeeper_ptr->tryRemove(zookeeper_part_replica_node); ec != Coordination::Error::ZOK)
{
/// Very complex case. It means that lock already doesn't exist when we tried to remove it.
/// So we don't know are we owner of this part or not. Maybe we just mutated it, renamed on disk and failed to lock in ZK.
/// But during mutation we can have hardlinks to another part. So it's not Ok to remove blobs of this part if it was mutated.
if (ec == Coordination::Error::ZNONODE)
{
if (has_parent)
{
LOG_INFO(logger, "Lock on path {} for part {} doesn't exist, refuse to remove blobs", zookeeper_part_replica_node, part_name);
return {false, {}};
}
else
{
LOG_INFO(logger, "Lock on path {} for part {} doesn't exist, but we don't have mutation parent, can remove blobs", zookeeper_part_replica_node, part_name);
}
}
else
{
throw zkutil::KeeperException::fromPath(ec, zookeeper_part_replica_node);
}
} I was expected to see the logs in case of errors:
But when looking in the log files But I have nothing that is returned :/ |
check what else is in
|
I had a first suspicious about 2 merges that was ongoing on my both replicas so I've decided to desactivated the merges on the second replica using Then I retry again my process. What I have on the first replica
On the second one
On the r2, it seems one of the replica emits a lock and block the blobs deletion so let's see in ZK table
But if we look to the parts that has been merged on ZK
It's pointing to the part that has been merged . |
tried insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
insert into test_s3 SELECT * FROM generateRandom() LIMIT 30000;
┌─disk_name─┬─part_type─┬─partition─┬─sum(rows)─┬─size─────┬─part_count─┐
│ s3disk │ Wide │ tuple() │ 180000 │ 1.01 GiB │ 6 │
└───────────┴───────────┴───────────┴───────────┴──────────┴────────────┘
┌─disk_name─┬─part_type─┬─partition─┬─sum(rows)─┬─size─────┬─part_count─┐
│ s3disk │ Wide │ tuple() │ 180000 │ 1.01 GiB │ 1 │
└───────────┴───────────┴───────────┴───────────┴──────────┴────────────┘
mergetree & replicatedmergetree / 2 replica / zero replication
no orphans
/usr/local/bin/aws s3 ls --recursive s3://t..../denis/s3cached/|wc -l
0
|
Which version of ClickHouse? |
I was finally able to identify (with very ugly methods) one orphans file on the logs. I see this:
Looking for the file on the remote_data_paths
On S3 now
|
@ray-at-github for your use case, can you try to drop the data using the sync keyword at the end? If it works for you I will create another issue |
@alifirat unfortunately I won’t be able to run the test again any time soon. I can probably do it in a month or so. But you can test it on your end. The only requirement is that you need some large amount of data in the table, say about 100 million rows and about 100k s3 objects. The point is to trigger lots of s3 delete operations within a short period, and therefore trigger AWS rate limiting. ClickHouse will then give up the failed s3 deletes and remove the local metadata links, hence leaving s3 orphans. I’m not quite sure your case is the same. |
@ray-at-github On huge tables, I prefer to use AWS lifecycle because there is nothing magic with ClickHouse since either it will timeout (timeout in ClickHouse, AWS S3 limit rates reached). |
@alifirat the Clickhouse S3 disk has its own method to organise table folders. You cannot tell which S3 folder belongs to which Clickhouse table. What if I have 2 tables in the same bucket and I only want to truncate one? AWS lifecycle won’t work in this case. In this ticket I am prompting (or hoping for) better handling of rate limiting errors in Clickhouse, instead of looking for a workaround. My solution for myself now is not to use s3 disk for production. |
Clickhouse can store such failures (objects) in a table or in a file and retry automatically later or leave it to the user's discretion. |
23.8 |
I understand, but that sounds a workaround to me. As a user ideally I can use S3 disk like a local disk, transparently. That’s what I am hoping for in this ticket, if it’s possible at all. |
@ray-at-github I think it's a trade off to do for a production usage because it helps to reduce a lot on the storage but it can also increase the number of operations you have to develop around his usage. I'm also curious to know if ClickHouse Cloud has already faces similar issue. |
@alifirat I agree, it’s a balance and each of us may have our own considerations. For me, the cost savings haven’t outweighed the hassle of manual intervention just yet.
I second that. Also wondering how SharedMergeTree handles rate limiting. |
No, it's not possible to use the AWS lifecycle correctly; it is not supported and will damage the tables. |
Yes, it is alright. To summarize:
|
I would expect that an exception during deletion will lead to parts not being removed (and retried later), but need to check the code... |
If you look to the analysis I've did, it's the not files of the inactive parts that have not been deleted but it's during the optimize process that seems to generate more files (I imagine tmp files) and those one are not removed but don't know why. |
Hey @alexey-milovidov @den-crane I continue analysis the issue I have and finally found something that make sense. Thank to your remarks, I've change my opinion about thinking that the optimize process was generating temporary file and think like this: If the orphans are duplicated then based on their size I should find them in system.remote_data_paths by filtering by the size column And guess what? It worked! I have a list of 281 orphans files and for each of them, I did a request to AWS to get their size (simple script like below):
It generates a file with only the size of orphans file like this :
Now by using this information, I've looked them in the
Bingo! I was able to find them and what was important to me, is if those files are really equals, the size yes but what about the content? Here it's very interesting and I have a couple of example like this one. I have some files that has been wrote during the optimize for which I have at least more than 2 retry. For instance:
ℹ️ I copy locally all of them and compared them, they were all equals. ℹ️ But now, I think it's the most important information in my comment, can we find a common pattern for those orphans file?
Last but not least, I have computed the size of those orphans files:
So 6,25 + 1,08 = 7,33 GiB (the size returned by AWS). Also, I've just noticed that during the merge operation
This is exactly the number of orphans file I have. |
Nice catch. Now it's crystal clear. /usr/local/bin/aws s3 ls --recursive s3://t..../denis/s3cached/|wc -l
0
CREATE TABLE test_s3(
c1 Int8,
c2 Date,
`c3.k` Array(String),
`c3.v1` Array(Int64),
`c3.v3` Array(Int64),
`c3.v4` Array(Int64),
) ENGINE = MergeTree
order by (c1,c2) SETTINGS disk = 's3disk',old_parts_lifetime=1,
min_bytes_for_wide_part=1,
vertical_merge_algorithm_min_rows_to_activate=1,
vertical_merge_algorithm_min_columns_to_activate=1;
insert into test_s3 values(1,1,['a','b'],[1,2],[1,2],[1,2]);
insert into test_s3 values(1,1,['a','b'],[1,2],[1,2],[1,2]);
optimize table test_s3 final;
drop table test_s3 sync;
select * FROM system.remote_data_paths limit 10;
0 rows in set. Elapsed: 0.003 sec.
/usr/local/bin/aws s3 ls --recursive s3://t..../denis/s3cached/|wc -l
6 |
@den-crane super fast to reproduce the issue 😅😅😅 At least you have a simple example to reproduce it 🙏 |
@den-crane When I look to the settings you have used for reproducing the issue, is it a side effect of the Merge Algorithm (vertical one on this case) ? |
@tavplubix @CheSema Do you think you will have the time to look for a potential fix, now that @den-crane has provided a way to reproduce the issue? |
Hey @ray-at-github Did you have |
@alifirat no, only primitive types like int and string. |
Yes. |
@den-crane do you know which file(s) is handling that? |
Do you mean source code .cpp files related to Vertical merge? No idea. |
Hi,
If we use S3 as a disk for a table, how does Clickhouse make sure there are no orphan files left in the S3 bucket?
In my recent test, I have encountered a lot of S3 rate limiting errors (HTTP 503, "Slow Down") when bulk inserted lots of data.
Then I truncated the table. This seems have left many files in the S3 bucket undeleted, although the table is empty.
This bucket is used only in this test and only for this table, I expect the S3 bucket to be empty when truncated. I also waited some time for Clickhouse to perform some sort of "garbage collection" but it seems not happening.
This will be an issue if we use S3 disks for production and the 'garbage' size keeps growing (storage costs $$).
Does Clickhouse have any mechanism to detect and collect garbage (unused S3 objects)?
Thanks for any insights if I missed something.
The text was updated successfully, but these errors were encountered: