Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory consumption during backup to S3 #45188

Merged
merged 5 commits into from
Jan 21, 2023

Conversation

vitlibar
Copy link
Member

@vitlibar vitlibar commented Jan 11, 2023

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Optimize memory consumption during backup to S3: files to S3 now will be copied directly without using WriteBufferFromS3 (which could use a lot of memory).

This PR fixes #43083

@robot-ch-test-poll robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Jan 11, 2023
@vitlibar vitlibar force-pushed the backup-to-s3-memory-optimization branch from 1f8e6d0 to 39153b0 Compare January 11, 2023 20:38
@CheSema
Copy link
Member

CheSema commented Jan 14, 2023

I like that approach. Actually this is the best way when we upload file from local disk to S3.
I have several question. Mostly just to make history, what and why this code do.

When you copy file you know its size. Why do you still use exponential part upload size instead some fixed?

When you copy from S3 to S3 why do you prefer seakable reading from S3 instead UploadPartCopyRequest? My concern is the following. With seakable reading you read a file twice. First is to calc sha hash. Second is to upload it to S3. Indeed this read is made with fixed buffer size, memory consumption is low. However I think it is worth to do with existing code in order not to have one more realisation with UploadPartCopyRequest.

@vitlibar
Copy link
Member Author

vitlibar commented Jan 16, 2023

When you copy file you know its size. Why do you still use exponential part upload size instead some fixed?

Yes you're right, that code was copied from WriteBufferFromS3 but there are no reasons why we should keep that exponential approach for those copy functions.

When you copy from S3 to S3 why do you prefer seakable reading from S3 instead UploadPartCopyRequest? My concern is the following. With seakable reading you read a file twice. First is to calc sha hash. Second is to upload it to S3. Indeed this read is made with fixed buffer size, memory consumption is low. However I think it is worth to do with existing code in order not to have one more realisation with UploadPartCopyRequest.

I don't quite understand. I don't prefer seekable reading over UploadPartCopyRequest. Copying from S3 to S3 with UploadPartCopyRequest must always be better. Seekable reading is not for copying from S3 to S3, it's mostly for uploading data from local disks to S3.

@CheSema
Copy link
Member

CheSema commented Jan 16, 2023

I don't quite understand. I don't prefer seekable reading over UploadPartCopyRequest

Sorry. Indeed, you don't. You have dedicated realisation for S3toS3 with UploadPartCopyRequest. This is great.

@vitlibar
Copy link
Member Author

When you copy file you know its size. Why do you still use exponential part upload size instead some fixed?

I did that. Now an uploaded file is splitted into parts of the same size.

@vitlibar vitlibar force-pushed the backup-to-s3-memory-optimization branch from 87e4f70 to a934fbf Compare January 16, 2023 15:13
Correctly use AbortMultipleUpload request.
Support std::ios_base::end StdStreamBufFromReadBuffer::seekpos().
@vitlibar vitlibar force-pushed the backup-to-s3-memory-optimization branch from a934fbf to 1c84518 Compare January 17, 2023 08:36
{
task->req = fillUploadPartRequest(part_number, part_offset, part_size);

schedule([this, task, task_finish_notify]()
Copy link
Member

@CheSema CheSema Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me, please, to understand a thing here, how many concurrent uploads are run here? Is it limited thread pool?

My concern is that it runs a lot of tasks, each makes a new connection. It could run out of available inodes or face other limitation inside CH.

In compare with WriteBufferFromS3, WriteBufferFromS3 uses threads to write the data which it consumed. If consuming is not fast, it just upload in a background with 1 thread. However WriteBufferFromS3 also is able to produce a lot of tasks. Here is a different situation, a lot of task is run at start. Each allocates at least 1M (buffer) + new connection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's normally used with IOThreadPool . It's limited of course. By default it runs up to 100 threads, so maximum 100 parallel uploads. But it actually must be smaller because IOThreadPool is also used for other tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact it should use its own thread pools backups_thread_pool and restores_thread_pool, so I changed the code to use the correct thread pools. Those thread pools are created specially for backups and restores.


for (size_t part_number = 1; position < end_position; ++part_number)
{
if (multipart_upload_aborted)
Copy link
Member

@CheSema CheSema Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to also check multipart_upload_aborted right before processUploadPartRequest? In order to stop work as fast as possible if non retryable error occurs.
My concern is that the code have to wait all task which has been scheduled. If 1000 task were scheduled and only 100 of them is running and finishing with connection timeout. It would take 10 x timeout to finish the upload with an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the queue size is pretty big

    auto queue_size = config.getUInt(".threadpool_writer_queue_size", 1000000);

and all task are about to be pushed to the schedule's queue. In case if first task fail no need to run the rest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to also check multipart_upload_aborted right before processUploadPartRequest?

Done, I changed the code.

@vitlibar vitlibar force-pushed the backup-to-s3-memory-optimization branch 2 times, most recently from 633d795 to 1a680b0 Compare January 20, 2023 16:29
@vitlibar
Copy link
Member Author

The test test_backup_restore_on_cluster/test_disallow_concurrency.py::test_concurrent_restores_on_different_node is flaky

@vitlibar vitlibar merged commit f0fda58 into ClickHouse:master Jan 21, 2023
@vitlibar vitlibar deleted the backup-to-s3-memory-optimization branch January 21, 2023 11:37
AVMusorin added a commit to AVMusorin/ClickHouse that referenced this pull request May 17, 2023
Comment on lines +31 to +32
StdStreamFromReadBuffer(std::unique_ptr<ReadBuffer> buf, size_t size) : Base(&stream_buf), stream_buf(std::move(buf), size) { }
StdStreamFromReadBuffer(ReadBuffer & buf, size_t size) : Base(&stream_buf), stream_buf(buf, size) { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions about this,

Base(&stream_buf) will call the following constructor

template <class _CharT, class _Traits>
basic_streambuf<_CharT, _Traits>::basic_streambuf()
    : __binp_(nullptr),
      __ninp_(nullptr),
      __einp_(nullptr),
      __bout_(nullptr),
      __nout_(nullptr),
      __eout_(nullptr)
{
}

template <class _CharT, class _Traits>
basic_streambuf<_CharT, _Traits>::basic_streambuf(const basic_streambuf& __sb)
    : __loc_(__sb.__loc_),
      __binp_(__sb.__binp_),
      __ninp_(__sb.__ninp_),
      __einp_(__sb.__einp_),
      __bout_(__sb.__bout_),
      __nout_(__sb.__nout_),
      __eout_(__sb.__eout_)
{
}

so, __ninp_ and __einp_ are nullptr. But std::iostream will call sgetc() to get char, so we will always call underflow() which does not advance the position.

inline _LIBCPP_HIDE_FROM_ABI_AFTER_V1
int_type sgetc() {
    if (__ninp_ == __einp_)
        return underflow();
    return traits_type::to_int_type(*__ninp_);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backup to S3 endpoint out of memory
4 participants