New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ReadBuffer for AzureBlobStorage #61884
base: master
Are you sure you want to change the base?
Conversation
This is an automated comment for commit fa4d205 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
f628a79
to
35cc40b
Compare
@@ -50,14 +50,12 @@ class ReadBufferFromAzureBlobStorage : public ReadBufferFromFileBase | |||
|
|||
private: | |||
|
|||
void initialize(); | |||
size_t readBytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed this function, as it also reads data now. But we still need initialized
bool for other checks, so kept that as it is.
|
||
std::unique_ptr<Azure::Core::IO::BodyStream> body_stream = std::move(download_response.Value.BodyStream); | ||
bytes_copied = body_stream->ReadToCount(reinterpret_cast<uint8_t *>(to), body_stream->Length()); | ||
auto download_response = blob_client->DownloadTo(reinterpret_cast<uint8_t *>(to), n, download_options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really give a performance improvement? Seems like just a different api (and this new way with DownloadTo
is better and more clear, so the change is good) but it seems to be not to be improving the perf anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but may be worth to update changelog entry to not-for-changelog, or if not updating then to make a bit more coherent description of the change in the changelog entry.
size_t bytes_read = 0; | ||
|
||
size_t sleep_time_with_backoff_milliseconds = 100; | ||
if (static_cast<size_t>(offset) >= getFileSize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a azure disk read it will be better not to execute additional request to azure just to get file size (getFileSize is only used for external table engines) as it is already known before we create the read buffer. So may be just pass file_size as optional argument in constructor of ReadBufferFrpmAzureBlobStorage
?
btw why do we need this check? There is no such check for other buffers AFAIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other buffers, I see we send the request to read the full intended length of the file.
Here we are using readBytes function to read bytes to a buffer and only read as much as its capacity (data_capacity). So every time we read we extend the offset at one point it will reach intended size or file_size .
In getFileSize() function we only read the value once, it is just returned for future calls.
Please let me know if you have any suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In getFileSize() function we only read the value once, it is just returned for future calls.
Yes, I saw that it is fetched only once, but if it is avoidable - better to avoid, we already have a file size when we create this read buffer, better to use it.
For other buffers, I see we send the request to read the full intended length of the file.
But isn't it the same as here? Here we have a buffer and try to read data_capacity
size to it, but if we reach the end of file while reading into buffer - we will just read nothing and the buffer will be empty, so we return false
as well. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isn't it the same as here? Here we have a buffer and try to read data_capacity size to it, but if we reach the end of file while reading into buffer - we will just read nothing and the buffer will be empty, so we return false as well. Or am I missing something?
DownloadTo function takes Range as a parameter (offset, length) if length is zero it reads full file, so when we reach end of file, the range will be (offset=file_size, length), even if length is zero we see error message that the "The range specified is invalid for the current size of the resource".
Previously when we used to Download & then ReadToBuffer, there we had a check to_read_bytes = min(total_size-offset, length), and we would just read 0 bytes & return false.
Hope that clarifies the need. Please suggest if you have better options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok understood. Also what if there are remaining N bytes to read from file and N is less that buffer_capacity, shouldn't it fail as well with incorrect range error? Seems currently code is not protected against this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But read_until_position
is not always set. It is not set for remote_filesystem_read_method=read
for (size_t i = 0; i < max_single_download_retries; ++i) | ||
{ | ||
try | ||
{ | ||
auto download_response = blob_client->Download(download_options); | ||
data_stream = std::move(download_response.Value.BodyStream); | ||
auto download_response = blob_client->DownloadTo(reinterpret_cast<uint8_t *>(data_ptr), data_capacity, download_options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will not improve performance but will decrease it instead because it will make a GET request for each call of nextImpl
instead of one request in initialize
method, as I understand the azure sdk. DownloadTo
looks more suitable for reading small blobs that can be read with one call.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Updated to use azure api DownloadTo instead of Download and ReadTo apis. DownloadTo reads to a buffer and uses parallel requests.