-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Blob Storage Disk support in ClickHouse #29430
Comments
Have you considered using a minio gateway for an azure server as a proxy server? |
Hey @UnamedRus, Yes, we have considered using a minio proxy, but we still decided to write the Disk Blob Storage part on our own. This approach gives more direct access to the storage and is thus more flexible, plus we remove one bottleneck from the system in the form of a proxy server - all the data would have to come through this proxy, which comes with a performance and system design penalty. |
Is there documentation for this feature? Seems it was released in 22.1, but can't find eg configuration docs |
Hey @mxalis, The feature isn't documented yet, as far as I know. Below I list the available configuration parameters. Connection parameters:
Authentication parameters (the disk will try all available methods and Managed Identity Credential):
Limit parameters (mainly for internal usage):
Other parameters:
@kssenii, would you like us to update the documentation or shall we leave it to the ClickHouse team? If the former, shall we just create a section in https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/ like the one for S3? |
@jkuklis thanks, this will get me started |
@jkuklis Any suggestion on data ingestion? I tried to insert a 30~40MB orc/parquet file to a MergeTree table with azure blob storage. It turns out timeout exception(exceed 300seconds). |
It would be great if you could do that :) |
Hey @openxxx, When I was working on the implementation, I also tested it using In steps, I created a table using the same schema as Have you managed to insert and query anything with that table? Maybe your authentication doesn't work as intended? For Premium page blob accounts, it's likely it's unsupported, the implementation was tested only with a Standard account. Could you share more context on how it manifests, e.g. share some configs or the error if there is one? |
Working config:
Non-working config:
disk config:
Error message on server start:
|
Hello!
We would like to propose introducing support for Azure Blob Storage Disks in ClickHouse, in a similar way it was done for AWS S3 Disks. At Contentsquare we have already started preliminary work to make sure this is feasible.
Context
We use S3 Disks in our AWS servers for example for storing raw data or monitoring data loss with certain metadata, for which regular disks would be too expensive. The
DiskS3
approach (and not e.g.S3Engine
one) is the best for us, as it can be used withMergeTree
s.Soon we will need a similar solution in Azure servers. We decided internally that the best way for us to go would be to develop for Blob Storage Disks the same logic that was developed for S3 Disks.
Note on alternatives: we considered using DataLake Gen 2, a higher abstraction built on top of Blob Storage to mimic a disk behavior, but it doesn't offer enough flexibility, for example a possibility to do a move operation, which is important for ClickHouse. We also considered using a proxy server to translate commands from S3 to Blob Storage, but it would be too error-prone and inefficient.
Work plan
Below we present what we think is necessary to add the Blob Storage Disk to ClickHouse:
Azure SDK dependency
We managed to add the dependency by:
azure-sdk-for-cpp
directory andazure-cmake
directory with customCMakeLists.txt
tocontrib
src
andcontrib
CMakeLists.txt
files.cmake
tocmake/find
and including it in the mainCMakeLists.txt
fileborignssl-cmake
CMakeLists.txt
, as one of its functions is used in the Azure SDKWe were able to manipulate Blob Storage from within ClickHouse with this configuration.
POCO HTTP wrapper for Azure
This part is used for communication over the network and interpretation of messages. It would be based on the S3 counterpart, with all its files located in src/IO/S3. The S3 version is quite developed and robust, for the start we could probably implement a simpler solution. We could also extract the common part with S3 and create a parent class for it.
Azure authentication part
For the start we would like to rely on the role-based authentication, in which authen is granted to an Azure instance as a whole (so there are no credentials or secrets). We have already conducted preliminary tests for this type of authentication, it is an open question whether we can leave it like that for now, as S3 implementation supports more ways to authenticate. For S3, authentication is done in src/IO/S3Common .h and .cpp.
Blob Storage buffer handling
This part is for actual reading and writing buffers for Blob Storage. For S3, these are implemented in src/IO, in fReadBufferFromS3 and WriteBufferFromS3 .h and .cpp files. It is unclear whether these need to be extracted from the Disk implementation very early on.
Blob Storage Disk
Blob Storage Disk implementation of the IDiskRemote interface, based on the equivalent src/Disk/S3 files. Regarding mutualization of the logic for Blob Storage and S3, on one hand, it might be hard, as the implementations are short and quite Disk-specific, but on the other hand, this part seems to be updated rather frequently, so it might make sense to mutualize the logic to ensure that potential fixes and refactors are applied to both Disks.
Integration tests
We would like to create a couple of end-to-end integration tests on Contentsquare use cases. We aim to run the full Azure pipeline for at least a couple of days to make sure the solution runs smoothly. Functional and unit tests are also considered.
Execution
We aim to implement this feature on our own at Contentsquare provided that we get a green light from you on the design. We have already started working on this feature and expect it to be ready in the first quarter of 2022.
Questions
Thanks for attention, let us know what you think!
Jakub Kuklis
Contentsquare
The text was updated successfully, but these errors were encountered: