Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing tables as files in HTTP interface #46925

Open
alexey-milovidov opened this issue Feb 27, 2023 · 5 comments · May be fixed by #64336
Open

Accessing tables as files in HTTP interface #46925

alexey-milovidov opened this issue Feb 27, 2023 · 5 comments · May be fixed by #64336
Labels
feature warmup task The task for new ClickHouse team members. Low risk, moderate complexity, no urgency.

Comments

@alexey-milovidov
Copy link
Member

Allow to add an HTTP handler representing a database or all databases under a specified URL prefix:

prefix/table.tsv
prefix/database/table.json

It contains the database name, the table name, optional data format as a file extension, and optional additional parameters.

The request is processed as SELECT * FROM database.table FORMAT TSV.

The additional parameters are the same as what's available in the HTTP interface, including the settings, like offset, limit, but we should add more parameters for query construction:

columns - provide a subset of columns to return;
where - filter the data by the corresponding expression;
select - write a select query explicitly, omitting the FROM section;

The authentication and access control work as usual.

Additional notes

It composes nicely with #38775.

Content-length, range requests, and ETag are either not implemented or can be implemented by caching the resulting content.

You might want a GraphQL interface, but it's out of scope.

We can also represent it under an S3-compatible interface, although it is much harder to do.

Motivation

Exploration of the possibilities. Proposed by @camuel

@UnamedRus
Copy link
Contributor

May be we don't need to have separate feature for that?
IE predefined HTTP handles already can do that (or can do with minor changes).

@alexey-milovidov
Copy link
Member Author

Yes, let's check if it's already possible.
If we can extract parameters from the URL's path and use parameterized queries with the Identifier parameter type, it should suffice.

@alexey-milovidov
Copy link
Member Author

Looks like it would not work for file extensions.

@UnamedRus
Copy link
Contributor

If we could add format setting, it should be ok to set this setting from parameters.

@camuel
Copy link

camuel commented Feb 27, 2023

My idea was to create an overlay S3 interface for DeltaLake/Hudi/Iceberg data already on S3. Right now ClickHouse provides read-only support for those. If a full support for cooperatively transactional data on S3 is planned in the future, I think it is worth considering providing a S3-compatible interface on top in addition to a standard ClickHouse HTTP interface for the following reasons:

  1. DERIVED DATA: It is not clear how to implement derived data like mat. views with DeltaLake/Hudi/Iceberg. Will mat. views only available through ClickHouse? In this case DeltaLake/Hudi/Iceberg interoperability is compromised as mat. views are only accessible through the particular engine. Let's assume Snowflake and ClickHouse work with the single Iceberg table on AWS S3. If a mat. view is required over this table, the same mat. view needs to be built in both ClickHouse and Snowflake. There is no other way. Or somehow mat. view should be introduced from CH to SF as a foreign table or vice versa which makes the whole thing pretty clumsy and I think Iceberg was not designed to work in this way. Creating an S3 overlay interface solves this problem by one engine being special and other engines connecting to a virtual S3 interface provided by this special engine. Mat. views become possible now and they look just like ordinary Iceberg tables. Heck, it will be possible even to represent Iceberg tables as ordinary parquet data and then on the client there is not even a need for Iceberg support!

  2. CORRUPTION AVOIDANCE: DeltaLake/Hudi/Iceberg are not well defined formats and also because of their cooperative nature corruption is easy and goes silent. One buggy client or just a client for another language or older version or just a client which assumes it is just ordinary parquet data can quickly and silently corrupt the data. Even read-only access is not safe as clients will read data wrongly and poison the entire downstream pipeline. An overlay S3-compatible engine can enforce the cooperative transactionality convention and never allow corruption.

  3. POLYMORPHIC: Such a design can always be reversed. If a user removes CH with such an overlay S3-compatible interface from his deployment, almost nothing will change for him and he can continue to use "DeltaLake/Hudi/Iceberg" in the old way. Just the URLs will need updating. So switching to this functionality and back is easy for the user. 

  4. STREAMING SUPPORT: DeltaLake/Hudi/Iceberg data don't like streaming as it creates too many small objects on underlying S3 service and this is costly and very slow. With the overlay S3-compatible interface, streaming can be supported in the original DeltaLake/Hudi/Iceberg interface and implemented as combining many small transactions into bigger ones on the underlying S3 service (but then of course N last transactions will be in danger) 

  5. QUERYING SUPPORT: S3Select facility in the S3 interface can be used to implement SQL querying without breaking S3 compatibility too much and without a need to extend it too much.

I haven't thought about an S3-compatible interface for ordinary ClickHouse deployment, but maybe this is a good idea too. I suggested this functionality only for the data which is already on S3 in DeltaLake/Hudi/Iceberg format. 

There are numerous technical challenges here. How to support range queries? How to support ETags/hashing? However, I think most of them are solvable somehow and if not then 90% compatibility is still better in my view than having the above problems like impossibility to support derived data, streaming data, silent corruption etc.. 

As a prototype/demo it will be nice to have Snowflake accessing a mat. view in ClickHouse thinking it is an Iceberg table on S3. 

@alexey-milovidov alexey-milovidov added the warmup task The task for new ClickHouse team members. Low risk, moderate complexity, no urgency. label Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature warmup task The task for new ClickHouse team members. Low risk, moderate complexity, no urgency.
Projects
None yet
3 participants