Accessing tables as files in HTTP interface #46925

alexey-milovidov · 2023-02-27T01:20:35Z

Allow to add an HTTP handler representing a database or all databases under a specified URL prefix:

prefix/table.tsv
prefix/database/table.json

It contains the database name, the table name, optional data format as a file extension, and optional additional parameters.

The request is processed as SELECT * FROM database.table FORMAT TSV.

The additional parameters are the same as what's available in the HTTP interface, including the settings, like offset, limit, but we should add more parameters for query construction:

columns - provide a subset of columns to return;
where - filter the data by the corresponding expression;
select - write a select query explicitly, omitting the FROM section;

The authentication and access control work as usual.

Additional notes

It composes nicely with #38775.

Content-length, range requests, and ETag are either not implemented or can be implemented by caching the resulting content.

You might want a GraphQL interface, but it's out of scope.

We can also represent it under an S3-compatible interface, although it is much harder to do.

Motivation

Exploration of the possibilities. Proposed by @camuel

The text was updated successfully, but these errors were encountered:

UnamedRus · 2023-02-27T02:11:43Z

May be we don't need to have separate feature for that?
IE predefined HTTP handles already can do that (or can do with minor changes).

alexey-milovidov · 2023-02-27T04:03:42Z

Yes, let's check if it's already possible.
If we can extract parameters from the URL's path and use parameterized queries with the Identifier parameter type, it should suffice.

alexey-milovidov · 2023-02-27T04:04:31Z

Looks like it would not work for file extensions.

UnamedRus · 2023-02-27T12:00:15Z

If we could add format setting, it should be ok to set this setting from parameters.

camuel · 2023-02-27T15:59:14Z

My idea was to create an overlay S3 interface for DeltaLake/Hudi/Iceberg data already on S3. Right now ClickHouse provides read-only support for those. If a full support for cooperatively transactional data on S3 is planned in the future, I think it is worth considering providing a S3-compatible interface on top in addition to a standard ClickHouse HTTP interface for the following reasons:

DERIVED DATA: It is not clear how to implement derived data like mat. views with DeltaLake/Hudi/Iceberg. Will mat. views only available through ClickHouse? In this case DeltaLake/Hudi/Iceberg interoperability is compromised as mat. views are only accessible through the particular engine. Let's assume Snowflake and ClickHouse work with the single Iceberg table on AWS S3. If a mat. view is required over this table, the same mat. view needs to be built in both ClickHouse and Snowflake. There is no other way. Or somehow mat. view should be introduced from CH to SF as a foreign table or vice versa which makes the whole thing pretty clumsy and I think Iceberg was not designed to work in this way. Creating an S3 overlay interface solves this problem by one engine being special and other engines connecting to a virtual S3 interface provided by this special engine. Mat. views become possible now and they look just like ordinary Iceberg tables. Heck, it will be possible even to represent Iceberg tables as ordinary parquet data and then on the client there is not even a need for Iceberg support!
CORRUPTION AVOIDANCE: DeltaLake/Hudi/Iceberg are not well defined formats and also because of their cooperative nature corruption is easy and goes silent. One buggy client or just a client for another language or older version or just a client which assumes it is just ordinary parquet data can quickly and silently corrupt the data. Even read-only access is not safe as clients will read data wrongly and poison the entire downstream pipeline. An overlay S3-compatible engine can enforce the cooperative transactionality convention and never allow corruption.
POLYMORPHIC: Such a design can always be reversed. If a user removes CH with such an overlay S3-compatible interface from his deployment, almost nothing will change for him and he can continue to use "DeltaLake/Hudi/Iceberg" in the old way. Just the URLs will need updating. So switching to this functionality and back is easy for the user.
STREAMING SUPPORT: DeltaLake/Hudi/Iceberg data don't like streaming as it creates too many small objects on underlying S3 service and this is costly and very slow. With the overlay S3-compatible interface, streaming can be supported in the original DeltaLake/Hudi/Iceberg interface and implemented as combining many small transactions into bigger ones on the underlying S3 service (but then of course N last transactions will be in danger)
QUERYING SUPPORT: S3Select facility in the S3 interface can be used to implement SQL querying without breaking S3 compatibility too much and without a need to extend it too much.

I haven't thought about an S3-compatible interface for ordinary ClickHouse deployment, but maybe this is a good idea too. I suggested this functionality only for the data which is already on S3 in DeltaLake/Hudi/Iceberg format.

There are numerous technical challenges here. How to support range queries? How to support ETags/hashing? However, I think most of them are solvable somehow and if not then 90% compatibility is still better in my view than having the above problems like impossibility to support derived data, streaming data, silent corruption etc..

As a prototype/demo it will be nice to have Snowflake accessing a mat. view in ClickHouse thinking it is an Iceberg table on S3.

alexey-milovidov added the feature label Feb 27, 2023

alexey-milovidov added the warmup task The task for new ClickHouse team members. Low risk, moderate complexity, no urgency. label Sep 3, 2023

alexey-milovidov mentioned this issue Jan 2, 2024

Roadmap 2024 (discussion) #58392

Open

nooblose linked a pull request May 24, 2024 that will close this issue

Add combining http parameters with queries, new handler for accessing tables as files #64336

Draft

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing tables as files in HTTP interface #46925

Accessing tables as files in HTTP interface #46925

alexey-milovidov commented Feb 27, 2023

UnamedRus commented Feb 27, 2023

alexey-milovidov commented Feb 27, 2023

alexey-milovidov commented Feb 27, 2023

UnamedRus commented Feb 27, 2023

camuel commented Feb 27, 2023

Accessing tables as files in HTTP interface #46925

Accessing tables as files in HTTP interface #46925

Comments

alexey-milovidov commented Feb 27, 2023

Additional notes

Motivation

UnamedRus commented Feb 27, 2023

alexey-milovidov commented Feb 27, 2023

alexey-milovidov commented Feb 27, 2023

UnamedRus commented Feb 27, 2023

camuel commented Feb 27, 2023