Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Apache Iceberg] Failing to create table #49958

Closed
melvynator opened this issue May 17, 2023 · 20 comments · Fixed by #50232
Closed

[Apache Iceberg] Failing to create table #49958

melvynator opened this issue May 17, 2023 · 20 comments · Fixed by #50232

Comments

@melvynator
Copy link
Member

melvynator commented May 17, 2023

Creating a table failed when using apache iceberg:

CREATE TABLE iceberg Engine=Iceberg(...)

2023.05.11 10:21:36.140819 [ 19080 ] {bc194292-7197-4ab9-9a30-b9461ab43ecd} <Error> TCPHandler: Code: 499. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /usr/bin/clickhouse
1. ? @ 0x9801d4d in /usr/bin/clickhouse
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, DB::S3Settings::RequestSettings const&, bool, bool, bool) @ 0x126aa4fe in /usr/bin/clickhouse
3. DB::StorageS3Source::KeysIterator::KeysIterator(DB::S3::Client const&, String const&, std::vector<String, std::allocator<String>> const&, String const&, DB::S3Settings::RequestSettings const&, std::shared_ptr<DB::IAST>, DB::Block const&, std::shared_ptr<DB::Context const>, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x1447e152 in /usr/bin/clickhouse
4. DB::StorageS3::createFileIterator(DB::StorageS3::Configuration const&, bool, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST>, DB::Block const&, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x14486c4b in /usr/bin/clickhouse
5. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x144861c6 in /usr/bin/clickhouse

We have looked to the codebase and I have the feeling that you're calling the head object on a folder to get the last modification time but a folder on S3 has never a last modification time. We have run the following command on the same buckert and keys set on the IcebergEngine and we see the same exact error (aws s3api head-object bucket --key mykey returns a 404).

Version used: 23.4.1.1943

@tavplubix
Copy link
Member

cc: @ucasfl, @kssenii

@kssenii
Copy link
Member

kssenii commented May 17, 2023

Please check the log message "New configuration path: {}, keys: {}" ("trace" log level) and verify what is in keys: .... It should match s3 files.

LOG_TRACE(
&Poco::Logger::get("DataLake"),
"New configuration path: {}, keys: {}",
configuration.getPath(), fmt::join(configuration.keys, ", "));

@kssenii
Copy link
Member

kssenii commented May 17, 2023

We have looked to the codebase and I have the feeling that you're calling the head object on a folder to get the last modification time but a folder on S3 has never a last modification time.

not the folder, it makes a head request to the files which are added here

if (keys.empty())
configuration.keys = getDataFiles(configuration, local_context);
else
configuration.keys = keys;

@alifirat
Copy link

alifirat commented May 17, 2023

@kssenii
Let me add more context to this ticket. To debug, we have tested locally with the docker image clickhouse/clickhouse-server:23.4-alpine and we have enable the trace_logs.
We have identified the line you have mentionned "New configuration path: {}, keys: {}", and in the trace log, we see

[clickhouse1] 2023.05.17 14:58:39.060859 [ 8 ] {795cb46f-4f58-4f6b-b325-29336025f225} <Trace> DataLake: New configuration path: iceberg-catalog/iceberg/, keys: s3://tmp/iceberg-catalog/iceberg/1.parquet, etc`

In total, the keys is returning 1725 files. (we have checked on the AWS UI, we have the same exact number of files).
Later in the trace, we see the error

: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR) (version 23.4.2.11 (official build)) (from 127.0.0.1:57074) (in query: select count(1) from test), Stack trace (when copying this message, always include the lines below)

that results from the command CREATE TABLE our_table Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/').

My question is:

  • do you do a head request for every files founded ? If yes why ?

@kssenii
Copy link
Member

kssenii commented May 17, 2023

/tmp/iceberg-catalog/iceberg/1.parquet - does the url of iceberg function argument (iceberg(<url>)) has the same prefix /tmp/iceberg-catalog/iceberg/?

UPDATE: ah I see it in your comment "that results from the command CREATE TABLE our_table Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/')."

do you do a head request for every files founded ? If yes why ?

I see it is done here and indeed all keys are iterated to make a head request

for (auto && key : all_keys)
{
auto info = S3::getObjectInfo(client_, bucket, key, version_id_, request_settings_);
total_size += info.size;
keys.emplace_back(std::move(key), std::move(info));
}

probably it is needed for schema inference
but I do not know why we do it for all keys, may be @Avogar can tell?

@kssenii
Copy link
Member

kssenii commented May 17, 2023

/tmp/iceberg-catalog/iceberg/1.parquet - does the url of iceberg function argument (iceberg()) has the same prefix /tmp/iceberg-catalog/iceberg/?

UPDATE: ah I see it in your comment "that results from the command CREATE TABLE our_table Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/')."

But the log line

<Trace> DataLake: New configuration path: iceberg-catalog/iceberg/, keys: s3://tmp/iceberg-catalog/iceberg/1.parquet,

suggests that you had Engine = Iceberg('s3://iceberg-catalog/iceberg/') instead of Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/'), isn't it? (because I see New configuration path: iceberg-catalog/iceberg/ without tmp prefix)

@alifirat
Copy link

Hey @kssenii
cs-tmp is the bucket name while the iceberg-catalog-iceberg is the key.

It's not possible to create Engine = Iceberg('s3://iceberg-catalog/iceberg/') without URL no ?

@kssenii
Copy link
Member

kssenii commented May 17, 2023

cs-tmp is the bucket name while the iceberg-catalog-iceberg is the key.

ok, just make sure that paths in iceberg metadata match paths in s3 (you can also read iceberg metadata files with clickhouse-local and check), because this error you get "Failed to get object info: No response body.." means that is tries to read non-existing file and prefixes mismatch between iceberg metadata and s3 url can be a reason (It is possible that in previous version such mismatch was allowed and in recent version it is not, I did not check)

data_file column, example:

* ─status─┬─────────snapshot_id─┬─data_file──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
* │ 1 │ 2252246380142525104 │ ('/iceberg_data/db/table_name/data/a=0/00000-1-c9535a00-2f4f-405c-bcfa-6d4f9f477235-00001.parquet','PARQUET',(0),1,631,67108864,[(1,46),(2,48)],[(1,1),(2,1)],[(1,0),(2,0)],[],[(1,'\0\0\0\0\0\0\0\0'),(2,'1')],[(1,'\0\0\0\0\0\0\0\0'),(2,'1')],NULL,[4],0) │
* │ 1 │ 2252246380142525104 │ ('/iceberg_data/db/table_name/data/a=1/00000-1-c9535a00-2f4f-405c-bcfa-6d4f9f477235-00002.parquet','PARQUET',(1),1,631,67108864,[(1,46),(2,48)],[(1,1),(2,1)],[(1,0),(2,0)],[],[(1,'\0\0\0\0\0\0\0'),(2,'2')],[(1,'\0\0\0\0\0\0\0'),(2,'2')],NULL,[4],0) │
* │ 1 │ 2252246380142525104 │ ('/iceberg_data/db/table_name/data/a=2/00000-1-c9535a00-2f4f-405c-bcfa-6d4f9f477235-00003.parquet','PARQUET',(2),1,631,67108864,[(1,46),(2,48)],[(1,1),(2,1)],[(1,0),(2,0)],[],[(1,'\0\0\0\0\0\0\0'),(2,'3')],[(1,'\0\0\0\0\0\0\0'),(2,'3')],NULL,[4],0) │

.

It's not possible to create Engine = Iceberg('s3://iceberg-catalog/iceberg/') without URL no ?

no, or I didn't understand the question, how would it work without url?

@alifirat
Copy link

@kssenii I'LM come back with a shorter example. During this time, I was just wondering (because it's missing a log here on the key) when calling the s3 api do you think you keep the bucket name in the key ? That may explains the 404.

I'm able to reproduce the issue with a single data file.

@alifirat
Copy link

@kssenii I would like to add this also. If the number of data is high, it means that the time to create the time will be linear regarding that.

If it's about schéma inference, it's available in the metadata why not looking there ? We plan to use Iceberg and I cannot imagine doing a request for each data file ..

@kssenii
Copy link
Member

kssenii commented May 17, 2023

cs-tmp is the bucket name while the iceberg-catalog-iceberg is the key.

"cs-tmp" or "tmp"? here keys: s3://tmp/iceberg-catalog/iceberg/1.parquet, I see just "tmp"

when calling the s3 api do you think you keep the bucket name in the key ?

no it is stored separately from key https://github.com/ClickHouse/ClickHouse/blob/master/src/IO/S3/URI.h#L25-L26
btw quite strange that in iceberg metadata you have "tmp" if as you say it is a bucket name, there should be only a key

it's available in the metadata why not looking there ?

it is possible, not implemented, you can create an issue.

@alifirat
Copy link

Hey @kssenii

Let me give you a simple and easy to reproduce.

How did I've generated the data ?

Run spark-sql with

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.1 \  
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ 
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog  \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \   
--conf spark.sql.catalog.iceberg.type=hadoop   \  
--conf spark.sql.catalog.iceberg.warehouse=s3://cs-tmp/akilic/iceberg-catalog/ \
 --conf spark.sql.defaultCatalog=iceberg
spark-sql> create table iceberg.test (a bigint) TBLPROPERTIES('format-version'='2');
spark-sql> INSERT INTO iceberg.test values 1, 2, 3; 
spark-sql>select * from iceberg.test ;
1
2
3

As you can see I'm able to read the data with spark-sql .

What do we have in AWS?

Data

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/data/
2023-05-17 22:52:14        422 00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
2023-05-17 22:52:15        425 00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

Metada

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/metadata/                                                                                                                                            0 [22:55:28]
2023-05-17 22:52:15       6629 0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
2023-05-17 22:52:15       4263 snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro
2023-05-17 22:51:38        847 v1.metadata.json
2023-05-17 22:52:16       1891 v2.metadata.json
2023-05-17 22:52:16          1 version-hint.text

ClickHouse

I'm using the docker image clickhouse-server:23.4-alpine

The issue happens when executing this simple request:

$  SET send_logs_level = 'trace'; 
$ create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')
[clickhouse1] 2023.05.17 21:07:03.317667 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: CREATE TABLE ON default.test
[clickhouse1] 2023.05.17 21:07:03.319788 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.321678 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.337961 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> NamedCollectionsUtils: Loaded 0 collections from SQL
[clickhouse1] 2023.05.17 21:07:04.549157 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> DataLake: New configuration path: akilic/iceberg-catalog/test/, keys: s3://cs-tmp/akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet, s3://cs-tmp/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
[clickhouse1] 2023.05.17 21:07:06.354270 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Error> executeQuery: Code: 499. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR) (version 23.4.2.11 (official build)) (from 127.0.0.1:56116) (in query: create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xbc87ee4 in /usr/bin/clickhouse
1. ? @ 0x86621f8 in /usr/bin/clickhouse
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, DB::S3Settings::RequestSettings const&, bool, bool, bool) @ 0xf5f857c in /usr/bin/clickhouse
3. DB::StorageS3Source::KeysIterator::KeysIterator(DB::S3::Client const&, String const&, std::vector<String, std::allocator<String>> const&, String const&, DB::S3Settings::RequestSettings const&, std::shared_ptr<DB::IAST>, DB::Block const&, std::shared_ptr<DB::Context const>, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e09e78 in /usr/bin/clickhouse
4. DB::StorageS3::createFileIterator(DB::StorageS3::Configuration const&, bool, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST>, DB::Block const&, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e11170 in /usr/bin/clickhouse
5. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x10e10910 in /usr/bin/clickhouse
6. DB::StorageS3::StorageS3(DB::StorageS3::Configuration const&, std::shared_ptr<DB::Context const>, DB::StorageID const&, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, String const&, std::optional<DB::FormatSettings>, bool, std::shared_ptr<DB::IAST>) @ 0x10e0f794 in /usr/bin/clickhouse
7. ? @ 0x10ed2294 in /usr/bin/clickhouse
8. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x10c7d7fc in /usr/bin/clickhouse
9. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x104b89b0 in /usr/bin/clickhouse
10. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x104b2a9c in /usr/bin/clickhouse
11. DB::InterpreterCreateQuery::execute() @ 0x104bd204 in /usr/bin/clickhouse
12. ? @ 0x1097d198 in /usr/bin/clickhouse
13. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x1097a5c4 in /usr/bin/clickhouse
14. DB::TCPHandler::runImpl() @ 0x11531b1c in /usr/bin/clickhouse
15. DB::TCPHandler::run() @ 0x115443e4 in /usr/bin/clickhouse
16. Poco::Net::TCPServerConnection::start() @ 0x121b0604 in /usr/bin/clickhouse
17. Poco::Net::TCPServerDispatcher::run() @ 0x121b1b20 in /usr/bin/clickhouse
18. Poco::PooledThread::run() @ 0x1235ac7c in /usr/bin/clickhouse
19. Poco::ThreadImpl::runnableEntry(void*) @ 0x12358544 in /usr/bin/clickhouse
20. start_thread @ 0x7624 in /lib/libpthread.so.0
21. ? @ 0xd149c in /lib/libc.so.6

In the trace you see two data file mentionned:

  • s3://cs-tmp/akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
  • s3://cs-tmp/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

and if we can see that it consistent with

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/data/
2023-05-17 22:52:14        422 00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
2023-05-17 22:52:15        425 00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

Let's check the head-object request for both files:

aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
{
    "AcceptRanges": "bytes",
    "Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
    "LastModified": "2023-05-17T20:52:14+00:00",
    "ContentLength": 422,
    "ETag": "\"9d27f6c2a869bf8424fc66076918b5d9\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet                                                                 0 [23:10:39]
{
    "AcceptRanges": "bytes",
    "Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
    "LastModified": "2023-05-17T20:52:15+00:00",
    "ContentLength": 425,
    "ETag": "\"1c919896c4bfc3f46260c2d7baa9e55c\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

If we tried to read the data without Iceberg:

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet')
─a─┐
│ 2 │
│ 3 │
└───┘

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')

SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')

Query id: 4050b563-57f2-4b2f-ab12-f4aac87a9cff

[clickhouse1] 2023.05.17 21:14:14.402574 [ 9 ] {4050b563-57f2-4b2f-ab12-f4aac87a9cff} <Debug> executeQuery: (from 127.0.0.1:56116) select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro') (stage: Complete)
Exception on client:
Code: 92. DB::Exception: Tuple cannot be empty: while receiving packet from localhost:9000. (EMPTY_DATA_PASSED)

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro') Format Vertical

Row 1:
──────
manifest_path:             s3://cs-tmp/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
manifest_length:           6629
partition_spec_id:         0
content:                   0
sequence_number:           1
min_sequence_number:       1
added_snapshot_id:         7532076000798921356
added_data_files_count:    2
existing_data_files_count: 0
deleted_data_files_count:  0
added_rows_count:          3
existing_rows_count:       0
deleted_rows_count:        0
partitions:                []

SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')

Code: 117. DB::Exception: Expected field "meta" with columns names and types, found field format-version: Cannot extract table structure from JSON format file. You can specify the structure manually. (INCORRECT_DATA) (version 23.4.2.11 (official build)) (from 127.0.0.1:39778) (in query: select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')), Stack trace (when copying this message, always include the lines below):

I've put as much details possible to help to know if the issue is in the way we have generated the data or there is a real bug in the 23.4 release

@alifirat
Copy link

@Avogar do you know if the head requests are used for schema inference ? If yes I would like to suggest an improvement by creating an issue.

I just want to be sure before 🙂

@Avogar
Copy link
Member

Avogar commented May 24, 2023

probably it is needed for schema inference
but I do not know why we do it for all keys, may be @Avogar can tell?

@Avogar do you know if the head requests are used for schema inference ? If yes I would like to suggest an improvement by creating an issue.

When I implemented schema inference, there was head request only for keys that we actually read data from. There were a lot of changed/improvements/refactoring from other people since then. Need to check

@Avogar
Copy link
Member

Avogar commented May 24, 2023

Actually, @kssenii it's your code :)
#43454

It was needed for calculating the total_size for progress bar. We use the same iterator for reading and for schema inference. For reading it's ok, we will do all these head requests anyway, but for schema inference we should not do it, we read only some first files, and we don't need to calculate total_size because we don't send progress on schema inference. We can add a flag for KeysIterator for schema inference to do the head request only when we requested new key. I will create a PR for it

UPD: #50203

@alifirat
Copy link

@Avogar if you don't mind, what do you think about the enhancement I've shared in #50012.

TL;DR: why not using the metadata for the schema inference ?

@Avogar
Copy link
Member

Avogar commented May 24, 2023

what do you think about the enhancement I've shared

It's very good idea.

why not using the metadata for the schema inference ?

It just wasn't implemented. Current schema inference was just derived from S3 table engine.

@alifirat
Copy link

alifirat commented May 25, 2023

Hey @Avogar I was wondering if there is a settings that allows to disable the schema inference. Let's imagine that I already know the schema of the Data Lakes tables and I would to skip the inference step:

  • Is it possible ?
  • If yes, does it work ?

@Avogar
Copy link
Member

Avogar commented May 25, 2023

Sure, schema inference works only if user didn't specify structure manually. You always can specify structure in the create statement as usual:

CREATE TABLE (column1 Type1, column2 Type2, ...) iceberg Engine=Iceberg(...)

As well as for table function (same as for s3 table function):

SELECT * FROM iceberg(url, format, 'column1 Type1, column2 Type2')

@kssenii
Copy link
Member

kssenii commented Jun 14, 2023

@alifirat, fixed slowness because of HEAD requests here #50976

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants