[Apache Iceberg] Failing to create table #49958

melvynator · 2023-05-17T12:42:03Z

Creating a table failed when using apache iceberg:

CREATE TABLE iceberg Engine=Iceberg(...)

2023.05.11 10:21:36.140819 [ 19080 ] {bc194292-7197-4ab9-9a30-b9461ab43ecd} <Error> TCPHandler: Code: 499. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /usr/bin/clickhouse
1. ? @ 0x9801d4d in /usr/bin/clickhouse
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, DB::S3Settings::RequestSettings const&, bool, bool, bool) @ 0x126aa4fe in /usr/bin/clickhouse
3. DB::StorageS3Source::KeysIterator::KeysIterator(DB::S3::Client const&, String const&, std::vector<String, std::allocator<String>> const&, String const&, DB::S3Settings::RequestSettings const&, std::shared_ptr<DB::IAST>, DB::Block const&, std::shared_ptr<DB::Context const>, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x1447e152 in /usr/bin/clickhouse
4. DB::StorageS3::createFileIterator(DB::StorageS3::Configuration const&, bool, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST>, DB::Block const&, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x14486c4b in /usr/bin/clickhouse
5. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x144861c6 in /usr/bin/clickhouse

We have looked to the codebase and I have the feeling that you're calling the head object on a folder to get the last modification time but a folder on S3 has never a last modification time. We have run the following command on the same buckert and keys set on the IcebergEngine and we see the same exact error (aws s3api head-object bucket --key mykey returns a 404).

Version used: 23.4.1.1943

The text was updated successfully, but these errors were encountered:

tavplubix · 2023-05-17T13:00:43Z

cc: @ucasfl, @kssenii

kssenii · 2023-05-17T13:03:40Z

Please check the log message "New configuration path: {}, keys: {}" ("trace" log level) and verify what is in keys: .... It should match s3 files.

ClickHouse/src/Storages/DataLakes/IStorageDataLake.h

Lines 72 to 75 in 36c31e1

    
           LOG_TRACE( 
        
               &Poco::Logger::get("DataLake"), 
        
               "New configuration path: {}, keys: {}", 
        
               configuration.getPath(), fmt::join(configuration.keys, ", "));

kssenii · 2023-05-17T13:05:42Z

We have looked to the codebase and I have the feeling that you're calling the head object on a folder to get the last modification time but a folder on S3 has never a last modification time.

not the folder, it makes a head request to the files which are added here

ClickHouse/src/Storages/DataLakes/IStorageDataLake.h

Lines 67 to 70 in 36c31e1

    
           if (keys.empty()) 
        
               configuration.keys = getDataFiles(configuration, local_context); 
        
           else 
        
               configuration.keys = keys;

alifirat · 2023-05-17T15:26:02Z

@kssenii
Let me add more context to this ticket. To debug, we have tested locally with the docker image clickhouse/clickhouse-server:23.4-alpine and we have enable the trace_logs.
We have identified the line you have mentionned "New configuration path: {}, keys: {}", and in the trace log, we see

[clickhouse1] 2023.05.17 14:58:39.060859 [ 8 ] {795cb46f-4f58-4f6b-b325-29336025f225} <Trace> DataLake: New configuration path: iceberg-catalog/iceberg/, keys: s3://tmp/iceberg-catalog/iceberg/1.parquet, etc`

In total, the keys is returning 1725 files. (we have checked on the AWS UI, we have the same exact number of files).
Later in the trace, we see the error

: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR) (version 23.4.2.11 (official build)) (from 127.0.0.1:57074) (in query: select count(1) from test), Stack trace (when copying this message, always include the lines below)

that results from the command CREATE TABLE our_table Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/').

My question is:

do you do a head request for every files founded ? If yes why ?

kssenii · 2023-05-17T15:36:37Z

/tmp/iceberg-catalog/iceberg/1.parquet - does the url of iceberg function argument (iceberg(<url>)) has the same prefix /tmp/iceberg-catalog/iceberg/?

UPDATE: ah I see it in your comment "that results from the command CREATE TABLE our_table Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/')."

do you do a head request for every files founded ? If yes why ?

I see it is done here and indeed all keys are iterated to make a head request

ClickHouse/src/Storages/StorageS3.cpp

Lines 459 to 464 in 67b8aca

    
           for (auto && key : all_keys) 
        
           { 
        
               auto info = S3::getObjectInfo(client_, bucket, key, version_id_, request_settings_); 
        
               total_size += info.size; 
        
               keys.emplace_back(std::move(key), std::move(info)); 
        
           }

probably it is needed for schema inference
but I do not know why we do it for all keys, may be @Avogar can tell?

kssenii · 2023-05-17T15:44:08Z

/tmp/iceberg-catalog/iceberg/1.parquet - does the url of iceberg function argument (iceberg()) has the same prefix /tmp/iceberg-catalog/iceberg/?

UPDATE: ah I see it in your comment "that results from the command CREATE TABLE our_table Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/')."

But the log line

<Trace> DataLake: New configuration path: iceberg-catalog/iceberg/, keys: s3://tmp/iceberg-catalog/iceberg/1.parquet,

suggests that you had Engine = Iceberg('s3://iceberg-catalog/iceberg/') instead of Engine = Iceberg('s3://tmp/iceberg-catalog/iceberg/'), isn't it? (because I see New configuration path: iceberg-catalog/iceberg/ without tmp prefix)

alifirat · 2023-05-17T15:55:13Z

Hey @kssenii
cs-tmp is the bucket name while the iceberg-catalog-iceberg is the key.

It's not possible to create Engine = Iceberg('s3://iceberg-catalog/iceberg/') without URL no ?

kssenii · 2023-05-17T16:03:48Z

cs-tmp is the bucket name while the iceberg-catalog-iceberg is the key.

ok, just make sure that paths in iceberg metadata match paths in s3 (you can also read iceberg metadata files with clickhouse-local and check), because this error you get "Failed to get object info: No response body.." means that is tries to read non-existing file and prefixes mismatch between iceberg metadata and s3 url can be a reason (It is possible that in previous version such mismatch was allowed and in recent version it is not, I did not check)

data_file column, example:

ClickHouse/src/Storages/DataLakes/IcebergMetadataParser.cpp

Lines 196 to 199 in 67b8aca

    
                * ─status─┬─────────snapshot_id─┬─data_file──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ 
        
                * │      1 │ 2252246380142525104 │ ('/iceberg_data/db/table_name/data/a=0/00000-1-c9535a00-2f4f-405c-bcfa-6d4f9f477235-00001.parquet','PARQUET',(0),1,631,67108864,[(1,46),(2,48)],[(1,1),(2,1)],[(1,0),(2,0)],[],[(1,'\0\0\0\0\0\0\0\0'),(2,'1')],[(1,'\0\0\0\0\0\0\0\0'),(2,'1')],NULL,[4],0) │ 
        
                * │      1 │ 2252246380142525104 │ ('/iceberg_data/db/table_name/data/a=1/00000-1-c9535a00-2f4f-405c-bcfa-6d4f9f477235-00002.parquet','PARQUET',(1),1,631,67108864,[(1,46),(2,48)],[(1,1),(2,1)],[(1,0),(2,0)],[],[(1,'\0\0\0\0\0\0\0'),(2,'2')],[(1,'\0\0\0\0\0\0\0'),(2,'2')],NULL,[4],0) │ 
        
                * │      1 │ 2252246380142525104 │ ('/iceberg_data/db/table_name/data/a=2/00000-1-c9535a00-2f4f-405c-bcfa-6d4f9f477235-00003.parquet','PARQUET',(2),1,631,67108864,[(1,46),(2,48)],[(1,1),(2,1)],[(1,0),(2,0)],[],[(1,'\0\0\0\0\0\0\0'),(2,'3')],[(1,'\0\0\0\0\0\0\0'),(2,'3')],NULL,[4],0) │

.

It's not possible to create Engine = Iceberg('s3://iceberg-catalog/iceberg/') without URL no ?

no, or I didn't understand the question, how would it work without url?

alifirat · 2023-05-17T16:16:48Z

@kssenii I'LM come back with a shorter example. During this time, I was just wondering (because it's missing a log here on the key) when calling the s3 api do you think you keep the bucket name in the key ? That may explains the 404.

I'm able to reproduce the issue with a single data file.

alifirat · 2023-05-17T16:29:16Z

@kssenii I would like to add this also. If the number of data is high, it means that the time to create the time will be linear regarding that.

If it's about schéma inference, it's available in the metadata why not looking there ? We plan to use Iceberg and I cannot imagine doing a request for each data file ..

kssenii · 2023-05-17T16:34:13Z

cs-tmp is the bucket name while the iceberg-catalog-iceberg is the key.

"cs-tmp" or "tmp"? here keys: s3://tmp/iceberg-catalog/iceberg/1.parquet, I see just "tmp"

when calling the s3 api do you think you keep the bucket name in the key ?

no it is stored separately from key https://github.com/ClickHouse/ClickHouse/blob/master/src/IO/S3/URI.h#L25-L26
btw quite strange that in iceberg metadata you have "tmp" if as you say it is a bucket name, there should be only a key

it's available in the metadata why not looking there ?

it is possible, not implemented, you can create an issue.

alifirat · 2023-05-17T21:20:53Z

Hey @kssenii

Let me give you a simple and easy to reproduce.

How did I've generated the data ?

Run spark-sql with

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.1 \  
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ 
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog  \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \   
--conf spark.sql.catalog.iceberg.type=hadoop   \  
--conf spark.sql.catalog.iceberg.warehouse=s3://cs-tmp/akilic/iceberg-catalog/ \
 --conf spark.sql.defaultCatalog=iceberg
spark-sql> create table iceberg.test (a bigint) TBLPROPERTIES('format-version'='2');
spark-sql> INSERT INTO iceberg.test values 1, 2, 3; 
spark-sql>select * from iceberg.test ;
1
2
3

As you can see I'm able to read the data with spark-sql .

What do we have in AWS?

Data

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/data/
2023-05-17 22:52:14        422 00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
2023-05-17 22:52:15        425 00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

Metada

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/metadata/                                                                                                                                            0 [22:55:28]
2023-05-17 22:52:15       6629 0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
2023-05-17 22:52:15       4263 snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro
2023-05-17 22:51:38        847 v1.metadata.json
2023-05-17 22:52:16       1891 v2.metadata.json
2023-05-17 22:52:16          1 version-hint.text

ClickHouse

I'm using the docker image clickhouse-server:23.4-alpine

The issue happens when executing this simple request:

$  SET send_logs_level = 'trace'; 
$ create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')
[clickhouse1] 2023.05.17 21:07:03.317667 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: CREATE TABLE ON default.test
[clickhouse1] 2023.05.17 21:07:03.319788 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.321678 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.337961 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> NamedCollectionsUtils: Loaded 0 collections from SQL
[clickhouse1] 2023.05.17 21:07:04.549157 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> DataLake: New configuration path: akilic/iceberg-catalog/test/, keys: s3://cs-tmp/akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet, s3://cs-tmp/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
[clickhouse1] 2023.05.17 21:07:06.354270 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Error> executeQuery: Code: 499. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR) (version 23.4.2.11 (official build)) (from 127.0.0.1:56116) (in query: create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xbc87ee4 in /usr/bin/clickhouse
1. ? @ 0x86621f8 in /usr/bin/clickhouse
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, DB::S3Settings::RequestSettings const&, bool, bool, bool) @ 0xf5f857c in /usr/bin/clickhouse
3. DB::StorageS3Source::KeysIterator::KeysIterator(DB::S3::Client const&, String const&, std::vector<String, std::allocator<String>> const&, String const&, DB::S3Settings::RequestSettings const&, std::shared_ptr<DB::IAST>, DB::Block const&, std::shared_ptr<DB::Context const>, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e09e78 in /usr/bin/clickhouse
4. DB::StorageS3::createFileIterator(DB::StorageS3::Configuration const&, bool, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST>, DB::Block const&, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e11170 in /usr/bin/clickhouse
5. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x10e10910 in /usr/bin/clickhouse
6. DB::StorageS3::StorageS3(DB::StorageS3::Configuration const&, std::shared_ptr<DB::Context const>, DB::StorageID const&, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, String const&, std::optional<DB::FormatSettings>, bool, std::shared_ptr<DB::IAST>) @ 0x10e0f794 in /usr/bin/clickhouse
7. ? @ 0x10ed2294 in /usr/bin/clickhouse
8. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x10c7d7fc in /usr/bin/clickhouse
9. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x104b89b0 in /usr/bin/clickhouse
10. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x104b2a9c in /usr/bin/clickhouse
11. DB::InterpreterCreateQuery::execute() @ 0x104bd204 in /usr/bin/clickhouse
12. ? @ 0x1097d198 in /usr/bin/clickhouse
13. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x1097a5c4 in /usr/bin/clickhouse
14. DB::TCPHandler::runImpl() @ 0x11531b1c in /usr/bin/clickhouse
15. DB::TCPHandler::run() @ 0x115443e4 in /usr/bin/clickhouse
16. Poco::Net::TCPServerConnection::start() @ 0x121b0604 in /usr/bin/clickhouse
17. Poco::Net::TCPServerDispatcher::run() @ 0x121b1b20 in /usr/bin/clickhouse
18. Poco::PooledThread::run() @ 0x1235ac7c in /usr/bin/clickhouse
19. Poco::ThreadImpl::runnableEntry(void*) @ 0x12358544 in /usr/bin/clickhouse
20. start_thread @ 0x7624 in /lib/libpthread.so.0
21. ? @ 0xd149c in /lib/libc.so.6

In the trace you see two data file mentionned:

s3://cs-tmp/akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
s3://cs-tmp/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

and if we can see that it consistent with

$ aws s3 ls  s3://cs-tmp/akilic/iceberg-catalog/test/data/
2023-05-17 22:52:14        422 00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
2023-05-17 22:52:15        425 00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet

Let's check the head-object request for both files:

aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
{
    "AcceptRanges": "bytes",
    "Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
    "LastModified": "2023-05-17T20:52:14+00:00",
    "ContentLength": 422,
    "ETag": "\"9d27f6c2a869bf8424fc66076918b5d9\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet                                                                 0 [23:10:39]
{
    "AcceptRanges": "bytes",
    "Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
    "LastModified": "2023-05-17T20:52:15+00:00",
    "ContentLength": 425,
    "ETag": "\"1c919896c4bfc3f46260c2d7baa9e55c\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

If we tried to read the data without Iceberg:

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet')
─a─┐
│ 2 │
│ 3 │
└───┘

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')

SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')

Query id: 4050b563-57f2-4b2f-ab12-f4aac87a9cff

[clickhouse1] 2023.05.17 21:14:14.402574 [ 9 ] {4050b563-57f2-4b2f-ab12-f4aac87a9cff} <Debug> executeQuery: (from 127.0.0.1:56116) select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro') (stage: Complete)
Exception on client:
Code: 92. DB::Exception: Tuple cannot be empty: while receiving packet from localhost:9000. (EMPTY_DATA_PASSED)

select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro') Format Vertical

Row 1:
──────
manifest_path:             s3://cs-tmp/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
manifest_length:           6629
partition_spec_id:         0
content:                   0
sequence_number:           1
min_sequence_number:       1
added_snapshot_id:         7532076000798921356
added_data_files_count:    2
existing_data_files_count: 0
deleted_data_files_count:  0
added_rows_count:          3
existing_rows_count:       0
deleted_rows_count:        0
partitions:                []

SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')

Code: 117. DB::Exception: Expected field "meta" with columns names and types, found field format-version: Cannot extract table structure from JSON format file. You can specify the structure manually. (INCORRECT_DATA) (version 23.4.2.11 (official build)) (from 127.0.0.1:39778) (in query: select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')), Stack trace (when copying this message, always include the lines below):

I've put as much details possible to help to know if the issue is in the way we have generated the data or there is a real bug in the 23.4 release

alifirat · 2023-05-18T16:10:56Z

@Avogar do you know if the head requests are used for schema inference ? If yes I would like to suggest an improvement by creating an issue.

I just want to be sure before 🙂

Avogar · 2023-05-24T14:43:58Z

probably it is needed for schema inference
but I do not know why we do it for all keys, may be @Avogar can tell?

@Avogar do you know if the head requests are used for schema inference ? If yes I would like to suggest an improvement by creating an issue.

When I implemented schema inference, there was head request only for keys that we actually read data from. There were a lot of changed/improvements/refactoring from other people since then. Need to check

Avogar · 2023-05-24T14:56:31Z

Actually, @kssenii it's your code :)
#43454

It was needed for calculating the total_size for progress bar. We use the same iterator for reading and for schema inference. For reading it's ok, we will do all these head requests anyway, but for schema inference we should not do it, we read only some first files, and we don't need to calculate total_size because we don't send progress on schema inference. We can add a flag for KeysIterator for schema inference to do the head request only when we requested new key. I will create a PR for it

UPD: #50203

alifirat · 2023-05-24T16:34:55Z

@Avogar if you don't mind, what do you think about the enhancement I've shared in #50012.

TL;DR: why not using the metadata for the schema inference ?

Avogar · 2023-05-24T16:36:42Z

what do you think about the enhancement I've shared

It's very good idea.

why not using the metadata for the schema inference ?

It just wasn't implemented. Current schema inference was just derived from S3 table engine.

alifirat · 2023-05-25T07:19:02Z

Hey @Avogar I was wondering if there is a settings that allows to disable the schema inference. Let's imagine that I already know the schema of the Data Lakes tables and I would to skip the inference step:

Is it possible ?
If yes, does it work ?

Avogar · 2023-05-25T12:32:53Z

Sure, schema inference works only if user didn't specify structure manually. You always can specify structure in the create statement as usual:

CREATE TABLE (column1 Type1, column2 Type2, ...) iceberg Engine=Iceberg(...)

As well as for table function (same as for s3 table function):

SELECT * FROM iceberg(url, format, 'column1 Type1, column2 Type2')

kssenii · 2023-06-14T11:03:07Z

@alifirat, fixed slowness because of HEAD requests here #50976

melvynator added the unexpected behaviour label May 17, 2023

alifirat mentioned this issue May 19, 2023

[Improvement] Infer Data Lakes schemas by reading the metadata #50012

Open

kssenii mentioned this issue May 25, 2023

Fix iceberg metadata parsing #50232

Merged

kssenii closed this as completed in #50232 May 30, 2023

tavplubix mentioned this issue May 31, 2023

New errors while reading iceberg files #49877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Apache Iceberg] Failing to create table #49958

[Apache Iceberg] Failing to create table #49958

melvynator commented May 17, 2023 •

edited by tavplubix

tavplubix commented May 17, 2023

kssenii commented May 17, 2023 •

edited

kssenii commented May 17, 2023 •

edited

alifirat commented May 17, 2023 •

edited

kssenii commented May 17, 2023 •

edited

kssenii commented May 17, 2023 •

edited

alifirat commented May 17, 2023

kssenii commented May 17, 2023

alifirat commented May 17, 2023

alifirat commented May 17, 2023

kssenii commented May 17, 2023 •

edited

alifirat commented May 17, 2023

alifirat commented May 18, 2023

Avogar commented May 24, 2023

Avogar commented May 24, 2023 •

edited

alifirat commented May 24, 2023

Avogar commented May 24, 2023

alifirat commented May 25, 2023 •

edited

Avogar commented May 25, 2023 •

edited

kssenii commented Jun 14, 2023

[Apache Iceberg] Failing to create table #49958

[Apache Iceberg] Failing to create table #49958

Comments

melvynator commented May 17, 2023 • edited by tavplubix

tavplubix commented May 17, 2023

kssenii commented May 17, 2023 • edited

kssenii commented May 17, 2023 • edited

alifirat commented May 17, 2023 • edited

kssenii commented May 17, 2023 • edited

kssenii commented May 17, 2023 • edited

alifirat commented May 17, 2023

kssenii commented May 17, 2023

alifirat commented May 17, 2023

alifirat commented May 17, 2023

kssenii commented May 17, 2023 • edited

alifirat commented May 17, 2023

How did I've generated the data ?

What do we have in AWS?

Data

Metada

ClickHouse

alifirat commented May 18, 2023

Avogar commented May 24, 2023

Avogar commented May 24, 2023 • edited

alifirat commented May 24, 2023

Avogar commented May 24, 2023

alifirat commented May 25, 2023 • edited

Avogar commented May 25, 2023 • edited

kssenii commented Jun 14, 2023

melvynator commented May 17, 2023 •

edited by tavplubix

kssenii commented May 17, 2023 •

edited

kssenii commented May 17, 2023 •

edited

alifirat commented May 17, 2023 •

edited

kssenii commented May 17, 2023 •

edited

kssenii commented May 17, 2023 •

edited

kssenii commented May 17, 2023 •

edited

Avogar commented May 24, 2023 •

edited

alifirat commented May 25, 2023 •

edited

Avogar commented May 25, 2023 •

edited