New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Apache Iceberg] Failing to create table #49958
Comments
Please check the log message ClickHouse/src/Storages/DataLakes/IStorageDataLake.h Lines 72 to 75 in 36c31e1
|
not the folder, it makes a head request to the files which are added here ClickHouse/src/Storages/DataLakes/IStorageDataLake.h Lines 67 to 70 in 36c31e1
|
@kssenii
In total, the keys is returning 1725 files. (we have checked on the AWS UI, we have the same exact number of files).
that results from the command My question is:
|
UPDATE: ah I see it in your comment
I see it is done here and indeed all keys are iterated to make a head request ClickHouse/src/Storages/StorageS3.cpp Lines 459 to 464 in 67b8aca
probably it is needed for schema inference but I do not know why we do it for all keys, may be @Avogar can tell? |
But the log line
suggests that you had |
Hey @kssenii It's not possible to create |
ok, just make sure that paths in iceberg metadata match paths in s3 (you can also read iceberg metadata files with clickhouse-local and check), because this error you get "Failed to get object info: No response body.." means that is tries to read non-existing file and prefixes mismatch between iceberg metadata and s3 url can be a reason (It is possible that in previous version such mismatch was allowed and in recent version it is not, I did not check)
ClickHouse/src/Storages/DataLakes/IcebergMetadataParser.cpp Lines 196 to 199 in 67b8aca
.
no, or I didn't understand the question, how would it work without url? |
@kssenii I'LM come back with a shorter example. During this time, I was just wondering (because it's missing a log here on the key) when calling the s3 api do you think you keep the bucket name in the key ? That may explains the 404. I'm able to reproduce the issue with a single data file. |
@kssenii I would like to add this also. If the number of data is high, it means that the time to create the time will be linear regarding that. If it's about schéma inference, it's available in the metadata why not looking there ? We plan to use Iceberg and I cannot imagine doing a request for each data file .. |
"cs-tmp" or "tmp"? here
no it is stored separately from key https://github.com/ClickHouse/ClickHouse/blob/master/src/IO/S3/URI.h#L25-L26
it is possible, not implemented, you can create an issue. |
Hey @kssenii Let me give you a simple and easy to reproduce. How did I've generated the data ?Run spark-sql with
As you can see I'm able to read the data with spark-sql . What do we have in AWS?Data$ aws s3 ls s3://cs-tmp/akilic/iceberg-catalog/test/data/
2023-05-17 22:52:14 422 00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
2023-05-17 22:52:15 425 00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet Metada$ aws s3 ls s3://cs-tmp/akilic/iceberg-catalog/test/metadata/ 0 [22:55:28]
2023-05-17 22:52:15 6629 0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
2023-05-17 22:52:15 4263 snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro
2023-05-17 22:51:38 847 v1.metadata.json
2023-05-17 22:52:16 1891 v2.metadata.json
2023-05-17 22:52:16 1 version-hint.text ClickHouseI'm using the docker image clickhouse-server:23.4-alpine The issue happens when executing this simple request: $ SET send_logs_level = 'trace';
$ create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')
[clickhouse1] 2023.05.17 21:07:03.317667 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: CREATE TABLE ON default.test
[clickhouse1] 2023.05.17 21:07:03.319788 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.321678 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> ContextAccess (default): Access granted: S3 ON *.*
[clickhouse1] 2023.05.17 21:07:03.337961 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> NamedCollectionsUtils: Loaded 0 collections from SQL
[clickhouse1] 2023.05.17 21:07:04.549157 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Trace> DataLake: New configuration path: akilic/iceberg-catalog/test/, keys: s3://cs-tmp/akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet, s3://cs-tmp/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
[clickhouse1] 2023.05.17 21:07:06.354270 [ 9 ] {9cedf26e-46a4-4b0f-a616-995a1919f8d7} <Error> executeQuery: Code: 499. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404. (S3_ERROR) (version 23.4.2.11 (official build)) (from 127.0.0.1:56116) (in query: create table test Engine = Iceberg('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/')), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xbc87ee4 in /usr/bin/clickhouse
1. ? @ 0x86621f8 in /usr/bin/clickhouse
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, DB::S3Settings::RequestSettings const&, bool, bool, bool) @ 0xf5f857c in /usr/bin/clickhouse
3. DB::StorageS3Source::KeysIterator::KeysIterator(DB::S3::Client const&, String const&, std::vector<String, std::allocator<String>> const&, String const&, DB::S3Settings::RequestSettings const&, std::shared_ptr<DB::IAST>, DB::Block const&, std::shared_ptr<DB::Context const>, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e09e78 in /usr/bin/clickhouse
4. DB::StorageS3::createFileIterator(DB::StorageS3::Configuration const&, bool, std::shared_ptr<DB::Context const>, std::shared_ptr<DB::IAST>, DB::Block const&, std::vector<DB::StorageS3Source::KeyWithInfo, std::allocator<DB::StorageS3Source::KeyWithInfo>>*) @ 0x10e11170 in /usr/bin/clickhouse
5. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x10e10910 in /usr/bin/clickhouse
6. DB::StorageS3::StorageS3(DB::StorageS3::Configuration const&, std::shared_ptr<DB::Context const>, DB::StorageID const&, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, String const&, std::optional<DB::FormatSettings>, bool, std::shared_ptr<DB::IAST>) @ 0x10e0f794 in /usr/bin/clickhouse
7. ? @ 0x10ed2294 in /usr/bin/clickhouse
8. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x10c7d7fc in /usr/bin/clickhouse
9. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x104b89b0 in /usr/bin/clickhouse
10. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x104b2a9c in /usr/bin/clickhouse
11. DB::InterpreterCreateQuery::execute() @ 0x104bd204 in /usr/bin/clickhouse
12. ? @ 0x1097d198 in /usr/bin/clickhouse
13. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x1097a5c4 in /usr/bin/clickhouse
14. DB::TCPHandler::runImpl() @ 0x11531b1c in /usr/bin/clickhouse
15. DB::TCPHandler::run() @ 0x115443e4 in /usr/bin/clickhouse
16. Poco::Net::TCPServerConnection::start() @ 0x121b0604 in /usr/bin/clickhouse
17. Poco::Net::TCPServerDispatcher::run() @ 0x121b1b20 in /usr/bin/clickhouse
18. Poco::PooledThread::run() @ 0x1235ac7c in /usr/bin/clickhouse
19. Poco::ThreadImpl::runnableEntry(void*) @ 0x12358544 in /usr/bin/clickhouse
20. start_thread @ 0x7624 in /lib/libpthread.so.0
21. ? @ 0xd149c in /lib/libc.so.6 In the trace you see two data file mentionned:
and if we can see that it consistent with
Let's check the head-object request for both files: aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00000-0-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet
{
"AcceptRanges": "bytes",
"Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
"LastModified": "2023-05-17T20:52:14+00:00",
"ContentLength": 422,
"ETag": "\"9d27f6c2a869bf8424fc66076918b5d9\"",
"ContentType": "binary/octet-stream",
"ServerSideEncryption": "AES256",
"Metadata": {}
}
aws s3api head-object --bucket cs-tmp --key akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet 0 [23:10:39]
{
"AcceptRanges": "bytes",
"Expiration": "expiry-date=\"Sat, 17 Jun 2023 00:00:00 GMT\", rule-id=\"1-month retention\"",
"LastModified": "2023-05-17T20:52:15+00:00",
"ContentLength": 425,
"ETag": "\"1c919896c4bfc3f46260c2d7baa9e55c\"",
"ContentType": "binary/octet-stream",
"ServerSideEncryption": "AES256",
"Metadata": {}
} If we tried to read the data without Iceberg: select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/data/00001-1-407951c1-f4a7-4aad-b40e-8c2ca3f3fb27-00001.parquet')
─a─┐
│ 2 │
│ 3 │
└───┘
select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')
SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro')
Query id: 4050b563-57f2-4b2f-ab12-f4aac87a9cff
[clickhouse1] 2023.05.17 21:14:14.402574 [ 9 ] {4050b563-57f2-4b2f-ab12-f4aac87a9cff} <Debug> executeQuery: (from 127.0.0.1:56116) select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro') (stage: Complete)
Exception on client:
Code: 92. DB::Exception: Tuple cannot be empty: while receiving packet from localhost:9000. (EMPTY_DATA_PASSED)
select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/snap-7532076000798921356-1-0e150576-5308-4020-a66f-cdf8cc4cc2c2.avro') Format Vertical
Row 1:
──────
manifest_path: s3://cs-tmp/akilic/iceberg-catalog/test/metadata/0e150576-5308-4020-a66f-cdf8cc4cc2c2-m0.avro
manifest_length: 6629
partition_spec_id: 0
content: 0
sequence_number: 1
min_sequence_number: 1
added_snapshot_id: 7532076000798921356
added_data_files_count: 2
existing_data_files_count: 0
deleted_data_files_count: 0
added_rows_count: 3
existing_rows_count: 0
deleted_rows_count: 0
partitions: []
SELECT *
FROM s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')
Code: 117. DB::Exception: Expected field "meta" with columns names and types, found field format-version: Cannot extract table structure from JSON format file. You can specify the structure manually. (INCORRECT_DATA) (version 23.4.2.11 (official build)) (from 127.0.0.1:39778) (in query: select * from s3('https://cs-tmp.s3.eu-west-1.amazonaws.com/akilic/iceberg-catalog/test/metadata/v2.metadata.json')), Stack trace (when copying this message, always include the lines below): I've put as much details possible to help to know if the issue is in the way we have generated the data or there is a real bug in the 23.4 release |
@Avogar do you know if the head requests are used for schema inference ? If yes I would like to suggest an improvement by creating an issue. I just want to be sure before 🙂 |
When I implemented schema inference, there was head request only for keys that we actually read data from. There were a lot of changed/improvements/refactoring from other people since then. Need to check |
Actually, @kssenii it's your code :) It was needed for calculating the total_size for progress bar. We use the same iterator for reading and for schema inference. For reading it's ok, we will do all these head requests anyway, but for schema inference we should not do it, we read only some first files, and we don't need to calculate total_size because we don't send progress on schema inference. We can add a flag for KeysIterator for schema inference to do the head request only when we requested new key. I will create a PR for it UPD: #50203 |
It's very good idea.
It just wasn't implemented. Current schema inference was just derived from S3 table engine. |
Hey @Avogar I was wondering if there is a settings that allows to disable the schema inference. Let's imagine that I already know the schema of the Data Lakes tables and I would to skip the inference step:
|
Sure, schema inference works only if user didn't specify structure manually. You always can specify structure in the create statement as usual:
As well as for table function (same as for
|
Creating a table failed when using apache iceberg:
CREATE TABLE iceberg Engine=Iceberg(...)
We have looked to the codebase and I have the feeling that you're calling the head object on a folder to get the last modification time but a folder on S3 has never a last modification time. We have run the following command on the same buckert and keys set on the IcebergEngine and we see the same exact error (aws s3api head-object bucket --key mykey returns a 404).
Version used:
23.4.1.1943
The text was updated successfully, but these errors were encountered: