You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a User, I would like to have an instant creation when creating a DataLake table like Iceberg or Delta Lake.
Describe the solution you'd like
Today, I have the feeling that the schema inference of the Data Lakes is inferred by reading the data files directly. The main issue I see with this approach is that if we have millions of data files, it's going to take a while to just infer the schema.
To me, the schema can be infer by reading the metadata:
For Iceberg, if you reads the v{N}.metadata.json there is a field called schemas that contains all the fields.
For Delta Lake, if you the reads the path metadata.schemaString you also have the schema.
To me, the improvements are going to be on 2 side:
performance: reading a simple JSON file on S3 against reading N files on S3
cost: 1 single AWS call for the schema vs multiple calles
A potential drawback if the schema changes during the time:
Well in this case, as like any ClickHouse table today, the user may need to run an alter table command or recreate the table since it's a read-only mode ?
Use case
As a User, I would like to have an instant creation when creating a DataLake table like Iceberg or Delta Lake.
Describe the solution you'd like
Today, I have the feeling that the schema inference of the Data Lakes is inferred by reading the data files directly. The main issue I see with this approach is that if we have millions of data files, it's going to take a while to just infer the schema.
To me, the schema can be infer by reading the metadata:
v{N}.metadata.json
there is a field calledschemas
that contains all the fields.metadata.schemaString
you also have the schema.To me, the improvements are going to be on 2 side:
A potential drawback if the schema changes during the time:
Describe alternatives you've considered
Don't look to alternatives.
Additional context
See #49958 #49959
The text was updated successfully, but these errors were encountered: