Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Infer Data Lakes schemas by reading the metadata #50012

Open
alifirat opened this issue May 19, 2023 · 1 comment
Open

[Improvement] Infer Data Lakes schemas by reading the metadata #50012

alifirat opened this issue May 19, 2023 · 1 comment
Labels

Comments

@alifirat
Copy link

alifirat commented May 19, 2023

Use case

As a User, I would like to have an instant creation when creating a DataLake table like Iceberg or Delta Lake.

Describe the solution you'd like

Today, I have the feeling that the schema inference of the Data Lakes is inferred by reading the data files directly. The main issue I see with this approach is that if we have millions of data files, it's going to take a while to just infer the schema.

To me, the schema can be infer by reading the metadata:

  • For Iceberg, if you reads the v{N}.metadata.json there is a field called schemas that contains all the fields.
  • For Delta Lake, if you the reads the path metadata.schemaString you also have the schema.

To me, the improvements are going to be on 2 side:

  • performance: reading a simple JSON file on S3 against reading N files on S3
  • cost: 1 single AWS call for the schema vs multiple calles

A potential drawback if the schema changes during the time:

  • Well in this case, as like any ClickHouse table today, the user may need to run an alter table command or recreate the table since it's a read-only mode ?

Describe alternatives you've considered

Don't look to alternatives.

Additional context

See #49958 #49959

@kssenii
Copy link
Member

kssenii commented May 23, 2024

For Iceberg we do already infer schema from metadata (from #55695), for DeltaLake it will be added once #63201 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants