Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Capture dataset table and storage format #256

Closed
mobuchowski opened this issue Sep 8, 2021 · 4 comments
Closed

[PROPOSAL] Capture dataset table and storage format #256

mobuchowski opened this issue Sep 8, 2021 · 4 comments
Labels

Comments

@mobuchowski
Copy link
Member

Purpose:
Current naming scheme capture dataset namespace and name. Namespace is defined as combination of "scheme" - which tells us which type of data source are we dealing with, and "authority" - which can be thought of as an instance of data source, like object storage bucket, saas account name or database instance. In some cases there is only one, global instance - like bigquery. Then, the dataset name is an "path" uniquely identifying dataset in the particular namespace.

Hovewer, in some cases, there are additional pieces of information intrinsically connected to particular dataset instance - it's table (like Hive, Iceberg, Hudi, Delta Lake...) and storage (Parquet, Orc, Avro, CSV) format. We should capture them.

Proposed implementation
The most basic schema would be to embed the info into name somehow.
One inspiration would be embedding it into schema part of namespace, like sqlalchemy connection strings:
postgresql+psycopg2://host:port. This approach has two main downsides: the smaller one is that it can simply get unwieldy, for example: gcs+iceberg+parquet://{path}. The worse consequence, is that by interfering with namespace part, it's possible to register duplicate datasets in one namespace, differing only by format.

Other option would be to add dedicated optional field to DatasourceDatasetFacet.

Third one would be to add another facet (DatasourceFormatDatasetFacet?) that would capture this and similar information.

@julienledem
Copy link
Member

If this is specific to Spark, then I would add a separate facet. Is it the case?

@mobuchowski
Copy link
Member Author

I think there's quite a bit of processing engines that are interested in writing data to files in different formats. At the very least Flink, Beam additionally.

@collado-mike
Copy link
Contributor

Serialization format and even table location information is really a facet that's auxiliary to the dataset- it's not really part of the dataset's identity. The serialization format can (and often does) change without changing the nature of the data. It is absolutely useful to capture (e.g., someone changes their file format from parquet to orc and breaks downstream consumers), but I'd keep it as separate facets rather than trying to build it into the dataset/datasource naming.

@mobuchowski
Copy link
Member Author

Already done in StorageDatasetFacet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants