[PROPOSAL] Capture dataset table and storage format #256

mobuchowski · 2021-09-08T10:19:14Z

Purpose:
Current naming scheme capture dataset namespace and name. Namespace is defined as combination of "scheme" - which tells us which type of data source are we dealing with, and "authority" - which can be thought of as an instance of data source, like object storage bucket, saas account name or database instance. In some cases there is only one, global instance - like bigquery. Then, the dataset name is an "path" uniquely identifying dataset in the particular namespace.

Hovewer, in some cases, there are additional pieces of information intrinsically connected to particular dataset instance - it's table (like Hive, Iceberg, Hudi, Delta Lake...) and storage (Parquet, Orc, Avro, CSV) format. We should capture them.

Proposed implementation
The most basic schema would be to embed the info into name somehow.
One inspiration would be embedding it into schema part of namespace, like sqlalchemy connection strings:
postgresql+psycopg2://host:port. This approach has two main downsides: the smaller one is that it can simply get unwieldy, for example: gcs+iceberg+parquet://{path}. The worse consequence, is that by interfering with namespace part, it's possible to register duplicate datasets in one namespace, differing only by format.

Other option would be to add dedicated optional field to DatasourceDatasetFacet.

Third one would be to add another facet (DatasourceFormatDatasetFacet?) that would capture this and similar information.

The text was updated successfully, but these errors were encountered:

julienledem · 2021-09-09T01:10:10Z

If this is specific to Spark, then I would add a separate facet. Is it the case?

mobuchowski · 2021-09-09T08:12:28Z

I think there's quite a bit of processing engines that are interested in writing data to files in different formats. At the very least Flink, Beam additionally.

collado-mike · 2021-09-09T15:49:16Z

Serialization format and even table location information is really a facet that's auxiliary to the dataset- it's not really part of the dataset's identity. The serialization format can (and often does) change without changing the nature of the data. It is absolutely useful to capture (e.g., someone changes their file format from parquet to orc and breaks downstream consumers), but I'd keep it as separate facets rather than trying to build it into the dataset/datasource naming.

mobuchowski · 2022-07-20T14:09:58Z

Already done in StorageDatasetFacet.

mobuchowski added the proposal label Sep 8, 2021

mobuchowski closed this as completed Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Capture dataset table and storage format #256

[PROPOSAL] Capture dataset table and storage format #256

mobuchowski commented Sep 8, 2021

julienledem commented Sep 9, 2021

mobuchowski commented Sep 9, 2021

collado-mike commented Sep 9, 2021

mobuchowski commented Jul 20, 2022

[PROPOSAL] Capture dataset table and storage format #256

[PROPOSAL] Capture dataset table and storage format #256

Comments

mobuchowski commented Sep 8, 2021

julienledem commented Sep 9, 2021

mobuchowski commented Sep 9, 2021

collado-mike commented Sep 9, 2021

mobuchowski commented Jul 20, 2022