You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Purpose:
Current naming scheme capture dataset namespace and name. Namespace is defined as combination of "scheme" - which tells us which type of data source are we dealing with, and "authority" - which can be thought of as an instance of data source, like object storage bucket, saas account name or database instance. In some cases there is only one, global instance - like bigquery. Then, the dataset name is an "path" uniquely identifying dataset in the particular namespace.
Hovewer, in some cases, there are additional pieces of information intrinsically connected to particular dataset instance - it's table (like Hive, Iceberg, Hudi, Delta Lake...) and storage (Parquet, Orc, Avro, CSV) format. We should capture them.
Proposed implementation
The most basic schema would be to embed the info into name somehow.
One inspiration would be embedding it into schema part of namespace, like sqlalchemy connection strings: postgresql+psycopg2://host:port. This approach has two main downsides: the smaller one is that it can simply get unwieldy, for example: gcs+iceberg+parquet://{path}. The worse consequence, is that by interfering with namespace part, it's possible to register duplicate datasets in one namespace, differing only by format.
Other option would be to add dedicated optional field to DatasourceDatasetFacet.
Third one would be to add another facet (DatasourceFormatDatasetFacet?) that would capture this and similar information.
The text was updated successfully, but these errors were encountered:
I think there's quite a bit of processing engines that are interested in writing data to files in different formats. At the very least Flink, Beam additionally.
Serialization format and even table location information is really a facet that's auxiliary to the dataset- it's not really part of the dataset's identity. The serialization format can (and often does) change without changing the nature of the data. It is absolutely useful to capture (e.g., someone changes their file format from parquet to orc and breaks downstream consumers), but I'd keep it as separate facets rather than trying to build it into the dataset/datasource naming.
Purpose:
Current naming scheme capture dataset namespace and name. Namespace is defined as combination of "scheme" - which tells us which type of data source are we dealing with, and "authority" - which can be thought of as an instance of data source, like object storage bucket, saas account name or database instance. In some cases there is only one, global instance - like bigquery. Then, the dataset name is an "path" uniquely identifying dataset in the particular namespace.
Hovewer, in some cases, there are additional pieces of information intrinsically connected to particular dataset instance - it's table (like Hive, Iceberg, Hudi, Delta Lake...) and storage (Parquet, Orc, Avro, CSV) format. We should capture them.
Proposed implementation
The most basic schema would be to embed the info into name somehow.
One inspiration would be embedding it into schema part of namespace, like sqlalchemy connection strings:
postgresql+psycopg2://host:port
. This approach has two main downsides: the smaller one is that it can simply get unwieldy, for example:gcs+iceberg+parquet://{path}
. The worse consequence, is that by interfering with namespace part, it's possible to register duplicate datasets in one namespace, differing only by format.Other option would be to add dedicated optional field to
DatasourceDatasetFacet
.Third one would be to add another facet (
DatasourceFormatDatasetFacet
?) that would capture this and similar information.The text was updated successfully, but these errors were encountered: