Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] support builtin lineage within DatasourceV2Relation #2394

Merged
merged 1 commit into from Feb 6, 2024

Conversation

pawel-big-lebowski
Copy link
Contributor

@pawel-big-lebowski pawel-big-lebowski commented Jan 25, 2024

Problem

Support builtin lineage within DatasourceV2Relation

Related to: #2349

Solution

Extensions can include openlineage related properties within table member of DatasourceV2Relation nodes of a job logical plan. PR contains logic to extract the properties and convert them into dataset facets.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/builtin-lineage-from-v2-relation branch from d7a1b65 to de2edd2 Compare January 26, 2024 08:47
@boring-cyborg boring-cyborg bot added the documentation Improvements or additions to documentation label Jan 26, 2024
@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/builtin-lineage-from-v2-relation branch from de2edd2 to ed978a6 Compare January 26, 2024 09:07
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review January 26, 2024 10:42
@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/builtin-lineage-from-v2-relation branch from ed978a6 to e0ddce8 Compare January 26, 2024 10:46
Copy link
Member

@mobuchowski mobuchowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment regarding documentation, looks good to me besides that


Properties can be also used to pass any dataset facet. For example:
```
openlineage.dataset.facets.customFacet={"property": "value", "_producer": "https://..."}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain that customFacet will be a key that the facet will be attached to inside of the facets dictionary in resulting event JSON:

inputs: [{
    "name": "dataset.schema.name",
    "namespace": "bigquery",
    "facets": {
        "customFacet": {
            "property": "value"
            ...
        }
    "
}}

Also, shouldn't OpenLineage side add _producer? I doubt those custom facets would add _schemaURL too, but that'a another topic.

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/builtin-lineage-from-v2-relation branch 2 times, most recently from 2b4b858 to 741f1c6 Compare February 6, 2024 11:00
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/builtin-lineage-from-v2-relation branch from 741f1c6 to 4ae9be5 Compare February 6, 2024 11:07
@pawel-big-lebowski pawel-big-lebowski merged commit 73b4a3b into main Feb 6, 2024
30 checks passed
@pawel-big-lebowski pawel-big-lebowski deleted the spark/builtin-lineage-from-v2-relation branch February 6, 2024 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation integration/spark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants