Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openlineage API does not set Dataset current_version_uuid #1361

Open
vitoravancini opened this issue May 27, 2021 · 2 comments
Open

Openlineage API does not set Dataset current_version_uuid #1361

vitoravancini opened this issue May 27, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@vitoravancini
Copy link

It seems that using the openlineage api the current version id is not populated for new datasets and the join that fetches the fields for the dataset never works.

This is the join that fails, 'dv' is never actually found and dv.fields is always null
image

The screenshot is for the DatasetDao.java file.

"SELECT d.*, dv.fields, sv.schema_location, facets, t.tags\n"

Thank you!

@wslulciuc wslulciuc added the bug Something isn't working label May 27, 2021
@julienledem julienledem added this to To do in Marquez 0.15.0 via automation Jun 8, 2021
@wslulciuc wslulciuc removed this from To do in Marquez 0.15.0 Jun 10, 2021
@wslulciuc wslulciuc added this to To do in Marquez 0.16.0 via automation Jun 10, 2021
@wslulciuc
Copy link
Member

wslulciuc commented Jun 14, 2021

Thanks for opening this issue, @vitoravancini! As of Marquez 0.15.+, the input datasets for an OpenLineage (=OL) event are no longer automatically register for a given run ID, see 56f9...ef20. This was to ensure output dataset versions were only created on run completion. You can view the job and its outputs as a single unit, where the output datasets provide a clear public contract for downstream job consumption. The change was introduced in PR #1258. But, we may want to reconsider the logic (or rather, refine it). Below, I outline the current flow when an OL event is received, then, a proposal for a possible alternative:

How are input / output datasets for an OpenLineage event handled?

If we reference the OL events in the Marquez quickstart, the my-input dataset is not created in Step 1. In the Marquez backend, the run is initially created with the ID provided (as well as any run specific facets), but the input dataset is ignored and assumed to have been already registered by an upstream job (or directly using the DatasetAPI). Then, in step 2, the OL event has my-output as an output dataset that is subsequently registered and versioned, then linked to the run ID. This results in the behavior you outlined in this issue.

Should an input dataset be registered and versioned if present in an OL event?

As an alternative, we could register all output datasets present in an OL event. But, we'd have to consider:

  1. How should input dataset ownership be assigned?
  2. Can the current dataset version be guaranteed? That is, an upstream job A may be in a running state with dataset B as it's output, but a downstream job C, with dataset B as its input, my modify it in some way therefore overriding the version created by job A.

For ease of usability, we may want to register and version a dataset if it does not yet exist. For example, this would be a common use case for jobs at the edge of a lineage graph. In other words, you may have an ETL job that loads data from a public vendor or there's no convenient way to link a job that produced it.

@julienledem @collado-mike: It be great to get your thoughts on this. We may want to also have the OpenLineage standard outline how consumers to handle input datasets?

@wslulciuc
Copy link
Member

wslulciuc commented Jun 14, 2021

@OleksandrDvornik: Since you'll be looking into this, to reproduce the issue, follow the Marquez quickstart guide using Marquez 0.15.+. You'll notice that only the output datasets for an OL event are registered as outlined by @vitoravancini. Let me know if you have any questions as you get started!

@julienledem julienledem removed this from To do in Marquez 0.16.0 Jul 1, 2021
@julienledem julienledem added this to To do in Marquez 0.17.0 via automation Jul 1, 2021
@collado-mike collado-mike removed this from To do in Marquez 0.17.0 Aug 17, 2021
@collado-mike collado-mike added this to To do in Marquez 0.18.0 via automation Aug 17, 2021
@julienledem julienledem removed this from To do in Marquez 0.18.0 Sep 14, 2021
@julienledem julienledem added this to To do in Marquez 0.19.0 via automation Sep 14, 2021
@wslulciuc wslulciuc removed this from To do in Marquez 0.19.0 Oct 19, 2021
@wslulciuc wslulciuc added this to To do in Marquez 0.20.0 via automation Oct 19, 2021
@wslulciuc wslulciuc removed this from To do in Marquez 0.20.0 Dec 13, 2021
@wslulciuc wslulciuc added this to To do in Marquez 0.21.0 via automation Dec 13, 2021
@wslulciuc wslulciuc removed this from To do in Marquez 0.21.0 Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

4 participants