Fix lineage for orphaned datasets #2314

collado-mike · 2022-12-12T21:05:23Z

Signed-off-by: Michael Collado collado.mike@gmail.com

Problem

Sometimes a dataset is generated by a job whose current version no longer writes to that database. Since the lineage logic for a dataset always starts with a job that has written to or read from the dataset, we'll generate the lineage for the current version of that job, which may not include the dataset we started from.

Solution

This validates that a selected dataset node is always in the results of the lineage returned from the database. If the dataset is not in the set of nodes returned from the database, we assume that it's no longer connected to the original job and return a lineage graph with only the original dataset node.
(note that we always select the latest job that has written to or read from the dataset, so if a newer job now writes to that dataset, it will not be treated as an orphan dataset).

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

Signed-off-by: Michael Collado <collado.mike@gmail.com>

codecov · 2022-12-12T21:16:24Z

Codecov Report

Merging #2314 (95aac6f) into main (b1ff80e) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #2314      +/-   ##
============================================
+ Coverage     77.01%   77.07%   +0.06%     
- Complexity     1166     1170       +4     
============================================
  Files           222      222              
  Lines          5307     5317      +10     
  Branches        424      425       +1     
============================================
+ Hits           4087     4098      +11     
  Misses          747      747              
+ Partials        473      472       -1

Impacted Files	Coverage Δ
.../src/main/java/marquez/service/LineageService.java	`86.77% <100.00%> (+2.09%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

collado-mike requested a review from wslulciuc December 12, 2022 21:05

boring-cyborg bot added the api API layer changes label Dec 12, 2022

Fix lineage for orphaned datasets

b302008

Signed-off-by: Michael Collado <collado.mike@gmail.com>

collado-mike force-pushed the fix/include_orphaned_dataset_lineage branch from 2e760fb to b302008 Compare December 12, 2022 21:08

wslulciuc approved these changes Dec 12, 2022

View reviewed changes

Merge branch 'main' into fix/include_orphaned_dataset_lineage

95aac6f

wslulciuc enabled auto-merge (squash) December 12, 2022 21:28

wslulciuc disabled auto-merge December 12, 2022 21:38

wslulciuc merged commit 3212c8f into main Dec 12, 2022

wslulciuc deleted the fix/include_orphaned_dataset_lineage branch December 12, 2022 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lineage for orphaned datasets #2314

Fix lineage for orphaned datasets #2314

collado-mike commented Dec 12, 2022

codecov bot commented Dec 12, 2022 •

edited

Loading

Fix lineage for orphaned datasets #2314

Fix lineage for orphaned datasets #2314

Conversation

collado-mike commented Dec 12, 2022

Problem

Solution

Checklist

codecov bot commented Dec 12, 2022 • edited Loading

Codecov Report

codecov bot commented Dec 12, 2022 •

edited

Loading