fix symlink display on marquez #2736

sophiely · 2024-01-23T10:40:04Z

Problem

Sending an event with a dataset symlink create an empty namespace with 0 dataset in it.

For example, this event:

{
  ...
  },
  "inputs": [
    {
      "namespace": "test_namespace"
      "name": "dataset_0",
      "facets": {
        "symlinks": {
          "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.1.0/client/python",
          "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json",
          "identifiers": [
            {
              "name": "symlink_prefix",
              "type": "DB_TABLE",
              "namespace": "symlink_test"
            }
          ]
        },
      },
    }
  ],
  ...
}

create an empty namespace called symlink_test

Closes: #2645

Solution

Display the dataset create by a symlink facet in their own namespace (here in the example the namespace symlink_test will contain a dataset called symlink_prefix)
The lineage of this dataset will be redirect to the lineage of the "main" dataset (the dataset test_namespace.dataset_0 in our example)

Example:

If we run these 2 runs:

new_dataset_a {facet: new_dataset_sym_a} ------- new_symlink_job_a ----------> new_dataset_b

then:
new_dataset_sym_a ------- new_symlink_job_b --------------> new_dataset_sym_b

On marquez we'll have

Please find more detailed explanation on the comments below.

This fix include another issue on the front though, the version/facets of the selected dataset are not directly displayed.

Since the selected dataset is the symlink and not the primary dataset, the front doesn't recognize the selected dataset as part of the lineage as a result the version endpoint is not run.

But if we click on the dataset new_dataset_a additional, the dataset version query is run and information are displayed

One-line summary:

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant) TODO
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've included a one-line summary of your change for the [CHANGELOG.md] (https://github.com/MarquezProject/marquez/blob/main/CHANGELOG.md#unreleased) (Depending on the change, this may not be necessary).
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

netlify · 2024-01-23T10:40:24Z

✅ Deploy Preview for peppy-sprite-186812 canceled.

Name	Link
🔨 Latest commit	`e832a6f`
🔍 Latest deploy log	https://app.netlify.com/sites/peppy-sprite-186812/deploys/65e75d29fb98ab0008f9e98c

codecov · 2024-01-23T10:58:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.46%. Comparing base (cc9c2c0) to head (e832a6f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #2736   +/-   ##
=========================================
  Coverage     84.45%   84.46%           
+ Complexity     1416     1415    -1     
=========================================
  Files           251      251           
  Lines          6447     6450    +3     
  Branches        291      292    +1     
=========================================
+ Hits           5445     5448    +3     
  Misses          850      850           
  Partials        152      152

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dkt-sophie-ly · 2024-01-23T15:48:56Z

api/src/main/resources/marquez/db/migration/V68__alter_datasets_view_to_keep_symlinks.sql

+   FROM datasets d
+     JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid
+     JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid


The query is not much different from before, but now the dataset uuid is not the primary key anymore since a dataset and his symlink has the same dataset uuid (that's why the group by is not here anymore).
We join this view with dataset_symlinks to identify if it's a primary dataset or not. if it is a primary dataset, the values in the row remain the same as before. If not, the namespace, name and namespace uuid are replaced by the one from the symlinks (value from the join table).

Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration

Valid, since we recreate the dataset_view on every marquez deploy, we can make your changes in R__3_Datasets_view.sql as @pawel-big-lebowski suggested.

dkt-sophie-ly · 2024-01-23T15:50:07Z

api/src/main/java/marquez/db/LineageDao.java

+      LEFT JOIN dataset_versions dv ON dv.uuid = ds.current_version_uuid
+      LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
+      WHERE dsym.is_primary = true
+      AND ds.uuid IN (<dsUuids>)""")


So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.

dkt-sophie-ly · 2024-01-23T15:50:15Z

api/src/main/java/marquez/db/LineageDao.java

+      LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
+      INNER JOIN (
+        SELECT uuid
+        FROM datasets_view as u
+        WHERE
+            u.name = :datasetName
+            AND u.namespace_name = :namespaceName
+        ) as u
+      	on u.uuid = ds.uuid
+      WHERE dsym.is_primary is true""")


So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.

could we use dataset_views for symlink filtering like

INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)

dkt-sophie-ly · 2024-01-23T15:51:00Z

api/src/main/java/marquez/service/LineageService.java

+    if (nodeId.isDatasetType()) {
+      DatasetId datasetId = nodeId.asDatasetId();
+      DatasetData datasetData =
+          this.getDatasetData(datasetId.getNamespace().getValue(), datasetId.getName().getValue());
+
+      if (!datasetIds.contains(datasetData.getUuid())) {
+        log.warn(
+            "Found jobs {} which no longer share lineage with dataset '{}' - discarding",
+            jobData.stream().map(JobData::getId).toList(),
+            nodeId.getValue());
+        return toLineageWithOrphanDataset(nodeId.asDatasetId());
+      }
+    }


Now we check if the uuid of the node and not the namespace+name

Nice! Thanks for adding the warn log 💯

pawel-big-lebowski

Would it be OK to add some test to LineageService which is failing prior to the code change introduced and is passing afterwards?

pawel-big-lebowski · 2024-01-26T14:53:11Z

api/src/main/resources/marquez/db/migration/V68__alter_datasets_view_to_keep_symlinks.sql

+   FROM datasets d
+     JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid
+     JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid


Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration

pawel-big-lebowski · 2024-01-26T14:59:27Z

api/src/main/java/marquez/db/LineageDao.java

+      LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
+      INNER JOIN (
+        SELECT uuid
+        FROM datasets_view as u
+        WHERE
+            u.name = :datasetName
+            AND u.namespace_name = :namespaceName
+        ) as u
+      	on u.uuid = ds.uuid
+      WHERE dsym.is_primary is true""")


could we use dataset_views for symlink filtering like

INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)

wslulciuc · 2024-01-30T09:09:54Z

CHANGELOG.md

@@ -2,6 +2,11 @@

 ## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.44.0...HEAD)

+### Fixed


Thanks for updating the changelog 💯

dkt-sophie-ly · 2024-02-07T14:10:43Z

Hi @pawel-big-lebowski @wslulciuc

Just updated the code accordingly yo your comments

Modify R__3_Datasets_view.sql
Use INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks) as a filter
Add a test in LineageService
Concerning this test I feel like the way I add a symlinks facets is not the best way (the dataset is not considered as a symlink). I may need a little help on that if that's ok with you :)

Thanks for your review !

sophiely · 2024-02-19T15:19:57Z

Hi @wslulciuc @pawel-big-lebowski :)
Did you have the time to check this PR ?

Thanks for your review

Signed-off-by: sophiely <ly.sophie200@gmail.com>

wslulciuc

Amazing work and test, @sophiely 💯

boring-cyborg bot added the api API layer changes label Jan 23, 2024

sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch from 47d94f9 to 50d3507 Compare January 23, 2024 10:49

boring-cyborg bot added the docs label Jan 23, 2024

dkt-sophie-ly reviewed Jan 23, 2024

View reviewed changes

sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch 2 times, most recently from 576a831 to e42ae18 Compare January 26, 2024 08:43

pawel-big-lebowski reviewed Jan 26, 2024

View reviewed changes

wslulciuc reviewed Jan 30, 2024

View reviewed changes

sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch 3 times, most recently from 6bfdd68 to b0e02fe Compare February 7, 2024 12:43

dkt-sophie-ly mentioned this pull request Feb 7, 2024

If a Dataset symlink is created afterwards with a DatasetEvent, the link is not created in the lineage #2738

Open

sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch 3 times, most recently from 062fa7d to 35129a8 Compare February 19, 2024 14:48

sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch from 35129a8 to 2a39b98 Compare February 21, 2024 08:38

sophiely added 7 commits February 22, 2024 09:27

fix symlink display on marquez

a03621f

Signed-off-by: sophiely <ly.sophie200@gmail.com>

fix code formatting

40bbca1

Signed-off-by: sophiely <ly.sophie200@gmail.com>

update changelog

005928a

Signed-off-by: sophiely <ly.sophie200@gmail.com>

change dataset_views query

e3e83b5

Signed-off-by: sophiely <ly.sophie200@gmail.com>

update changelog

ec44bab

Signed-off-by: sophiely <ly.sophie200@gmail.com>

rename migration file

e70b1a6

Signed-off-by: sophiely <ly.sophie200@gmail.com>

rename migration file

00a7d34

Signed-off-by: sophiely <ly.sophie200@gmail.com>

sophiely added 4 commits February 22, 2024 09:27

fix formatting and add migration file

93b4363

Signed-off-by: sophiely <ly.sophie200@gmail.com>

fix formatting

a6dde9b

Signed-off-by: sophiely <ly.sophie200@gmail.com>

resolve comments

a4bfbe1

Signed-off-by: sophiely <ly.sophie200@gmail.com>

resolve tests

0178bb3

Signed-off-by: sophiely <ly.sophie200@gmail.com>

sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch from 2a39b98 to 0178bb3 Compare February 22, 2024 08:28

wslulciuc approved these changes Mar 5, 2024

View reviewed changes

Merge branch 'main' into fix/display-symlinks-datasets-and-lineage

e832a6f

wslulciuc enabled auto-merge (squash) March 5, 2024 18:01

wslulciuc disabled auto-merge March 5, 2024 18:24

wslulciuc merged commit b0683ad into MarquezProject:main Mar 5, 2024
16 checks passed

sophiely deleted the fix/display-symlinks-datasets-and-lineage branch July 25, 2024 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix symlink display on marquez #2736

fix symlink display on marquez #2736

sophiely commented Jan 23, 2024 •

edited

Loading

netlify bot commented Jan 23, 2024 •

edited

Loading

codecov bot commented Jan 23, 2024 •

edited

Loading

dkt-sophie-ly Jan 23, 2024

pawel-big-lebowski Jan 26, 2024

wslulciuc Jan 30, 2024

dkt-sophie-ly Jan 23, 2024

dkt-sophie-ly Jan 23, 2024

pawel-big-lebowski Jan 26, 2024

dkt-sophie-ly Jan 23, 2024

wslulciuc Jan 30, 2024

pawel-big-lebowski left a comment

pawel-big-lebowski Jan 26, 2024

pawel-big-lebowski Jan 26, 2024

wslulciuc Jan 30, 2024

dkt-sophie-ly commented Feb 7, 2024 •

edited

Loading

sophiely commented Feb 19, 2024

wslulciuc left a comment

		@@ -2,6 +2,11 @@

		## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.44.0...HEAD)

		### Fixed

fix symlink display on marquez #2736

fix symlink display on marquez #2736

Conversation

sophiely commented Jan 23, 2024 • edited Loading

Problem

Solution

Checklist

netlify bot commented Jan 23, 2024 • edited Loading

✅ Deploy Preview for peppy-sprite-186812 canceled.

codecov bot commented Jan 23, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pawel-big-lebowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkt-sophie-ly commented Feb 7, 2024 • edited Loading

sophiely commented Feb 19, 2024

wslulciuc left a comment

Choose a reason for hiding this comment

sophiely commented Jan 23, 2024 •

edited

Loading

netlify bot commented Jan 23, 2024 •

edited

Loading

codecov bot commented Jan 23, 2024 •

edited

Loading

dkt-sophie-ly commented Feb 7, 2024 •

edited

Loading