Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix symlink display on marquez #2736

Conversation

sophiely
Copy link
Contributor

@sophiely sophiely commented Jan 23, 2024

Problem

Sending an event with a dataset symlink create an empty namespace with 0 dataset in it.

For example, this event:

{
  ...
  },
  "inputs": [
    {
      "namespace": "test_namespace"
      "name": "dataset_0",
      "facets": {
        "symlinks": {
          "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.1.0/client/python",
          "_schemaURL": "https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json",
          "identifiers": [
            {
              "name": "symlink_prefix",
              "type": "DB_TABLE",
              "namespace": "symlink_test"
            }
          ]
        },
      },
    }
  ],
  ...
}

create an empty namespace called symlink_test

Closes: #2645

Solution

  • Display the dataset create by a symlink facet in their own namespace (here in the example the namespace symlink_test will contain a dataset called symlink_prefix)
  • The lineage of this dataset will be redirect to the lineage of the "main" dataset (the dataset test_namespace.dataset_0 in our example)

Example:

If we run these 2 runs:

new_dataset_a {facet: new_dataset_sym_a} ------- new_symlink_job_a ----------> new_dataset_b

then:
new_dataset_sym_a ------- new_symlink_job_b --------------> new_dataset_sym_b

On marquez we'll have

image image

Please find more detailed explanation on the comments below.

This fix include another issue on the front though, the version/facets of the selected dataset are not directly displayed.

Since the selected dataset is the symlink and not the primary dataset, the front doesn't recognize the selected dataset as part of the lineage as a result the version endpoint is not run.

image

But if we click on the dataset new_dataset_a additional, the dataset version query is run and information are displayed

image

One-line summary:

Checklist

@boring-cyborg boring-cyborg bot added the api API layer changes label Jan 23, 2024
Copy link

netlify bot commented Jan 23, 2024

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit e832a6f
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/65e75d29fb98ab0008f9e98c

@sophiely sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch from 47d94f9 to 50d3507 Compare January 23, 2024 10:49
Copy link

codecov bot commented Jan 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.46%. Comparing base (cc9c2c0) to head (e832a6f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2736   +/-   ##
=========================================
  Coverage     84.45%   84.46%           
+ Complexity     1416     1415    -1     
=========================================
  Files           251      251           
  Lines          6447     6450    +3     
  Branches        291      292    +1     
=========================================
+ Hits           5445     5448    +3     
  Misses          850      850           
  Partials        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@boring-cyborg boring-cyborg bot added the docs label Jan 23, 2024
Comment on lines 30 to 32
FROM datasets d
JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid
JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query is not much different from before, but now the dataset uuid is not the primary key anymore since a dataset and his symlink has the same dataset uuid (that's why the group by is not here anymore).
We join this view with dataset_symlinks to identify if it's a primary dataset or not. if it is a primary dataset, the values in the row remain the same as before. If not, the namespace, name and namespace uuid are replaced by the one from the symlinks (value from the join table).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid, since we recreate the dataset_view on every marquez deploy, we can make your changes in R__3_Datasets_view.sql as @pawel-big-lebowski suggested.

Comment on lines +105 to +108
LEFT JOIN dataset_versions dv ON dv.uuid = ds.current_version_uuid
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
WHERE dsym.is_primary = true
AND ds.uuid IN (<dsUuids>)""")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.

Comment on lines 116 to 125
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
INNER JOIN (
SELECT uuid
FROM datasets_view as u
WHERE
u.name = :datasetName
AND u.namespace_name = :namespaceName
) as u
on u.uuid = ds.uuid
WHERE dsym.is_primary is true""")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use dataset_views for symlink filtering like

 INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
 WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)

Comment on lines +118 to +130
if (nodeId.isDatasetType()) {
DatasetId datasetId = nodeId.asDatasetId();
DatasetData datasetData =
this.getDatasetData(datasetId.getNamespace().getValue(), datasetId.getName().getValue());

if (!datasetIds.contains(datasetData.getUuid())) {
log.warn(
"Found jobs {} which no longer share lineage with dataset '{}' - discarding",
jobData.stream().map(JobData::getId).toList(),
nodeId.getValue());
return toLineageWithOrphanDataset(nodeId.asDatasetId());
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we check if the uuid of the node and not the namespace+name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for adding the warn log 💯

@sophiely sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch 2 times, most recently from 576a831 to e42ae18 Compare January 26, 2024 08:43
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be OK to add some test to LineageService which is failing prior to the code change introduced and is passing afterwards?

Comment on lines 30 to 32
FROM datasets d
JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid
JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we modify view definition in R__3_Datasets_view.sql?
Flywaydb scripts starting with R are run with each migration Repeatable migration

Comment on lines 116 to 125
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name
INNER JOIN (
SELECT uuid
FROM datasets_view as u
WHERE
u.name = :datasetName
AND u.namespace_name = :namespaceName
) as u
on u.uuid = ds.uuid
WHERE dsym.is_primary is true""")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use dataset_views for symlink filtering like

 INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
 WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)

@@ -2,6 +2,11 @@

## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.44.0...HEAD)

### Fixed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the changelog 💯

@sophiely sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch 3 times, most recently from 6bfdd68 to b0e02fe Compare February 7, 2024 12:43
@dkt-sophie-ly
Copy link

dkt-sophie-ly commented Feb 7, 2024

Hi @pawel-big-lebowski @wslulciuc

Just updated the code accordingly yo your comments

  • Modify R__3_Datasets_view.sql
  • Use INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks) as a filter
  • Add a test in LineageService
    Concerning this test I feel like the way I add a symlinks facets is not the best way (the dataset is not considered as a symlink). I may need a little help on that if that's ok with you :)

Thanks for your review !

@sophiely
Copy link
Contributor Author

Hi @wslulciuc @pawel-big-lebowski :)
Did you have the time to check this PR ?

Thanks for your review

@sophiely sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch from 35129a8 to 2a39b98 Compare February 21, 2024 08:38
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
@sophiely sophiely force-pushed the fix/display-symlinks-datasets-and-lineage branch from 2a39b98 to 0178bb3 Compare February 22, 2024 08:28
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work and test, @sophiely 💯

@wslulciuc wslulciuc enabled auto-merge (squash) March 5, 2024 18:01
@wslulciuc wslulciuc disabled auto-merge March 5, 2024 18:24
@wslulciuc wslulciuc merged commit b0683ad into MarquezProject:main Mar 5, 2024
16 checks passed
@sophiely sophiely deleted the fix/display-symlinks-datasets-and-lineage branch July 25, 2024 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Symlink facet create an empty namespace with 0 dataset in it
4 participants