-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix symlink display on marquez #2736
fix symlink display on marquez #2736
Conversation
✅ Deploy Preview for peppy-sprite-186812 canceled.
|
47d94f9
to
50d3507
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2736 +/- ##
=========================================
Coverage 84.45% 84.46%
+ Complexity 1416 1415 -1
=========================================
Files 251 251
Lines 6447 6450 +3
Branches 291 292 +1
=========================================
+ Hits 5445 5448 +3
Misses 850 850
Partials 152 152 ☔ View full report in Codecov by Sentry. |
FROM datasets d | ||
JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid | ||
JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query is not much different from before, but now the dataset uuid is not the primary key anymore since a dataset and his symlink has the same dataset uuid (that's why the group by is not here anymore).
We join this view with dataset_symlinks to identify if it's a primary dataset or not. if it is a primary dataset, the values in the row remain the same as before. If not, the namespace, name and namespace uuid are replaced by the one from the symlinks (value from the join table).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we modify view definition in R__3_Datasets_view.sql
?
Flywaydb scripts starting with R
are run with each migration Repeatable migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid, since we recreate the dataset_view
on every marquez deploy, we can make your changes in R__3_Datasets_view.sql
as @pawel-big-lebowski suggested.
LEFT JOIN dataset_versions dv ON dv.uuid = ds.current_version_uuid | ||
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name | ||
WHERE dsym.is_primary = true | ||
AND ds.uuid IN (<dsUuids>)""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name | ||
INNER JOIN ( | ||
SELECT uuid | ||
FROM datasets_view as u | ||
WHERE | ||
u.name = :datasetName | ||
AND u.namespace_name = :namespaceName | ||
) as u | ||
on u.uuid = ds.uuid | ||
WHERE dsym.is_primary is true""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here since the view datasets_views can have several rows with the same uuid we choose the one flagged as primary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use dataset_views for symlink filtering like
INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)
if (nodeId.isDatasetType()) { | ||
DatasetId datasetId = nodeId.asDatasetId(); | ||
DatasetData datasetData = | ||
this.getDatasetData(datasetId.getNamespace().getValue(), datasetId.getName().getValue()); | ||
|
||
if (!datasetIds.contains(datasetData.getUuid())) { | ||
log.warn( | ||
"Found jobs {} which no longer share lineage with dataset '{}' - discarding", | ||
jobData.stream().map(JobData::getId).toList(), | ||
nodeId.getValue()); | ||
return toLineageWithOrphanDataset(nodeId.asDatasetId()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we check if the uuid of the node and not the namespace+name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for adding the warn
log 💯
576a831
to
e42ae18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be OK to add some test to LineageService
which is failing prior to the code change introduced and is passing afterwards?
FROM datasets d | ||
JOIN dataset_symlinks symlinks ON d.uuid = symlinks.dataset_uuid | ||
JOIN namespaces ON symlinks.namespace_uuid = namespaces.uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we modify view definition in R__3_Datasets_view.sql
?
Flywaydb scripts starting with R
are run with each migration Repeatable migration
LEFT JOIN dataset_symlinks dsym ON dsym.namespace_uuid = ds.namespace_uuid and dsym.name = ds.name | ||
INNER JOIN ( | ||
SELECT uuid | ||
FROM datasets_view as u | ||
WHERE | ||
u.name = :datasetName | ||
AND u.namespace_name = :namespaceName | ||
) as u | ||
on u.uuid = ds.uuid | ||
WHERE dsym.is_primary is true""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use dataset_views for symlink filtering like
INNER JOIN datasets_view AS d ON d.uuid = df.dataset_uuid
WHERE CAST((:namespaceName, :datasetName) AS DATASET_NAME) = ANY(d.dataset_symlinks)
@@ -2,6 +2,11 @@ | |||
|
|||
## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.44.0...HEAD) | |||
|
|||
### Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the changelog 💯
6bfdd68
to
b0e02fe
Compare
Hi @pawel-big-lebowski @wslulciuc Just updated the code accordingly yo your comments
Thanks for your review ! |
062fa7d
to
35129a8
Compare
Hi @wslulciuc @pawel-big-lebowski :) Thanks for your review |
35129a8
to
2a39b98
Compare
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
Signed-off-by: sophiely <ly.sophie200@gmail.com>
2a39b98
to
0178bb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work and test, @sophiely 💯
Problem
Sending an event with a dataset symlink create an empty namespace with 0 dataset in it.
For example, this event:
create an empty namespace called symlink_test
Closes: #2645
Solution
Example:
If we run these 2 runs:
new_dataset_a {facet: new_dataset_sym_a} ------- new_symlink_job_a ----------> new_dataset_b
then:
new_dataset_sym_a ------- new_symlink_job_b --------------> new_dataset_sym_b
On marquez we'll have
Please find more detailed explanation on the comments below.
This fix include another issue on the front though, the version/facets of the selected dataset are not directly displayed.
Since the selected dataset is the symlink and not the primary dataset, the front doesn't recognize the selected dataset as part of the lineage as a result the version endpoint is not run.
But if we click on the dataset new_dataset_a additional, the dataset version query is run and information are displayed
One-line summary:
Checklist
CHANGELOG.md
] (https://github.com/MarquezProject/marquez/blob/main/CHANGELOG.md#unreleased) (Depending on the change, this may not be necessary)..sql
database schema migration according to Flyway's naming convention (if relevant)