dataset symlinks provided #2087

pawel-big-lebowski · 2022-08-25T08:24:52Z

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

A SymlinkDatasetFacet has been introduced in spec recently (OpenLineage/OpenLineage#936) and it allows a dataset to be identified by multiple (name, namespace) tuples. We need to modify Marquez to handle it, as currently many joins to dataset table are done directly based on name and namespace value.

Part of: #2066

Solution

This PR contains:

a refactor of current Marquez database model to allow alternative dataset names by creating an extra dataset_symlinks table with dataset name.
removal of dataset's name from dataset table.
implementation of SymlinkDatasetFacet logic.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

codecov · 2022-08-26T10:17:40Z

Codecov Report

Merging #2087 (60c423d) into main (bb3d163) will increase coverage by 0.19%.
The diff coverage is 97.82%.

@@             Coverage Diff              @@
##               main    #2087      +/-   ##
============================================
+ Coverage     75.30%   75.49%   +0.19%     
- Complexity     1038     1045       +7     
============================================
  Files           203      206       +3     
  Lines          4883     4925      +42     
  Branches        399      399              
============================================
+ Hits           3677     3718      +41     
  Misses          763      763              
- Partials        443      444       +1

Impacted Files	Coverage Δ
api/src/main/java/marquez/db/Columns.java	`81.81% <ø> (ø)`
...a/marquez/db/mappers/DatasetSymlinksRowMapper.java	`90.00% <90.00%> (ø)`
api/src/main/java/marquez/db/DatasetDao.java	`98.64% <100.00%> (+0.11%)`	⬆️
...pi/src/main/java/marquez/db/DatasetSymlinkDao.java	`100.00% <100.00%> (ø)`
api/src/main/java/marquez/db/OpenLineageDao.java	`95.41% <100.00%> (+0.24%)`	⬆️
...main/java/marquez/db/models/DatasetSymlinkRow.java	`100.00% <100.00%> (ø)`
...main/java/marquez/service/models/LineageEvent.java	`84.12% <100.00%> (+1.07%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mobuchowski · 2022-08-30T14:35:41Z

@pawel-big-lebowski let me try to summarize changes to model: after this change, internally datasets table don't represent "full" dataset, since name is dropped - getting dataset identifier requires joining with symlinks table. datasets table still contains namespace_name - it's the same namespace name that is used by primary symlink.

Primary symlink is the one that is being send in regular dataset name, rather the ones in SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?

pawel-big-lebowski · 2022-08-30T16:13:54Z

@pawel-big-lebowski let me try to summarize changes to model: after this change, internally datasets table don't represent "full" dataset, since name is dropped - getting dataset identifier requires joining with symlinks table. datasets table still contains namespace_name - it's the same namespace name that is used by primary symlink.

Yes, you're right. The reason behind droping a name column is to make sure that no-one now or later tries to retrieve dataset by filtering name column in dataset table. Whenever this is required, dataset_symlinks table should be joined to make sure we look for alternative names.

The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.

Primary symlink is the one that is being send in regular dataset name, rather the ones in SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?

Yes, the first (namespace/name) becomes primary and I think it's the best we can do.

julienledem

Overall this looks good to me. Is it going to conflict with the delete feature @mobuchowski is working on?

api/src/main/java/marquez/db/DatasetDao.java

collado-mike

Given the problem the dataset symlinking aims to solve is to resolve lineage for jobs that may be accessing datasets by different names, I'm wondering if we can scope the symlinks functionality to just the lineage API. As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences. I think we can start by scoping this to just the lineage API, then adjust other APIs as needed.

pawel-big-lebowski · 2022-08-31T07:30:06Z

Given the problem the dataset symlinking aims to solve is to resolve lineage for jobs that may be accessing datasets by different names, I'm wondering if we can scope the symlinks functionality to just the lineage API. As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences. I think we can start by scoping this to just the lineage API, then adjust other APIs as needed.

In order to confirm that, one should know all the Marquez features to make sure datasets are not joined by name whenever symlinks should be applied. This creates an assumption which is not stated explicitly, so is error prone in future. That's why I am in favour of an assumption that dataset_symlinks should be always use to retrieve a dataset by name. If a dataset has two different names, it gets returned to matter how we request it.

Performance wise, this should not affect postgresql. It's just an extra join of index column to retrieve extra column (not used for filtering) and an uncorrelated subquery to get possible symlinks run before existing queries. Postgres is smart and joins on foreign keys, which are just applied on result rows, should be fast. Full-scans digging jsonb columns are slow, we know that and there is proposal to fix it.

To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.

api/src/main/resources/marquez/db/migration/V46__dataset_symlinks.sql

api/src/main/java/marquez/db/DatasetSymlinkDao.java

mobuchowski · 2022-08-31T09:57:44Z

@julienledem great point. I talked with @pawel-big-lebowski and we aligned to use datasets_view to provide joined data from datasets and datasets_symlinks. This way, most of the SQL busywork behind this change will be opaque to queries.

With that, only issue is that "namespaces" aren't unique anymore - methods like

List<Dataset> findAll(String namespaceName, int limit, int offset)

still need to filter by is_primary field to not return duplicates.

api/src/main/java/marquez/db/Columns.java

api/src/main/java/marquez/db/DatasetDao.java

api/src/main/java/marquez/db/Columns.java

api/src/main/java/marquez/db/DatasetSymlinkDao.java

api/src/main/resources/marquez/db/migration/V46__dataset_symlinks.sql

wslulciuc · 2022-09-07T07:52:19Z

To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.

@pawel-big-lebowski I think @collado-mike's comment: "As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences" wasn't necessarily performance related (though there was follow up work on job symlinks needed to improve performance and implementation), but rather unexpected behavior given the number of queries modified. That said, I agree with you that "dataset_symlinks should be always use to retrieve a dataset by name", but would prefer we use datasets_view (introduced in #2032).

Note, we're hoping to add load testing to CI at some point to help with DB performance insight, see #2047

wslulciuc

I would also update our CHANGELOG.md with this sweet sweet feature!

wslulciuc · 2022-09-07T08:05:12Z

The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.

@pawel-big-lebowski mind opening an issue to capture this change as follow up work?

pawel-big-lebowski · 2022-09-09T10:58:55Z

@collado-mike , @mobuchowski and @wslulciuc, thank you for great reviews and feedback.

Changes applied according to that:

will reuse a dataset_view introduced by Maciej,
won't drop name column in datasets table to avoid risky migrations and limit amount of possible unexpected behavior changes.

api/src/main/resources/marquez/db/migration/V47__dataset_symlinks.sql

api/src/main/java/marquez/db/DatasetSymlinkDao.java

api/src/main/resources/marquez/db/migration/V47__dataset_symlinks.sql

wslulciuc

Great work, @pawel-big-lebowski! I left some additional comments, but looks good to merge overall once they are resolved 💯 🥇

api/src/main/java/marquez/db/DatasetSymlinkDao.java

api/src/main/java/marquez/db/DatasetDao.java

api/src/main/java/marquez/db/DatasetSymlinkDao.java

api/src/main/java/marquez/db/OpenLineageDao.java

api/src/main/java/marquez/service/models/DatasetSymlink.java

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski requested review from collado-mike and wslulciuc August 25, 2022 08:31

pawel-big-lebowski force-pushed the feature/dataset-symlinks branch from a7d9feb to eea3bb9 Compare August 26, 2022 10:12

pawel-big-lebowski force-pushed the feature/dataset-symlinks branch from eea3bb9 to eb67b0f Compare August 26, 2022 10:40

julienledem reviewed Aug 30, 2022

View reviewed changes

api/src/main/java/marquez/db/DatasetDao.java Outdated Show resolved Hide resolved

collado-mike reviewed Aug 30, 2022

View reviewed changes

pawel-big-lebowski commented Aug 31, 2022

View reviewed changes

api/src/main/resources/marquez/db/migration/V46__dataset_symlinks.sql Outdated Show resolved Hide resolved

mobuchowski reviewed Aug 31, 2022

View reviewed changes

api/src/main/java/marquez/db/DatasetSymlinkDao.java Outdated Show resolved Hide resolved

wslulciuc reviewed Sep 6, 2022

View reviewed changes

api/src/main/java/marquez/db/Columns.java Show resolved Hide resolved

wslulciuc reviewed Sep 6, 2022

View reviewed changes

api/src/main/java/marquez/db/DatasetDao.java Outdated Show resolved Hide resolved

api/src/main/java/marquez/db/Columns.java Show resolved Hide resolved

wslulciuc reviewed Sep 7, 2022

View reviewed changes

wslulciuc requested changes Sep 7, 2022

View reviewed changes

pawel-big-lebowski force-pushed the feature/dataset-symlinks branch from eb67b0f to fda3425 Compare September 8, 2022 09:51

boring-cyborg bot added the api API layer changes label Sep 8, 2022

pawel-big-lebowski marked this pull request as draft September 8, 2022 09:51

pawel-big-lebowski force-pushed the feature/dataset-symlinks branch 2 times, most recently from e435702 to 32d8785 Compare September 9, 2022 06:38

boring-cyborg bot added the docs label Sep 9, 2022

pawel-big-lebowski force-pushed the feature/dataset-symlinks branch from 32d8785 to f25b6f5 Compare September 9, 2022 10:56

pawel-big-lebowski marked this pull request as ready for review September 9, 2022 10:56

pawel-big-lebowski requested review from wslulciuc and collado-mike September 9, 2022 11:41

pawel-big-lebowski requested a review from mobuchowski September 9, 2022 11:41

mobuchowski reviewed Sep 20, 2022

View reviewed changes

api/src/main/resources/marquez/db/migration/V47__dataset_symlinks.sql Outdated Show resolved Hide resolved

mobuchowski reviewed Sep 21, 2022

View reviewed changes

api/src/main/java/marquez/db/DatasetSymlinkDao.java Outdated Show resolved Hide resolved

mobuchowski reviewed Sep 21, 2022

View reviewed changes

api/src/main/resources/marquez/db/migration/V47__dataset_symlinks.sql Outdated Show resolved Hide resolved

wslulciuc approved these changes Sep 27, 2022

View reviewed changes

pawel-big-lebowski force-pushed the feature/dataset-symlinks branch from f25b6f5 to 2f57f19 Compare September 28, 2022 06:36

dataset symlinks provided

60c423d

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski force-pushed the feature/dataset-symlinks branch from 2f57f19 to 60c423d Compare September 28, 2022 06:47

pawel-big-lebowski merged commit 2909864 into main Sep 28, 2022

pawel-big-lebowski deleted the feature/dataset-symlinks branch September 28, 2022 07:05

wslulciuc mentioned this pull request Nov 16, 2022

Have a strategy to deal with renamed datasets. #1512

Open

ddave09 mentioned this pull request Oct 30, 2023

Merge query issues for tables datasets, dataset_versions, and job_versions #2673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset symlinks provided #2087

dataset symlinks provided #2087

pawel-big-lebowski commented Aug 25, 2022 •

edited

Loading

codecov bot commented Aug 26, 2022 •

edited

Loading

mobuchowski commented Aug 30, 2022

pawel-big-lebowski commented Aug 30, 2022

julienledem left a comment

collado-mike left a comment

pawel-big-lebowski commented Aug 31, 2022 •

edited

Loading

mobuchowski commented Aug 31, 2022

wslulciuc commented Sep 7, 2022 •

edited

Loading

wslulciuc left a comment

wslulciuc commented Sep 7, 2022

pawel-big-lebowski commented Sep 9, 2022 •

edited

Loading

wslulciuc left a comment •

edited

Loading

dataset symlinks provided #2087

dataset symlinks provided #2087

Conversation

pawel-big-lebowski commented Aug 25, 2022 • edited Loading

Problem

Solution

Checklist

codecov bot commented Aug 26, 2022 • edited Loading

Codecov Report

mobuchowski commented Aug 30, 2022

pawel-big-lebowski commented Aug 30, 2022

julienledem left a comment

Choose a reason for hiding this comment

collado-mike left a comment

Choose a reason for hiding this comment

pawel-big-lebowski commented Aug 31, 2022 • edited Loading

mobuchowski commented Aug 31, 2022

wslulciuc commented Sep 7, 2022 • edited Loading

wslulciuc left a comment

Choose a reason for hiding this comment

wslulciuc commented Sep 7, 2022

pawel-big-lebowski commented Sep 9, 2022 • edited Loading

wslulciuc left a comment • edited Loading

Choose a reason for hiding this comment

pawel-big-lebowski commented Aug 25, 2022 •

edited

Loading

codecov bot commented Aug 26, 2022 •

edited

Loading

pawel-big-lebowski commented Aug 31, 2022 •

edited

Loading

wslulciuc commented Sep 7, 2022 •

edited

Loading

pawel-big-lebowski commented Sep 9, 2022 •

edited

Loading

wslulciuc left a comment •

edited

Loading