Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset symlinks provided #2087

Merged
merged 1 commit into from
Sep 28, 2022
Merged

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Aug 25, 2022

Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com

Problem

A SymlinkDatasetFacet has been introduced in spec recently (OpenLineage/OpenLineage#936) and it allows a dataset to be identified by multiple (name, namespace) tuples. We need to modify Marquez to handle it, as currently many joins to dataset table are done directly based on name and namespace value.

Part of: #2066

Solution

This PR contains:

  • a refactor of current Marquez database model to allow alternative dataset names by creating an extra dataset_symlinks table with dataset name.
  • removal of dataset's name from dataset table.
  • implementation of SymlinkDatasetFacet logic.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@codecov
Copy link

codecov bot commented Aug 26, 2022

Codecov Report

Merging #2087 (60c423d) into main (bb3d163) will increase coverage by 0.19%.
The diff coverage is 97.82%.

@@             Coverage Diff              @@
##               main    #2087      +/-   ##
============================================
+ Coverage     75.30%   75.49%   +0.19%     
- Complexity     1038     1045       +7     
============================================
  Files           203      206       +3     
  Lines          4883     4925      +42     
  Branches        399      399              
============================================
+ Hits           3677     3718      +41     
  Misses          763      763              
- Partials        443      444       +1     
Impacted Files Coverage Δ
api/src/main/java/marquez/db/Columns.java 81.81% <ø> (ø)
...a/marquez/db/mappers/DatasetSymlinksRowMapper.java 90.00% <90.00%> (ø)
api/src/main/java/marquez/db/DatasetDao.java 98.64% <100.00%> (+0.11%) ⬆️
...pi/src/main/java/marquez/db/DatasetSymlinkDao.java 100.00% <100.00%> (ø)
api/src/main/java/marquez/db/OpenLineageDao.java 95.41% <100.00%> (+0.24%) ⬆️
...main/java/marquez/db/models/DatasetSymlinkRow.java 100.00% <100.00%> (ø)
...main/java/marquez/service/models/LineageEvent.java 84.12% <100.00%> (+1.07%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mobuchowski
Copy link
Contributor

@pawel-big-lebowski let me try to summarize changes to model: after this change, internally datasets table don't represent "full" dataset, since name is dropped - getting dataset identifier requires joining with symlinks table. datasets table still contains namespace_name - it's the same namespace name that is used by primary symlink.

Primary symlink is the one that is being send in regular dataset name, rather the ones in SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?

@pawel-big-lebowski
Copy link
Collaborator Author

@pawel-big-lebowski let me try to summarize changes to model: after this change, internally datasets table don't represent "full" dataset, since name is dropped - getting dataset identifier requires joining with symlinks table. datasets table still contains namespace_name - it's the same namespace name that is used by primary symlink.

Yes, you're right. The reason behind droping a name column is to make sure that no-one now or later tries to retrieve dataset by filtering name column in dataset table. Whenever this is required, dataset_symlinks table should be joined to make sure we look for alternative names.

The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.

Primary symlink is the one that is being send in regular dataset name, rather the ones in SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?

Yes, the first (namespace/name) becomes primary and I think it's the best we can do.

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. Is it going to conflict with the delete feature @mobuchowski is working on?

api/src/main/java/marquez/db/DatasetDao.java Outdated Show resolved Hide resolved
Copy link
Collaborator

@collado-mike collado-mike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the problem the dataset symlinking aims to solve is to resolve lineage for jobs that may be accessing datasets by different names, I'm wondering if we can scope the symlinks functionality to just the lineage API. As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences. I think we can start by scoping this to just the lineage API, then adjust other APIs as needed.

@pawel-big-lebowski
Copy link
Collaborator Author

pawel-big-lebowski commented Aug 31, 2022

Given the problem the dataset symlinking aims to solve is to resolve lineage for jobs that may be accessing datasets by different names, I'm wondering if we can scope the symlinks functionality to just the lineage API. As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences. I think we can start by scoping this to just the lineage API, then adjust other APIs as needed.

In order to confirm that, one should know all the Marquez features to make sure datasets are not joined by name whenever symlinks should be applied. This creates an assumption which is not stated explicitly, so is error prone in future. That's why I am in favour of an assumption that dataset_symlinks should be always use to retrieve a dataset by name. If a dataset has two different names, it gets returned to matter how we request it.

Performance wise, this should not affect postgresql. It's just an extra join of index column to retrieve extra column (not used for filtering) and an uncorrelated subquery to get possible symlinks run before existing queries. Postgres is smart and joins on foreign keys, which are just applied on result rows, should be fast. Full-scans digging jsonb columns are slow, we know that and there is proposal to fix it.

To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.

@mobuchowski
Copy link
Contributor

@julienledem great point. I talked with @pawel-big-lebowski and we aligned to use datasets_view to provide joined data from datasets and datasets_symlinks. This way, most of the SQL busywork behind this change will be opaque to queries.

With that, only issue is that "namespaces" aren't unique anymore - methods like

List<Dataset> findAll(String namespaceName, int limit, int offset)

still need to filter by is_primary field to not return duplicates.

@wslulciuc
Copy link
Member

wslulciuc commented Sep 7, 2022

To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.

@pawel-big-lebowski I think @collado-mike's comment: "As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences" wasn't necessarily performance related (though there was follow up work on job symlinks needed to improve performance and implementation), but rather unexpected behavior given the number of queries modified. That said, I agree with you that "dataset_symlinks should be always use to retrieve a dataset by name", but would prefer we use datasets_view (introduced in #2032).

Note, we're hoping to add load testing to CI at some point to help with DB performance insight, see #2047

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also update our CHANGELOG.md with this sweet sweet feature!

@wslulciuc
Copy link
Member

The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.

@pawel-big-lebowski mind opening an issue to capture this change as follow up work?

@boring-cyborg boring-cyborg bot added the api API layer changes label Sep 8, 2022
@pawel-big-lebowski pawel-big-lebowski marked this pull request as draft September 8, 2022 09:51
@pawel-big-lebowski pawel-big-lebowski force-pushed the feature/dataset-symlinks branch 2 times, most recently from e435702 to 32d8785 Compare September 9, 2022 06:38
@boring-cyborg boring-cyborg bot added the docs label Sep 9, 2022
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review September 9, 2022 10:56
@pawel-big-lebowski
Copy link
Collaborator Author

pawel-big-lebowski commented Sep 9, 2022

@collado-mike , @mobuchowski and @wslulciuc, thank you for great reviews and feedback.

Changes applied according to that:

  • will reuse a dataset_view introduced by Maciej,
  • won't drop name column in datasets table to avoid risky migrations and limit amount of possible unexpected behavior changes.

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @pawel-big-lebowski! I left some additional comments, but looks good to merge overall once they are resolved 💯 🥇

api/src/main/java/marquez/db/DatasetSymlinkDao.java Outdated Show resolved Hide resolved
api/src/main/java/marquez/db/DatasetDao.java Outdated Show resolved Hide resolved
api/src/main/java/marquez/db/DatasetSymlinkDao.java Outdated Show resolved Hide resolved
api/src/main/java/marquez/db/OpenLineageDao.java Outdated Show resolved Hide resolved
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants