-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset symlinks provided #2087
Conversation
a7d9feb
to
eea3bb9
Compare
Codecov Report
@@ Coverage Diff @@
## main #2087 +/- ##
============================================
+ Coverage 75.30% 75.49% +0.19%
- Complexity 1038 1045 +7
============================================
Files 203 206 +3
Lines 4883 4925 +42
Branches 399 399
============================================
+ Hits 3677 3718 +41
Misses 763 763
- Partials 443 444 +1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
eea3bb9
to
eb67b0f
Compare
@pawel-big-lebowski let me try to summarize changes to model: after this change, internally Primary symlink is the one that is being send in regular dataset name, rather the ones in |
Yes, you're right. The reason behind droping a The
Yes, the first (namespace/name) becomes primary and I think it's the best we can do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me. Is it going to conflict with the delete feature @mobuchowski is working on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the problem the dataset symlinking aims to solve is to resolve lineage for jobs that may be accessing datasets by different names, I'm wondering if we can scope the symlinks functionality to just the lineage API. As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences. I think we can start by scoping this to just the lineage API, then adjust other APIs as needed.
In order to confirm that, one should know all the Marquez features to make sure datasets are not joined by name whenever symlinks should be applied. This creates an assumption which is not stated explicitly, so is error prone in future. That's why I am in favour of an assumption that Performance wise, this should not affect postgresql. It's just an extra join of index column to retrieve extra column (not used for filtering) and an uncorrelated subquery to get possible symlinks run before existing queries. Postgres is smart and joins on foreign keys, which are just applied on result rows, should be fast. Full-scans digging jsonb columns are slow, we know that and there is proposal to fix it. To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues. |
api/src/main/resources/marquez/db/migration/V46__dataset_symlinks.sql
Outdated
Show resolved
Hide resolved
@julienledem great point. I talked with @pawel-big-lebowski and we aligned to use With that, only issue is that "namespaces" aren't unique anymore - methods like
still need to filter by |
api/src/main/resources/marquez/db/migration/V46__dataset_symlinks.sql
Outdated
Show resolved
Hide resolved
api/src/main/resources/marquez/db/migration/V46__dataset_symlinks.sql
Outdated
Show resolved
Hide resolved
@pawel-big-lebowski I think @collado-mike's comment: "As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences" wasn't necessarily performance related (though there was follow up work on job symlinks needed to improve performance and implementation), but rather unexpected behavior given the number of queries modified. That said, I agree with you that "dataset_symlinks should be always use to retrieve a dataset by name", but would prefer we use Note, we're hoping to add load testing to CI at some point to help with DB performance insight, see #2047 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also update our CHANGELOG.md
with this sweet sweet feature!
@pawel-big-lebowski mind opening an issue to capture this change as follow up work? |
eb67b0f
to
fda3425
Compare
e435702
to
32d8785
Compare
32d8785
to
f25b6f5
Compare
@collado-mike , @mobuchowski and @wslulciuc, thank you for great reviews and feedback. Changes applied according to that:
|
api/src/main/resources/marquez/db/migration/V47__dataset_symlinks.sql
Outdated
Show resolved
Hide resolved
api/src/main/resources/marquez/db/migration/V47__dataset_symlinks.sql
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @pawel-big-lebowski! I left some additional comments, but looks good to merge overall once they are resolved 💯 🥇
f25b6f5
to
2f57f19
Compare
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
2f57f19
to
60c423d
Compare
Signed-off-by: Pawel Leszczynski leszczynski.pawel@gmail.com
Problem
A
SymlinkDatasetFacet
has been introduced in spec recently (OpenLineage/OpenLineage#936) and it allows a dataset to be identified by multiple(name, namespace)
tuples. We need to modify Marquez to handle it, as currently many joins to dataset table are done directly based onname
andnamespace
value.Part of: #2066
Solution
This PR contains:
dataset_symlinks
table with dataset name.SymlinkDatasetFacet
logic.Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)