-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK][FLINK] job ownership facet #2533
Conversation
d0358fa
to
d6d24a1
Compare
513209f
to
d3f2cd4
Compare
d3f2cd4
to
bb656be
Compare
@@ -129,6 +129,7 @@ private static List<Tuple2<String, String>> filterProperties(SparkConf conf) { | |||
.filter( | |||
e -> | |||
e._1.startsWith("transport") | |||
|| e._1.startsWith("job.owners") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just job
?
this.openLineageEventEmitter = openLineageEventEmitter; | ||
this.openLineageYaml = openLineageYaml; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass direct object that we're using later rather than whole OpenLineageYaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to pass a whole config to the context for future usecases (for example config based dataset facet builder). The problem is that sometimes for Spark we rely on Spark conf entries, which means that if you put config entry to openlineage.yaml, it will not work.
@pawel-big-lebowski what if you have multiple different jobs with different owners using same cluster? I don't think there's a way to properly assign ownership using proposed mechanism then. What if we added some jobName -> owners mapping in the config? Is getting proper job name too much of a hurdle for users? |
This seems to be another problem. We do have two ways for providing config entries:
However, we're missing some way to combine those two approaches, which would make totally sense like: take common config entries from |
Here are my thoughts:
global vs. per-job: While the above approach simplifies configuration, including global or per-job analysis settings might introduce unnecessary complexity. Maybe having global configuration per client (or having some default) and overriding this per job via integration config could solve the problem? A bit of side-note: Python client should have similar way to provide Job ownership, shouldn't it? |
Job ownership facet is not such an example as it's common for Spark and Flink, which is a good reason to keep its config in Java client library.
I really would like to avoid explaining users where to put each config entry. It makes sense for a user to put common environment related properties into From Spark & Flink integration side, it would have been useful to have a single object containing all the config entries to be passed with the context. I think
I think that even if we have non-client-library config entries, we should be able to access them through client library classes (like
|
Deploying |
I think pure facet is one thing, how we're using it can be a separate issue. There is no prescription that we can't have per-integration configuration that exists in two different integrations slightly differently, if that's the best solution.
I would avoid putting anything integration-specific to
I very much disagree - if anything,
I would (mostly) go the opposite way - client-specific config should be translated to integration-specific one. Client - or rather what it's becoming - a generic helper library should provide mechanisms that integration should define how to use. For example, client provides circuit breaker or metrics - but it's up to integration to use it. |
@mobuchowski To wrap up the thoughts to the context of this PR: your suggestion is to use only Spark & Flink conf entries for job ownership facet while not changing anything in Is my understanding correct? |
@pawel-big-lebowski we can provide a generic mechanism in I think circuit breaker and metrics are good examples of desired behavior - |
BTW, if we're discussing generic configuration, I would rather have |
4446de3
to
dedf3f7
Compare
42b2aac
to
7eff37b
Compare
dedf3f7
to
ed10b2d
Compare
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
ed10b2d
to
597622e
Compare
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
I see that changelog entry was added, but for 1.12.0 instead of Unreleased. Fix: #2636 |
Problem
Spark and Flink integration should be capable of passing job owner information from config entries that should result in
ownership
facet being attached to OL event.Solution
add config entry in java client
consume config entry in both flink & spark job facets generation code
refactor on Spark side to include OpenLineageYaml in OpenLineageContext. This should be useful for other features in future as well.
TODO: docs PR will be prepared once this PR gets approved.
If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports
S3
andGCS
filesystem operations, tested with AWS EMR).One-line summary:
Checklist
SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project