Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK][FLINK] job ownership facet #2533

Merged
merged 1 commit into from
Apr 22, 2024
Merged

[SPARK][FLINK] job ownership facet #2533

merged 1 commit into from
Apr 22, 2024

Conversation

pawel-big-lebowski
Copy link
Contributor

@pawel-big-lebowski pawel-big-lebowski commented Mar 22, 2024

Problem

Spark and Flink integration should be capable of passing job owner information from config entries that should result in ownership facet being attached to OL event.

Solution

  • add config entry in java client

  • consume config entry in both flink & spark job facets generation code

  • refactor on Spark side to include OpenLineageYaml in OpenLineageContext. This should be useful for other features in future as well.

  • TODO: docs PR will be prepared once this PR gets approved.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@@ -129,6 +129,7 @@ private static List<Tuple2<String, String>> filterProperties(SparkConf conf) {
.filter(
e ->
e._1.startsWith("transport")
|| e._1.startsWith("job.owners")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just job?

this.openLineageEventEmitter = openLineageEventEmitter;
this.openLineageYaml = openLineageYaml;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we pass direct object that we're using later rather than whole OpenLineageYaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to pass a whole config to the context for future usecases (for example config based dataset facet builder). The problem is that sometimes for Spark we rely on Spark conf entries, which means that if you put config entry to openlineage.yaml, it will not work.

@mobuchowski
Copy link
Member

@pawel-big-lebowski what if you have multiple different jobs with different owners using same cluster? I don't think there's a way to properly assign ownership using proposed mechanism then.

What if we added some jobName -> owners mapping in the config? Is getting proper job name too much of a hurdle for users?

@pawel-big-lebowski
Copy link
Contributor Author

@pawel-big-lebowski what if you have multiple different jobs with different owners using same cluster? I don't think there's a way to properly assign ownership using proposed mechanism then.

What if we added some jobName -> owners mapping in the config? Is getting proper job name too much of a hurdle for users?

This seems to be another problem. We do have two ways for providing config entries:

  • provide a file openlineage.yaml which is convenient to have global OL settings per all the job run in the cluster
  • Spark config entries which is convenient for per-job basis

However, we're missing some way to combine those two approaches, which would make totally sense like: take common config entries from openlineage.yaml and mix it with job specific spark conf. Otherwise we end up with two worlds of OpenLineage configuration entries, as we are now.

@JDarDagran
Copy link
Contributor

Here are my thoughts:

  • integration-specific configuration: settings unique to each integration (like Spark) should be defined solely within that integration's configuration system (e.g. Spark configuration). This keeps things clean and avoids overloading the client configuration.
  • common client-side configuration: general settings applicable across integrations are in the client configuration. This allows for centralized management and easier updates. You can leverage approaches like providing an openlineage.yaml file or environment variables for client configuration.
  • integration-to-client configuration translation to bridge the gap: if specific integration settings affect the client's behavior, there should be a mechanism to translate them into client-side configuration. This ensures the client receives the necessary information regardless of the initial configuration source. In Airflow you can use Airflow's configuration mechanism to configure Python client (unfortunately, we have to maintain old environment variables as well, that's a different story).

global vs. per-job:

While the above approach simplifies configuration, including global or per-job analysis settings might introduce unnecessary complexity. Maybe having global configuration per client (or having some default) and overriding this per job via integration config could solve the problem?

A bit of side-note: Python client should have similar way to provide Job ownership, shouldn't it?

@pawel-big-lebowski
Copy link
Contributor Author

Here are my thoughts:

  • integration-specific configuration: settings unique to each integration (like Spark) should be defined solely within that integration's configuration system (e.g. Spark configuration). This keeps things clean and avoids overloading the client configuration.

Job ownership facet is not such an example as it's common for Spark and Flink, which is a good reason to keep its config in Java client library.

  • common client-side configuration: general settings applicable across integrations are in the client configuration. This allows for centralized management and easier updates. You can leverage approaches like providing an openlineage.yaml file or environment variables for client configuration.

I really would like to avoid explaining users where to put each config entry. It makes sense for a user to put common environment related properties into openlineage.yaml while keeping job specific ones in spark.conf. But we shouldn't enforce any distinction of config properties on our end.

From Spark & Flink integration side, it would have been useful to have a single object containing all the config entries to be passed with the context. I think OpenLineageYaml is the object to do this. I think it should be extended to contain Map<String, String> to be contain other config entries which does not have to be used purely by java client.

  • integration-to-client configuration translation to bridge the gap: if specific integration settings affect the client's behavior, there should be a mechanism to translate them into client-side configuration. This ensures the client receives the necessary information regardless of the initial configuration source. In Airflow you can use Airflow's configuration mechanism to configure Python client (unfortunately, we have to maintain old environment variables as well, that's a different story).

I think that even if we have non-client-library config entries, we should be able to access them through client library classes (like OpenLineageYaml), because having a consistent way to pass configuration from user perspective is more important.

global vs. per-job:

While the above approach simplifies configuration, including global or per-job analysis settings might introduce unnecessary complexity. Maybe having global configuration per client (or having some default) and overriding this per job via integration config could solve the problem?

A bit of side-note: Python client should have similar way to provide Job ownership, shouldn't it?

@dolfinus
Copy link
Contributor

dolfinus commented Apr 4, 2024

It makes sense for a user to put common environment related properties into openlineage.yaml

Deploying openlineage.yml file in different Spark modes (client, cluster) may be a tricky part. IMHO it's better if configuration can be passed via either .yml file or via explicitly set Spark session config.

@mobuchowski
Copy link
Member

mobuchowski commented Apr 4, 2024

Job ownership facet is not such an example as it's common for Spark and Flink, which is a good reason to keep its config in Java client library.

I think pure facet is one thing, how we're using it can be a separate issue. There is no prescription that we can't have per-integration configuration that exists in two different integrations slightly differently, if that's the best solution.

I really would like to avoid explaining users where to put each config entry. It makes sense for a user to put common environment related properties into openlineage.yaml while keeping job specific ones in spark.conf. But we shouldn't enforce any distinction of config properties on our end.

I would avoid putting anything integration-specific to openlineage.yaml regardless on whether it's used per-job, per-cluster or in some other way.

From Spark & Flink integration side, it would have been useful to have a single object containing all the config entries to be passed with the context. I think OpenLineageYaml is the object to do this. I think it should be extended to contain Map<String, String> to be contain other config entries which does not have to be used purely by java client.

I very much disagree - if anything, OpenLineageYaml should be a source (or one of them) to create integration-specific object with relevant context. We should deal less in stringly comparisons, and more in constructed objects from that configuration. In Spark, this way of dealing with things already exists - OpenLineageContext.

integration-to-client configuration translation to bridge the gap: if specific integration settings affect the client's behavior, there should be a mechanism to translate them into client-side configuration. This ensures the client receives the necessary information regardless of the initial configuration source. In Airflow you can use Airflow's configuration mechanism to configure Python client (unfortunately, we have to maintain old environment variables as well, that's a different story).

I would (mostly) go the opposite way - client-specific config should be translated to integration-specific one. Client - or rather what it's becoming - a generic helper library should provide mechanisms that integration should define how to use. For example, client provides circuit breaker or metrics - but it's up to integration to use it.

@pawel-big-lebowski
Copy link
Contributor Author

@mobuchowski To wrap up the thoughts to the context of this PR: your suggestion is to use only Spark & Flink conf entries for job ownership facet while not changing anything in openlineage-java. The leads to some duplicated code, but it's better because this feature does not have anything in common with openlineage-java. Users will not be able to configure job ownership over openlineage.yaml and this shouldn't be considered a problem.

Is my understanding correct?

@mobuchowski
Copy link
Member

@pawel-big-lebowski we can provide a generic mechanism in openlineage-java - but it can't assume how particular integrations will use it, especially that they will serialize some data to untyped map Map<String, String> that would be passed with OpenLineageYaml.

I think circuit breaker and metrics are good examples of desired behavior - openlineage-java acting as a library - but I'm not sure if it would be the best solution here, as it can't clearly deal with per-job vs per-cluster config, so I would consider mechanism specific to the integration here.

@mobuchowski
Copy link
Member

BTW, if we're discussing generic configuration, I would rather have OpenLineageConfig (changed from OpenLineageYaml) that would be derived by integration-specific ones (like SparkOpenLineageConfig) which would create internal object used in OpenLineageContext - there should be no need for stringly typed checks inside code outside of the objects created by it - it strongly couples general code to some configuration options, rather than to interfaces.

@pawel-big-lebowski pawel-big-lebowski force-pushed the job-ownership branch 2 times, most recently from 4446de3 to dedf3f7 Compare April 17, 2024 06:18
@pawel-big-lebowski pawel-big-lebowski changed the base branch from main to spark/configuration-refactor April 17, 2024 06:19
@pawel-big-lebowski pawel-big-lebowski marked this pull request as draft April 17, 2024 06:19
@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 6 times, most recently from 42b2aac to 7eff37b Compare April 22, 2024 11:21
Base automatically changed from spark/configuration-refactor to main April 22, 2024 11:55
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@mobuchowski mobuchowski merged commit d1bd39d into main Apr 22, 2024
42 checks passed
@mobuchowski mobuchowski deleted the job-ownership branch April 22, 2024 20:06
HuangZhenQiu pushed a commit to HuangZhenQiu/OpenLineage that referenced this pull request Apr 23, 2024
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@dolfinus
Copy link
Contributor

dolfinus commented Apr 23, 2024

I see that changelog entry was added, but for 1.12.0 instead of Unreleased. Fix: #2636

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants