[SPARK][FLINK] job ownership facet #2533

pawel-big-lebowski · 2024-03-22T12:26:28Z

Problem

Spark and Flink integration should be capable of passing job owner information from config entries that should result in ownership facet being attached to OL event.

Solution

add config entry in java client
consume config entry in both flink & spark job facets generation code
refactor on Spark side to include OpenLineageYaml in OpenLineageContext. This should be useful for other features in future as well.
TODO: docs PR will be prepared once this PR gets approved.

Note: All schema changes require discussion. Please link the issue for context.

Your change modifies the core OpenLineage model
Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

You've signed-off your work
Your pull request title follows our guidelines
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

mobuchowski · 2024-04-02T09:55:47Z

integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java

@@ -129,6 +129,7 @@ private static List<Tuple2<String, String>> filterProperties(SparkConf conf) {
        .filter(
            e ->
                e._1.startsWith("transport")
+                    || e._1.startsWith("job.owners")


Why not just job?

mobuchowski · 2024-04-02T09:56:33Z

integration/spark/app/src/main/java/io/openlineage/spark/agent/lifecycle/ContextFactory.java

    this.openLineageEventEmitter = openLineageEventEmitter;
+    this.openLineageYaml = openLineageYaml;


Can't we pass direct object that we're using later rather than whole OpenLineageYaml?

I think it's better to pass a whole config to the context for future usecases (for example config based dataset facet builder). The problem is that sometimes for Spark we rely on Spark conf entries, which means that if you put config entry to openlineage.yaml, it will not work.

mobuchowski · 2024-04-02T09:58:02Z

@pawel-big-lebowski what if you have multiple different jobs with different owners using same cluster? I don't think there's a way to properly assign ownership using proposed mechanism then.

What if we added some jobName -> owners mapping in the config? Is getting proper job name too much of a hurdle for users?

pawel-big-lebowski · 2024-04-03T07:11:51Z

@pawel-big-lebowski what if you have multiple different jobs with different owners using same cluster? I don't think there's a way to properly assign ownership using proposed mechanism then.

What if we added some jobName -> owners mapping in the config? Is getting proper job name too much of a hurdle for users?

This seems to be another problem. We do have two ways for providing config entries:

provide a file openlineage.yaml which is convenient to have global OL settings per all the job run in the cluster
Spark config entries which is convenient for per-job basis

However, we're missing some way to combine those two approaches, which would make totally sense like: take common config entries from openlineage.yaml and mix it with job specific spark conf. Otherwise we end up with two worlds of OpenLineage configuration entries, as we are now.

JDarDagran · 2024-04-03T08:42:39Z

Here are my thoughts:

integration-specific configuration: settings unique to each integration (like Spark) should be defined solely within that integration's configuration system (e.g. Spark configuration). This keeps things clean and avoids overloading the client configuration.
common client-side configuration: general settings applicable across integrations are in the client configuration. This allows for centralized management and easier updates. You can leverage approaches like providing an openlineage.yaml file or environment variables for client configuration.
integration-to-client configuration translation to bridge the gap: if specific integration settings affect the client's behavior, there should be a mechanism to translate them into client-side configuration. This ensures the client receives the necessary information regardless of the initial configuration source. In Airflow you can use Airflow's configuration mechanism to configure Python client (unfortunately, we have to maintain old environment variables as well, that's a different story).

global vs. per-job:

While the above approach simplifies configuration, including global or per-job analysis settings might introduce unnecessary complexity. Maybe having global configuration per client (or having some default) and overriding this per job via integration config could solve the problem?

A bit of side-note: Python client should have similar way to provide Job ownership, shouldn't it?

pawel-big-lebowski · 2024-04-04T12:12:19Z

Here are my thoughts:

integration-specific configuration: settings unique to each integration (like Spark) should be defined solely within that integration's configuration system (e.g. Spark configuration). This keeps things clean and avoids overloading the client configuration.

Job ownership facet is not such an example as it's common for Spark and Flink, which is a good reason to keep its config in Java client library.

common client-side configuration: general settings applicable across integrations are in the client configuration. This allows for centralized management and easier updates. You can leverage approaches like providing an openlineage.yaml file or environment variables for client configuration.

I really would like to avoid explaining users where to put each config entry. It makes sense for a user to put common environment related properties into openlineage.yaml while keeping job specific ones in spark.conf. But we shouldn't enforce any distinction of config properties on our end.

From Spark & Flink integration side, it would have been useful to have a single object containing all the config entries to be passed with the context. I think OpenLineageYaml is the object to do this. I think it should be extended to contain Map<String, String> to be contain other config entries which does not have to be used purely by java client.

integration-to-client configuration translation to bridge the gap: if specific integration settings affect the client's behavior, there should be a mechanism to translate them into client-side configuration. This ensures the client receives the necessary information regardless of the initial configuration source. In Airflow you can use Airflow's configuration mechanism to configure Python client (unfortunately, we have to maintain old environment variables as well, that's a different story).

I think that even if we have non-client-library config entries, we should be able to access them through client library classes (like OpenLineageYaml), because having a consistent way to pass configuration from user perspective is more important.

global vs. per-job:

While the above approach simplifies configuration, including global or per-job analysis settings might introduce unnecessary complexity. Maybe having global configuration per client (or having some default) and overriding this per job via integration config could solve the problem?

A bit of side-note: Python client should have similar way to provide Job ownership, shouldn't it?

dolfinus · 2024-04-04T12:15:27Z

It makes sense for a user to put common environment related properties into openlineage.yaml

Deploying openlineage.yml file in different Spark modes (client, cluster) may be a tricky part. IMHO it's better if configuration can be passed via either .yml file or via explicitly set Spark session config.

mobuchowski · 2024-04-04T13:08:16Z

Job ownership facet is not such an example as it's common for Spark and Flink, which is a good reason to keep its config in Java client library.

I think pure facet is one thing, how we're using it can be a separate issue. There is no prescription that we can't have per-integration configuration that exists in two different integrations slightly differently, if that's the best solution.

I really would like to avoid explaining users where to put each config entry. It makes sense for a user to put common environment related properties into openlineage.yaml while keeping job specific ones in spark.conf. But we shouldn't enforce any distinction of config properties on our end.

I would avoid putting anything integration-specific to openlineage.yaml regardless on whether it's used per-job, per-cluster or in some other way.

From Spark & Flink integration side, it would have been useful to have a single object containing all the config entries to be passed with the context. I think OpenLineageYaml is the object to do this. I think it should be extended to contain Map<String, String> to be contain other config entries which does not have to be used purely by java client.

I very much disagree - if anything, OpenLineageYaml should be a source (or one of them) to create integration-specific object with relevant context. We should deal less in stringly comparisons, and more in constructed objects from that configuration. In Spark, this way of dealing with things already exists - OpenLineageContext.

integration-to-client configuration translation to bridge the gap: if specific integration settings affect the client's behavior, there should be a mechanism to translate them into client-side configuration. This ensures the client receives the necessary information regardless of the initial configuration source. In Airflow you can use Airflow's configuration mechanism to configure Python client (unfortunately, we have to maintain old environment variables as well, that's a different story).

I would (mostly) go the opposite way - client-specific config should be translated to integration-specific one. Client - or rather what it's becoming - a generic helper library should provide mechanisms that integration should define how to use. For example, client provides circuit breaker or metrics - but it's up to integration to use it.

pawel-big-lebowski · 2024-04-04T13:17:36Z

@mobuchowski To wrap up the thoughts to the context of this PR: your suggestion is to use only Spark & Flink conf entries for job ownership facet while not changing anything in openlineage-java. The leads to some duplicated code, but it's better because this feature does not have anything in common with openlineage-java. Users will not be able to configure job ownership over openlineage.yaml and this shouldn't be considered a problem.

Is my understanding correct?

mobuchowski · 2024-04-04T13:24:19Z

@pawel-big-lebowski we can provide a generic mechanism in openlineage-java - but it can't assume how particular integrations will use it, especially that they will serialize some data to untyped map Map<String, String> that would be passed with OpenLineageYaml.

I think circuit breaker and metrics are good examples of desired behavior - openlineage-java acting as a library - but I'm not sure if it would be the best solution here, as it can't clearly deal with per-job vs per-cluster config, so I would consider mechanism specific to the integration here.

mobuchowski · 2024-04-04T13:36:26Z

BTW, if we're discussing generic configuration, I would rather have OpenLineageConfig (changed from OpenLineageYaml) that would be derived by integration-specific ones (like SparkOpenLineageConfig) which would create internal object used in OpenLineageContext - there should be no need for stringly typed checks inside code outside of the objects created by it - it strongly couples general code to some configuration options, rather than to interfaces.

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

dolfinus · 2024-04-23T07:58:42Z

I see that changelog entry was added, but for 1.12.0 instead of Unreleased. Fix: #2636

boring-cyborg bot added area:client/java openlineage-java area:integration/flink area:integration/spark streaming labels Mar 22, 2024

pawel-big-lebowski force-pushed the job-ownership branch 3 times, most recently from d0358fa to d6d24a1 Compare March 23, 2024 09:43

boring-cyborg bot added the area:documentation Improvements or additions to documentation label Mar 23, 2024

pawel-big-lebowski force-pushed the job-ownership branch 3 times, most recently from 513209f to d3f2cd4 Compare March 25, 2024 07:33

pawel-big-lebowski marked this pull request as ready for review March 25, 2024 07:55

pawel-big-lebowski requested a review from mobuchowski March 25, 2024 08:03

pawel-big-lebowski force-pushed the job-ownership branch from d3f2cd4 to bb656be Compare March 25, 2024 08:10

mobuchowski reviewed Apr 2, 2024

View reviewed changes

mobuchowski mentioned this pull request Apr 5, 2024

[PROPOSAL] Modularise the openlineage-java project #2581

Open

pawel-big-lebowski force-pushed the job-ownership branch 2 times, most recently from 4446de3 to dedf3f7 Compare April 17, 2024 06:18

pawel-big-lebowski changed the base branch from main to spark/configuration-refactor April 17, 2024 06:19

pawel-big-lebowski marked this pull request as draft April 17, 2024 06:19

pawel-big-lebowski force-pushed the spark/configuration-refactor branch 6 times, most recently from 42b2aac to 7eff37b Compare April 22, 2024 11:21

Base automatically changed from spark/configuration-refactor to main April 22, 2024 11:55

pawel-big-lebowski force-pushed the job-ownership branch from dedf3f7 to ed10b2d Compare April 22, 2024 12:14

[SPARK][FLINK] job ownership facet

597622e

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

pawel-big-lebowski force-pushed the job-ownership branch from ed10b2d to 597622e Compare April 22, 2024 12:22

pawel-big-lebowski marked this pull request as ready for review April 22, 2024 12:22

pawel-big-lebowski requested a review from mobuchowski April 22, 2024 12:47

mobuchowski approved these changes Apr 22, 2024

View reviewed changes

mobuchowski merged commit d1bd39d into main Apr 22, 2024
42 checks passed

mobuchowski deleted the job-ownership branch April 22, 2024 20:06

HuangZhenQiu pushed a commit to HuangZhenQiu/OpenLineage that referenced this pull request Apr 23, 2024

[SPARK][FLINK] job ownership facet (OpenLineage#2533)

7f44a43

Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>

dolfinus mentioned this pull request Apr 23, 2024

Fix CHANGELOG #2636

Merged

8 tasks

pawel-big-lebowski mentioned this pull request Apr 24, 2024

job ownership config OpenLineage/docs#322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK][FLINK] job ownership facet #2533

[SPARK][FLINK] job ownership facet #2533

pawel-big-lebowski commented Mar 22, 2024 •

edited by HuangZhenQiu

mobuchowski Apr 2, 2024

mobuchowski Apr 2, 2024

pawel-big-lebowski Apr 3, 2024

mobuchowski commented Apr 2, 2024

pawel-big-lebowski commented Apr 3, 2024

JDarDagran commented Apr 3, 2024

pawel-big-lebowski commented Apr 4, 2024

dolfinus commented Apr 4, 2024

mobuchowski commented Apr 4, 2024 •

edited

pawel-big-lebowski commented Apr 4, 2024

mobuchowski commented Apr 4, 2024

mobuchowski commented Apr 4, 2024

dolfinus commented Apr 23, 2024 •

edited

		this.openLineageEventEmitter = openLineageEventEmitter;
		this.openLineageYaml = openLineageYaml;

[SPARK][FLINK] job ownership facet #2533

[SPARK][FLINK] job ownership facet #2533

Conversation

pawel-big-lebowski commented Mar 22, 2024 • edited by HuangZhenQiu

Problem

Solution

One-line summary:

Checklist

mobuchowski Apr 2, 2024

Choose a reason for hiding this comment

mobuchowski Apr 2, 2024

Choose a reason for hiding this comment

pawel-big-lebowski Apr 3, 2024

Choose a reason for hiding this comment

mobuchowski commented Apr 2, 2024

pawel-big-lebowski commented Apr 3, 2024

JDarDagran commented Apr 3, 2024

pawel-big-lebowski commented Apr 4, 2024

dolfinus commented Apr 4, 2024

mobuchowski commented Apr 4, 2024 • edited

pawel-big-lebowski commented Apr 4, 2024

mobuchowski commented Apr 4, 2024

mobuchowski commented Apr 4, 2024

dolfinus commented Apr 23, 2024 • edited

pawel-big-lebowski commented Mar 22, 2024 •

edited by HuangZhenQiu

mobuchowski commented Apr 4, 2024 •

edited

dolfinus commented Apr 23, 2024 •

edited