Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK][FLINK][JAVA] Support yaml config files together with SparkConf/FlinkConf #2583

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

pawel-big-lebowski
Copy link
Contributor

@pawel-big-lebowski pawel-big-lebowski commented Apr 5, 2024

Problem

Client-java documentation says it can read configs directly from openlineage.yaml file, while Spark integration reads config entries from SparkConf only. Flink integration can read it from file and Flink conf exclusively. This is confusing for the user.

Problems:

  • There is no way to add flink specific config to Flink integration.
  • Some config entries in Flink conf will not work if passed through yaml file.
  • Spark integration does not support reading conf from yaml file at all.

Solution

  • rename OpenlineageYaml -> OpenlineageConfig
  • create SparkOpenlineageConfig, FlinkOpenlineageConfig extending the above to keep entries specific to integration
  • modify code to use only OpenLineageConfig classes.
  • update doc to mention that both ways can be used interchangeably and final documentation will be a merge of all values provided.

Please describe your change as it relates to the problem, or bug fix, as well as any dependencies. If your change requires a schema change, please describe the schema modification(s) and whether it's a backwards-incompatible or backwards-compatible change, then select one of the following:

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 3 times, most recently from 8982f93 to aaaba7f Compare April 9, 2024 11:43
@pawel-big-lebowski pawel-big-lebowski changed the title [SPARK][JAVA] Unify config entries [SPARK][JAVA] Allow config values through openlineage.yaml , SparkConf and FlinkConf Apr 9, 2024
@pawel-big-lebowski pawel-big-lebowski changed the title [SPARK][JAVA] Allow config values through openlineage.yaml , SparkConf and FlinkConf [SPARK][FLINK][JAVA] Support openlineage.yaml together with SparkConf/FlinkConf Apr 9, 2024
@pawel-big-lebowski pawel-big-lebowski changed the title [SPARK][FLINK][JAVA] Support openlineage.yaml together with SparkConf/FlinkConf [SPARK][FLINK][JAVA] Support yaml config files together with SparkConf/FlinkConf Apr 9, 2024
@Getter @Setter private @Nullable Map<String, String> urlParams;
@Getter @Setter private @Nullable Map<String, String> headers;

@JsonProperty(access = JsonProperty.Access.WRITE_ONLY)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to prevent serialising the field within debug facet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 3 times, most recently from ea0c971 to 7d12b4c Compare April 10, 2024 13:09
@boring-cyborg boring-cyborg bot added the area:documentation Improvements or additions to documentation label Apr 10, 2024
@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 8 times, most recently from b8e85bf to 04f03c7 Compare April 11, 2024 11:53
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review April 11, 2024 12:20
* Contains methods which allows overwriting values of a config object with values from another
* object
*/
public interface OverwriteConfig<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've added this mechanism, but what's the actual purpose of it? Can't we just create new config instance rather than making config mutable if we need to merge config entries from multiple sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go with the approach you describe: change the interface to return a new value instead
will prepare a change for this.

@@ -31,33 +32,27 @@ public class EventEmitter {
@Getter private String applicationJobName;
@Getter private Optional<List<String>> customEnvironmentVariables;

public EventEmitter(ArgumentParser argument, String applicationJobName)
public EventEmitter(SparkOpenLineageConfig config, String applicationJobName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 3 times, most recently from dff1c24 to de990d0 Compare April 16, 2024 06:55
@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 3 times, most recently from 62c7ce0 to d5c8a3a Compare April 18, 2024 06:59
Copy link
Contributor

@d-m-h d-m-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on a phone right now, so it's a bit difficult to see the changes, so forgive me if I ask something that is obvious.

When converting from flat properties to a config object, how is this done?

To we try and infer a hierarchy from using dot separated properties? For example, if we do spark.openlineage.transport.foo.bar, do we assume that the object has the hierarchy of:

transport:
\tfoo:
\t\tbar: $value

If so, this could pose a problem in my use case, as I have letting the custom transport act as a pass through for the values the spark integration provides. For example, I have the following property:

spark.openlineage.transport.properties.cluster.federation

The cluster.federation is a property that the library my transport wraps expects. So if this code tries to infer hierarchy, this would break my integration.

@pawel-big-lebowski
Copy link
Contributor Author

@d-m-h Actually this is already done within this piece of code -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java#L88

I don't think this PR changes this behaviour anyway.

Copy link
Member

@mobuchowski mobuchowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing job @pawel-big-lebowski. That change was very much needed. I'm approving but please wait with merging till we hear if @d-m-h's concerns are answered.

One thing you can improve in the meantime is improving comments on MergeConfig class - I believe we can explain the semantics better and in simple language.

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 2 times, most recently from ac1a464 to 7e5b0e2 Compare April 19, 2024 07:07
@pawel-big-lebowski
Copy link
Contributor Author

@mobuchowski fixed docs for MergeConfig

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/configuration-refactor branch 2 times, most recently from 263055e to 42b2aac Compare April 22, 2024 11:08
Signed-off-by: Pawel Leszczynski <leszczynski.pawel@gmail.com>
@pawel-big-lebowski pawel-big-lebowski merged commit 10ad0aa into main Apr 22, 2024
42 checks passed
@pawel-big-lebowski pawel-big-lebowski deleted the spark/configuration-refactor branch April 22, 2024 11:55
@dolfinus dolfinus mentioned this pull request Apr 23, 2024
8 tasks
@dolfinus
Copy link
Contributor

I see that changelog entry was added, but for 1.12.0 instead of Unreleased. Fix: #2636

@pawel-big-lebowski
Copy link
Contributor Author

@dolfinus thank you, sry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants