Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark: Change SparkPropertyFacetBuilder to support recording spark run time … #2523

Merged

Conversation

Ruihua98
Copy link
Contributor

@Ruihua98 Ruihua98 commented Mar 15, 2024

Change SparkPropertyFacetBuilder to support recording spark run time config

Problem

Modify SparkPropertyFacetBuilder to capture RuntimeConfig of Spark session, because the existing SparkPropertyFacet can only capture static config of spark context.
This Facet will be added in both RDD-related runs and SQL related runs.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

One-line summary:

Modify SparkPropertyFacetBuilder to support recording spark run time config

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

…config.

Signed-off-by: Ruihua Wang <ruihuawang@microsoft.com>
@Ruihua98 Ruihua98 changed the title Change SparkPropertyFacetBuilder to support recording spark run time … spark: Change SparkPropertyFacetBuilder to support recording spark run time … Mar 15, 2024
@Ruihua98
Copy link
Contributor Author

@pawel-big-lebowski Moved the old PR to here!
spark: modify SparkPropertyFacetBuilder to support recording spark's runtime config #2480

@Ruihua98
Copy link
Contributor Author

Ruihua98 commented Mar 15, 2024

The old PR is reviewed. I just copied it to here to make it clean. List the old PR below.
#2480

@Ruihua98
Copy link
Contributor Author

@pawel-big-lebowski I found that after these lines of codes added, not all the openlineage events reported has a SparkPropertyFacet.
if (!(event instanceof SparkListenerJobStart) && !(event instanceof SparkListenerJobEnd)) { return; }
There might be a mismatch between openlineage status 'START'/'COMPLETE' and spark event type 'SparkListenerJobStart'/'SparkListenerJobEnd'. Seems that other types who extends SparkListenerEvent will also contribute a Openlineage event, but after we add these lines of code, these events will not have a SparkPropertyFacet.
Is this a normal use case? I'm a little bit worry about this.

@pawel-big-lebowski
Copy link
Contributor

First, I think CI is failing because of different reason:

 2024-03-15 17:10:48 ERROR io.openlineage.client.circuitBreaker.NoOpCircuitBreaker - OpenLineage callable failed to execute. Swallowing the exception {}
  java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.internal.SessionStateBuilder':
        at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1178) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:162) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at scala.Option.getOrElse(Option.scala:189) ~[iceberg-spark-runtime-3.3_2.12-0.14.0.jar:?]
        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:160) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:157) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:185) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:185) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at io.openlineage.spark.agent.facets.builder.SparkPropertyFacetBuilder.lambda$buildFacet$4(SparkPropertyFacetBuilder.java:77) ~[openlineage-spark-shared_2.12-1.11.0-SNAPSHOT.jar:?]

It looks to me as if there was a problem with SparkSession at the end event - which is possible if a whole spark context is already down.

My understanding of the code is:

  • before this change facet was added for SparkListenerJobStart events only,
  • now it's also included for SparkListenerJobEnd

My previous comment on adding the condition was based on your comment:

This is because attributes in run-time config might change during the running of a job. However, Java doen't support grammar like 'public void build(SparkListenerJobStart|SparkListenerJobEnd event)'. So we change the parameter of this function from SparkListenerJobStart to SparkListenerEvent.

I thought you want to add the facet for SparkListenerJobStart|SparkListenerJobEnd only. In case of adding the facet to all the events, the condition is not necessary.

…ception

Signed-off-by: ruihuawang <ruihuawang@microsoft.com>
@Ruihua98 Ruihua98 force-pushed the spark-property-facet-modification branch from fc5727a to ec58be3 Compare March 19, 2024 02:54
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>
@Ruihua98
Copy link
Contributor Author

Ruihua98 commented Mar 19, 2024

First, I think CI is failing because of different reason:

 2024-03-15 17:10:48 ERROR io.openlineage.client.circuitBreaker.NoOpCircuitBreaker - OpenLineage callable failed to execute. Swallowing the exception {}
  java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.internal.SessionStateBuilder':
        at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1178) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:162) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at scala.Option.getOrElse(Option.scala:189) ~[iceberg-spark-runtime-3.3_2.12-0.14.0.jar:?]
        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:160) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:157) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:185) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:185) ~[spark-sql_2.12-3.3.4.jar:3.3.4]
        at io.openlineage.spark.agent.facets.builder.SparkPropertyFacetBuilder.lambda$buildFacet$4(SparkPropertyFacetBuilder.java:77) ~[openlineage-spark-shared_2.12-1.11.0-SNAPSHOT.jar:?]

It looks to me as if there was a problem with SparkSession at the end event - which is possible if a whole spark context is already down.

My understanding of the code is:

  • before this change facet was added for SparkListenerJobStart events only,
  • now it's also included for SparkListenerJobEnd

My previous comment on adding the condition was based on your comment:

This is because attributes in run-time config might change during the running of a job. However, Java doen't support grammar like 'public void build(SparkListenerJobStart|SparkListenerJobEnd event)'. So we change the parameter of this function from SparkListenerJobStart to SparkListenerEvent.

I thought you want to add the facet for SparkListenerJobStart|SparkListenerJobEnd only. In case of adding the facet to all the events, the condition is not necessary.

I apologize for my misleading comments. What we really want is that: each open lineage event with Start/Complete status has this SparkPropertyFacet. Previously I believed SparkListenerJobStart is OpenLineage START status and SparkListenerJobEnd is OpenLineage COMPLETE status. Recently I found I was wrong. So I finally delete the checking logic. By the way, I also change the error handling logic to fix the CI. I change the catcher to super class RuntimeException and now it can handle all the cases in CI.
All the above changes has been submitted in this PR. @pawel-big-lebowski

@Ruihua98
Copy link
Contributor Author

@pawel-big-lebowski
Seems that all the checks have been passed. Could we merge this PR now?

@pawel-big-lebowski pawel-big-lebowski merged commit c82c208 into OpenLineage:main Mar 20, 2024
31 checks passed
Copy link

boring-cyborg bot commented Mar 20, 2024

Great job! Congrats on your first merged pull request in OpenLineage!

mobuchowski pushed a commit that referenced this pull request Mar 29, 2024
…n time … (#2523)

* Change SparkPropertyFacetBuilder to support recording spark run time config.

Signed-off-by: Ruihua Wang <ruihuawang@microsoft.com>

* Remove checking logic for SparkListener instances and catch RuntimeException

Signed-off-by: ruihuawang <ruihuawang@microsoft.com>

* Fix format
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>

* Fix format again
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>

---------

Signed-off-by: Ruihua Wang <ruihuawang@microsoft.com>
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>
Co-authored-by: pawel.leszczynski <leszczynski.pawel@gmail.com>
blacklight pushed a commit to blacklight/OpenLineage that referenced this pull request Apr 4, 2024
…n time … (OpenLineage#2523)

* Change SparkPropertyFacetBuilder to support recording spark run time config.

Signed-off-by: Ruihua Wang <ruihuawang@microsoft.com>

* Remove checking logic for SparkListener instances and catch RuntimeException

Signed-off-by: ruihuawang <ruihuawang@microsoft.com>

* Fix format
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>

* Fix format again
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>

---------

Signed-off-by: Ruihua Wang <ruihuawang@microsoft.com>
Signed-off-by: ruihuawang <ruihuawang@microsoft.com>
Co-authored-by: pawel.leszczynski <leszczynski.pawel@gmail.com>
Signed-off-by: Fabio Manganiello <fabio@manganiello.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants