Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark Integration] environment-properties not showing up #2203

Closed
rodrigo-maia-manta opened this issue Oct 20, 2023 · 9 comments
Closed

[Spark Integration] environment-properties not showing up #2203

rodrigo-maia-manta opened this issue Oct 20, 2023 · 9 comments
Labels
area:integration/spark kind:bug Something isn't working

Comments

@rodrigo-maia-manta
Copy link

Hello!
We are currently trying to work with the spark integration for OpenLineage in our Databricks instance.

We´ve recently started using the "environment-properties" attribute with information (for our context) regarding the notebook path (if it is a notebook run), or the the job run ID (if it is a databricks job run). But the thing is that these attributes are not always present, if present at all.

Problem:
"environment-properties" attribute is not present for all runID and sometimes is not present at all. (Maybe its being filtered by some condition)

Context:
databricks platform
"openLineageVersion": "1.4.1"
"sparkVersion": "3.4.0"
"scalaVersion": "2.12.15"

Spark Cluster Config:
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.version v1
spark.openlineage.debugFacet enabled
spark.openlineage.transport.type console

Example Notebook
Simple Read and Write operation
image

**Resulting JSON payload / without environment-property attribute **
metadata.json

**Resulting JSON payload from another example / with environment-property attribute for some runIds **
metadata.json

@rodrigo-maia-manta
Copy link
Author

Any ideas on how i can support the investigation of this issue?

@pawel-big-lebowski
Copy link
Contributor

Any ideas on how i can support the investigation of this issue?

Am I correct that this feature was working and got suddenly broken? If so, could you help us pointing to which OL release broke it?

@gerson23
Copy link

gerson23 commented Apr 2, 2024

I run into a similar issue recently: most of START events were missing the environment-properties field, which broke the integration with Purview-ADB-Linesage-Solution-Accelerator.

After some investigation, I could drill down to:

  • SparkListenerJobStart is skipped several times in logs:
INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerJobStart
  • When this skipped, the class DatabricksEnvironmentFacetBuilder doesn't ever build, so the environment-properties isn't populated. Hence it goes missing in the START event.

To test this possible cause, I removed the skipping logic from SparkSQLExecutionContext for SparkListenerJobStart event and forced eventType = START. Using this workaround, I was able to have START events properly populated and the integration with Purview could work again. However, this causes duplicated START events, so it shouldn't really be the real fix.

Environment

  • Databricks: 14.3 (Spark 3.5.0, Scala 2.12)
  • OpenLineage: 1.8.0

Note: We were using OL 0.18 before, but that version also stopped working. So, I guess this can be a side effect of a recent Databricks/Spark change.

@pawel-big-lebowski
Copy link
Contributor

I was fixing this last week within PR -> #2537
Within the PR I've added an integration test to verify that environment property facet is filled.
I hopy you will be able to test this out in few days.

@gerson23
Copy link

gerson23 commented Apr 3, 2024

Thanks @pawel-big-lebowski. I've rebuilt and tested with the new jar, but this environment property issue is still happening.

Looking at the logs, from a job I have I could count:

  • 59 START events
  • of those, only 4 had the environment-property

I've then changed the following line and tested the custom jar again with the same job:

EventType eventType = emittedOnSqlExecutionStart ? RUNNING : START;

Changed to EventType eventType = START;

Results:

  • 114 START events
  • of those, 56 had environment-property

So, my guess is that there are some race condition between SparkListenerJobStart and SparkListenerSQLExecutionStart events. In most cases, though, we are getting START events from SparkListenerSQLExecutionStart and RUNNING events from SparkListenerJobStart. Issue is that Databricks environment only works with the latter.

Note: in good news, #2537 looks to really resolve #2499 without adding jars back

@pawel-big-lebowski
Copy link
Contributor

Our model allows sending only a single START event per run. So, more start events does not seem better.

The issue you encounter may be related to these lines:

extends CustomFacetBuilder<SparkListenerJobStart, EnvironmentFacet> {
private Map<String, Object> dbProperties;

which enforce building this particular facet only for SparkListenerJobStart.

Please, keep in mind that Openlineage model is cumulative. So, if for a single run there exists any event with environment-property it is perfectly fine. It is the backend which should be able to merge all the run related events.

@gerson23
Copy link

gerson23 commented Apr 4, 2024

Our model allows sending only a single START event per run. So, more start events does not seem better.

Yeah, I know. I only wanted to check if the RUNNING events were being generated by a SparkListenerJobStart, which is the case. So, as you mentioned, events are cumulative, therefore this looks to be correct behavior now.

extends CustomFacetBuilder<SparkListenerJobStart, EnvironmentFacet> {
private Map<String, Object> dbProperties;

I thought about changing this facet, or having a different version, to accept SparkListenerSQLExecutionStart, but that event doesn't contain the Databricks jobs information.

I guess https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator needs to be updated in order to accept environment-property coming from RUNNING events as well. Then, IMHO this issue can be closed.

@kacpermuda
Copy link
Contributor

@gerson23 Should this issue be closed ?

@gerson23
Copy link

gerson23 commented May 3, 2024

@kacpermuda Yes, we can close this. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:integration/spark kind:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants