New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify job hierarchy for Spark #1672
Comments
First, current version (11.2023) of the integration creates job names based on: In my opinion, there is nothing wrong on the verbosity although it requires extra amount of work on the backend side, to merge multiple OL events to a single overview of a Spark Job. However, I think there are other issues that I would consider a problem:
I think the solution to this should be some
|
From discussion with @harels @pawel-big-lebowski: there are two different options we can talk about here:
We should think whether we can make sure that introducing those won't cause a memory leak. |
Associating the input/output datasets with the parent job (spark application) makes more sense for data catalog based backends. Sending events at the application level (instead of action level) can help improve the experience as all the intermediate data assets would be skipped and we'll just have the input and output datasets for a Job (entire spark app) run. It can be simpler. |
There are a few topics here. 1. How to map events for Spark actions to the Spark application? 2. How to present the lineage information to the user? It's important to recognise also that aggregation is not a lossless transformation. 3. Whether inputs and outputs cumulate between events |
@mgorsk1 replying to OpenLineage/docs#268
Agreed.
I would describe what we need to do differently, since use of
This allows us to create deeper hierarchy - for example when we know that Spark job is scheduled by Airflow or some other scheduler.
Ideally our "unique identifier" of a job would be a composite aggregation of names of all parent jobs - but for practical reasons - like, making processing OpenLineage data in relational databases easier - some duplication, like including @jenspfaug @pawel-big-lebowski @wslulciuc I think we have the consensus on 1st point of Jens as described above. I will work on adding application events to Spark integration. For the consumer-side aggregation I'd skip it for now unless we have additional input, since it introduces possible loss of information. |
@mobuchowski would you mind describing a whole example where a DAG spawns a Task, which spawns a Spark application, which spawns a Spark action? What are the parent run id, parent namespace, parent name going to be on each of the levels? Btw, in addition to |
@jenspfaug I've created diagram for that case: This describes parent hierarchy of an Airflow DAG that has two operators, first one spawning Spark job. The job has two actions: That's order of events that would come from execution of above:
I'd rather move to use https://openlineage.io/docs/ fully for all documentation rather than scattered .md files, I'll add doc there when PR regarding this topic will get accepted.
Yes - added this in PR. |
Somehow I still fail to understand this hop where parent of Spark Application is not DAG but SparkOperator. This overcomplicates things and brings little value in my opinion. If you consider comparison between SparkOperator and, for example, BashOperator (which also can be source of lineage information considering it accepts inlets and outlets kwargs):
So your design assumes for just one operator different behavior is desired than for any other operator in airflow and I don't see how that is a good thing. |
@mobuchowski thank you very much for putting this together. From the discussion above I understood that there should be separate events OnApplicationStart and OnApplicationEnd with job namespace and name representing the Spark application itself. You only seem to account for events coming from the actions themselves. @mgorsk1 I see the following reasons for having the Spark application as the parent of the Spark actions.
|
Indeed, if BashOperator spawns another application that would produce lineage, that spawned process should be responsible for emitting lineage with Airflow task as parent run. I used term application above as I believe that if BashOperator just uses Bash to do Bash-specific processing then there should be no child for BashOperator. Example of that would be: sort input.txt > sorted.txt The same goes with PythonOperator - if no external to Airflow's execution environment application is executed then PythonOperator is primary job that produces inputs and outputs. |
@mgorsk1 to add what @JDarDagran and @jenspfaug wrote, imagine multiple operators spawning different jobs, not only Spark - having granular information allows you to determine what actually spawned the job. Looking at your example, similarly DAG alone does not do anything, but spawns the actual jobs - operators - that do the work. @jenspfaug - you are obviously right, I've not included those events in previous graph. The fixed one is here, with grouped Spark Action START/RUNNING/COMPLETE events for clarity. |
From Airflow's perspective, I believe parent of the SparkApplication should be the specific Operators/task that spawned it. Airflow DAG is not granular enough - it can have a thousand of tasks, so just pointing from Spark job to the DAG does not seem sufficient. Having said that, DAG metadata can also be present in the lineage event facet, as a sort of additional context. On spark, I think events should be aggregated on the metastore level, and not decided on by consumer (at least not by default), given lossy character of aggregation and potential lineage consumer-dependent context. |
Merged in #2371 - I will close the issue, but please reopen or create new one if you have further comments. We're going with ⏫ solution now, but that does not mean we can't revisit it further. |
We currently name spark jobs using the following naming pattern:
{spark.app.name}
: name of spark app{spark.app.name}.{action.name}
: name of spark actionAlthough collecting lineage events for a parent job (spark app) and children (actions) can be very useful, the level of integration might be to verbose. That is, input/output datasets are associated with the spark actions, but in some cases, associating them to the parent job would be preferred.
I propose we allow users to toggle the verbosity of the integration level. For example, let's say we have the spark app
MySparkApp
with actions (assuming naming pattern above):MySparkApp.action1
MySparkApp.action2
MySparkApp.action3
where each spark action had input/output datasets. This is the current default behavior. But, when verbosity was
disabled
, the input/output datasets associated with the spark actions would now be associated to only theMySparkApp
; therefore, simplifying the lineage graph and "flattening" the job hierarchy.The text was updated successfully, but these errors were encountered: