spark: Add config for disabling RDD event emitting#4118
Conversation
|
FYI @tnazarew |
|
To add more context, there are OpenLineage From my understanding, OpenLineage produces these two jobs when these RDDs are used:
I believe these will always show the input and outputs as the same dataset. This seems to only be relevant in how Spark works internally, but not necessarily relevant for data lineage tracking. However, there may be RDDs that do provide relevant data lineage. @kchledowski, can we selectively filter on specific RDDs as an enhancement to this PR or is it only feasible to disable RDD metadata extraction totally? |
|
@luke-hoffman1 As far as I understand it from the code, the only RDDs that would generate any kind of input dataset in the event are:
If we have any events triggered by different RDDs, they wouldn't have any input datasets, which I assume would make them rather irrelevant in terms of data lineage. At this point I think it makes more sense to disable event emitting for RDDs completely and if there would be any need in the future to add some kind of filtering for RDD, that would be a seperate PR. Also, when I say disable event emitting for RDDs completely, it means in cases where RDD is the trigger for event emission. It might be that RDD is part of a Logical Plan, where it would probably be relevant in terms of data lineage, in which case the events would still get emitted. |
|
The PR still needs some work (marked as draft), as I think I've disabled too much, I need to move the check to a different place in code. |
db8de69 to
e88ef92
Compare
e88ef92 to
ea32692
Compare
Signed-off-by: kchledowski <github@chledowski.com>
Signed-off-by: kchledowski <github@chledowski.com>
ea32692 to
41eaf96
Compare
Problem
In some Spark environments, users may want to disable OpenLineage event emission specifically for RDD operations while keeping SQL-based operations enabled. Currently, the OpenLineage Spark integration only provides a global disable option (spark.openlineage.disabled) which turns off all lineage tracking. This creates a limitation where users cannot selectively disable RDD lineage while maintaining visibility into SQL operations.
Solution
Introduce a new configuration option
spark.openlineage.filter.rddEventsDisabledthat allows users to selectively disable OpenLineage event emission for RDD operations only.One-line summary:
Add config for disabling RDD event emitting
Checklist
SPDX-License-Identifier: Apache-2.0
Copyright 2018-2025 contributors to the OpenLineage project