Skip to content

spark: Add config for disabling RDD event emitting#4118

Merged
tnazarew merged 2 commits into
OpenLineage:mainfrom
kchledowski:config-disable-rdd-lineage
Jan 7, 2026
Merged

spark: Add config for disabling RDD event emitting#4118
tnazarew merged 2 commits into
OpenLineage:mainfrom
kchledowski:config-disable-rdd-lineage

Conversation

@kchledowski
Copy link
Copy Markdown
Contributor

@kchledowski kchledowski commented Nov 4, 2025

Problem

In some Spark environments, users may want to disable OpenLineage event emission specifically for RDD operations while keeping SQL-based operations enabled. Currently, the OpenLineage Spark integration only provides a global disable option (spark.openlineage.disabled) which turns off all lineage tracking. This creates a limitation where users cannot selectively disable RDD lineage while maintaining visibility into SQL operations.

Solution

Introduce a new configuration option spark.openlineage.filter.rddEventsDisabled that allows users to selectively disable OpenLineage event emission for RDD operations only.

One-line summary:

Add config for disabling RDD event emitting

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (not required for changes to tests, docs, or CI config)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2025 contributors to the OpenLineage project

@kchledowski kchledowski requested a review from a team as a code owner November 4, 2025 14:55
@boring-cyborg boring-cyborg Bot added area:documentation Improvements or additions to documentation area:integration/spark area:tests Testing code language:java Uses Java programming language labels Nov 4, 2025
@kchledowski
Copy link
Copy Markdown
Contributor Author

FYI @tnazarew

@luke-hoffman1
Copy link
Copy Markdown
Contributor

To add more context, there are OpenLineage RunEvents that contain lineage that might not make since when it comes to tracking data movement across datasets. For example, the following three RDDs:

From my understanding, OpenLineage produces these two jobs when these RDDs are used:

  • map_partitions_file_scan
  • map_partitions_parallel_collection

I believe these will always show the input and outputs as the same dataset. This seems to only be relevant in how Spark works internally, but not necessarily relevant for data lineage tracking. However, there may be RDDs that do provide relevant data lineage. @kchledowski, can we selectively filter on specific RDDs as an enhancement to this PR or is it only feasible to disable RDD metadata extraction totally?

@kchledowski kchledowski marked this pull request as draft November 7, 2025 14:04
@kchledowski
Copy link
Copy Markdown
Contributor Author

@luke-hoffman1 As far as I understand it from the code, the only RDDs that would generate any kind of input dataset in the event are:

  • HadoopRDD
  • NewHadoopRDD
  • FileScanRDD
  • ParallelCollectionRDD

If we have any events triggered by different RDDs, they wouldn't have any input datasets, which I assume would make them rather irrelevant in terms of data lineage.

At this point I think it makes more sense to disable event emitting for RDDs completely and if there would be any need in the future to add some kind of filtering for RDD, that would be a seperate PR.

Also, when I say disable event emitting for RDDs completely, it means in cases where RDD is the trigger for event emission. It might be that RDD is part of a Logical Plan, where it would probably be relevant in terms of data lineage, in which case the events would still get emitted.

@kchledowski
Copy link
Copy Markdown
Contributor Author

The PR still needs some work (marked as draft), as I think I've disabled too much, I need to move the check to a different place in code.

@kchledowski kchledowski force-pushed the config-disable-rdd-lineage branch from db8de69 to e88ef92 Compare November 28, 2025 15:48
@kchledowski kchledowski marked this pull request as ready for review December 1, 2025 08:16
@kchledowski kchledowski force-pushed the config-disable-rdd-lineage branch from e88ef92 to ea32692 Compare December 10, 2025 11:32
Signed-off-by: kchledowski <github@chledowski.com>
Signed-off-by: kchledowski <github@chledowski.com>
@kchledowski kchledowski force-pushed the config-disable-rdd-lineage branch from ea32692 to 41eaf96 Compare December 10, 2025 14:37
@tnazarew tnazarew merged commit e03658a into OpenLineage:main Jan 7, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:documentation Improvements or additions to documentation area:integration/spark area:tests Testing code language:java Uses Java programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants