Skip to content

spark: Disable CLL for LogicalRDD plans#4329

Merged
mobuchowski merged 1 commit intoOpenLineage:mainfrom
kchledowski:disable-rdd-cll
Feb 11, 2026
Merged

spark: Disable CLL for LogicalRDD plans#4329
mobuchowski merged 1 commit intoOpenLineage:mainfrom
kchledowski:disable-rdd-cll

Conversation

@kchledowski
Copy link
Contributor

One-line summary for changelog:

Disable Column-Level Lineage extraction for Spark LogicalRDD plans to prevent incorrect lineage caused by lost schema and transformation context.

Meaningful description

Column-Level Lineage generated from Spark LogicalRDD plans is currently
unreliable. Although OpenLineage can still identify the underlying data
source associated with an RDD, all transformation context is lost
both:

  • transformations performed before converting the DataFrame to an RDD,
  • and transformations performed on the RDD itself.

When Spark constructs a LogicalRDD, it contains only the final row
schema
, without any expression trees, renames, derived columns, or
projections that produced it. As a result, OpenLineage cannot determine
whether the output schema corresponds directly to the source schema or
has been altered by transformations.

This leads to the situation where:

  • lineage might be correct if no schema‑changing transformations
    occurred
    ,
  • lineage might be incorrect if transformations happened before or
    during RDD processing,
  • and there is no way for OpenLineage to distinguish these cases.

Therefore, any column-level lineage produced from a LogicalRDD is
inherently unverifiable and may be misleading.

Example illustrating the problem

// Source table: (a, b)
val df = spark.table("source_table")

// Derive new column 'c'
val transformed = df.select(
  $"a",
  $"b",
  concat($"a", $"b").as("c")
)

// Convert to RDD and perform another transformation on the RDD
val rdd = transformed.rdd.map { row =>
  // Example RDD-level transformation, e.g. uppercasing column c
  Row(
    row.getString(0),
    row.getString(1),
    row.getString(2).toUpperCase
  )
}

val finalDf = spark.createDataFrame(rdd, transformed.schema)

finalDf.write.saveAsTable("result_table")

Expected lineage:

result_table.c → source_table.a, source_table.b

Current lineage with LogicalRDD:

result_table.c → source_table.c

To avoid producing wrong lineage — which is worse than incomplete lineage — this PR disables Column-Level Lineage generation when Spark produces a LogicalRDD. This ensures correctness and avoids misleading downstream consumers.

Column-level lineage produced for Spark LogicalRDD plans cannot be
trusted. While the underlying data source can still be identified,
all transformation context—both before converting a DataFrame
to an RDD and during RDD-level operations—is lost. LogicalRDD
only retains the final schema, with no expression trees or
derivation history.

As a result, the lineage may accidentally be correct if no
schema-changing transformations occurred, but may also be incorrect when
columns are derived, renamed, or otherwise modified. Since there is
no way to distinguish these cases, the lineage becomes inherently
unverifiable and potentially misleading.

To avoid generating incorrect metadata, column-level lineage
extraction is disabled for LogicalRDD nodes.

Signed-off-by: kchledowski <github@chledowski.com>
@boring-cyborg boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Feb 10, 2026
@kchledowski kchledowski marked this pull request as ready for review February 10, 2026 14:46
@kchledowski kchledowski requested a review from a team as a code owner February 10, 2026 14:46
@mobuchowski mobuchowski merged commit 316f697 into OpenLineage:main Feb 11, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:integration/spark area:tests Testing code language:java Uses Java programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants