spark: Disable CLL for LogicalRDD plans by kchledowski · Pull Request #4329 · OpenLineage/OpenLineage

kchledowski · 2026-02-10T14:12:58Z

One-line summary for changelog:

Disable Column-Level Lineage extraction for Spark LogicalRDD plans to prevent incorrect lineage caused by lost schema and transformation context.

Meaningful description

Column-Level Lineage generated from Spark LogicalRDD plans is currently
unreliable. Although OpenLineage can still identify the underlying data
source associated with an RDD, all transformation context is lost —
both:

transformations performed before converting the DataFrame to an RDD,
and transformations performed on the RDD itself.

When Spark constructs a LogicalRDD, it contains only the final row
schema, without any expression trees, renames, derived columns, or
projections that produced it. As a result, OpenLineage cannot determine
whether the output schema corresponds directly to the source schema or
has been altered by transformations.

This leads to the situation where:

lineage might be correct if no schema‑changing transformations
occurred,
lineage might be incorrect if transformations happened before or
during RDD processing,
and there is no way for OpenLineage to distinguish these cases.

Therefore, any column-level lineage produced from a LogicalRDD is
inherently unverifiable and may be misleading.

Example illustrating the problem

// Source table: (a, b)
val df = spark.table("source_table")

// Derive new column 'c'
val transformed = df.select(
  $"a",
  $"b",
  concat($"a", $"b").as("c")
)

// Convert to RDD and perform another transformation on the RDD
val rdd = transformed.rdd.map { row =>
  // Example RDD-level transformation, e.g. uppercasing column c
  Row(
    row.getString(0),
    row.getString(1),
    row.getString(2).toUpperCase
  )
}

val finalDf = spark.createDataFrame(rdd, transformed.schema)

finalDf.write.saveAsTable("result_table")

Expected lineage:

result_table.c → source_table.a, source_table.b

Current lineage with LogicalRDD:

result_table.c → source_table.c

To avoid producing wrong lineage — which is worse than incomplete lineage — this PR disables Column-Level Lineage generation when Spark produces a LogicalRDD. This ensures correctness and avoids misleading downstream consumers.

Column-level lineage produced for Spark LogicalRDD plans cannot be trusted. While the underlying data source can still be identified, all transformation context—both before converting a DataFrame to an RDD and during RDD-level operations—is lost. LogicalRDD only retains the final schema, with no expression trees or derivation history. As a result, the lineage may accidentally be correct if no schema-changing transformations occurred, but may also be incorrect when columns are derived, renamed, or otherwise modified. Since there is no way to distinguish these cases, the lineage becomes inherently unverifiable and potentially misleading. To avoid generating incorrect metadata, column-level lineage extraction is disabled for LogicalRDD nodes. Signed-off-by: kchledowski <github@chledowski.com>

boring-cyborg bot added area:integration/spark area:tests Testing code language:java Uses Java programming language labels Feb 10, 2026

kchledowski marked this pull request as ready for review February 10, 2026 14:46

kchledowski requested a review from a team as a code owner February 10, 2026 14:46

mobuchowski approved these changes Feb 11, 2026

View reviewed changes

mobuchowski merged commit 316f697 into OpenLineage:main Feb 11, 2026
33 checks passed

kchledowski mentioned this pull request Feb 13, 2026

spark: Disable input schema extraction for LogicalRDD and add schema extraction for Iceberg DataSourceRDD #4331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark: Disable CLL for LogicalRDD plans#4329

spark: Disable CLL for LogicalRDD plans#4329
mobuchowski merged 1 commit intoOpenLineage:mainfrom
kchledowski:disable-rdd-cll

kchledowski commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kchledowski commented Feb 10, 2026

One-line summary for changelog:

Meaningful description

Example illustrating the problem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants