Skip to content

[Fork-CI] PR #6830 (test classpath fix) on Spark 4.2 — fork-internal validation#2

Draft
SanJSp wants to merge 17 commits into
masterfrom
fix-changelog-fork-ci
Draft

[Fork-CI] PR #6830 (test classpath fix) on Spark 4.2 — fork-internal validation#2
SanJSp wants to merge 17 commits into
masterfrom
fix-changelog-fork-ci

Conversation

@SanJSp
Copy link
Copy Markdown
Owner

@SanJSp SanJSp commented May 21, 2026

Fork-internal CI validation for upstream draft PR delta-io/delta#6830. The official PR targets delta-io/delta:master where MASTER = None (4.0 / 4.1 only), so the Spark 4.2 row never runs in upstream CI. This branch pins the fork's matrix to actually exercise Spark 4.2.

What this branch contains

fix-changelog-fork-ci:
  e7f0a8c84  [Fork-CI] Pin Spark 4.2.0-preview4, enable MASTER, extend ALL_SPECS
  472195d3a  Merge delta_oss_upstream/pr-6657 (Murali + Vishnu Spark 4.2 ABI shims, 25 commits)
  027903d14  Fix CI                                                  <- PR #6830 commit 3
  c93506f40  Make V2 changelog tests pass on Spark 4.2               <- PR #6830 commit 2
  5a7bdd532  Move changelog tests to spark-unified                   <- PR #6830 commit 1
  ea7fe7c91  master HEAD

What it does NOT contain

  • DV implementation (in fork PR for #6832 — to be opened next)
  • Any of these commits will be pushed to the upstream delta-io/delta PR

Local verification (preview4)

[info] Passed: Total 21, Failed 0, Errors 0, Passed 21
[success] Total time: 76 s

CI expectation

The fork's GitHub Actions matrix should now include spark_version=master with continue-on-error: false (PR delta-io#6657 marks it continue-on-error at the upstream config). Pass on that row validates the upstream PR's Spark 4.2 path.

SanJSp and others added 13 commits May 27, 2026 18:10
…io#6886)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

Follow-up to [delta-io#6794](delta-io#6794). Adds
Deletion Vector (DV) support to the V2 changelog read path so `CHANGES
FROM VERSION x TO VERSION y` returns correct results on tables with
`delta.enableDeletionVectors=true`.

**In short,** I apply change_type delete to every removeFile and
change_type insert to every row in addFile per commit. A carry-over pair
(delete+insert) will be filtered out as carry-over, but with the DVs we
apply the DV as filter to the AddFile, leading to delete rows without
insert partner, which represent actual delets.

### Implementation

`DeltaChangelogBatch.java`:

- `CDCInputPartition` carries a per-file `deletionVectorInfo: String`
(the base64-serialized DV descriptor, or `null` when the file has no
DV).
- `planInputPartitions` extracts the DV per `AddFile`/`RemoveFile` via
`DeletionVectorDescriptor::serializeToBase64` and threads it onto the
partition.
- `createReader` uses the new public
`PartitionUtils.buildDvMetadata(...)` helper to set
`FILE_ROW_INDEX_FILTER_ID_ENCODED` and `FILE_ROW_INDEX_FILTER_TYPE`
together (with `RowIndexFilterType.IF_CONTAINED`) on the per-file
metadata, so the existing `DropMarkedRowsFilter` machinery picks up the
DV and produces the correct `delete` change rows.
- Wrapper `RuntimeException` messages now include the inner cause's
`getMessage()` so `AnalysisException` error classes (e.g.
`DELTA_CHANGELOG_*`) reach the user instead of being hidden behind a
generic "Failed to process CDC commit actions".

`PartitionUtils.java`:

- New public 1-arg `buildDvMetadata(String dvBase64)` overload that
takes the base64 string directly and uses `IF_CONTAINED` semantics.
Returns `java.util.Map` for ergonomic iteration from the changelog
caller. Existing private overloads are unchanged.

### Tests

`DeltaChangelogDirectBatchExecutionTest.java`:

- `testDirectBatchExecutionWithExplicitExpectedRows` and
`testUpdateProducesPairedDeleteAndInsert` are now parameterized via
`@ParameterizedTest @valuesource(booleans = {false, true})` so each runs
once with DVs off (rewrite-on-delete) and once with DVs on
(DV-on-delete). The expected row sets are identical, so the same
assertions cover both physical layouts.

`DeltaChangelogDvTest.java` (new):

DV-specific scenarios that don't fit the parameterized variants.

- `test_mixedDvDelete_perFileBranching`: two `AddFile`s in one commit, a
single `DELETE` touching both; asserts the per-file DV branch in
`planInputPartitions` sets the DV constants independently per file.
- `test_multiVersionDvDeletes_perCommitIsolation`: two `DELETE` commits
in sequence on a DV-enabled table; asserts Catalyst's batch CDC
post-processor partitions on `(row_id, version)` so each commit's
carry-overs collapse independently.

## Note on Spark 4.2 lane

The Spark 4.2 lane is non-blocking (`continue-on-error: true`); a
follow-up PR ([delta-io#6830](delta-io#6830))
relocates these tests to `spark-unified/test` where the in-tree
`DeltaCatalog` is on the classpath, making the 4.2 lane green for these
tests too.

## How was this patch tested?

Locally against Spark 4.0 / 4.1 (default lane, green) and Spark 4.2
(with PR delta-io#6830 applied to verify the tests pass end-to-end). 25 / 25
changelog tests pass with both PRs combined.

## Does this PR introduce _any_ user-facing changes?

The V2 changelog path is gated behind
`spark.databricks.delta.changelogV2.enabled` and not yet on a released
code path. With this PR, when enabled, it now returns correct results on
DV-enabled tables.
   and its helpers configure
   to the unified Java  in
  , but they live in .  cannot depend
  on  (circular), so the in-tree unified DeltaCatalog is
  not on the test classpath. The class with the same FQN is instead
  pulled in via:

      sparkV2/test
        -> kernelUnityCatalog(test->test)
        -> kernelDefaults(test->test)
        -> io.delta %% delta-spark % 4.0.0 % test

  That released jar contains the pre-delta-io#5320  (a single
  class extending  directly, without the
   ancestor). At runtime the JVM mixes the released
  4.0.0 catalog with the current-build  self-typed
  trait, and the trait dispatch's synthetic cast to
  fails with . After the trait inline, the same
  released jar surfaces as  because its bytecode targets the pre-Spark-4.2 6-arg signature.

  Move the changelog tests to . spark-unified has the
  hybrid DeltaCatalog in its  sources, so the test classpath
  resolves to the in-tree class. spark-unified depends on  only
  at , so the kernelUnityCatalog -> kernelDefaults ->
  released-delta-spark-4.0.0 chain does not apply, and the released jar is
  not on the test classpath at all.

  Three deltas, no production code change:

  - Move the three test files from
     to
    .
  - Add JUnit Jupiter to  . The
    changelog tests use Jupiter (, , ,
    ); spark-unified previously had no Jupiter deps
    because its other tests are scalatest.
  - Make  self-contained. It used to extend
     from ; that base is not visible from
     since spark-unified does not depend on sparkV2 at
    . Inline the static / fields, the
    / lifecycle, and the  (x2)//
     helpers actually used by the changelog tests.

  Run with  and a Spark 4.2 preview pinned in
  [Spark] Move changelog tests to spark-unified to fix Spark 4.2 classpath

   and its helpers configure
   to the unified Java  in
  , but they live in .  cannot depend
  on  (circular), so the in-tree unified DeltaCatalog is
  not on the test classpath. The class with the same FQN is instead
  pulled in via:

      sparkV2/test
        -> kernelUnityCatalog(test->test)
        -> kernelDefaults(test->test)
        -> io.delta %% delta-spark % 4.0.0 % test

  That released jar contains the pre-delta-io#5320  (a single
  class extending  directly, without the
   ancestor). At runtime the JVM mixes the released
  4.0.0 catalog with the current-build  self-typed
  trait, and the trait dispatch's synthetic cast to
  fails with . After the trait inline, the same
  released jar surfaces as  because its bytecode targets the pre-Spark-4.2 6-arg signature.

  Move the changelog tests to . spark-unified has the
  hybrid DeltaCatalog in its  sources, so the test classpath
  resolves to the in-tree class. spark-unified depends on  only
  at , so the kernelUnityCatalog -> kernelDefaults ->
  released-delta-spark-4.0.0 chain does not apply, and the released jar is
  not on the test classpath at all.

  Three deltas, no production code change:

  - Move the three test files from
     to
    .
  - Add JUnit Jupiter to  . The
    changelog tests use Jupiter (, , ,
    ); spark-unified previously had no Jupiter deps
    because its other tests are scalatest.
  - Make  self-contained. It used to extend
     from ; that base is not visible from
     since spark-unified does not depend on sparkV2 at
    . Inline the static / fields, the
    / lifecycle, and the  (x2)//
     helpers actually used by the changelog tests.

  Run with  and a Spark 4.2 preview pinned in
 Three follow-up fixes after the test relocation in 5a7bdd5. Without
  them the tests are either undiscovered, or hit wrapped error messages
  that hide the underlying error class assertions.

  1. build.sbt (spark-unified test source dirs)

     spark-unified pins Test/baseDirectory to sparkV1's directory, so
     CrossSparkVersions.sparkDependentSettings() adds the version-shim
     test dir (scala-shims/spark-4.2) under spark/, not under
     spark-unified/. The relocated changelog tests sit at
     spark-unified/src/test/scala-shims/spark-4.2/... and would not be
     discovered. Add unifiedDir/src/test/scala-shims/spark-4.2 to the
     explicit Test/unmanagedSourceDirectories list so the shim dir is
     visible from both module bases.

  2. DeltaChangelogScanBuilder.java (boundary RT-disabled error class)

     The eager start-snapshot row-tracking check threw
     DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, but that error class
     is for the per-commit loop in DeltaChangelogBatch that detects an
     RT toggle inside the range. A boundary snapshot without RT is the
     feature-missing case and should match the end-snapshot check, which
     throws DELTA_CHANGELOG_REQUIRES_ROW_TRACKING. Align both boundary
     checks to use REQUIRES_ROW_TRACKING; the in-range error class stays
     for the mid-range toggle case only.

  3. DeltaChangelogBatch.java (preserve AnalysisException error class)

     The generic catch (Exception e) inside planInputPartitions wraps
     DeltaAnalysisException (checked Exception, not RuntimeException) in
     a new RuntimeException("Failed to process CDC commit actions"),
     discarding the inner getMessage(). The error class
     (e.g. DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE,
     DELTA_CHANGELOG_SCHEMA_CHANGE_IN_RANGE) no longer reaches the user.
     Append e.getMessage() to the wrapper message so the original
     formatted error string surfaces
Per PR review feedback.

Co-authored-by: Isaac
The unscoped excludeDependencies blocked MiMa's previous-ABI resolution
during the compile-time backward-compat check (the spark/mimaPreviousClassfiles
task could not find io.delta:delta-spark_2.13:4.2.0). Restrict the exclusion
to the Test config so MiMa sees the artifact in Compile/runtime contexts
while the test classpath still drops the released jar that shadows the
in-tree unified DeltaCatalog.

Co-authored-by: Isaac
The start-snapshot row-tracking check throws
throwChangelogRowTrackingDisabledInRange(startVersion) (not the generic
throwChangelogRequiresRowTracking) to distinguish:

- table never had row tracking (end-boundary fails) - tell user to enable RT.
- table has RT now but startVersion predates the toggle (start-boundary
  fails) - tell user which version is the problem.

A previous rebase had collapsed both branches onto the generic error,
losing the start-version context. Restore parity with master.

Co-authored-by: Isaac
When a table never had row tracking, both the start and end boundary
checks fail. Previously the start-boundary check fired first and threw
DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, telling the user that
RT toggled mid-range. This is misleading because RT was never enabled.

Reorder so the end-boundary check runs first:
- end has no RT  -> DELTA_CHANGELOG_REQUIRES_ROW_TRACKING (table needs RT).
- end has RT, start has no RT  -> DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE
  (toggle within range, points at the offending startVersion).

This surfaces only with the assumeFalse skip in DeltaChangelogTestBase
removed (which is also part of this PR), so the testChangelogRejectsTableWithoutRowTracking
case starts exercising the boundary logic on Spark 4.2 preview5.

Co-authored-by: Isaac
The full delta-spark test suite against Spark 4.2 includes pre-existing
UC connector incompatibilities (UCProxy bytecode targets Spark 4.0's
6-arg CatalogStorageFormat.copy, not 4.2's 7-arg signature). Those
failures are masked by continue-on-error on the master matrix lane in
spark_test.yaml, but they obscure the signal we actually want: does
THIS PR's changelog code path work on Spark 4.2?

Add a narrow workflow that runs only
spark/testOnly io.delta.spark.internal.v2.read.changelog.* against
Spark 4.2-preview4. Green = the PR is unblocked on 4.2 even though
the broader matrix is unstable. Lives on the fork-CI branches only.

Co-authored-by: Isaac
Master now pins fullVersion=4.2.0-preview5 directly with MASTER=None
and uses '4.2' as the matrix alias. The 'master' alias no longer
resolves. Switch the focused workflow to -DsparkVersion=4.2.

Co-authored-by: Isaac
@SanJSp SanJSp force-pushed the fix-changelog-fork-ci branch from d7e3b73 to 6f4c9d2 Compare May 27, 2026 13:27
SanJSp added 4 commits May 27, 2026 14:47
test_multiVersionDvDeletes_perCommitIsolation passes locally on Spark 4.2
preview5 but fails on CI runs with the same artifact. Same-path Add+Remove
from a DV-DELETE produces carry-over delete/insert pairs that Spark Catalyst
Phase-1 cancels for two out of three (row_id, _commit_version) partitions,
but leaves the third surviving on CI. The values look identical to the
canceled partitions, so the divergence between local and CI environments
has no obvious source yet.

Disabling so the rest of the V2 changelog suite stays green on CI 4.2
preview5 while the root cause is investigated. Re-enable when fixed.

Co-authored-by: Isaac
Each variant isolates one variable from the failing
test_multiVersionDvDeletes_perCommitIsolation scenario:

- singleDvDelete_noPriorDvState   - removes prior DV update at v2
- multiVersionDvDeletes_readOnlyLastVersion - narrows range to v3..v3
- multiVersionDvDeletes_realDeleteAtFirstRow - real-delete at first row
                                               instead of middle
- multiVersionDvDeletes_insertViaSelect - INSERT via SELECT/range
                                          instead of INSERT VALUES

All 4 pass locally (Spark 4.2 preview5, OpenJDK 17.0.15). The hope is
that on CI exactly one variant fails, identifying which variable
triggers the Phase-1 carry-over leak; or several fail with a pattern.

Co-authored-by: Isaac
…ct still triggers

TD asked to verify whether the exclusion is still needed by removing
the lines and seeing what happens. Local clean compile + V2 changelog
tests pass without the exclusion (24/25, with the one DV test being
@disabled). The conflicting delta-spark 4.0.0 jar is still on the
test classpath per `show spark/Test/dependencyClasspath`, so this
relies on classloader resolution order picking the in-tree class.

Push to fork-CI to validate against CI's multi-JVM environment (where
the original DELTA_MISSING_CHANGE_DATA race was first observed). If
CI is green, the exclusion can be permanently dropped.

Co-authored-by: Isaac
…heory

logCatalogProvenance prints the mounted DeltaCatalog class, the jar/dir
it was loaded from, and whether it carries the ChangelogSupport mixin.
Runs at the start of each DV test via withDvTable.

Local result with exclusion removed:
  class=...delta.catalog.DeltaCatalog
  loadedFrom=.../delta-spark_4.2_2.13-4.3.0-SNAPSHOT.jar (in-tree)
  implementsChangelogSupport=true
  -> all tests pass

Theory to verify on CI: failing tests will show either the released
delta-spark 4.0.0 jar as the load source, or ChangelogSupport=false,
or both -- confirming that classloader race picks the wrong class
when the exclusion is absent.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant