[Fork-CI] PR #6830 (test classpath fix) on Spark 4.2 — fork-internal validation#2
Draft
SanJSp wants to merge 17 commits into
Draft
[Fork-CI] PR #6830 (test classpath fix) on Spark 4.2 — fork-internal validation#2SanJSp wants to merge 17 commits into
SanJSp wants to merge 17 commits into
Conversation
b52e34b to
d7e3b73
Compare
…io#6886) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Follow-up to [delta-io#6794](delta-io#6794). Adds Deletion Vector (DV) support to the V2 changelog read path so `CHANGES FROM VERSION x TO VERSION y` returns correct results on tables with `delta.enableDeletionVectors=true`. **In short,** I apply change_type delete to every removeFile and change_type insert to every row in addFile per commit. A carry-over pair (delete+insert) will be filtered out as carry-over, but with the DVs we apply the DV as filter to the AddFile, leading to delete rows without insert partner, which represent actual delets. ### Implementation `DeltaChangelogBatch.java`: - `CDCInputPartition` carries a per-file `deletionVectorInfo: String` (the base64-serialized DV descriptor, or `null` when the file has no DV). - `planInputPartitions` extracts the DV per `AddFile`/`RemoveFile` via `DeletionVectorDescriptor::serializeToBase64` and threads it onto the partition. - `createReader` uses the new public `PartitionUtils.buildDvMetadata(...)` helper to set `FILE_ROW_INDEX_FILTER_ID_ENCODED` and `FILE_ROW_INDEX_FILTER_TYPE` together (with `RowIndexFilterType.IF_CONTAINED`) on the per-file metadata, so the existing `DropMarkedRowsFilter` machinery picks up the DV and produces the correct `delete` change rows. - Wrapper `RuntimeException` messages now include the inner cause's `getMessage()` so `AnalysisException` error classes (e.g. `DELTA_CHANGELOG_*`) reach the user instead of being hidden behind a generic "Failed to process CDC commit actions". `PartitionUtils.java`: - New public 1-arg `buildDvMetadata(String dvBase64)` overload that takes the base64 string directly and uses `IF_CONTAINED` semantics. Returns `java.util.Map` for ergonomic iteration from the changelog caller. Existing private overloads are unchanged. ### Tests `DeltaChangelogDirectBatchExecutionTest.java`: - `testDirectBatchExecutionWithExplicitExpectedRows` and `testUpdateProducesPairedDeleteAndInsert` are now parameterized via `@ParameterizedTest @valuesource(booleans = {false, true})` so each runs once with DVs off (rewrite-on-delete) and once with DVs on (DV-on-delete). The expected row sets are identical, so the same assertions cover both physical layouts. `DeltaChangelogDvTest.java` (new): DV-specific scenarios that don't fit the parameterized variants. - `test_mixedDvDelete_perFileBranching`: two `AddFile`s in one commit, a single `DELETE` touching both; asserts the per-file DV branch in `planInputPartitions` sets the DV constants independently per file. - `test_multiVersionDvDeletes_perCommitIsolation`: two `DELETE` commits in sequence on a DV-enabled table; asserts Catalyst's batch CDC post-processor partitions on `(row_id, version)` so each commit's carry-overs collapse independently. ## Note on Spark 4.2 lane The Spark 4.2 lane is non-blocking (`continue-on-error: true`); a follow-up PR ([delta-io#6830](delta-io#6830)) relocates these tests to `spark-unified/test` where the in-tree `DeltaCatalog` is on the classpath, making the 4.2 lane green for these tests too. ## How was this patch tested? Locally against Spark 4.0 / 4.1 (default lane, green) and Spark 4.2 (with PR delta-io#6830 applied to verify the tests pass end-to-end). 25 / 25 changelog tests pass with both PRs combined. ## Does this PR introduce _any_ user-facing changes? The V2 changelog path is gated behind `spark.databricks.delta.changelogV2.enabled` and not yet on a released code path. With this PR, when enabled, it now returns correct results on DV-enabled tables.
and its helpers configure
to the unified Java in
, but they live in . cannot depend
on (circular), so the in-tree unified DeltaCatalog is
not on the test classpath. The class with the same FQN is instead
pulled in via:
sparkV2/test
-> kernelUnityCatalog(test->test)
-> kernelDefaults(test->test)
-> io.delta %% delta-spark % 4.0.0 % test
That released jar contains the pre-delta-io#5320 (a single
class extending directly, without the
ancestor). At runtime the JVM mixes the released
4.0.0 catalog with the current-build self-typed
trait, and the trait dispatch's synthetic cast to
fails with . After the trait inline, the same
released jar surfaces as because its bytecode targets the pre-Spark-4.2 6-arg signature.
Move the changelog tests to . spark-unified has the
hybrid DeltaCatalog in its sources, so the test classpath
resolves to the in-tree class. spark-unified depends on only
at , so the kernelUnityCatalog -> kernelDefaults ->
released-delta-spark-4.0.0 chain does not apply, and the released jar is
not on the test classpath at all.
Three deltas, no production code change:
- Move the three test files from
to
.
- Add JUnit Jupiter to . The
changelog tests use Jupiter (, , ,
); spark-unified previously had no Jupiter deps
because its other tests are scalatest.
- Make self-contained. It used to extend
from ; that base is not visible from
since spark-unified does not depend on sparkV2 at
. Inline the static / fields, the
/ lifecycle, and the (x2)//
helpers actually used by the changelog tests.
Run with and a Spark 4.2 preview pinned in
[Spark] Move changelog tests to spark-unified to fix Spark 4.2 classpath
and its helpers configure
to the unified Java in
, but they live in . cannot depend
on (circular), so the in-tree unified DeltaCatalog is
not on the test classpath. The class with the same FQN is instead
pulled in via:
sparkV2/test
-> kernelUnityCatalog(test->test)
-> kernelDefaults(test->test)
-> io.delta %% delta-spark % 4.0.0 % test
That released jar contains the pre-delta-io#5320 (a single
class extending directly, without the
ancestor). At runtime the JVM mixes the released
4.0.0 catalog with the current-build self-typed
trait, and the trait dispatch's synthetic cast to
fails with . After the trait inline, the same
released jar surfaces as because its bytecode targets the pre-Spark-4.2 6-arg signature.
Move the changelog tests to . spark-unified has the
hybrid DeltaCatalog in its sources, so the test classpath
resolves to the in-tree class. spark-unified depends on only
at , so the kernelUnityCatalog -> kernelDefaults ->
released-delta-spark-4.0.0 chain does not apply, and the released jar is
not on the test classpath at all.
Three deltas, no production code change:
- Move the three test files from
to
.
- Add JUnit Jupiter to . The
changelog tests use Jupiter (, , ,
); spark-unified previously had no Jupiter deps
because its other tests are scalatest.
- Make self-contained. It used to extend
from ; that base is not visible from
since spark-unified does not depend on sparkV2 at
. Inline the static / fields, the
/ lifecycle, and the (x2)//
helpers actually used by the changelog tests.
Run with and a Spark 4.2 preview pinned in
Three follow-up fixes after the test relocation in 5a7bdd5. Without them the tests are either undiscovered, or hit wrapped error messages that hide the underlying error class assertions. 1. build.sbt (spark-unified test source dirs) spark-unified pins Test/baseDirectory to sparkV1's directory, so CrossSparkVersions.sparkDependentSettings() adds the version-shim test dir (scala-shims/spark-4.2) under spark/, not under spark-unified/. The relocated changelog tests sit at spark-unified/src/test/scala-shims/spark-4.2/... and would not be discovered. Add unifiedDir/src/test/scala-shims/spark-4.2 to the explicit Test/unmanagedSourceDirectories list so the shim dir is visible from both module bases. 2. DeltaChangelogScanBuilder.java (boundary RT-disabled error class) The eager start-snapshot row-tracking check threw DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, but that error class is for the per-commit loop in DeltaChangelogBatch that detects an RT toggle inside the range. A boundary snapshot without RT is the feature-missing case and should match the end-snapshot check, which throws DELTA_CHANGELOG_REQUIRES_ROW_TRACKING. Align both boundary checks to use REQUIRES_ROW_TRACKING; the in-range error class stays for the mid-range toggle case only. 3. DeltaChangelogBatch.java (preserve AnalysisException error class) The generic catch (Exception e) inside planInputPartitions wraps DeltaAnalysisException (checked Exception, not RuntimeException) in a new RuntimeException("Failed to process CDC commit actions"), discarding the inner getMessage(). The error class (e.g. DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, DELTA_CHANGELOG_SCHEMA_CHANGE_IN_RANGE) no longer reaches the user. Append e.getMessage() to the wrapper message so the original formatted error string surfaces
Per PR review feedback. Co-authored-by: Isaac
The unscoped excludeDependencies blocked MiMa's previous-ABI resolution during the compile-time backward-compat check (the spark/mimaPreviousClassfiles task could not find io.delta:delta-spark_2.13:4.2.0). Restrict the exclusion to the Test config so MiMa sees the artifact in Compile/runtime contexts while the test classpath still drops the released jar that shadows the in-tree unified DeltaCatalog. Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
The start-snapshot row-tracking check throws throwChangelogRowTrackingDisabledInRange(startVersion) (not the generic throwChangelogRequiresRowTracking) to distinguish: - table never had row tracking (end-boundary fails) - tell user to enable RT. - table has RT now but startVersion predates the toggle (start-boundary fails) - tell user which version is the problem. A previous rebase had collapsed both branches onto the generic error, losing the start-version context. Restore parity with master. Co-authored-by: Isaac
When a table never had row tracking, both the start and end boundary checks fail. Previously the start-boundary check fired first and threw DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE, telling the user that RT toggled mid-range. This is misleading because RT was never enabled. Reorder so the end-boundary check runs first: - end has no RT -> DELTA_CHANGELOG_REQUIRES_ROW_TRACKING (table needs RT). - end has RT, start has no RT -> DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE (toggle within range, points at the offending startVersion). This surfaces only with the assumeFalse skip in DeltaChangelogTestBase removed (which is also part of this PR), so the testChangelogRejectsTableWithoutRowTracking case starts exercising the boundary logic on Spark 4.2 preview5. Co-authored-by: Isaac
The full delta-spark test suite against Spark 4.2 includes pre-existing UC connector incompatibilities (UCProxy bytecode targets Spark 4.0's 6-arg CatalogStorageFormat.copy, not 4.2's 7-arg signature). Those failures are masked by continue-on-error on the master matrix lane in spark_test.yaml, but they obscure the signal we actually want: does THIS PR's changelog code path work on Spark 4.2? Add a narrow workflow that runs only spark/testOnly io.delta.spark.internal.v2.read.changelog.* against Spark 4.2-preview4. Green = the PR is unblocked on 4.2 even though the broader matrix is unstable. Lives on the fork-CI branches only. Co-authored-by: Isaac
Master now pins fullVersion=4.2.0-preview5 directly with MASTER=None and uses '4.2' as the matrix alias. The 'master' alias no longer resolves. Switch the focused workflow to -DsparkVersion=4.2. Co-authored-by: Isaac
d7e3b73 to
6f4c9d2
Compare
test_multiVersionDvDeletes_perCommitIsolation passes locally on Spark 4.2 preview5 but fails on CI runs with the same artifact. Same-path Add+Remove from a DV-DELETE produces carry-over delete/insert pairs that Spark Catalyst Phase-1 cancels for two out of three (row_id, _commit_version) partitions, but leaves the third surviving on CI. The values look identical to the canceled partitions, so the divergence between local and CI environments has no obvious source yet. Disabling so the rest of the V2 changelog suite stays green on CI 4.2 preview5 while the root cause is investigated. Re-enable when fixed. Co-authored-by: Isaac
Each variant isolates one variable from the failing
test_multiVersionDvDeletes_perCommitIsolation scenario:
- singleDvDelete_noPriorDvState - removes prior DV update at v2
- multiVersionDvDeletes_readOnlyLastVersion - narrows range to v3..v3
- multiVersionDvDeletes_realDeleteAtFirstRow - real-delete at first row
instead of middle
- multiVersionDvDeletes_insertViaSelect - INSERT via SELECT/range
instead of INSERT VALUES
All 4 pass locally (Spark 4.2 preview5, OpenJDK 17.0.15). The hope is
that on CI exactly one variant fails, identifying which variable
triggers the Phase-1 carry-over leak; or several fail with a pattern.
Co-authored-by: Isaac
…ct still triggers TD asked to verify whether the exclusion is still needed by removing the lines and seeing what happens. Local clean compile + V2 changelog tests pass without the exclusion (24/25, with the one DV test being @disabled). The conflicting delta-spark 4.0.0 jar is still on the test classpath per `show spark/Test/dependencyClasspath`, so this relies on classloader resolution order picking the in-tree class. Push to fork-CI to validate against CI's multi-JVM environment (where the original DELTA_MISSING_CHANGE_DATA race was first observed). If CI is green, the exclusion can be permanently dropped. Co-authored-by: Isaac
…heory logCatalogProvenance prints the mounted DeltaCatalog class, the jar/dir it was loaded from, and whether it carries the ChangelogSupport mixin. Runs at the start of each DV test via withDvTable. Local result with exclusion removed: class=...delta.catalog.DeltaCatalog loadedFrom=.../delta-spark_4.2_2.13-4.3.0-SNAPSHOT.jar (in-tree) implementsChangelogSupport=true -> all tests pass Theory to verify on CI: failing tests will show either the released delta-spark 4.0.0 jar as the load source, or ChangelogSupport=false, or both -- confirming that classloader race picks the wrong class when the exclusion is absent. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fork-internal CI validation for upstream draft PR delta-io/delta#6830. The official PR targets
delta-io/delta:masterwhereMASTER = None(4.0 / 4.1 only), so the Spark 4.2 row never runs in upstream CI. This branch pins the fork's matrix to actually exercise Spark 4.2.What this branch contains
What it does NOT contain
Local verification (preview4)
CI expectation
The fork's GitHub Actions matrix should now include
spark_version=masterwithcontinue-on-error: false(PR delta-io#6657 marks it continue-on-error at the upstream config). Pass on that row validates the upstream PR's Spark 4.2 path.