Skip to content

[SparkConnector][No Review]FixNoClassDefFoundError for MetadataVersionUtil#48837

Merged
FabianMeiswinkel merged 10 commits intoAzure:mainfrom
xinlian12:fix/cosmos-spark-metadataversion-noclass
Apr 17, 2026
Merged

[SparkConnector][No Review]FixNoClassDefFoundError for MetadataVersionUtil#48837
FabianMeiswinkel merged 10 commits intoAzure:mainfrom
xinlian12:fix/cosmos-spark-metadataversion-noclass

Conversation

@xinlian12
Copy link
Copy Markdown
Member

@xinlian12 xinlian12 commented Apr 16, 2026

Description

Fixes a NoClassDefFoundError for MetadataVersionUtil in the Cosmos Spark connector that occurs on certain Spark distributions (e.g., Databricks Runtime 17.3+) where MetadataVersionUtil has been relocated or removed.

Problem

ChangeFeedInitialOffsetWriter directly references org.apache.spark.sql.execution.streaming.MetadataVersionUtil, which is an internal Spark class. Some Spark distributions relocate or remove this class, causing a NoClassDefFoundError at runtime when the change feed offset writer attempts to deserialize a log file.

Solution

Inline the validateVersion logic from MetadataVersionUtil into a private companion object (ChangeFeedInitialOffsetWriter object) to eliminate the runtime dependency on MetadataVersionUtil. The inlined implementation preserves the same validation semantics:

  • Parses the version string (e.g., "v1") from the log file header
  • Validates the version is within the supported range
  • Throws IllegalStateException with descriptive messages for malformed or unsupported versions

Changes

  • ChangeFeedInitialOffsetWriter.scala: Removed the import of MetadataVersionUtil, replaced the call to MetadataVersionUtil.validateVersion(...) with a local ChangeFeedInitialOffsetWriter.validateVersion(...), and added a companion object with the inlined validation logic.

Tests

  • Added spark live test for change feed streaming
  • Manual testing on spark Databricks with 17.3 runtime
  • Also confirmed the path for HDFSMetadataLog
image

…ector

Inline version validation logic in ChangeFeedInitialOffsetWriter instead
of depending on Spark-internal MetadataVersionUtil, which has been
relocated in Databricks Runtime 17.3 LTS (Spark 4.0).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 marked this pull request as ready for review April 16, 2026 21:39
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners April 16, 2026 21:39
Copilot AI review requested due to automatic review settings April 16, 2026 21:39
@xinlian12 xinlian12 changed the title Fix NoClassDefFoundError for MetadataVersionUtil in Cosmos Spark conn… [SparkConnector][No Review]FixNoClassDefFoundError for MetadataVersionUtil Apr 16, 2026
@xinlian12
Copy link
Copy Markdown
Member Author

@sdkReviewAgent

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes a runtime dependency on Spark’s MetadataVersionUtil (which can be relocated in some Spark distributions) by inlining equivalent log-version validation logic into the Cosmos Spark connector’s change feed offset metadata reader/writer.

Changes:

  • Removed the import/reference to org.apache.spark.sql.execution.streaming.MetadataVersionUtil.
  • Added an internal validateVersion implementation and switched deserialize to use it.

@xinlian12
Copy link
Copy Markdown
Member Author

@sdkReviewAgent

@xinlian12
Copy link
Copy Markdown
Member Author

Review complete (14:39)

No new comments — existing review coverage is sufficient.

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@FabianMeiswinkel
Copy link
Copy Markdown
Member

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 and others added 2 commits April 16, 2026 15:12
Add ChangeFeedInitialOffsetWriterSpec with tests covering:
- Valid version strings within supported range
- Version exceeding max supported (UnsupportedLogVersion)
- Malformed versions: non-numeric, empty, missing v prefix, v0, negative, bare v

Widen companion object visibility to private[spark] for testability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…st notebooks

Add structured streaming scenarios using cosmos.oltp.changeFeed to both
basicScenario.scala and basicScenarioAadManagedIdentity.scala notebooks.
These scenarios exercise the ChangeFeedInitialOffsetWriter and
HDFSMetadataLog code paths that can break on certain Spark distributions
(e.g. Databricks Runtime 17.3+).

Each scenario:
- Creates a sink container
- Reads change feed from source via readStream with micro-batch
- Writes to sink container via writeStream
- Validates records were copied
- Cleans up both containers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 and others added 2 commits April 16, 2026 20:01
Use file:/tmp/ instead of /tmp/ for checkpoint location to avoid DBFS
access issues on Unity Catalog-enabled Databricks clusters. Also:
- Remove unused Trigger import
- Stop query before reading sink to avoid race conditions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace cosmos.oltp sink with in-memory sink to eliminate the need for
a separate sink container. This avoids 404 errors from sink container
creation/resolution and removes checkpoint path concerns.

The test still exercises the full ChangeFeedInitialOffsetWriter and
HDFSMetadataLog code paths (readStream with cosmos.oltp.changeFeed),
which is the goal for validating the MetadataVersionUtil fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Both notebooks now use the same pattern: derive changeFeedCfg from the
existing cfg map (which already has the correct auth config) plus the
change feed-specific options. Write to an in-memory sink to avoid
container creation issues. This ensures both key-based and AAD/MSI
notebooks exercise identical streaming logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

The MSI notebook shares a cluster with basicScenario, and the Cosmos
client cache retains references from the first notebook's proactive
connection init. When basicScenario drops the source container during
cleanup, the MSI notebook's change feed streaming fails with 404 on
the cached (now-deleted) container. The change feed streaming test in
basicScenario already provides sufficient coverage for the
ChangeFeedInitialOffsetWriter code paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xinlian12 and others added 2 commits April 16, 2026 22:03
Add detailed logging to capture:
- Endpoint, database, container, auth config used
- Source container record count before streaming
- Streaming query ID
- Full exception details on failure

This will help diagnose why the change feed streaming fails
on the MSI notebook but succeeds on the key-based one.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MSI change feed test passes on a fresh cluster but fails when
basicScenario runs first on the same cluster without restart. The
basicScenario leaves cached Cosmos client state (proactive connection
init on the ephemeral endpoint) that causes the MSI streaming query
to resolve to the wrong endpoint, resulting in a 404. The change feed
test in basicScenario provides sufficient coverage for the
ChangeFeedInitialOffsetWriter/HDFSMetadataLog code paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@FabianMeiswinkel FabianMeiswinkel merged commit df7614a into Azure:main Apr 17, 2026
36 checks passed
tvaron3 added a commit to tvaron3/azure-sdk-for-java that referenced this pull request Apr 17, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 added a commit that referenced this pull request Apr 17, 2026
* Release azure-cosmos-spark 4.47.0

Version bumps and CHANGELOG updates for:
- azure-cosmos-spark_3-3_2-12 4.47.0
- azure-cosmos-spark_3-4_2-12 4.47.0
- azure-cosmos-spark_3-5_2-12 4.47.0
- azure-cosmos-spark_3-5_2-13 4.47.0
- azure-cosmos-spark_4-0_2-13 4.47.0

Features Added:
- Added support for change feed with startFrom point-in-time on merged partitions (PR #48752)

Bugs Fixed:
- Fixed readContainerThroughput unnecessary permission requirement (PR #48800)

Also updated azure-cosmos CHANGELOG to reclassify the startFrom fix as a feature.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review: add clinit fix to CHANGELOGs and DBR 17.3 known issue

- Added JVM <clinit> deadlock fix (PR #48689) to all 5 spark connector CHANGELOGs
- Added Known Issues section to Spark 4.0 README for Structured Streaming
  incompatibility with Databricks Runtime 17.3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Reword DBR 17.3 known issue based on IcM 779484786

Updated with accurate details: MetadataVersionUtil$ class removal,
DBR 17.3 includes Spark 4.1 changes while reporting 4.0.0, and
recommendation to stay on previous LTS until DBR 18 LTS.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove DBR 17.3 known issue - will be fixed before release

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update spark release date to 2026-04-17

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add MetadataVersionUtil fix to Spark 4.0 CHANGELOG (PR #48837)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants