Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix404ExceptionOnSparkAfterSplit #29982

Merged
merged 5 commits into from
Jul 18, 2022
Merged

Conversation

xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Jul 15, 2022

Description

When a split happens, customer may get a 404 exception.
Example stackTraces:
First getting a 410/1002

2/07/05 18:09:53 WARN TransientIOErrorsRetryingIterator: Transient failure handled in TransientIOErrorsRetryingIterator.hasNextInternal - will be retried (attempt#1) in 3586ms
{"ClassName":"GoneException","userAgent":"azsdk-java-cosmos/4.30.0 Linux/5.4.0-1080-azure JRE/1.8.0_302","statusCode":410,"resourceAddress":null,"innerErrorMessage":"Epk Range '{\"min\":\"05C1AFF91793C0\",\"max\":\"05C1B18FD52380\"}' is gone.","causeInfo":null,"responseHeaders":"{x-ms-substatus=1002}"}
	at azure_cosmos_spark.com.azure.cosmos.implementation.feedranges.FeedRangeEpkImpl.lambda$populateFeedRangeFilteringHeaders$2(FeedRangeEpkImpl.java:192)
Then getting a 404:
2/07/05 18:09:57 INFO TransientIOErrorsRetryingIterator: Attempting to cancel oldPagedFlux, Context: n/a
22/07/05 18:09:57 WARN ChangeFeedFetcher$FeedRangeContinuationSplitRetryPolicy: Exception not applicable - will fail the request. Context: n/a
{"ClassName":"NotFoundException","userAgent":"azsdk-java-cosmos/4.30.0 Linux/5.4.0-1080-azure JRE/1.8.0_302","statusCode":404,"resourceAddress":null,"innerErrorMessage":"Entity with the specified id does not exist in the system. More info: https://aka.ms/cosmosdb-tsg-not-found-java: Stale cache for collection rid 'HBpFAO8LmCQ='.","causeInfo":null,"responseHeaders":"{}"}
	at azure_cosmos_spark.com.azure.cosmos.implementation.feedranges.FeedRangeEpkImpl.lambda$populateFeedRangeFilteringHeaders$2(FeedRangeEpkImpl.java:180)
	at azure_cosmos_spark.reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:125)
	at azure_cosmos_spark.reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:127)

In this PR, it includes two fixes:

  1. Return the correct shouldRetryResult after FeedRangeContinuationSplitRetryPolicy.handleSplit, so the process will continues instead of throwing back to TransientIOErrorsRetryingIterator.
  2. The reason we are getting 404 exception is because after the the FeedRangeContinuationSplitRetryPolicy.handleSplit,there is a range(Min = Max, isMinInclusive = true, isMaxInclusive = false) being generated, which causing no overlapping ranges in theFeedRangeEpkImpl.populateFeedRangeFilteringHeaders`, hence a client side 404 is being generated. The fix is to use the correct isMinInClusive, isMaxInclusive flag.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

@ghost ghost added the Cosmos label Jul 15, 2022
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Annie!

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annie - please ping me offline. Want to understand whether after the split we still have three ranges - one with min==max (but proper flags) - if so, that is a bug we should address as well.

@xinlian12
Copy link
Member Author

Annie - please ping me offline. Want to understand whether after the split we still have three ranges - one with min==max (but proper flags) - if so, that is a bug we should address as well.

By passing the proper flags, there is no range with min=max any more

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@xinlian12
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member Author

/azp run java - cosmos - spark

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants