Skip to content

Storage - Fix Flaky Stress Tests#48359

Merged
browndav-msft merged 25 commits intoAzure:mainfrom
browndav-msft:fix-flaky-tests
Mar 16, 2026
Merged

Storage - Fix Flaky Stress Tests#48359
browndav-msft merged 25 commits intoAzure:mainfrom
browndav-msft:fix-flaky-tests

Conversation

@browndav-msft
Copy link
Copy Markdown
Member

This is a fix to fix the issues we've been having with the stress tests.

- read functions had FAIL_FAST which would throw an error when  the stream had reached then end and we wanted to read from the stream again. So we removed from  both reads.
- refactor code so that the exit criteria is a tthe beginning
- refactor the emitContentInfo for dry
- changed emitValue to tryEmitValue
- remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown
- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha
- opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha
@github-actions github-actions Bot added Azure.Core azure-core Storage Storage Service (Queues, Blobs, Files) labels Mar 10, 2026
@ibrandes ibrandes marked this pull request as ready for review March 10, 2026 23:35
Copilot AI review requested due to automatic review settings March 10, 2026 23:35
@ibrandes ibrandes changed the title Fix flaky tests Storage - Fix Flaky Stress Tests Mar 10, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce flakiness in Storage stress tests by making cleanup more resilient, making CRC telemetry streams tolerate re-subscription/double-close behaviors, and aligning dependencies with the chosen OpenTelemetry runtime metrics version.

Changes:

  • Replace unconditional deletes with deleteIfExists() across multiple stress scenarios to avoid cleanup failures when resources are already gone.
  • Add retry/timeout-based global cleanup logic in scenario base classes and add retry logic to async runs.
  • Adjust CRC stream emission behavior to avoid failures on repeated terminal events; downgrade OpenTelemetry instrumentation dependencies to 2.15.0-alpha.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/TelemetryHelper.java Adjust JVM runtime metrics registration and make timeout/cancellation detection null-safe.
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcOutputStream.java Switch sink emission to tryEmitValue to tolerate double-close.
sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcInputStream.java Refactor EOF emission, add resubscription state reset, and switch to tryEmitValue.
sdk/storage/azure-storage-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/UploadFromFile.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/ShareScenarioBase.java Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-file-share-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/UploadFromFile.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/Upload.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/DataLakeScenarioBase.java Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-file-datalake-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/UploadPages.java Use deleteIfExists() and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/Upload.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/StageBlock.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobScenarioBase.java Add retrying global cleanup + async retry behavior and new logging.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobOutputStream.java Use deleteIfExists() and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/CommitBlockList.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobUpload.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobOutputStream.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlobScenarioBase.java Add retrying global cleanup + async retry behavior and structured logging.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlock.java Use deleteIfExists() and swallow delete errors during cleanup.
sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlobOutputStream.java Use deleteIfExists() during per-test cleanup.
sdk/storage/azure-storage-blob-stress/pom.xml Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha.
sdk/parents/azure-client-sdk-parent/pom.xml Downgrade io.clientcore:linting-extensions used by checkstyle plugin from beta.2 to beta.1.

Comment thread sdk/storage/azure-storage-file-datalake-stress/pom.xml
Comment thread sdk/storage/azure-storage-stress/pom.xml
Comment thread sdk/storage/azure-storage-file-share-stress/pom.xml
@browndav-msft browndav-msft requested a review from ibrandes March 11, 2026 17:04
Copy link
Copy Markdown
Member

@ibrandes ibrandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i resolved all comments that i don't think need to be addressed right now, but there are still a couple lingering ones (make sure you expand hidden conversations). not sure how important the ones about globalCleanupAsync are, but i think we should address the ones about doOnError and deleteAllFilesInFileSystem, thoughts?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.

Comment thread sdk/storage/azure-storage-stress/pom.xml
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.

Comment thread sdk/storage/azure-storage-file-datalake-stress/pom.xml
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 8 comments.

Comment thread sdk/storage/azure-storage-stress/pom.xml
Comment thread sdk/storage/azure-storage-blob-stress/pom.xml
Comment thread sdk/storage/azure-storage-file-share-stress/pom.xml
Comment thread sdk/storage/azure-storage-file-datalake-stress/pom.xml
Copy link
Copy Markdown
Member

@ibrandes ibrandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@browndav-msft browndav-msft enabled auto-merge (squash) March 16, 2026 22:35
@browndav-msft browndav-msft merged commit 3544379 into Azure:main Mar 16, 2026
12 checks passed
@browndav-msft browndav-msft deleted the fix-flaky-tests branch March 16, 2026 22:37
@weidongxu-microsoft
Copy link
Copy Markdown
Member

weidongxu-microsoft commented Mar 17, 2026

@browndav-msft Please update
https://github.com/Azure/azure-sdk-for-java/blob/main/eng/versioning/external_dependencies.txt#L126

Otherwise any increment PR would revert this version change. e.g. #48434

browndav-msft added a commit to browndav-msft/azure-sdk-for-java that referenced this pull request Mar 17, 2026
* removed enableDeterministic

* change .delete() to .deleteIfExists()

* remove Sinks.EmitFailureHandler.FAIL_FAST from CrcInputStream

- read functions had FAIL_FAST which would throw an error when  the stream had reached then end and we wanted to read from the stream again. So we removed from  both reads.
- refactor code so that the exit criteria is a tthe beginning
- refactor the emitContentInfo for dry

* prevent crashes on reattempted close on stream

- changed emitValue to tryEmitValue
- remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown

* fix telemetry so that it doesnt swallow errors

* roll back two deps because they were causing failures in the containers

- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha
- opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha

* rollback azure-client-sdk-parent linting extensions from 1.0.0-beta.2 t0 beta.1

* revert linting extensions to beta2

* remove comments with old code

* add logging for errors

* remove catches for double close issue and okay status

* recursively delete files then delete the directory

* change to sync deletes, refactor for easier reading

* restructing share clean up so super calls only once

* incorporate copilot suggestions

* incorporate copilot suggestions

* incorporate copilot suggestions

* incorporate copilot suggestions

* fix all deletes to make sync and wrap in try-catch

* fix tests so that super.globalCleanupAsync() is only called once

* change telemetry to loggin only returns final state instead of failed retries when ultimately successful

* undo versio downgrade for linting-extensions

* Fixing spacing in error messages

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* refactor datalake delete all so that it is easier to read

* refactor runAsync in ShareScenarioBase so retry failures dont show as failures upon success

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
browndav-msft added a commit that referenced this pull request May 1, 2026
* removed enableDeterministic

* change .delete() to .deleteIfExists()

* remove Sinks.EmitFailureHandler.FAIL_FAST from CrcInputStream

- read functions had FAIL_FAST which would throw an error when  the stream had reached then end and we wanted to read from the stream again. So we removed from  both reads.
- refactor code so that the exit criteria is a tthe beginning
- refactor the emitContentInfo for dry

* prevent crashes on reattempted close on stream

- changed emitValue to tryEmitValue
- remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown

* fix telemetry so that it doesnt swallow errors

* roll back two deps because they were causing failures in the containers

- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha
- opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha

* rollback azure-client-sdk-parent linting extensions from 1.0.0-beta.2 t0 beta.1

* revert linting extensions to beta2

* remove comments with old code

* add logging for errors

* remove catches for double close issue and okay status

* recursively delete files then delete the directory

* change to sync deletes, refactor for easier reading

* restructing share clean up so super calls only once

* incorporate copilot suggestions

* incorporate copilot suggestions

* incorporate copilot suggestions

* incorporate copilot suggestions

* fix all deletes to make sync and wrap in try-catch

* fix tests so that super.globalCleanupAsync() is only called once

* change telemetry to loggin only returns final state instead of failed retries when ultimately successful

* undo versio downgrade for linting-extensions

* Fixing spacing in error messages

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* refactor datalake delete all so that it is easier to read

* refactor runAsync in ShareScenarioBase so retry failures dont show as failures upon success

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
browndav-msft added a commit to browndav-msft/azure-sdk-for-java that referenced this pull request May 6, 2026
* removed enableDeterministic

* change .delete() to .deleteIfExists()

* remove Sinks.EmitFailureHandler.FAIL_FAST from CrcInputStream

- read functions had FAIL_FAST which would throw an error when  the stream had reached then end and we wanted to read from the stream again. So we removed from  both reads.
- refactor code so that the exit criteria is a tthe beginning
- refactor the emitContentInfo for dry

* prevent crashes on reattempted close on stream

- changed emitValue to tryEmitValue
- remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown

* fix telemetry so that it doesnt swallow errors

* roll back two deps because they were causing failures in the containers

- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha
- opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha

* rollback azure-client-sdk-parent linting extensions from 1.0.0-beta.2 t0 beta.1

* revert linting extensions to beta2

* remove comments with old code

* add logging for errors

* remove catches for double close issue and okay status

* recursively delete files then delete the directory

* change to sync deletes, refactor for easier reading

* restructing share clean up so super calls only once

* incorporate copilot suggestions

* incorporate copilot suggestions

* incorporate copilot suggestions

* incorporate copilot suggestions

* fix all deletes to make sync and wrap in try-catch

* fix tests so that super.globalCleanupAsync() is only called once

* change telemetry to loggin only returns final state instead of failed retries when ultimately successful

* undo versio downgrade for linting-extensions

* Fixing spacing in error messages

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* refactor datalake delete all so that it is easier to read

* refactor runAsync in ShareScenarioBase so retry failures dont show as failures upon success

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Azure.Core azure-core Storage Storage Service (Queues, Blobs, Files)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants