Storage - Fix Flaky Stress Tests#48359
Conversation
- read functions had FAIL_FAST which would throw an error when the stream had reached then end and we wanted to read from the stream again. So we removed from both reads. - refactor code so that the exit criteria is a tthe beginning - refactor the emitContentInfo for dry
- changed emitValue to tryEmitValue - remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown
- opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha - opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce flakiness in Storage stress tests by making cleanup more resilient, making CRC telemetry streams tolerate re-subscription/double-close behaviors, and aligning dependencies with the chosen OpenTelemetry runtime metrics version.
Changes:
- Replace unconditional deletes with
deleteIfExists()across multiple stress scenarios to avoid cleanup failures when resources are already gone. - Add retry/timeout-based global cleanup logic in scenario base classes and add retry logic to async runs.
- Adjust CRC stream emission behavior to avoid failures on repeated terminal events; downgrade OpenTelemetry instrumentation dependencies to
2.15.0-alpha.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/TelemetryHelper.java | Adjust JVM runtime metrics registration and make timeout/cancellation detection null-safe. |
| sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcOutputStream.java | Switch sink emission to tryEmitValue to tolerate double-close. |
| sdk/storage/azure-storage-stress/src/main/java/com/azure/storage/stress/CrcInputStream.java | Refactor EOF emission, add resubscription state reset, and switch to tryEmitValue. |
| sdk/storage/azure-storage-stress/pom.xml | Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha. |
| sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/UploadFromFile.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-file-share-stress/src/main/java/com/azure/storage/file/share/stress/ShareScenarioBase.java | Add retrying global cleanup + async retry behavior and new logging. |
| sdk/storage/azure-storage-file-share-stress/pom.xml | Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha. |
| sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/UploadFromFile.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/Upload.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-file-datalake-stress/src/main/java/com/azure/storage/file/datalake/stress/DataLakeScenarioBase.java | Add retrying global cleanup + async retry behavior and new logging. |
| sdk/storage/azure-storage-file-datalake-stress/pom.xml | Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/UploadPages.java | Use deleteIfExists() and swallow delete errors during cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/Upload.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/StageBlock.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobScenarioBase.java | Add retrying global cleanup + async retry behavior and new logging. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/PageBlobOutputStream.java | Use deleteIfExists() and swallow delete errors during cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/CommitBlockList.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobUpload.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlockBlobOutputStream.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/BlobScenarioBase.java | Add retrying global cleanup + async retry behavior and structured logging. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlock.java | Use deleteIfExists() and swallow delete errors during cleanup. |
| sdk/storage/azure-storage-blob-stress/src/main/java/com/azure/storage/blob/stress/AppendBlobOutputStream.java | Use deleteIfExists() during per-test cleanup. |
| sdk/storage/azure-storage-blob-stress/pom.xml | Downgrade OTel runtime telemetry + logback appender to 2.15.0-alpha. |
| sdk/parents/azure-client-sdk-parent/pom.xml | Downgrade io.clientcore:linting-extensions used by checkstyle plugin from beta.2 to beta.1. |
ibrandes
left a comment
There was a problem hiding this comment.
i resolved all comments that i don't think need to be addressed right now, but there are still a couple lingering ones (make sure you expand hidden conversations). not sure how important the ones about globalCleanupAsync are, but i think we should address the ones about doOnError and deleteAllFilesInFileSystem, thoughts?
… retries when ultimately successful
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
… failures upon success
|
@browndav-msft Please update Otherwise any increment PR would revert this version change. e.g. #48434 |
* removed enableDeterministic * change .delete() to .deleteIfExists() * remove Sinks.EmitFailureHandler.FAIL_FAST from CrcInputStream - read functions had FAIL_FAST which would throw an error when the stream had reached then end and we wanted to read from the stream again. So we removed from both reads. - refactor code so that the exit criteria is a tthe beginning - refactor the emitContentInfo for dry * prevent crashes on reattempted close on stream - changed emitValue to tryEmitValue - remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown * fix telemetry so that it doesnt swallow errors * roll back two deps because they were causing failures in the containers - opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha - opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha * rollback azure-client-sdk-parent linting extensions from 1.0.0-beta.2 t0 beta.1 * revert linting extensions to beta2 * remove comments with old code * add logging for errors * remove catches for double close issue and okay status * recursively delete files then delete the directory * change to sync deletes, refactor for easier reading * restructing share clean up so super calls only once * incorporate copilot suggestions * incorporate copilot suggestions * incorporate copilot suggestions * incorporate copilot suggestions * fix all deletes to make sync and wrap in try-catch * fix tests so that super.globalCleanupAsync() is only called once * change telemetry to loggin only returns final state instead of failed retries when ultimately successful * undo versio downgrade for linting-extensions * Fixing spacing in error messages Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * refactor datalake delete all so that it is easier to read * refactor runAsync in ShareScenarioBase so retry failures dont show as failures upon success --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* removed enableDeterministic * change .delete() to .deleteIfExists() * remove Sinks.EmitFailureHandler.FAIL_FAST from CrcInputStream - read functions had FAIL_FAST which would throw an error when the stream had reached then end and we wanted to read from the stream again. So we removed from both reads. - refactor code so that the exit criteria is a tthe beginning - refactor the emitContentInfo for dry * prevent crashes on reattempted close on stream - changed emitValue to tryEmitValue - remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown * fix telemetry so that it doesnt swallow errors * roll back two deps because they were causing failures in the containers - opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha - opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha * rollback azure-client-sdk-parent linting extensions from 1.0.0-beta.2 t0 beta.1 * revert linting extensions to beta2 * remove comments with old code * add logging for errors * remove catches for double close issue and okay status * recursively delete files then delete the directory * change to sync deletes, refactor for easier reading * restructing share clean up so super calls only once * incorporate copilot suggestions * incorporate copilot suggestions * incorporate copilot suggestions * incorporate copilot suggestions * fix all deletes to make sync and wrap in try-catch * fix tests so that super.globalCleanupAsync() is only called once * change telemetry to loggin only returns final state instead of failed retries when ultimately successful * undo versio downgrade for linting-extensions * Fixing spacing in error messages Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * refactor datalake delete all so that it is easier to read * refactor runAsync in ShareScenarioBase so retry failures dont show as failures upon success --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* removed enableDeterministic * change .delete() to .deleteIfExists() * remove Sinks.EmitFailureHandler.FAIL_FAST from CrcInputStream - read functions had FAIL_FAST which would throw an error when the stream had reached then end and we wanted to read from the stream again. So we removed from both reads. - refactor code so that the exit criteria is a tthe beginning - refactor the emitContentInfo for dry * prevent crashes on reattempted close on stream - changed emitValue to tryEmitValue - remove Sinks.EmitFailureHandler.FAIL_FAST so that multiple closes does not cause an error to be thrown * fix telemetry so that it doesnt swallow errors * roll back two deps because they were causing failures in the containers - opentelemetry-runtime-telemetry-java8 from 2.24.0-alpha -> 2.15.0-alpha - opentelemetry-logback-appender-1.0 from 2.24.0-alpha -> 2.15.0-alpha * rollback azure-client-sdk-parent linting extensions from 1.0.0-beta.2 t0 beta.1 * revert linting extensions to beta2 * remove comments with old code * add logging for errors * remove catches for double close issue and okay status * recursively delete files then delete the directory * change to sync deletes, refactor for easier reading * restructing share clean up so super calls only once * incorporate copilot suggestions * incorporate copilot suggestions * incorporate copilot suggestions * incorporate copilot suggestions * fix all deletes to make sync and wrap in try-catch * fix tests so that super.globalCleanupAsync() is only called once * change telemetry to loggin only returns final state instead of failed retries when ultimately successful * undo versio downgrade for linting-extensions * Fixing spacing in error messages Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * refactor datalake delete all so that it is easier to read * refactor runAsync in ShareScenarioBase so retry failures dont show as failures upon success --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This is a fix to fix the issues we've been having with the stress tests.