fix(storage): standardize filename sanitization across all backends#543
Merged
michael-richey merged 6 commits intomainfrom Apr 27, 2026
Merged
fix(storage): standardize filename sanitization across all backends#543michael-richey merged 6 commits intomainfrom
michael-richey merged 6 commits intomainfrom
Conversation
…ckends Colons in resource IDs were encoded inconsistently: LocalFile replaced ':' with '.', but S3/GCS/Azure used raw IDs. This made state files non-portable between backends and violated GCS/S3/Azure naming guidelines. Adds `BaseStorage._sanitize_id_for_filename()` and applies it in all four backends' `put()` methods. Also adds `get_by_ids()` and `get_single()` abstract methods (with implementations) needed by the upcoming --minimize-reads feature. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
3 tasks
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ultdict imports - Add BaseStorage._check_id_collisions() to log an error when two resource IDs sanitize to the same filename (e.g. 'foo:bar' and 'foo.bar'), called before every per-file write loop in all four backends - Guard get_by_ids() with ValueError in all backends when resource_per_file=False, since key construction by ID only works in per-file mode - Move defaultdict import from inside _list_and_load() to module level in azure_blob_container.py and gcs_bucket.py (was already at module level in S3) - Add 4 new tests covering collision detection and the get_by_ids guard Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Dict iteration in Python 3.7+ is insertion-ordered, so the first ID encountered wins when two IDs collide on the same sanitized filename. Add a comment to make this behavior explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_check_id_collisions() now returns the set of IDs that would collide with an earlier ID's sanitized filename. All four backends capture the return value and skip those IDs in the write loop, preventing the second resource from clobbering the first. Error log is still emitted so operators know a collision occurred. Test updated to assert both the error log and that only one file is written to disk (containing the first ID's data, not the second's). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
riyazsh
approved these changes
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BaseStorage._sanitize_id_for_filename()shared static method that replaces:with.for cross-platform filename safetyLocalFilewas sanitizing colons in filenames but S3/GCS/Azure were writing raw IDs (potentially including colons)BaseStorage._check_id_collisions()that detects when two resource IDs sanitize to the same filename — logs an error and returns the colliding IDs so callers can skip them, preventing silent overwritesget_by_ids()andget_single()methods toBaseStoragewith implementations in all four backends (needed by the upcoming--minimize-readsfeature); both require--resource-per-fileWhy colons are problematic:
:Migration: Existing S3/GCS/Azure state files with
:in keys will be orphaned. Nextimportrun rewrites them with sanitized keys. Since most Datadog resource IDs are UUIDs or integers without colons, practical impact is minimal.Test plan
pytest tests/unit/test_storage_sanitization.py— 14 tests, all passpytest tests/unit/— full regression, 350 tests passtest_round_trip_colon_id_put_then_get_singleverifies sanitized filename on disk, read back via original IDtest_collision_is_logged_and_skippedverifies only one file is written (first ID wins, second is skipped) and error is logged🤖 Generated with Claude Code