Skip to content

fix(storage): standardize filename sanitization across all backends#543

Merged
michael-richey merged 6 commits intomainfrom
hamr/minimize-reads-1-sanitize
Apr 27, 2026
Merged

fix(storage): standardize filename sanitization across all backends#543
michael-richey merged 6 commits intomainfrom
hamr/minimize-reads-1-sanitize

Conversation

@michael-richey
Copy link
Copy Markdown
Collaborator

@michael-richey michael-richey commented Apr 23, 2026

Summary

  • Adds BaseStorage._sanitize_id_for_filename() shared static method that replaces : with . for cross-platform filename safety
  • Fixes inconsistency where LocalFile was sanitizing colons in filenames but S3/GCS/Azure were writing raw IDs (potentially including colons)
  • State files are now portable between backends
  • Adds BaseStorage._check_id_collisions() that detects when two resource IDs sanitize to the same filename — logs an error and returns the colliding IDs so callers can skip them, preventing silent overwrites
  • Adds abstract get_by_ids() and get_single() methods to BaseStorage with implementations in all four backends (needed by the upcoming --minimize-reads feature); both require --resource-per-file

Why colons are problematic:

Migration: Existing S3/GCS/Azure state files with : in keys will be orphaned. Next import run rewrites them with sanitized keys. Since most Datadog resource IDs are UUIDs or integers without colons, practical impact is minimal.

Test plan

  • pytest tests/unit/test_storage_sanitization.py — 14 tests, all pass
  • pytest tests/unit/ — full regression, 350 tests pass
  • Round-trip test: test_round_trip_colon_id_put_then_get_single verifies sanitized filename on disk, read back via original ID
  • Collision test: test_collision_is_logged_and_skipped verifies only one file is written (first ID wins, second is skipped) and error is logged

🤖 Generated with Claude Code

…ckends

Colons in resource IDs were encoded inconsistently: LocalFile replaced
':' with '.', but S3/GCS/Azure used raw IDs. This made state files
non-portable between backends and violated GCS/S3/Azure naming guidelines.

Adds `BaseStorage._sanitize_id_for_filename()` and applies it in all four
backends' `put()` methods. Also adds `get_by_ids()` and `get_single()`
abstract methods (with implementations) needed by the upcoming
--minimize-reads feature.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
michael-richey and others added 2 commits April 23, 2026 11:05
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ultdict imports

- Add BaseStorage._check_id_collisions() to log an error when two resource IDs
  sanitize to the same filename (e.g. 'foo:bar' and 'foo.bar'), called before
  every per-file write loop in all four backends
- Guard get_by_ids() with ValueError in all backends when resource_per_file=False,
  since key construction by ID only works in per-file mode
- Move defaultdict import from inside _list_and_load() to module level in
  azure_blob_container.py and gcs_bucket.py (was already at module level in S3)
- Add 4 new tests covering collision detection and the get_by_ids guard

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@michael-richey michael-richey marked this pull request as ready for review April 23, 2026 20:52
@michael-richey michael-richey requested a review from a team as a code owner April 23, 2026 20:52
michael-richey and others added 3 commits April 24, 2026 11:47
Dict iteration in Python 3.7+ is insertion-ordered, so the first ID
encountered wins when two IDs collide on the same sanitized filename.
Add a comment to make this behavior explicit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_check_id_collisions() now returns the set of IDs that would collide
with an earlier ID's sanitized filename. All four backends capture the
return value and skip those IDs in the write loop, preventing the
second resource from clobbering the first. Error log is still emitted
so operators know a collision occurred.

Test updated to assert both the error log and that only one file is
written to disk (containing the first ID's data, not the second's).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@michael-richey michael-richey merged commit 16eb4b9 into main Apr 27, 2026
11 checks passed
@michael-richey michael-richey deleted the hamr/minimize-reads-1-sanitize branch April 27, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants