fix(storage): standardize filename sanitization across all backends by michael-richey · Pull Request #543 · DataDog/datadog-sync-cli

michael-richey · 2026-04-23T14:09:48Z

Summary

Adds BaseStorage._sanitize_id_for_filename() shared static method that replaces : with . for cross-platform filename safety
Fixes inconsistency where LocalFile was sanitizing colons in filenames but S3/GCS/Azure were writing raw IDs (potentially including colons)
State files are now portable between backends
Adds BaseStorage._check_id_collisions() that detects when two resource IDs sanitize to the same filename — logs an error and returns the colliding IDs so callers can skip them, preventing silent overwrites
Adds abstract get_by_ids() and get_single() methods to BaseStorage with implementations in all four backends (needed by the upcoming --minimize-reads feature); both require --resource-per-file

Why colons are problematic:

GCS: explicitly recommends avoiding :
S3: requires special handling in some URL contexts
Azure: reserved URL characters must be percent-encoded
Windows: colons are invalid in filenames

Migration: Existing S3/GCS/Azure state files with : in keys will be orphaned. Next import run rewrites them with sanitized keys. Since most Datadog resource IDs are UUIDs or integers without colons, practical impact is minimal.

Test plan

pytest tests/unit/test_storage_sanitization.py — 14 tests, all pass
pytest tests/unit/ — full regression, 350 tests pass
Round-trip test: test_round_trip_colon_id_put_then_get_single verifies sanitized filename on disk, read back via original ID
Collision test: test_collision_is_logged_and_skipped verifies only one file is written (first ID wins, second is skipped) and error is logged

🤖 Generated with Claude Code

…ckends Colons in resource IDs were encoded inconsistently: LocalFile replaced ':' with '.', but S3/GCS/Azure used raw IDs. This made state files non-portable between backends and violated GCS/S3/Azure naming guidelines. Adds `BaseStorage._sanitize_id_for_filename()` and applies it in all four backends' `put()` methods. Also adds `get_by_ids()` and `get_single()` abstract methods (with implementations) needed by the upcoming --minimize-reads feature. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ultdict imports - Add BaseStorage._check_id_collisions() to log an error when two resource IDs sanitize to the same filename (e.g. 'foo:bar' and 'foo.bar'), called before every per-file write loop in all four backends - Guard get_by_ids() with ValueError in all backends when resource_per_file=False, since key construction by ID only works in per-file mode - Move defaultdict import from inside _list_and_load() to module level in azure_blob_container.py and gcs_bucket.py (was already at module level in S3) - Add 4 new tests covering collision detection and the get_by_ids guard Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Dict iteration in Python 3.7+ is insertion-ordered, so the first ID encountered wins when two IDs collide on the same sanitized filename. Add a comment to make this behavior explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_check_id_collisions() now returns the set of IDs that would collide with an earlier ID's sanitized filename. All four backends capture the return value and skip those IDs in the write loop, preventing the second resource from clobbering the first. Error log is still emitted so operators know a collision occurred. Test updated to assert both the error log and that only one file is written to disk (containing the first ID's data, not the second's). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

michael-richey mentioned this pull request Apr 23, 2026

feat(sync): add --minimize-reads with type-scoped storage loading #544

Merged

3 tasks

michael-richey added changelog/Changed changelog/Fixed and removed changelog/Changed labels Apr 23, 2026

michael-richey and others added 2 commits April 23, 2026 11:05

style: apply black formatting and fix ruff warnings

472e0bd

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

michael-richey marked this pull request as ready for review April 23, 2026 20:52

michael-richey requested a review from a team as a code owner April 23, 2026 20:52

michael-richey and others added 3 commits April 24, 2026 11:47

style: apply black formatting

ffc15d8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

riyazsh approved these changes Apr 27, 2026

View reviewed changes

michael-richey merged commit 16eb4b9 into main Apr 27, 2026
11 checks passed

michael-richey deleted the hamr/minimize-reads-1-sanitize branch April 27, 2026 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): standardize filename sanitization across all backends#543

fix(storage): standardize filename sanitization across all backends#543
michael-richey merged 6 commits intomainfrom
hamr/minimize-reads-1-sanitize

michael-richey commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michael-richey commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michael-richey commented Apr 23, 2026 •

edited

Loading