Skip to content

workload-replay: anonymize SQL on the AST, not by regex#36801

Closed
jasonhernandez wants to merge 1 commit into
workload-anonymize-require-parserfrom
workload-anonymize-ast
Closed

workload-replay: anonymize SQL on the AST, not by regex#36801
jasonhernandez wants to merge 1 commit into
workload-anonymize-require-parserfrom
workload-anonymize-ast

Conversation

@jasonhernandez
Copy link
Copy Markdown
Contributor

Sixth in the stack — base workload-anonymize-require-parser (#36799). The big one: replaces text-regex SQL rewriting with AST-based rewriting using Materialize's own parser.

Why

Running the tool against a real production capture (565 queries) showed the regex identifier substitution corrupted ~12% of queries — 65 failures (47 SELECT, 18 FETCH). The raw SQL parsed 100%; our own regex broke it by matching identifiers as substrings and inside string literals. The strict --require-parser mode then (correctly) refused to emit the broken output, so the tool couldn't complete on a real capture at all.

Text substitution was never the right tool — and we already shell out to the real parser. So do all rewriting on the AST.

What changed

mz-sql-anonymize (Rust) now parses each statement and rewrites the AST via VisitMut:

  • Identifier renaming as whole tokens. Reaches object/cluster/type references by overriding the visitors for Raw's AstInfo associated types (visit_item_name_mut, visit_cluster_name_mut, visit_data_type_mut, …), whose generic defaults are no-ops. No substring corruption, no in-string rewrites, no word-boundary/case guesswork. (Mz identifiers are case-sensitive → exact match is correct.)
  • Query-literal redaction on the AST — numbers, hex strings, intervals included (the regex only caught single-quoted strings).
  • Config preserved: CREATE CLUSTER/CLUSTER REPLICA/ALTER CLUSTER and SET/RESET/SET TRANSACTION/ALTER SYSTEM keep their literals (sizes, timeouts) — replay needs them.

Protocol: {mapping, rename_identifiers, redact_literals, statements} → rewritten SQL per statement (or null if unparseable).

Python sends all cluster/DDL/query SQL through the helper, and:

  • still scrubs DDL create_sql literals with the blanket regex, because option strings (broker addresses, hosts) are typed AST fields the visitor — and the engine's own redacted Display — don't treat as redactable literals;
  • applies the structural identifier mapping (column types, child schema/db, query routing) directly (not SQL);
  • falls back to regex only when the binary is unavailable (gated by --require-parser) or a statement doesn't parse.

verify now also exempts preserved config-statement literals (matching the helper), so a kept SET … = '5s' isn't flagged.

Validation (same production capture, re-captured)

Metric Regex (before) AST (after)
Anonymized queries that fail to re-parse 65 / 565 (12%) 0 / 623
Identifier leaks (non-keyword originals) 0 (only the transaction_id format key)
Query numeric literals remaining leaked 0
Cluster SIZE / SET timeout preserved preserved

Default mode (--require-parser) now completes on a real capture, where before it errored.

Tests

  • 8 Rust unit tests (token-level rename, qualified refs, no substring/in-string corruption, query-literal redaction incl. numbers, cluster/SET preservation, parse-failure → null).
  • 22 Python tests, incl. a skip-if-unbuilt end-to-end AST rename test and the config-literal verify exemption.
  • bin/fmt, clippy, ruff, test-attribute lint clean; no Cargo.lock version bumps (adds serde, already present).

Note for reviewers

This supersedes #36746's query-only redaction approach. The stack tells the story (heuristic → parser-literals → tests → subsource fix → require-parser → full AST); if preferred I can squash the SQL-rewriting PRs before merge.

🤖 Generated with Claude Code

Validating against a real production capture showed the regex identifier
substitution corrupted ~12% of queries (65 of 565: 47 SELECT, 18 FETCH) —
it matched identifiers as substrings and inside string literals, producing
SQL that no longer parsed. The raw SQL parsed fine; our own rewrite broke it.

Rework mz-sql-anonymize to do all SQL rewriting on the parsed AST:

- Rename identifiers as whole tokens via a VisitMut over the AST. This reaches
  object/cluster/type references (the `Raw` AstInfo associated types, whose
  generic visitors are no-ops) by overriding visit_item_name_mut and friends.
  No substring corruption, no in-string rewrites, no word-boundary or case
  guesswork. Mz identifiers are case-sensitive, so exact matching is correct.
- Redact query literals on the AST (visit_value_mut), covering numbers, hex
  strings, and intervals that the single-quoted-string regex missed.
- Preserve config literals (CREATE CLUSTER / CLUSTER REPLICA / ALTER CLUSTER
  and SET / RESET / SET TRANSACTION / ALTER SYSTEM) so replay keeps sizes and
  timeouts.

The helper now takes {mapping, rename_identifiers, redact_literals,
statements} and returns the rewritten SQL per statement (or null if it does
not parse). Python sends all cluster/DDL/query SQL through it, and:

- still scrubs DDL create_sql literals with the blanket regex, because option
  strings (broker addresses, hosts) are typed AST fields neither the visitor
  nor the engine's redacted Display treats as literals;
- applies the structural identifier mapping (column types, child schema/db,
  query routing fields) directly, since those are not SQL;
- falls back to the regex only when the binary is unavailable (gated by
  --require-parser) or a statement does not parse.

verify now also exempts preserved config-statement literals (matching the
helper) so a kept `SET ... = '5s'` is not flagged.

Re-validated on the production capture: 0 of 623 anonymized queries fail to
re-parse (was 65/565), 0 identifier leaks beyond the format key, numbers
redacted, cluster sizes preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jasonhernandez
Copy link
Copy Markdown
Contributor Author

Superseded by #36803, which squashes this stack into a single PR against main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant