workload-replay: anonymize SQL on the AST, not by regex by jasonhernandez · Pull Request #36801 · MaterializeInc/materialize

jasonhernandez · 2026-05-29T21:10:38Z

Sixth in the stack — base workload-anonymize-require-parser (#36799). The big one: replaces text-regex SQL rewriting with AST-based rewriting using Materialize's own parser.

Why

Running the tool against a real production capture (565 queries) showed the regex identifier substitution corrupted ~12% of queries — 65 failures (47 SELECT, 18 FETCH). The raw SQL parsed 100%; our own regex broke it by matching identifiers as substrings and inside string literals. The strict --require-parser mode then (correctly) refused to emit the broken output, so the tool couldn't complete on a real capture at all.

Text substitution was never the right tool — and we already shell out to the real parser. So do all rewriting on the AST.

What changed

mz-sql-anonymize (Rust) now parses each statement and rewrites the AST via VisitMut:

Identifier renaming as whole tokens. Reaches object/cluster/type references by overriding the visitors for Raw's AstInfo associated types (visit_item_name_mut, visit_cluster_name_mut, visit_data_type_mut, …), whose generic defaults are no-ops. No substring corruption, no in-string rewrites, no word-boundary/case guesswork. (Mz identifiers are case-sensitive → exact match is correct.)
Query-literal redaction on the AST — numbers, hex strings, intervals included (the regex only caught single-quoted strings).
Config preserved: CREATE CLUSTER/CLUSTER REPLICA/ALTER CLUSTER and SET/RESET/SET TRANSACTION/ALTER SYSTEM keep their literals (sizes, timeouts) — replay needs them.

Protocol: {mapping, rename_identifiers, redact_literals, statements} → rewritten SQL per statement (or null if unparseable).

Python sends all cluster/DDL/query SQL through the helper, and:

still scrubs DDL create_sql literals with the blanket regex, because option strings (broker addresses, hosts) are typed AST fields the visitor — and the engine's own redacted Display — don't treat as redactable literals;
applies the structural identifier mapping (column types, child schema/db, query routing) directly (not SQL);
falls back to regex only when the binary is unavailable (gated by --require-parser) or a statement doesn't parse.

verify now also exempts preserved config-statement literals (matching the helper), so a kept SET … = '5s' isn't flagged.

Validation (same production capture, re-captured)

Metric	Regex (before)	AST (after)
Anonymized queries that fail to re-parse	65 / 565 (12%)	0 / 623
Identifier leaks (non-keyword originals)	—	0 (only the `transaction_id` format key)
Query numeric literals remaining	leaked	0
Cluster SIZE / SET timeout	preserved	preserved

Default mode (--require-parser) now completes on a real capture, where before it errored.

Tests

8 Rust unit tests (token-level rename, qualified refs, no substring/in-string corruption, query-literal redaction incl. numbers, cluster/SET preservation, parse-failure → null).
22 Python tests, incl. a skip-if-unbuilt end-to-end AST rename test and the config-literal verify exemption.
bin/fmt, clippy, ruff, test-attribute lint clean; no Cargo.lock version bumps (adds serde, already present).

Note for reviewers

This supersedes #36746's query-only redaction approach. The stack tells the story (heuristic → parser-literals → tests → subsource fix → require-parser → full AST); if preferred I can squash the SQL-rewriting PRs before merge.

🤖 Generated with Claude Code

Validating against a real production capture showed the regex identifier substitution corrupted ~12% of queries (65 of 565: 47 SELECT, 18 FETCH) — it matched identifiers as substrings and inside string literals, producing SQL that no longer parsed. The raw SQL parsed fine; our own rewrite broke it. Rework mz-sql-anonymize to do all SQL rewriting on the parsed AST: - Rename identifiers as whole tokens via a VisitMut over the AST. This reaches object/cluster/type references (the `Raw` AstInfo associated types, whose generic visitors are no-ops) by overriding visit_item_name_mut and friends. No substring corruption, no in-string rewrites, no word-boundary or case guesswork. Mz identifiers are case-sensitive, so exact matching is correct. - Redact query literals on the AST (visit_value_mut), covering numbers, hex strings, and intervals that the single-quoted-string regex missed. - Preserve config literals (CREATE CLUSTER / CLUSTER REPLICA / ALTER CLUSTER and SET / RESET / SET TRANSACTION / ALTER SYSTEM) so replay keeps sizes and timeouts. The helper now takes {mapping, rename_identifiers, redact_literals, statements} and returns the rewritten SQL per statement (or null if it does not parse). Python sends all cluster/DDL/query SQL through it, and: - still scrubs DDL create_sql literals with the blanket regex, because option strings (broker addresses, hosts) are typed AST fields neither the visitor nor the engine's redacted Display treats as literals; - applies the structural identifier mapping (column types, child schema/db, query routing fields) directly, since those are not SQL; - falls back to the regex only when the binary is unavailable (gated by --require-parser) or a statement does not parse. verify now also exempts preserved config-statement literals (matching the helper) so a kept `SET ... = '5s'` is not flagged. Re-validated on the production capture: 0 of 623 anonymized queries fail to re-parse (was 65/565), 0 identifier leaks beyond the format key, numbers redacted, cluster sizes preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jasonhernandez · 2026-05-29T22:28:33Z

Superseded by #36803, which squashes this stack into a single PR against main.

jasonhernandez mentioned this pull request May 29, 2026

workload-replay: harden and AST-anonymize captured workloads #36803

Draft

jasonhernandez closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload-replay: anonymize SQL on the AST, not by regex#36801

workload-replay: anonymize SQL on the AST, not by regex#36801
jasonhernandez wants to merge 1 commit into
workload-anonymize-require-parserfrom
workload-anonymize-ast

jasonhernandez commented May 29, 2026

Uh oh!

jasonhernandez commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonhernandez commented May 29, 2026

Why

What changed

Validation (same production capture, re-captured)

Tests

Note for reviewers

Uh oh!

jasonhernandez commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant