workload-replay: anonymize SQL on the AST, not by regex#36801
Closed
jasonhernandez wants to merge 1 commit into
Closed
workload-replay: anonymize SQL on the AST, not by regex#36801jasonhernandez wants to merge 1 commit into
jasonhernandez wants to merge 1 commit into
Conversation
Validating against a real production capture showed the regex identifier
substitution corrupted ~12% of queries (65 of 565: 47 SELECT, 18 FETCH) —
it matched identifiers as substrings and inside string literals, producing
SQL that no longer parsed. The raw SQL parsed fine; our own rewrite broke it.
Rework mz-sql-anonymize to do all SQL rewriting on the parsed AST:
- Rename identifiers as whole tokens via a VisitMut over the AST. This reaches
object/cluster/type references (the `Raw` AstInfo associated types, whose
generic visitors are no-ops) by overriding visit_item_name_mut and friends.
No substring corruption, no in-string rewrites, no word-boundary or case
guesswork. Mz identifiers are case-sensitive, so exact matching is correct.
- Redact query literals on the AST (visit_value_mut), covering numbers, hex
strings, and intervals that the single-quoted-string regex missed.
- Preserve config literals (CREATE CLUSTER / CLUSTER REPLICA / ALTER CLUSTER
and SET / RESET / SET TRANSACTION / ALTER SYSTEM) so replay keeps sizes and
timeouts.
The helper now takes {mapping, rename_identifiers, redact_literals,
statements} and returns the rewritten SQL per statement (or null if it does
not parse). Python sends all cluster/DDL/query SQL through it, and:
- still scrubs DDL create_sql literals with the blanket regex, because option
strings (broker addresses, hosts) are typed AST fields neither the visitor
nor the engine's redacted Display treats as literals;
- applies the structural identifier mapping (column types, child schema/db,
query routing fields) directly, since those are not SQL;
- falls back to the regex only when the binary is unavailable (gated by
--require-parser) or a statement does not parse.
verify now also exempts preserved config-statement literals (matching the
helper) so a kept `SET ... = '5s'` is not flagged.
Re-validated on the production capture: 0 of 623 anonymized queries fail to
re-parse (was 65/565), 0 identifier leaks beyond the format key, numbers
redacted, cluster sizes preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Superseded by #36803, which squashes this stack into a single PR against main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sixth in the stack — base
workload-anonymize-require-parser(#36799). The big one: replaces text-regex SQL rewriting with AST-based rewriting using Materialize's own parser.Why
Running the tool against a real production capture (565 queries) showed the regex identifier substitution corrupted ~12% of queries — 65 failures (47
SELECT, 18FETCH). The raw SQL parsed 100%; our own regex broke it by matching identifiers as substrings and inside string literals. The strict--require-parsermode then (correctly) refused to emit the broken output, so the tool couldn't complete on a real capture at all.Text substitution was never the right tool — and we already shell out to the real parser. So do all rewriting on the AST.
What changed
mz-sql-anonymize(Rust) now parses each statement and rewrites the AST viaVisitMut:Raw'sAstInfoassociated types (visit_item_name_mut,visit_cluster_name_mut,visit_data_type_mut, …), whose generic defaults are no-ops. No substring corruption, no in-string rewrites, no word-boundary/case guesswork. (Mz identifiers are case-sensitive → exact match is correct.)CREATE CLUSTER/CLUSTER REPLICA/ALTER CLUSTERandSET/RESET/SET TRANSACTION/ALTER SYSTEMkeep their literals (sizes, timeouts) — replay needs them.Protocol:
{mapping, rename_identifiers, redact_literals, statements}→ rewritten SQL per statement (ornullif unparseable).Python sends all cluster/DDL/query SQL through the helper, and:
create_sqlliterals with the blanket regex, because option strings (broker addresses, hosts) are typed AST fields the visitor — and the engine's own redacted Display — don't treat as redactable literals;--require-parser) or a statement doesn't parse.verifynow also exempts preserved config-statement literals (matching the helper), so a keptSET … = '5s'isn't flagged.Validation (same production capture, re-captured)
transaction_idformat key)Default mode (
--require-parser) now completes on a real capture, where before it errored.Tests
bin/fmt,clippy,ruff, test-attribute lint clean; noCargo.lockversion bumps (addsserde, already present).Note for reviewers
This supersedes #36746's query-only redaction approach. The stack tells the story (heuristic → parser-literals → tests → subsource fix → require-parser → full AST); if preferred I can squash the SQL-rewriting PRs before merge.
🤖 Generated with Claude Code