workload-replay: redact query literals with the Mz parser#36746
Draft
jasonhernandez wants to merge 2 commits into
Draft
workload-replay: redact query literals with the Mz parser#36746jasonhernandez wants to merge 2 commits into
jasonhernandez wants to merge 2 commits into
Conversation
The regex literal redaction only handles single-quoted strings, so numeric
literals (account numbers, SSNs, ids), dollar-quoted strings, and escape
strings in query predicates were emitted verbatim. Use Materialize's own
parser instead, which handles every literal form the dialect supports.
Add `mz-sql-anonymize`, a small CLI that reads a JSON array of SQL strings
on stdin and writes back each statement run through
`to_ast_string_redacted()` (or null when it does not parse). It depends only
on the standalone `mz-sql-parser` crate. The anonymizer locates the built
binary (via MZ_SQL_ANONYMIZE_BIN or target/{release,debug}), redacts all
query SQL in one batch, and falls back per-statement to the regex when the
binary is unavailable or a statement does not parse, printing a warning that
points at `cargo build --release -p mz-sql-anonymize`.
Scope: only query SQL goes through the parser. DDL create_sql keeps the
blanket regex, because `to_ast_string_redacted()` deliberately does not
redact DDL option strings (connection hosts/users, sink topics, source
options) — routing those through the parser would regress the connection and
sink leak fix from the parent commit. The verify pass accepts both the
parser's `'<REDACTED>'` and the regex's `'literal_N'` placeholders.
This addresses the "wrap the Mz parser" TODO for literals. Identifier
renaming still uses the regex: the parser alone cannot resolve which object a
bare name refers to (it has no catalog), so scoped renaming remains future
work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's check-rust-test-attributes.sh rejects plain #[test]. Add mz-ore as a dev-dependency with the 'test' feature (matching sql-parser's pattern) and switch the unit tests over. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #36745 (base branch
workload-anonymize-hardening). Review/merge that one first.Motivation
The follow-up to #36745. That PR's regex literal redaction only catches single-quoted strings, so numeric literals in query predicates (
WHERE ssn = 123456789, account ids), dollar-quoted strings, and escape strings were emitted verbatim. This routes query SQL through Materialize's own parser, which redacts every literal form the dialect supports.What changed
src/sql-anonymize— a small CLI (mz-sql-anonymize) that reads a JSON array of SQL strings on stdin and writes back each statement run throughto_ast_string_redacted()(the same'<REDACTED>'placeholder the rest of Materialize uses to turn customer data into usage data), ornullif it doesn't parse. Depends only on the standalonemz-sql-parsercrate.MZ_SQL_ANONYMIZE_BIN, thentarget/{release,debug}), redacts all query SQL in one batch subprocess call, and falls back per-statement to the regex when the binary isn't built or a statement doesn't parse (with a warning pointing atcargo build --release -p mz-sql-anonymize). Zero-setup users still get the regex behavior.'<REDACTED>'(parser) and'literal_N'(regex fallback) placeholders.Key design decision (discovered during testing)
to_ast_string_redacted()intentionally does not redact DDL option strings — I confirmed thatCREATE CONNECTION ... (BROKER 'host', SASL USERNAME 'admin')comes back with'host'/'admin'intact, because the engine treats connection/sink/source config as usage data, not customer data. So:create_sql→ blanket regex (kept from workload-replay: harden workload anonymization #36745), because it must scrub connection hosts/users, sink topics, and source options that the parser leaves intact.Routing DDL through the parser would have regressed the connection/sink leak fix. (The verify pass would have caught it, but better to design it right.)
Why subprocess and not PyO3
I started toward PyO3 as requested, but found: zero PyO3 precedent in the repo, a deliberately "binary-wheels-only" Python venv (
ci/builder/requirements.txt), and no native-compile step inbin/pyactivate— so PyO3 would mean adding the first native build to the venv (maturin + pyactivate/CI changes). A subprocess CLI matches existing repo patterns (Rust binaries +spawn.py), needs no venv changes, and uses the same parser. Happy to revisit PyO3 if preferred.Known costs / follow-ups
mz-sql-parsernatively pulls its transitivemz-oredeps (axum, sentry, tonic, opentelemetry). Fine for an on-demand dev tool, but heavier than the WASM build (which dodges this via the wasm32 target). Could be trimmed by paringmz-orefeatures.mz-sql) or a running instance'sRENAME.Testing
cargo test -p mz-sql-anonymize(4 tests),cargo clippyclean,bin/fmtclean,Cargo.lockgains only the new package (no version bumps).id = 987654321is redacted to'<REDACTED>'(the regex left it exposed), connection/sink/default strings scrubbed via regex, cluster SIZE preserved, verify passes. Confirmed the no-binary fallback warns and degrades to regex.🤖 Generated with Claude Code