Skip to content

workload-replay: redact query literals with the Mz parser#36746

Draft
jasonhernandez wants to merge 2 commits into
workload-anonymize-hardeningfrom
workload-anonymize-parser
Draft

workload-replay: redact query literals with the Mz parser#36746
jasonhernandez wants to merge 2 commits into
workload-anonymize-hardeningfrom
workload-anonymize-parser

Conversation

@jasonhernandez
Copy link
Copy Markdown
Contributor

@jasonhernandez jasonhernandez commented May 27, 2026

Stacked on #36745 (base branch workload-anonymize-hardening). Review/merge that one first.

Motivation

The follow-up to #36745. That PR's regex literal redaction only catches single-quoted strings, so numeric literals in query predicates (WHERE ssn = 123456789, account ids), dollar-quoted strings, and escape strings were emitted verbatim. This routes query SQL through Materialize's own parser, which redacts every literal form the dialect supports.

What changed

  • New crate src/sql-anonymize — a small CLI (mz-sql-anonymize) that reads a JSON array of SQL strings on stdin and writes back each statement run through to_ast_string_redacted() (the same '<REDACTED>' placeholder the rest of Materialize uses to turn customer data into usage data), or null if it doesn't parse. Depends only on the standalone mz-sql-parser crate.
  • Python integration — the anonymizer locates the built binary (MZ_SQL_ANONYMIZE_BIN, then target/{release,debug}), redacts all query SQL in one batch subprocess call, and falls back per-statement to the regex when the binary isn't built or a statement doesn't parse (with a warning pointing at cargo build --release -p mz-sql-anonymize). Zero-setup users still get the regex behavior.
  • README + verify updated; the verify pass accepts both '<REDACTED>' (parser) and 'literal_N' (regex fallback) placeholders.

Key design decision (discovered during testing)

to_ast_string_redacted() intentionally does not redact DDL option strings — I confirmed that CREATE CONNECTION ... (BROKER 'host', SASL USERNAME 'admin') comes back with 'host'/'admin' intact, because the engine treats connection/sink/source config as usage data, not customer data. So:

  • Query SQL → parser (comprehensive; catches numbers + all string forms).
  • DDL create_sql → blanket regex (kept from workload-replay: harden workload anonymization #36745), because it must scrub connection hosts/users, sink topics, and source options that the parser leaves intact.

Routing DDL through the parser would have regressed the connection/sink leak fix. (The verify pass would have caught it, but better to design it right.)

Why subprocess and not PyO3

I started toward PyO3 as requested, but found: zero PyO3 precedent in the repo, a deliberately "binary-wheels-only" Python venv (ci/builder/requirements.txt), and no native-compile step in bin/pyactivate — so PyO3 would mean adding the first native build to the venv (maturin + pyactivate/CI changes). A subprocess CLI matches existing repo patterns (Rust binaries + spawn.py), needs no venv changes, and uses the same parser. Happy to revisit PyO3 if preferred.

Known costs / follow-ups

  • Dependency weight: linking mz-sql-parser natively pulls its transitive mz-ore deps (axum, sentry, tonic, opentelemetry). Fine for an on-demand dev tool, but heavier than the WASM build (which dodges this via the wasm32 target). Could be trimmed by paring mz-ore features.
  • Identifiers still use the regex. The parser alone can't resolve which object a bare name refers to (no catalog), so the scoped-rename / mapping-collapse issue from workload-replay: harden workload anonymization #36745 remains future work — it needs the planner (mz-sql) or a running instance's RENAME.
  • Fallback gap: without the binary, numeric literals in queries still leak (regex limitation) — surfaced via the warning, not the verify pass.

Testing

  • cargo test -p mz-sql-anonymize (4 tests), cargo clippy clean, bin/fmt clean, Cargo.lock gains only the new package (no version bumps).
  • End-to-end on a synthetic capture: confirmed a query's id = 987654321 is redacted to '<REDACTED>' (the regex left it exposed), connection/sink/default strings scrubbed via regex, cluster SIZE preserved, verify passes. Confirmed the no-binary fallback warns and degrades to regex.

🤖 Generated with Claude Code

The regex literal redaction only handles single-quoted strings, so numeric
literals (account numbers, SSNs, ids), dollar-quoted strings, and escape
strings in query predicates were emitted verbatim. Use Materialize's own
parser instead, which handles every literal form the dialect supports.

Add `mz-sql-anonymize`, a small CLI that reads a JSON array of SQL strings
on stdin and writes back each statement run through
`to_ast_string_redacted()` (or null when it does not parse). It depends only
on the standalone `mz-sql-parser` crate. The anonymizer locates the built
binary (via MZ_SQL_ANONYMIZE_BIN or target/{release,debug}), redacts all
query SQL in one batch, and falls back per-statement to the regex when the
binary is unavailable or a statement does not parse, printing a warning that
points at `cargo build --release -p mz-sql-anonymize`.

Scope: only query SQL goes through the parser. DDL create_sql keeps the
blanket regex, because `to_ast_string_redacted()` deliberately does not
redact DDL option strings (connection hosts/users, sink topics, source
options) — routing those through the parser would regress the connection and
sink leak fix from the parent commit. The verify pass accepts both the
parser's `'<REDACTED>'` and the regex's `'literal_N'` placeholders.

This addresses the "wrap the Mz parser" TODO for literals. Identifier
renaming still uses the regex: the parser alone cannot resolve which object a
bare name refers to (it has no catalog), so scoped renaming remains future
work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's check-rust-test-attributes.sh rejects plain #[test]. Add mz-ore as a
dev-dependency with the 'test' feature (matching sql-parser's pattern) and
switch the unit tests over.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant