Skip to content

fix(query): six critical-path hardening fixes from query-path review#414

Merged
xe-nvdk merged 3 commits intomainfrom
review/query-path-criticals
Apr 28, 2026
Merged

fix(query): six critical-path hardening fixes from query-path review#414
xe-nvdk merged 3 commits intomainfrom
review/query-path-criticals

Conversation

@xe-nvdk
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk commented Apr 28, 2026

Summary

A 4-agent staff/principal-engineer review of the query execution path (mirror of the ingest review that produced #413) surfaced 8 raw CRITICALs, 27 HIGH, 50 MEDIUM, 24 LOW findings — deduplicated to 6 unique CRITICALs, all addressed in this PR. The remaining HIGH/MEDIUM/LOW findings will be filed as separate issues in a follow-up.

The 6 Fixes

# File:Line Title Class
C1 query.go:506-541 Expanded SQL denylist + comment-strip + literal-mask normalisation Defense-in-depth on read-only API
C2 query.go (8 sites), query_arrow.go:52, internal/sql/mask.go x-arc-database header validation + universal read_parquet path quoting Path-literal escape
C3 query.go:506+ (read_parquet rejection), query.go:75 (CTE regex) Reject direct read_parquet() in user SQL + extend CTE name regex RBAC bypass closure
C4 query.go:1238+ (caller), query/parallel_executor.go:97+ Parallel-partition partial-failure now fails the whole request + launch-loop semaphore Silent data-loss + fan-out cap
C5 query_json_writer.go:80, query_arrow_json.go:169, query_arrow.go:128, query.go (3 callsites) streamTypedJSON/streamArrowJSON return (int, error); per-row/per-batch ctx check; registry Fail/TimedOut on partial-failure Stream-error observability
C6 query_arrow.go:147 Arrow IPC: defer→explicit batch.Release per iteration Memory pinning

Bench Delta (within 5% no-regression budget)

Workload: SELECT RegionID, COUNT(*), SUM(ResolutionWidth) FROM clickbench.hits GROUP BY RegionID ORDER BY 2 DESC LIMIT 100 on 99.9 M rows, 20 iterations per format.

Format Baseline (post-restart, pre-fix) After 6 fixes Delta
JSON avg 73.07 ms 74.44 ms +1.9%
JSON p95 75.86 ms 78.58 ms +3.6%
JSON p99 75.86 ms 78.58 ms +3.6%
Arrow avg 73.29 ms 73.35 ms +0.1%
Arrow p95 76.14 ms 78.36 ms +2.9%
Arrow p99 76.14 ms 78.36 ms +2.9%

JSON and Arrow paths both go through executeArrowJSONQuerystreamArrowJSON in the production build (-tags=duckdb_arrow), so the per-row ctx check in streamTypedJSON (only on the database/sql fallback) is NOT on the hot path. The most likely sources of the small delta: the expanded ValidateSQLRequest pipeline (mask-literals + strip-comments + larger denylist regex + new userSQLReadParquetPattern check) which now runs three regex passes per request instead of one, and the quotePath() helper allocations. All on the per-request path; well within the 5% budget.

Test Plan

  • go test ./internal/api/... ./internal/query/... ./internal/queryregistry/... ./internal/database/... (no -tags)
  • go test -tags=duckdb_arrow ./internal/api/... (Arrow code paths)
  • go build ./internal/... ./cmd/...
  • Smoke test against running Arc with each attack vector + a legit query:
    • C2: x-arc-database: evil', enable_external_access=true) UNION SELECT 1 FROM read_parquet('y → 400 invalid x-arc-database header: name contains invalid characters
    • C3: SELECT * FROM read_parquet('/etc/passwd') → 400 Direct read_parquet() calls are not allowed in user SQL
    • C1 (ATTACH): ATTACH 'http://attacker/db' AS evil → 400 Dangerous SQL operation not allowed
    • C1 (PRAGMA): PRAGMA enable_external_access=true → 400 Dangerous SQL operation not allowed
    • Legit: SELECT COUNT(*) FROM clickbench.hits99,997,497 rows, 1 ms
  • New regression tests:
    • TestDangerousSQLPatterns extended with 13 RCE-class keyword cases + 4 case variations (now goes through ValidateSQLRequest, not raw regex)
    • TestValidateSQLRequest_BypassesAndFalsePositives — comment-injection bypasses + literal/quoted-identifier false-positive tolerance + 6 read_parquet-direct-call rejections
    • TestValidateHeaderDatabase — 18 cases covering legitimate values + 11 SQLi/path-traversal vectors
    • TestQuotePath — 6 cases pinning the escape contract
    • TestExtractCTENames extended with 3 cases for parenthesized CTE column-list form
    • TestStreamTypedJSON_ReturnsErrorOnScanFailure and ..._OnCtxCancel for C5 contract

Review-Agent Reports

The 4 agent reports (correctness / security / code quality / perf) are retained in the conversation transcript that drove this PR; the remaining HIGH/MEDIUM/LOW findings (~80 items after dedup against #313/#309/#308/#307/#334/#332/#331/#330/#352) will be filed as individual GitHub issues after this PR opens.

🤖 Generated with Claude Code

A 4-agent staff/principal-engineer review of the query execution path
(mirror of the ingest review that produced PR #413) surfaced six
issues we want addressed before 26.05.1 GA. All are fixed here. Bench
delta is within the 5% no-regression budget on a 99.9M-row ClickBench
GROUP BY workload (JSON p95 75.86ms→78.58ms, Arrow p95 76.14ms→78.36ms).

C1 — Expanded SQL denylist + comment-strip + literal-mask normalisation.
The query API guardrail previously gated only DDL/DML keywords. It
now also gates ATTACH/DETACH/COPY/EXPORT/IMPORT DATABASE/PRAGMA/
SET <var>=/LOAD/INSTALL/CALL — session/extension/file-system ops that
don't belong in a read-only query path. ValidateSQLRequest now strips
comments and masks string literals before the regex check, so
`DROP /* */ TABLE x` cannot interleave comments past token boundaries
and `SELECT 'DROP TABLE x'` is not falsely rejected. Removed the dead
(h *QueryHandler).validateSQL method (code-quality finding from the
same review). Test matrix covers every blocked keyword + comment-
injection + quoted-identifier false-positive cases.

C2 — x-arc-database header validation + universal read_parquet path
quoting. The header value lands inside read_parquet('<base>/<db>/...')
storage paths Arc generates internally; in some edge cases a header
containing a single quote or shell-active char could break out of the
literal. New validateHeaderDatabase helper validates at every entry
point (executeQuery, executeQueryArrow, estimateQuery). New quotePath()
helper routes every read_parquet('PATH', ...) interpolation site
through sqlutil.EscapeStringLiteral (added to internal/sql/mask.go as
a single source of truth). Eight read_parquet sites in query.go
converted. Tests cover SQLi vectors (quote, NUL, newline, comma,
backslash, path-traversal) and boundary conditions.

C3 — Reject direct read_parquet() in user SQL + extend CTE name regex.
Arc's transformation layer is the only legitimate source of
read_parquet — in some edge cases, a user query containing
read_parquet directly produced zero extracted table references and
bypassed the (database, measurement) RBAC pair-check. ValidateSQLRequest
now rejects user SQL containing read_parquet(. Companion fix: the
CTE-name extraction regex was extended to recognise the parenthesized
column-list form (WITH foo(c1, c2) AS (...)), so a CTE name doesn't
leak into the table-reference list. Tests cover read_parquet in JOIN,
subqueries, CTEs, plus the new CTE column-list parse.

C4 — Parallel-partition partial-failure now fails the whole request.
When the parallel executor fans out a query across N partitions and
one or more partition queries error, NewMergedRowIterator previously
returned the surviving partitions' rows as a 200/success — a fraction
of the result silently dropped. The handler now inspects per-partition
Error after ExecutePartitioned and fails the request with HTTP 500 on
any partition error. Companion fix in parallel_executor.go: the
goroutine fan-out semaphore is acquired in the launch loop instead of
inside each spawned goroutine, so a 10K-path query bounds in-flight
goroutines at MaxConcurrentPartitions (default 4) instead of spawning
10K goroutines parked on the semaphore.

C5 — Streaming-response error semantics. streamTypedJSON and
streamArrowJSON previously returned only the row count; in some edge
cases (Scan failures mid-stream, deferred *sql.Rows.Err(), context
cancellation after the response envelope was already flushed) the
loop silently `continue`'d and the caller marked the query
Complete(rowCount). Both functions now return (int, error), perform a
ctx.Err() check at every row/batch boundary, and check the iterator's
deferred error after the loop. Callers route any error to
IncQueryErrors, registry Fail() (or TimedOut() on
context.DeadlineExceeded), and an Error log line. The HTTP status
cannot be changed retroactively (headers already flushed) but
operator-side observability is now correct. The Arrow IPC stream loop
in executeQueryArrow adopts the same per-batch ctx-check pattern.

C6 — Arrow IPC streaming memory pinned by deferred Release. The
executeQueryArrow IPC loop used `defer batch.Release()` inside the
for-reader.Next() body. Defers accumulate on the closure stack until
the closure exits — for a 10M-row result with 10K-row batches,
1,000 deferred Release calls held all casted Arrow records alive
until the entire stream completed. Releases each casted batch
explicitly after ipcWriter.Write, restoring constant per-batch
memory. The reader-owned input batch needs no Release; reader.Next()
releases the prior record automatically.

Build clean; full ./internal/api/... ./internal/query/...
./internal/queryregistry/... ./internal/database/... test suites
pass. Smoke test against running Arc verified all 5 attack vectors
are rejected (header SQLi, direct read_parquet, ATTACH RCE, PRAGMA
hardening) and legitimate queries still work.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 28, 2026

/gemini review

@gemini-code-assist this PR addresses six critical-path findings from a 4-agent staff/principal-engineer review of Arc's query execution path — companion to the ingest-path review that landed as #413.

Please pay particular attention to:

  1. SQL denylist normalisation pipeline (internal/api/query.go ValidateSQLRequest ~1465+ and dangerousSQLPattern ~506+). Comment-strip + literal-mask runs before the regex; verify there's no normalisation order that lets a keyword slip past, especially in interaction with the new userSQLReadParquetPattern check.
  2. x-arc-database header validation + quotePath() escape symmetry (internal/api/query.go validateHeaderDatabase + quotePath ~1750+, internal/sql/mask.go EscapeStringLiteral). Eight read_parquet interpolation sites in query.go now route through quotePath. Look for any site I missed, especially under buildMultiTierReadParquet and buildReadParquetExprForParallel.
  3. Stream-error semantics (streamTypedJSON / streamArrowJSON returning (int, error)). The HTTP envelope is closed even on error so the client gets valid JSON; the registry/metrics/log path differentiates Complete vs Fail vs TimedOut. Check the three callsites in query.go (parallel + standard + queryMeasurement) and the Arrow JSON callsite in query_arrow_json.go for divergence.
  4. Parallel-partition error aggregation (query.go after ExecutePartitioned). Any errored partition fails the whole request with HTTP 500. Companion launch-loop semaphore in parallel_executor.go:101+ bounds in-flight goroutines at MaxConcurrentPartitions.
  5. Arrow IPC defer→explicit Release (query_arrow.go:147+). The fix tracks the casted record via a separate castedBatch variable so the reader-owned input record is never accidentally released; reader.Next() releases the prior one automatically.
  6. CTE name regex (patternCTENames ~75). Now handles WITH foo(c1, c2) AS (...) form. Three new test cases pin this; verify the (?:\s*\([^)]*\))? non-greedy column-list match doesn't introduce a backtracking pathology.

Bench delta on 99.9M-row ClickBench GROUP BY: JSON p95 +3.6%, Arrow p95 +2.9% — within the 5% budget. Smoke test against running Arc verified all four attack vectors (header SQLi, direct read_parquet, ATTACH RCE, PRAGMA hardening) reject correctly while legitimate queries return successfully.

Looking for any findings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements significant hardening of the query execution path, addressing security vulnerabilities and correctness issues identified during a staff-level review. Key changes include expanding the SQL denylist to block RCE-class operations, introducing robust input validation for the x-arc-database header, and ensuring safe path interpolation via a new quotePath helper. Additionally, the PR improves error handling for streaming responses and parallel partition execution to prevent silent data loss, and optimizes memory usage in Arrow IPC streaming. I have provided feedback on further tightening the SQL denylist regex and ensuring consistent context usage for streaming responses.

Comment thread internal/api/query.go Outdated
Comment thread internal/api/query.go Outdated
Comment thread internal/api/query.go Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements critical security hardening and correctness fixes for the query execution path. Key changes include expanding the SQL denylist with normalization (comment stripping and literal masking), validating the x-arc-database header, and preventing direct read_parquet calls in user SQL to enforce RBAC. Additionally, the PR ensures partial failures in streaming and parallel execution are surfaced as errors rather than silent partial results, and it fixes a memory leak in Arrow IPC streaming by releasing batches explicitly. Review feedback suggests using more idiomatic select blocks for context cancellation checks within streaming loops and simplifying error collection logic.

Comment thread internal/api/query.go
Comment thread internal/api/query_arrow.go Outdated
Comment thread internal/api/query_json_writer.go Outdated
…elect

Addresses 5 NEW gemini findings on PR #414 commit 244849b. One MEDIUM
(G4 — simplify error-collection loop) was a no-op suggestion identical
to the existing code; skipped.

CRITICAL/security-critical (gemini G1) — SET regex bypassable:
The previous \bSET\s+(?:GLOBAL\s+|...)?\w+\s*= form required an
equals sign and a word after the optional scope keyword. DuckDB
also accepts:
  - SET enable_external_access TO true (TO instead of =)
  - SET VARIABLE x = 1            (VARIABLE keyword)
  - RESET enable_external_access  (mutates session state)
All three slipped past the regex. Replaced the multi-form pattern
with bare \bSET\b and added \bRESET\b. The query API is read-only
so any session-state mutation is forbidden — bare-keyword match is
correct here. New regression tests for all three bypass shapes.

CRITICAL/security-critical (gemini G2) — CALL regex bypassable:
\bCALL\s+\w+ required whitespace before the procedure name, but
DuckDB accepts CALL(proc_name) with no space. Replaced with bare
\bCALL\b. Regression test for the no-space form.

MEDIUM (gemini G3) — queryMeasurement streamCtx:
Switched from context.Background() to c.UserContext() so client
disconnects propagate to per-row cancellation. Fasthttp keeps
c.Context() alive across the SetBodyStreamWriter boundary so it's
safe inside the async stream callback (consistent with the
executeQuery and executeQueryArrow patterns).

MEDIUM (gemini G5+G6) — idiomatic select for ctx cancellation:
Replaced 'if ctx.Err() != nil' with 'select { case <-ctx.Done() }'
+ default at three call sites — streamTypedJSON, streamArrowJSON,
and the Arrow IPC stream loop in executeQueryArrow. Required
labeled-break (break scanLoop / batchLoop / streamLoop) because
plain break inside select-in-for would only break the select.
Same semantics, more idiomatic Go.

Bench: two runs on warm state show JSON avg 72.64-73.85ms, p95
74.34-81.31ms; Arrow avg 72.30-73.66ms, p95 74.30-75.50ms — well
within run-to-run jitter of baseline (73ms avg / 76ms p95). The
"+3.6% JSON p95" delta from the initial commit appears to have
been first-restart cold-cache noise, not a real regression.

Build clean (default + -tags=duckdb_arrow); full test sweep across
internal/api/... internal/query/... internal/queryregistry/...
internal/database/... passes.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented Apr 28, 2026

/gemini review

@gemini-code-assist round 2 fixes pushed in commit 892d24c. All 5 actionable findings addressed; G4 (simplify error-collection loop) was a no-op suggestion — your snippet is identical to the existing code, so I skipped it.

🚨 G1 (CRITICAL) — SET regex bypassable:

  • ✅ Replaced \bSET\s+(?:GLOBAL\s+|...)?\w+\s*= with bare \bSET\b + new \bRESET\b. The query API is read-only so any session-state mutation is forbidden — bare-keyword match is correct.
  • New regression tests for SET ... TO, SET VARIABLE x = 1, RESET <var>. All pass.

🚨 G2 (CRITICAL) — CALL regex bypassable:

  • ✅ Replaced \bCALL\s+\w+ with bare \bCALL\b. Regression test for CALL(proc) with no whitespace before paren.

G3 (MEDIUM) — queryMeasurement ctx:

  • streamCtx := c.UserContext() instead of context.Background() — client disconnects now propagate. Consistent with executeQuery/executeQueryArrow.

G5+G6 (MEDIUM) — idiomatic select for ctx:

  • ✅ All three sites switched: streamTypedJSON, streamArrowJSON, Arrow IPC loop in executeQueryArrow. Used labeled-break (break scanLoop/batchLoop/streamLoop) because plain break inside select-in-for only breaks the select.

Bench: two runs on warm state show JSON avg 72.6-73.9ms / p95 74.3-81.3ms; Arrow avg 72.3-73.7ms / p95 74.3-75.5ms — well within run-to-run jitter of baseline (73ms avg / 76ms p95). The "+3.6% JSON p95" delta from the initial commit appears to have been first-restart cold-cache noise.

Looking for any final findings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive security hardening and correctness fixes for the query execution path. Key improvements include an expanded SQL denylist that blocks session and file-system operations, normalized SQL validation that strips comments and masks literals, and strict validation of the x-arc-database header to prevent path injection. The changes also ensure that partial failures in parallel and streaming queries are correctly surfaced as errors rather than silent successes. Furthermore, a memory leak in Arrow IPC streaming was resolved by explicitly releasing record batches, and the parallel executor was optimized to bound goroutine creation. I have no feedback to provide.

The C1 denylist bullet listed SET <var>=/CALL (the original narrower
regex form). After gemini round 1 found two bypass shapes (SET TO,
SET VARIABLE, RESET, CALL with no whitespace before paren), the
regex simplified to bare-keyword anchors plus added RESET. Update
the release notes to match.
@xe-nvdk xe-nvdk merged commit 947d2b7 into main Apr 28, 2026
5 checks passed
@xe-nvdk xe-nvdk deleted the review/query-path-criticals branch April 28, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant