Add query timeout to prevent indefinite hangs on S3 disconnection by khalid244 · Pull Request #152 · Basekick-Labs/arc

khalid244 · 2026-01-25T16:13:37Z

Problem

When S3 becomes unavailable during query execution, queries hang indefinitely (120+ seconds) waiting for DuckDB's internal HTTP timeout. This causes:

Poor user experience with unresponsive queries
Connection pool exhaustion under load
No clear error feedback to clients

Solution

Add query timeout support that uses the existing server.read_timeout configuration value. When the timeout is exceeded, queries return HTTP 504 with a clear "Query timed out" error message.

Changes

internal/database/duckdb.go

Added QueryContext method that accepts a context for timeout/cancellation support

internal/api/query.go

Added queryTimeout field to QueryHandler struct
Updated NewQueryHandler to accept timeout parameter
Modified query execution to use QueryContext with timeout
Added timeout detection that returns HTTP 504 Gateway Timeout

cmd/arc/main.go

Pass cfg.Server.ReadTimeout to query handler

Configuration

Query timeout is controlled by the existing server config:

[server]
read_timeout = "30s"  # Also used as query timeout

Behavior

Scenario	Before	After
S3 unavailable	Hangs 120+ seconds	Returns 504 after read_timeout
Error message	"Query execution failed"	"Query timed out"
HTTP status	500	504

Limitations

The context timeout detects when the deadline has passed but cannot forcibly cancel DuckDB's internal HTTP operations. The query will still wait for DuckDB's internal timeout, but clients receive the correct error type and status code.

Testing

Deploy with read_timeout = "30s"
Block S3 egress (NetworkPolicy or firewall)
Execute query - should return HTTP 504 "Query timed out"

xe-nvdk

Thanks for the PR! You've identified a real issue - queries can hang when S3 is unavailable. However, there are several concerns with this implementation:

1. Config Confusion

read_timeout is Fiber's HTTP read timeout - it controls how long the server waits to read the request body from the client. It does NOT control query execution time.

These are different concerns:

read_timeout = HTTP layer (request reading)
query_timeout = Application layer (query execution)

A query can run for 2 minutes even with read_timeout=30s because the request was already fully received. Reusing this config is misleading.

Suggestion: Add a dedicated query_timeout config instead of reusing read_timeout.

2. Incomplete Coverage

The timeout is only applied to the non-profiled query path:

if profileMode {
    rows, profile, err = h.db.QueryWithProfile(convertedSQL)  // ❌ No timeout
} else {
    rows, err = h.db.QueryContext(ctx, convertedSQL)  // ✅ Has timeout
}

Profiled queries will still hang indefinitely.

3. Doesn't Actually Cancel DuckDB

As you noted in "Limitations", the context timeout doesn't stop DuckDB's internal HTTP operations. So:

Client gets 504 after 30s ✅
But the DuckDB query keeps running in the background until its 120s timeout ❌
This wastes server resources and could cause connection pool issues

4. We Already Have Query Timeout

DuckDB itself has timeout settings that actually cancel queries:

SET http_timeout=30000;  -- 30 seconds

This is already configurable and actually stops the query.

Recommendation

If you want to improve timeout handling:

Add a dedicated [server] query_timeout config (don't reuse read_timeout)
Apply timeout to ALL query paths (including profiled queries)
Consider using DuckDB's native http_timeout setting which actually cancels operations

I'd suggest closing this PR and opening an issue to discuss the best approach for query timeouts.

xe-nvdk

Correction to my previous review:

I was wrong about us already having query timeout - we don't. The ReadTimeout is only for HTTP request reading, not query execution. So this PR is adding something genuinely new and useful.

Updated Review

The core idea is valid - we need query execution timeouts. However, I still have concerns:

1. Config Naming

Reusing read_timeout for query timeout is confusing since they're different concepts. Consider adding a separate query_timeout config, or at minimum document clearly that read_timeout is being reused for both purposes.

2. Incomplete Coverage

Profiled queries still don't get the timeout:

if profileMode {
    rows, profile, err = h.db.QueryWithProfile(convertedSQL)  // ❌ No timeout
} else {
    rows, err = h.db.QueryContext(ctx, convertedSQL)  // ✅ Has timeout
}

3. Context Cancellation Limitation

As you noted, the context doesn't actually cancel DuckDB's HTTP operations - it just detects the timeout. This is a known Go/CGO limitation. Worth documenting but not a blocker.

If you address #1 (separate config or clear docs) and #2 (cover profiled queries), I'd be happy to approve.

xe-nvdk · 2026-01-25T16:55:17Z

Python Implementation Reference

I checked the Python legacy implementation (python-legacy branch) and found that query timeout does exist there:

File: api/duckdb_pool.py

async def execute_async(
    self,
    sql: str,
    priority: QueryPriority = QueryPriority.NORMAL,
    timeout: float = 300.0  # 5 minute default
) -> Dict[str, Any]:

Key differences from your PR:

Aspect	Python	Your PR
Default timeout	300s (5 minutes)	30s (via `read_timeout`)
Implementation	Connection pool level	Query handler level
Metrics	`total_queries_timeout` counter	None
Per-query control	Yes (`timeout` param)	No (global only)

Recommendations based on Python implementation:

Use 300s default (or make it configurable) - 30s may be too aggressive for complex queries
Add a dedicated config like query_timeout instead of reusing read_timeout
Add timeout metrics - Python tracks total_queries_timeout which is useful for monitoring
Consider per-query timeout - Python allows different timeouts per query based on priority

This PR is essentially porting a missing feature from Python to Go, which is valuable. The implementation just needs some refinement to match what we had before.

khalid244 · 2026-01-25T18:51:54Z

Thanks for the detailed review! I've addressed all your feedback:

1. Dedicated `query.timeout` config ✅

Added new query.timeout config (default: 300 seconds / 5 minutes)
No longer reusing read_timeout - these are now separate concerns
Configurable via ARC_QUERY_TIMEOUT environment variable

2. Timeout applies to ALL query paths ✅

Added QueryWithProfileContext() method in duckdb.go
Both profiled and non-profiled queries now use the context with timeout:

if profileMode {
    rows, profile, err = h.db.QueryWithProfileContext(ctx, convertedSQL)  // ✅ Now has timeout
} else {
    rows, err = h.db.QueryContext(ctx, convertedSQL)  // ✅ Has timeout
}

3. Timeout metrics ✅

Added query_timeouts_total counter for monitoring
Available in JSON API (/api/v1/metrics) and Prometheus (/metrics):

# HELP arc_query_timeouts_total Queries that exceeded timeout
# TYPE arc_query_timeouts_total counter
arc_query_timeouts_total 1

Files changed:

internal/config/config.go - Added query.timeout config
internal/database/duckdb.go - Added QueryWithProfileContext()
internal/api/query.go - Use new config, apply timeout to profiled queries, increment timeout metric
internal/metrics/metrics.go - Added query_timeouts_total metric
cmd/arc/main.go - Use cfg.Query.Timeout instead of cfg.Server.ReadTimeout

xe-nvdk

Review: Query Timeout - Almost There!

Great job addressing the previous feedback! The implementation is clean and well-documented. Just one more thing needed:

Missing: Timeout Support for Arrow and Estimate Endpoints

The timeout is only applied to executeQuery. The other two query endpoints also need timeout support:

1. Arrow Endpoint (query_arrow.go)

func (h *QueryHandler) executeQueryArrow(c *fiber.Ctx) error {
    // ... needs timeout context like executeQuery
    rows, err := h.db.Query(convertedSQL)  // <- No timeout!
}

2. Estimate Endpoint (query.go:estimateQuery)

func (h *QueryHandler) estimateQuery(c *fiber.Ctx) error {
    // ... needs timeout context
    rows, err := h.db.Query(estimateSQL)  // <- No timeout!
}

Suggested Changes

Apply the same pattern from executeQuery to both endpoints:

ctx := c.UserContext()
var cancel context.CancelFunc
if h.queryTimeout > 0 {
    ctx, cancel = context.WithTimeout(ctx, h.queryTimeout)
    defer cancel()
}

rows, err := h.db.QueryContext(ctx, sql)

// Check for timeout
if h.queryTimeout > 0 && ctx.Err() == context.DeadlineExceeded {
    m.IncQueryTimeouts()
    return c.Status(fiber.StatusGatewayTimeout).JSON(...)
}

Why This Matters

Arrow queries can also hang indefinitely on S3 issues
Estimate queries run EXPLAIN ANALYZE which executes the query
Consistent behavior across all query APIs

Once these are added, this PR is ready to merge!

khalid244 · 2026-01-25T22:23:26Z

Addressed: Timeout Support for Arrow and Estimate Endpoints ✅

Added timeout support to both endpoints as requested:

1. Arrow Endpoint (query_arrow.go)

ctx := c.UserContext()
var cancel context.CancelFunc
if h.queryTimeout > 0 {
    ctx, cancel = context.WithTimeout(ctx, h.queryTimeout)
    defer cancel()
}

rows, err := h.db.QueryContext(ctx, convertedSQL)

if h.queryTimeout > 0 && ctx.Err() == context.DeadlineExceeded {
    m.IncQueryTimeouts()
    return c.Status(fiber.StatusGatewayTimeout).JSON(...)
}

2. Estimate Endpoint (query.go:estimateQuery)

Same pattern applied to the COUNT(*) query execution

Files changed:

internal/api/query_arrow.go - Added context/metrics imports, timeout support
internal/api/query.go - Added timeout support to estimateQuery

All three query APIs now have consistent timeout behavior and increment the query_timeouts_total metric.

xe-nvdk

@khalid244 Thanks for the updates! The main query endpoint looks great now with:

Dedicated query.timeout config (default 300s)
QueryWithProfileContext for profiled queries
query_timeouts_total metric
504 Gateway Timeout response

However, I noticed query_arrow.go is not in the diff. Your comment mentioned adding timeout support to Arrow and Estimate endpoints, but those changes aren't in the PR yet. Can you verify and push those commits?

Missing from diff:

internal/api/query_arrow.go - Arrow endpoint timeout
estimateQuery function timeout in query.go

Once those are added, this PR is ready to merge!

- Add QueryContext method to DuckDB wrapper for timeout support - Add queryTimeout field to QueryHandler using server.read_timeout - Return HTTP 504 "Query timed out" when timeout is exceeded When S3 becomes unavailable, queries now respect the configured read_timeout instead of hanging for 120+ seconds waiting for DuckDB's internal HTTP timeout.

Changes: 1. Add dedicated query.timeout config instead of reusing server.read_timeout - New config: query.timeout (default: 30 seconds, 0 = no timeout) - This separates HTTP layer timeout from query execution timeout 2. Apply timeout to profiled queries - Add QueryWithProfileContext method to support context/timeout - QueryWithProfile now delegates to QueryWithProfileContext - Both profiled and non-profiled queries now respect timeout 3. Update main.go to use cfg.Query.Timeout instead of cfg.Server.ReadTimeout Addresses review feedback from xe-nvdk on PR #152

- Apply query timeout to executeQueryArrow in query_arrow.go - Apply query timeout to estimateQuery in query.go - Both endpoints now return HTTP 504 on timeout - Increment query_timeouts_total metric on timeout

xe-nvdk

All feedback addressed. Implementation is complete:

Dedicated query.timeout config (default 300s)
QueryWithProfileContext for profiled queries
Arrow endpoint timeout support
Estimate endpoint timeout support
query_timeouts_total metric
504 Gateway Timeout responses

Ready to merge.

Documents the new query.timeout configuration for preventing indefinite hangs during S3 disconnection.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

xe-nvdk requested changes Jan 25, 2026

View reviewed changes

xe-nvdk reviewed Jan 25, 2026

View reviewed changes

xe-nvdk requested changes Jan 25, 2026

View reviewed changes

xe-nvdk reviewed Jan 25, 2026

View reviewed changes

Khalid added 5 commits January 26, 2026 01:37

Change default query timeout to 5 minutes (300s)

27c504c

Add query timeout metrics (query_timeouts_total)

c089912

Add timeout support to Arrow and Estimate query endpoints

765f7b9

- Apply query timeout to executeQueryArrow in query_arrow.go - Apply query timeout to estimateQuery in query.go - Both endpoints now return HTTP 504 on timeout - Increment query_timeouts_total metric on timeout

xe-nvdk approved these changes Jan 25, 2026

View reviewed changes

xe-nvdk merged commit 50bbb55 into Basekick-Labs:main Jan 25, 2026

xe-nvdk added a commit that referenced this pull request Jan 25, 2026

docs: add query timeout to release notes (PR #152)

ed9bead

Documents the new query.timeout configuration for preventing indefinite hangs during S3 disconnection.

xe-nvdk mentioned this pull request Feb 1, 2026

release: Arc v26.02.1 #175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add query timeout to prevent indefinite hangs on S3 disconnection#152

Add query timeout to prevent indefinite hangs on S3 disconnection#152
xe-nvdk merged 5 commits intoBasekick-Labs:mainfrom
khalid244:feature/query-timeout

khalid244 commented Jan 25, 2026

Uh oh!

xe-nvdk left a comment

Uh oh!

xe-nvdk left a comment

Uh oh!

xe-nvdk commented Jan 25, 2026

Uh oh!

khalid244 commented Jan 25, 2026

Uh oh!

xe-nvdk left a comment

Uh oh!

khalid244 commented Jan 25, 2026

Uh oh!

xe-nvdk left a comment

Uh oh!

xe-nvdk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khalid244 commented Jan 25, 2026

Problem

Solution

Changes

Configuration

Behavior

Limitations

Testing

Uh oh!

xe-nvdk left a comment

Choose a reason for hiding this comment

1. Config Confusion

2. Incomplete Coverage

3. Doesn't Actually Cancel DuckDB

4. We Already Have Query Timeout

Recommendation

Uh oh!

xe-nvdk left a comment

Choose a reason for hiding this comment

Updated Review

1. Config Naming

2. Incomplete Coverage

3. Context Cancellation Limitation

Uh oh!

xe-nvdk commented Jan 25, 2026

Python Implementation Reference

Recommendations based on Python implementation:

Uh oh!

khalid244 commented Jan 25, 2026

1. Dedicated query.timeout config ✅

2. Timeout applies to ALL query paths ✅

3. Timeout metrics ✅

Files changed:

Uh oh!

xe-nvdk left a comment

Choose a reason for hiding this comment

Review: Query Timeout - Almost There!

Missing: Timeout Support for Arrow and Estimate Endpoints

Suggested Changes

Why This Matters

Uh oh!

khalid244 commented Jan 25, 2026

Uh oh!

xe-nvdk left a comment

Choose a reason for hiding this comment

Uh oh!

xe-nvdk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Dedicated `query.timeout` config ✅