Skip to content

Add query timeout to prevent indefinite hangs on S3 disconnection#152

Merged
xe-nvdk merged 5 commits intoBasekick-Labs:mainfrom
khalid244:feature/query-timeout
Jan 25, 2026
Merged

Add query timeout to prevent indefinite hangs on S3 disconnection#152
xe-nvdk merged 5 commits intoBasekick-Labs:mainfrom
khalid244:feature/query-timeout

Conversation

@khalid244
Copy link
Copy Markdown

Problem

When S3 becomes unavailable during query execution, queries hang indefinitely (120+ seconds) waiting for DuckDB's internal HTTP timeout. This causes:

  • Poor user experience with unresponsive queries
  • Connection pool exhaustion under load
  • No clear error feedback to clients

Solution

Add query timeout support that uses the existing server.read_timeout configuration value. When the timeout is exceeded, queries return HTTP 504 with a clear "Query timed out" error message.

Changes

internal/database/duckdb.go

  • Added QueryContext method that accepts a context for timeout/cancellation support

internal/api/query.go

  • Added queryTimeout field to QueryHandler struct
  • Updated NewQueryHandler to accept timeout parameter
  • Modified query execution to use QueryContext with timeout
  • Added timeout detection that returns HTTP 504 Gateway Timeout

cmd/arc/main.go

  • Pass cfg.Server.ReadTimeout to query handler

Configuration

Query timeout is controlled by the existing server config:

[server]
read_timeout = "30s"  # Also used as query timeout

Behavior

Scenario Before After
S3 unavailable Hangs 120+ seconds Returns 504 after read_timeout
Error message "Query execution failed" "Query timed out"
HTTP status 500 504

Limitations

The context timeout detects when the deadline has passed but cannot forcibly cancel DuckDB's internal HTTP operations. The query will still wait for DuckDB's internal timeout, but clients receive the correct error type and status code.

Testing

  1. Deploy with read_timeout = "30s"
  2. Block S3 egress (NetworkPolicy or firewall)
  3. Execute query - should return HTTP 504 "Query timed out"

Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! You've identified a real issue - queries can hang when S3 is unavailable. However, there are several concerns with this implementation:

1. Config Confusion

read_timeout is Fiber's HTTP read timeout - it controls how long the server waits to read the request body from the client. It does NOT control query execution time.

These are different concerns:

  • read_timeout = HTTP layer (request reading)
  • query_timeout = Application layer (query execution)

A query can run for 2 minutes even with read_timeout=30s because the request was already fully received. Reusing this config is misleading.

Suggestion: Add a dedicated query_timeout config instead of reusing read_timeout.

2. Incomplete Coverage

The timeout is only applied to the non-profiled query path:

if profileMode {
    rows, profile, err = h.db.QueryWithProfile(convertedSQL)  // ❌ No timeout
} else {
    rows, err = h.db.QueryContext(ctx, convertedSQL)  // ✅ Has timeout
}

Profiled queries will still hang indefinitely.

3. Doesn't Actually Cancel DuckDB

As you noted in "Limitations", the context timeout doesn't stop DuckDB's internal HTTP operations. So:

  • Client gets 504 after 30s ✅
  • But the DuckDB query keeps running in the background until its 120s timeout ❌
  • This wastes server resources and could cause connection pool issues

4. We Already Have Query Timeout

DuckDB itself has timeout settings that actually cancel queries:

SET http_timeout=30000;  -- 30 seconds

This is already configurable and actually stops the query.

Recommendation

If you want to improve timeout handling:

  1. Add a dedicated [server] query_timeout config (don't reuse read_timeout)
  2. Apply timeout to ALL query paths (including profiled queries)
  3. Consider using DuckDB's native http_timeout setting which actually cancels operations

I'd suggest closing this PR and opening an issue to discuss the best approach for query timeouts.

Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to my previous review:

I was wrong about us already having query timeout - we don't. The ReadTimeout is only for HTTP request reading, not query execution. So this PR is adding something genuinely new and useful.

Updated Review

The core idea is valid - we need query execution timeouts. However, I still have concerns:

1. Config Naming

Reusing read_timeout for query timeout is confusing since they're different concepts. Consider adding a separate query_timeout config, or at minimum document clearly that read_timeout is being reused for both purposes.

2. Incomplete Coverage

Profiled queries still don't get the timeout:

if profileMode {
    rows, profile, err = h.db.QueryWithProfile(convertedSQL)  // ❌ No timeout
} else {
    rows, err = h.db.QueryContext(ctx, convertedSQL)  // ✅ Has timeout
}

3. Context Cancellation Limitation

As you noted, the context doesn't actually cancel DuckDB's HTTP operations - it just detects the timeout. This is a known Go/CGO limitation. Worth documenting but not a blocker.


If you address #1 (separate config or clear docs) and #2 (cover profiled queries), I'd be happy to approve.

@xe-nvdk
Copy link
Copy Markdown
Member

xe-nvdk commented Jan 25, 2026

Python Implementation Reference

I checked the Python legacy implementation (python-legacy branch) and found that query timeout does exist there:

File: api/duckdb_pool.py

async def execute_async(
    self,
    sql: str,
    priority: QueryPriority = QueryPriority.NORMAL,
    timeout: float = 300.0  # 5 minute default
) -> Dict[str, Any]:

Key differences from your PR:

Aspect Python Your PR
Default timeout 300s (5 minutes) 30s (via read_timeout)
Implementation Connection pool level Query handler level
Metrics total_queries_timeout counter None
Per-query control Yes (timeout param) No (global only)

Recommendations based on Python implementation:

  1. Use 300s default (or make it configurable) - 30s may be too aggressive for complex queries
  2. Add a dedicated config like query_timeout instead of reusing read_timeout
  3. Add timeout metrics - Python tracks total_queries_timeout which is useful for monitoring
  4. Consider per-query timeout - Python allows different timeouts per query based on priority

This PR is essentially porting a missing feature from Python to Go, which is valuable. The implementation just needs some refinement to match what we had before.

@khalid244
Copy link
Copy Markdown
Author

Thanks for the detailed review! I've addressed all your feedback:

1. Dedicated query.timeout config ✅

  • Added new query.timeout config (default: 300 seconds / 5 minutes)
  • No longer reusing read_timeout - these are now separate concerns
  • Configurable via ARC_QUERY_TIMEOUT environment variable

2. Timeout applies to ALL query paths ✅

  • Added QueryWithProfileContext() method in duckdb.go
  • Both profiled and non-profiled queries now use the context with timeout:
if profileMode {
    rows, profile, err = h.db.QueryWithProfileContext(ctx, convertedSQL)  // ✅ Now has timeout
} else {
    rows, err = h.db.QueryContext(ctx, convertedSQL)  // ✅ Has timeout
}

3. Timeout metrics ✅

  • Added query_timeouts_total counter for monitoring
  • Available in JSON API (/api/v1/metrics) and Prometheus (/metrics):
# HELP arc_query_timeouts_total Queries that exceeded timeout
# TYPE arc_query_timeouts_total counter
arc_query_timeouts_total 1

Files changed:

  • internal/config/config.go - Added query.timeout config
  • internal/database/duckdb.go - Added QueryWithProfileContext()
  • internal/api/query.go - Use new config, apply timeout to profiled queries, increment timeout metric
  • internal/metrics/metrics.go - Added query_timeouts_total metric
  • cmd/arc/main.go - Use cfg.Query.Timeout instead of cfg.Server.ReadTimeout

Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Query Timeout - Almost There!

Great job addressing the previous feedback! The implementation is clean and well-documented. Just one more thing needed:

Missing: Timeout Support for Arrow and Estimate Endpoints

The timeout is only applied to executeQuery. The other two query endpoints also need timeout support:

1. Arrow Endpoint (query_arrow.go)

func (h *QueryHandler) executeQueryArrow(c *fiber.Ctx) error {
    // ... needs timeout context like executeQuery
    rows, err := h.db.Query(convertedSQL)  // <- No timeout!
}

2. Estimate Endpoint (query.go:estimateQuery)

func (h *QueryHandler) estimateQuery(c *fiber.Ctx) error {
    // ... needs timeout context
    rows, err := h.db.Query(estimateSQL)  // <- No timeout!
}

Suggested Changes

Apply the same pattern from executeQuery to both endpoints:

ctx := c.UserContext()
var cancel context.CancelFunc
if h.queryTimeout > 0 {
    ctx, cancel = context.WithTimeout(ctx, h.queryTimeout)
    defer cancel()
}

rows, err := h.db.QueryContext(ctx, sql)

// Check for timeout
if h.queryTimeout > 0 && ctx.Err() == context.DeadlineExceeded {
    m.IncQueryTimeouts()
    return c.Status(fiber.StatusGatewayTimeout).JSON(...)
}

Why This Matters

  • Arrow queries can also hang indefinitely on S3 issues
  • Estimate queries run EXPLAIN ANALYZE which executes the query
  • Consistent behavior across all query APIs

Once these are added, this PR is ready to merge!

@khalid244
Copy link
Copy Markdown
Author

Addressed: Timeout Support for Arrow and Estimate Endpoints ✅

Added timeout support to both endpoints as requested:

1. Arrow Endpoint (query_arrow.go)

ctx := c.UserContext()
var cancel context.CancelFunc
if h.queryTimeout > 0 {
    ctx, cancel = context.WithTimeout(ctx, h.queryTimeout)
    defer cancel()
}

rows, err := h.db.QueryContext(ctx, convertedSQL)

if h.queryTimeout > 0 && ctx.Err() == context.DeadlineExceeded {
    m.IncQueryTimeouts()
    return c.Status(fiber.StatusGatewayTimeout).JSON(...)
}

2. Estimate Endpoint (query.go:estimateQuery)

  • Same pattern applied to the COUNT(*) query execution

Files changed:

  • internal/api/query_arrow.go - Added context/metrics imports, timeout support
  • internal/api/query.go - Added timeout support to estimateQuery

All three query APIs now have consistent timeout behavior and increment the query_timeouts_total metric.

Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khalid244 Thanks for the updates! The main query endpoint looks great now with:

  • Dedicated query.timeout config (default 300s)
  • QueryWithProfileContext for profiled queries
  • query_timeouts_total metric
  • 504 Gateway Timeout response

However, I noticed query_arrow.go is not in the diff. Your comment mentioned adding timeout support to Arrow and Estimate endpoints, but those changes aren't in the PR yet. Can you verify and push those commits?

Missing from diff:

  • internal/api/query_arrow.go - Arrow endpoint timeout
  • estimateQuery function timeout in query.go

Once those are added, this PR is ready to merge!

Khalid added 5 commits January 26, 2026 01:37
- Add QueryContext method to DuckDB wrapper for timeout support
- Add queryTimeout field to QueryHandler using server.read_timeout
- Return HTTP 504 "Query timed out" when timeout is exceeded

When S3 becomes unavailable, queries now respect the configured
read_timeout instead of hanging for 120+ seconds waiting for
DuckDB's internal HTTP timeout.
Changes:
1. Add dedicated query.timeout config instead of reusing server.read_timeout
   - New config: query.timeout (default: 30 seconds, 0 = no timeout)
   - This separates HTTP layer timeout from query execution timeout

2. Apply timeout to profiled queries
   - Add QueryWithProfileContext method to support context/timeout
   - QueryWithProfile now delegates to QueryWithProfileContext
   - Both profiled and non-profiled queries now respect timeout

3. Update main.go to use cfg.Query.Timeout instead of cfg.Server.ReadTimeout

Addresses review feedback from xe-nvdk on PR #152
- Apply query timeout to executeQueryArrow in query_arrow.go
- Apply query timeout to estimateQuery in query.go
- Both endpoints now return HTTP 504 on timeout
- Increment query_timeouts_total metric on timeout
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feedback addressed. Implementation is complete:

  • Dedicated query.timeout config (default 300s)
  • QueryWithProfileContext for profiled queries
  • Arrow endpoint timeout support
  • Estimate endpoint timeout support
  • query_timeouts_total metric
  • 504 Gateway Timeout responses

Ready to merge.

@xe-nvdk xe-nvdk merged commit 50bbb55 into Basekick-Labs:main Jan 25, 2026
xe-nvdk added a commit that referenced this pull request Jan 25, 2026
Documents the new query.timeout configuration for preventing
indefinite hangs during S3 disconnection.
xe-nvdk added a commit that referenced this pull request Feb 1, 2026
## New Features

### InfluxDB Client Compatibility
Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED).

- `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf)
- `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients)
- Arc-native endpoint `/api/v1/write/line-protocol` preserved
- Supports Bearer token, Token header, API key header, and query parameter authentication

### MQTT Ingestion Support
Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure.

- Subscribe to multiple MQTT topics with wildcard support (`+`, `#`)
- Dynamic subscription management via REST API
- TLS/SSL connections with certificate validation
- Authentication via username/password or client certificates
- Connection auto-reconnect with exponential backoff
- Per-subscription statistics and monitoring
- Passwords encrypted at rest
- Auto-start subscriptions on server restart

### S3 File Caching via cache_httpfs Extension (PR #149)
Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards).

- In-memory only — no disk caching, preserves stateless compute philosophy
- Opt-in, configurable cache size and TTL
- Graceful degradation if extension fails to load

*Contributed by @khalid244*

### Relative Time Expression Support in Partition Pruning
Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months.

## Bug Fixes

- **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars.
- **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)*
- **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s.
- **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`.
- **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC.
- **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)*
- **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)*
- **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess.
- **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead).
- **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations).
- **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)*
- **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching.
- **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement).
- **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes.
- **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction.
- **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure.
- **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion.
- **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)*
- **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)*
- **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases.
- **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends.
- **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion.
- **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)*

## Improvements

- **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable.
- **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries).
- **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup).
- **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather.
- **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)*
- **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster).
- **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries).
- **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances.
- **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart.

## Security

- **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported.

## Breaking Changes

- `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes.

## Contributors

Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.
xe-nvdk added a commit that referenced this pull request Feb 1, 2026
## New Features

### InfluxDB Client Compatibility
Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED).

- `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf)
- `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients)
- Arc-native endpoint `/api/v1/write/line-protocol` preserved
- Supports Bearer token, Token header, API key header, and query parameter authentication

### MQTT Ingestion Support
Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure.

- Subscribe to multiple MQTT topics with wildcard support (`+`, `#`)
- Dynamic subscription management via REST API
- TLS/SSL connections with certificate validation
- Authentication via username/password or client certificates
- Connection auto-reconnect with exponential backoff
- Per-subscription statistics and monitoring
- Passwords encrypted at rest
- Auto-start subscriptions on server restart

### S3 File Caching via cache_httpfs Extension (PR #149)
Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards).

- In-memory only — no disk caching, preserves stateless compute philosophy
- Opt-in, configurable cache size and TTL
- Graceful degradation if extension fails to load

*Contributed by @khalid244*

### Relative Time Expression Support in Partition Pruning
Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months.

## Bug Fixes

- **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars.
- **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)*
- **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s.
- **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`.
- **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC.
- **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)*
- **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)*
- **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess.
- **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead).
- **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations).
- **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)*
- **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching.
- **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement).
- **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes.
- **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction.
- **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure.
- **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion.
- **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)*
- **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)*
- **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases.
- **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends.
- **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion.
- **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)*

## Improvements

- **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable.
- **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries).
- **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup).
- **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather.
- **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)*
- **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster).
- **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries).
- **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances.
- **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart.

## Security

- **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported.

## Breaking Changes

- `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes.

## Contributors

Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.
@xe-nvdk xe-nvdk mentioned this pull request Feb 1, 2026
xe-nvdk added a commit that referenced this pull request Feb 1, 2026
## New Features

### InfluxDB Client Compatibility
Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED).

- `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf)
- `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients)
- Arc-native endpoint `/api/v1/write/line-protocol` preserved
- Supports Bearer token, Token header, API key header, and query parameter authentication

### MQTT Ingestion Support
Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure.

- Subscribe to multiple MQTT topics with wildcard support (`+`, `#`)
- Dynamic subscription management via REST API
- TLS/SSL connections with certificate validation
- Authentication via username/password or client certificates
- Connection auto-reconnect with exponential backoff
- Per-subscription statistics and monitoring
- Passwords encrypted at rest
- Auto-start subscriptions on server restart

### S3 File Caching via cache_httpfs Extension (PR #149)
Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards).

- In-memory only — no disk caching, preserves stateless compute philosophy
- Opt-in, configurable cache size and TTL
- Graceful degradation if extension fails to load

*Contributed by @khalid244*

### Relative Time Expression Support in Partition Pruning
Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months.

## Bug Fixes

- **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars.
- **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)*
- **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s.
- **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`.
- **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC.
- **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)*
- **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)*
- **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess.
- **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead).
- **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations).
- **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)*
- **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching.
- **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement).
- **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes.
- **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction.
- **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure.
- **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion.
- **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)*
- **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)*
- **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases.
- **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends.
- **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion.
- **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)*

## Improvements

- **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable.
- **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries).
- **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup).
- **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather.
- **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)*
- **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster).
- **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries).
- **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances.
- **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart.

## Security

- **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported.

## Breaking Changes

- `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes.

## Contributors

Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants