Fix WHERE clause regex to match across newlines by khalid244 · Pull Request #148 · Basekick-Labs/arc

khalid244 · 2026-01-21T22:39:13Z

Summary

Fix partition pruning failure for multi-line SQL queries
Change .+? to [\s\S]+? in whereClausePattern regex to match newlines

Fixes #146

The whereClausePattern regex used .+? which does not match newline characters in Go regex. This caused partition pruning to fail for multi-line SQL queries, resulting in full table scans instead of targeted partition paths. Changed .+? to [\s\S]+? which explicitly matches any character including newlines. Fixes #146

xe-nvdk

Reviewed and tested extensively. The fix is correct:

Changes .+? to [\s\S]+? to match across newlines in Go regex
All existing tests pass (21+ test cases)
Multi-line queries now correctly extract time ranges for partition pruning
No regressions detected
Edge cases (subqueries, CTEs, UNION, long clauses) handled correctly

Thanks for the contribution, @khalid244!

Note: We'll add a test case for multi-line queries to prevent future regressions.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

This comment was marked as duplicate.

Sign in to view

xe-nvdk merged commit 2ade299 into Basekick-Labs:main Jan 21, 2026

xe-nvdk approved these changes Jan 21, 2026

View reviewed changes

xe-nvdk mentioned this pull request Feb 1, 2026

release: Arc v26.02.1 #175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WHERE clause regex to match across newlines#148

Fix WHERE clause regex to match across newlines#148
xe-nvdk merged 1 commit intoBasekick-Labs:mainfrom
khalid244:fix/where-clause-newline-regex

khalid244 commented Jan 21, 2026

Uh oh!

This comment was marked as duplicate.

Uh oh!

xe-nvdk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khalid244 commented Jan 21, 2026

Summary

Uh oh!

This comment was marked as duplicate.

Uh oh!

xe-nvdk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants