Skip to content

perf(ingest): remove dead code and optimize ingestion path#159

Merged
xe-nvdk merged 2 commits intomainfrom
perf/ingest-optimization-26.02.1
Jan 25, 2026
Merged

perf(ingest): remove dead code and optimize ingestion path#159
xe-nvdk merged 2 commits intomainfrom
perf/ingest-optimization-26.02.1

Conversation

@xe-nvdk
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk commented Jan 25, 2026

Summary

Code cleanup and performance optimizations for release 26.02.1:

  • Remove ~195 lines of dead code (unused WriteParquetFromInterface, buildArrayFromInterface, getSchemaFromInterface)
  • Consolidate duplicate checkWritePermissions() into shared permissions.go
  • Fix duplicate Parquet writer properties creation (was allocating twice per write)
  • Optimize tryXxxZeroCopy() functions to check type before allocating (avoids 1-2MB wasted allocation on type mismatch)

Changes

File Lines Change
internal/ingest/arrow_writer.go -195 Remove dead code, fix writer props, optimize zero-copy
internal/api/lineprotocol.go -32 Use shared permissions function
internal/api/msgpack.go -32 Use shared permissions function
internal/api/permissions.go +47 New shared RBAC permission check

Net change: -223 lines

Benchmark Results

geomean: -0.17% (no regression, slight improvement)

All existing benchmarks pass with comparable or better performance.

Test plan

  • Unit tests pass (go test ./internal/ingest/... ./internal/api/...)
  • Build succeeds (go build ./...)
  • Benchmarks show no regression
  • Manual endpoint testing for /write, /api/v2/write, /api/v1/write/msgpack

Code cleanup and performance optimizations for release 26.02.1:

1. Remove dead code (~195 lines):
   - WriteParquetFromInterface() - never called
   - buildArrayFromInterface() - never called
   - getSchemaFromInterface() - only used by above

2. Consolidate duplicate code:
   - Extract checkWritePermissions() to shared permissions.go
   - Reduce code duplication between lineprotocol.go and msgpack.go

3. Performance optimizations:
   - Fix duplicate Parquet writer properties creation (was creating twice)
   - Optimize tryXxxZeroCopy() to check first element type before allocating
     (avoids 1-2MB wasted allocation on type mismatch)

Net change: -223 lines of code removed

Benchmarks show no regression, slight improvement in geomean (-0.17%).
The pre-check optimization added in the previous commit caused a 2%
throughput regression:
- Before: 17,937,193 records/sec
- After:  17,590,365 records/sec

Root cause: The extra checks (len, nil, type assertion) on every call
added overhead in the hot path. The "saved allocation" case (type
mismatch) is rare in real workloads where data is typically homogeneous.

Revert to original single-pass implementation which allocates and
converts in one loop iteration.
@xe-nvdk xe-nvdk merged commit d8bb37b into main Jan 25, 2026
5 checks passed
xe-nvdk added a commit that referenced this pull request Feb 1, 2026
## New Features

### InfluxDB Client Compatibility
Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED).

- `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf)
- `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients)
- Arc-native endpoint `/api/v1/write/line-protocol` preserved
- Supports Bearer token, Token header, API key header, and query parameter authentication

### MQTT Ingestion Support
Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure.

- Subscribe to multiple MQTT topics with wildcard support (`+`, `#`)
- Dynamic subscription management via REST API
- TLS/SSL connections with certificate validation
- Authentication via username/password or client certificates
- Connection auto-reconnect with exponential backoff
- Per-subscription statistics and monitoring
- Passwords encrypted at rest
- Auto-start subscriptions on server restart

### S3 File Caching via cache_httpfs Extension (PR #149)
Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards).

- In-memory only — no disk caching, preserves stateless compute philosophy
- Opt-in, configurable cache size and TTL
- Graceful degradation if extension fails to load

*Contributed by @khalid244*

### Relative Time Expression Support in Partition Pruning
Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months.

## Bug Fixes

- **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars.
- **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)*
- **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s.
- **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`.
- **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC.
- **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)*
- **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)*
- **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess.
- **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead).
- **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations).
- **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)*
- **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching.
- **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement).
- **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes.
- **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction.
- **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure.
- **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion.
- **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)*
- **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)*
- **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases.
- **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends.
- **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion.
- **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)*

## Improvements

- **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable.
- **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries).
- **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup).
- **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather.
- **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)*
- **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster).
- **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries).
- **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances.
- **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart.

## Security

- **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported.

## Breaking Changes

- `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes.

## Contributors

Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.
xe-nvdk added a commit that referenced this pull request Feb 1, 2026
## New Features

### InfluxDB Client Compatibility
Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED).

- `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf)
- `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients)
- Arc-native endpoint `/api/v1/write/line-protocol` preserved
- Supports Bearer token, Token header, API key header, and query parameter authentication

### MQTT Ingestion Support
Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure.

- Subscribe to multiple MQTT topics with wildcard support (`+`, `#`)
- Dynamic subscription management via REST API
- TLS/SSL connections with certificate validation
- Authentication via username/password or client certificates
- Connection auto-reconnect with exponential backoff
- Per-subscription statistics and monitoring
- Passwords encrypted at rest
- Auto-start subscriptions on server restart

### S3 File Caching via cache_httpfs Extension (PR #149)
Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards).

- In-memory only — no disk caching, preserves stateless compute philosophy
- Opt-in, configurable cache size and TTL
- Graceful degradation if extension fails to load

*Contributed by @khalid244*

### Relative Time Expression Support in Partition Pruning
Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months.

## Bug Fixes

- **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars.
- **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)*
- **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s.
- **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`.
- **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC.
- **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)*
- **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)*
- **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess.
- **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead).
- **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations).
- **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)*
- **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching.
- **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement).
- **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes.
- **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction.
- **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure.
- **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion.
- **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)*
- **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)*
- **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases.
- **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends.
- **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion.
- **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)*

## Improvements

- **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable.
- **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries).
- **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup).
- **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather.
- **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)*
- **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster).
- **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries).
- **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances.
- **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart.

## Security

- **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported.

## Breaking Changes

- `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes.

## Contributors

Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.
@xe-nvdk xe-nvdk mentioned this pull request Feb 1, 2026
xe-nvdk added a commit that referenced this pull request Feb 1, 2026
## New Features

### InfluxDB Client Compatibility
Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED).

- `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf)
- `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients)
- Arc-native endpoint `/api/v1/write/line-protocol` preserved
- Supports Bearer token, Token header, API key header, and query parameter authentication

### MQTT Ingestion Support
Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure.

- Subscribe to multiple MQTT topics with wildcard support (`+`, `#`)
- Dynamic subscription management via REST API
- TLS/SSL connections with certificate validation
- Authentication via username/password or client certificates
- Connection auto-reconnect with exponential backoff
- Per-subscription statistics and monitoring
- Passwords encrypted at rest
- Auto-start subscriptions on server restart

### S3 File Caching via cache_httpfs Extension (PR #149)
Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards).

- In-memory only — no disk caching, preserves stateless compute philosophy
- Opt-in, configurable cache size and TTL
- Graceful degradation if extension fails to load

*Contributed by @khalid244*

### Relative Time Expression Support in Partition Pruning
Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months.

## Bug Fixes

- **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars.
- **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)*
- **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s.
- **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`.
- **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC.
- **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)*
- **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)*
- **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess.
- **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead).
- **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations).
- **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)*
- **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching.
- **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement).
- **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes.
- **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction.
- **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure.
- **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion.
- **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)*
- **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)*
- **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases.
- **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends.
- **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion.
- **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)*

## Improvements

- **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable.
- **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries).
- **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup).
- **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather.
- **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)*
- **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster).
- **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries).
- **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances.
- **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart.

## Security

- **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported.

## Breaking Changes

- `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes.

## Contributors

Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant