fix(compaction): cleanup orphaned temp directories on startup by xe-nvdk · Pull Request #165 · Basekick-Labs/arc

xe-nvdk · 2026-01-25T22:11:42Z

Summary

Fixes #164

When a pod crashes mid-compaction, the defer cleanup in job.go:202 never runs and temp directories accumulate at ./data/compaction/{job_id}/. This adds a CleanupOrphanedTempDirs() method that removes all orphaned temp directories on startup.

Changes

File	Change
`internal/compaction/manager.go`	Add `CleanupOrphanedTempDirs()` method
`cmd/arc/main.go`	Call cleanup on startup before compaction schedulers
`internal/compaction/manager_test.go`	Add unit tests

Behavior

On startup, before compaction schedulers are started:

Read all entries in TempDirectory (default: ./data/compaction/)
Remove any subdirectories (these are orphaned job temp dirs)
Log each removed directory and total count

Test plan

Unit tests pass: go test ./internal/compaction/... -run TestManager_CleanupOrphanedTempDirs -v
All compaction tests pass
Build passes

Manual verification:

mkdir -p ./data/compaction/fake_orphan_job_123
# Start arc
# Check logs for "Cleaned up orphaned temp directory"
# Verify ./data/compaction/fake_orphan_job_123 is removed

Fixes #164 When a pod crashes mid-compaction, the defer cleanup never runs and temp directories accumulate at ./data/compaction/{job_id}/. This adds a CleanupOrphanedTempDirs() method that removes all orphaned temp directories on startup. - Add CleanupOrphanedTempDirs() method to Manager - Call cleanup on startup before compaction schedulers start - Add unit tests for cleanup functionality

…notes Documents Issue #164 and PR #165 fixes: - Startup cleanup for orphaned directories from previous crashes - Parent-side cleanup after subprocess completes/fails

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

@khalid244

## New Features ### InfluxDB Client Compatibility Arc's Line Protocol endpoints now use the same paths as InfluxDB, enabling drop-in compatibility with all official InfluxDB client libraries (Go, Python, JavaScript, Java, C#, PHP, Ruby, Telegraf, Node-RED). - `/api/v1/write` → `/write` (InfluxDB 1.x clients, Telegraf) - `/api/v1/write/influxdb` → `/api/v2/write` (InfluxDB 2.x clients) - Arc-native endpoint `/api/v1/write/line-protocol` preserved - Supports Bearer token, Token header, API key header, and query parameter authentication ### MQTT Ingestion Support Native MQTT subscription for IoT and edge data ingestion. Connect directly to MQTT brokers without requiring additional infrastructure. - Subscribe to multiple MQTT topics with wildcard support (`+`, `#`) - Dynamic subscription management via REST API - TLS/SSL connections with certificate validation - Authentication via username/password or client certificates - Connection auto-reconnect with exponential backoff - Per-subscription statistics and monitoring - Passwords encrypted at rest - Auto-start subscriptions on server restart ### S3 File Caching via cache_httpfs Extension (PR #149) Optional in-memory caching of S3 Parquet files via DuckDB's `cache_httpfs` community extension. 5-10x query performance improvement for workloads with repeated file access (CTEs, subqueries, Grafana dashboards). - In-memory only — no disk caching, preserves stateless compute philosophy - Opt-in, configurable cache size and TTL - Graceful degradation if extension fails to load *Contributed by @khalid244* ### Relative Time Expression Support in Partition Pruning Queries using `NOW() - INTERVAL` now benefit from partition pruning. Supports seconds, minutes, hours, days, weeks, months. ## Bug Fixes - **Control Characters in Measurement Names Break S3 (Issue #122)** — Added strict validation for measurement names across all ingestion endpoints (Line Protocol, MsgPack, Continuous Queries). Names must start with a letter, contain only alphanumeric/underscore/hyphen, max 128 chars. - **Partition Pruner Fails on Non-Existent S3 Partitions (Issue #125, #144, PR #145)** — Fixed "No files found" errors when time range includes non-existent S3 partitions. Extended `filterExistingPaths()` for S3/Azure storage. Also fixed `filepath.Join()` mangling S3 URLs. *(Day-level fix by @khalid244)* - **Server Timeout Config Values Ignored (Issue #126)** — `server.read_timeout` and `server.write_timeout` now correctly use configuration values instead of hardcoded 30s. - **Large Payload Ingestion (413 Request Entity Too Large)** — `MaxPayloadSize` config now correctly passed to Fiber's `BodyLimit`. - **Query Results Timestamp Timezone Inconsistency** — All timestamps in query results now normalized to UTC. - **Azure Blob Storage Query SSL Certificate Errors on Linux (PR #92)** — Fixed by using system curl for SSL on Linux. *(Contributed by @schotime)* - **UTC Consistency for Compaction Filenames (PR #132)** — Compacted filenames now use UTC timestamps consistently. *(Contributed by @schotime)* - **S3 Subprocess Configuration Issues (Issue #131)** — Fixed missing credentials and SSL config forwarding to compaction subprocess. - **Query Failures with Non-UTF8 Data (Issue #136)** — Added automatic UTF-8 sanitization during ingestion with optimized fast-path (~6-25ns overhead). - **Nanosecond Timestamp Support for MessagePack Ingestion** — Extended timestamp detection to handle 19-digit nanosecond timestamps (important for InfluxDB migrations). - **WHERE Clause Regex Fails to Match Multi-Line Queries (Issue #146, PR #148)** — Fixed partition pruner failing on multi-line SQL queries. *(Contributed by @khalid244)* - **String Literals Containing SQL Keywords Break Partition Pruning** — Added string literal masking before regex matching. - **Buffer Age-Based Flush Timing Under High Load (Issue #142)** — Ticker now fires at half the configured interval for more consistent flush timing (~25% improvement). - **Arrow Writer Panic During High-Concurrency Writes (Issue #130)** — Fixed out-of-bounds panic during schema evolution with concurrent writes. - **Empty Directories Not Cleaned Up After Daily Compaction** — Automatic cleanup of empty hour-level partition directories after compaction. - **Compactor OOM and Segfaults with Large Datasets (Issue #102)** — Streaming I/O, memory limit passthrough, file batching (1000 max), and adaptive batch sizing on failure. - **Orphaned Compaction Temp Directories (Issue #164, PR #165)** — Two-layer cleanup: startup cleanup + parent-side cleanup after subprocess completion. - **Compaction Data Duplication on Crash (Issue #157, PR #163)** — Manifest-based tracking in S3 prevents re-compaction after crash. *(Contributed by @khalid244)* - **WAL-Based S3 Recovery (Issue #159, PR #162)** — Fixed data loss during S3 outages with startup recovery, periodic recovery, and backpressure handling. *(Contributed by @khalid244)* - **Tiered Storage Query Routing (Issue #166, PR #167)** — Fixed queries not routing to cold tier data and database listing not showing cold-only databases. - **Retention Policies for S3/Azure Storage Backends (Issue #169, PR #170)** — Retention policies now work with all storage backends. - **Retention Policy Empty Directory Cleanup (Issue #171)** — Empty directories cleaned up after retention policy file deletion. - **Query Timeout for S3 Disconnection (Issue #151, PR #152)** — Configurable query timeout prevents indefinite hangs. Returns HTTP 504 when exceeded. *(Contributed by @khalid244)* ## Improvements - **Configurable Server Idle and Shutdown Timeouts** — `server.idle_timeout` and `server.shutdown_timeout` now configurable. - **Automatic Time Function Query Optimization** — `time_bucket()` and `date_trunc()` automatically rewritten to epoch arithmetic (2-2.5x faster GROUP BY queries). - **Parallel Partition Scanning** — Queries spanning 3+ partitions now execute concurrently (2-4x speedup). - **Two-Stage Distributed Aggregation (Enterprise)** — 5-20x speedup for cross-shard aggregations via scatter/gather. - **DuckDB Query Engine Optimizations** — Parquet metadata caching, prefetching, pool-wide `SET GLOBAL` consistency (18-24% faster aggregations). *(SET GLOBAL fix by @khalid244)* - **Automatic Regex-to-String Function Optimization** — URL domain extraction patterns rewritten to native string functions (2x+ faster). - **Database Header for Query Optimization** — `x-arc-database` header skips regex parsing (5-17% faster queries). - **MQTT Client Auto-Generated Client ID** — Prevents client ID collisions across multiple Arc instances. - **MQTT Restart Endpoint** — `/api/v1/restart` to apply MQTT config changes without server restart. ## Security - **Token Hashing Security Model** — New tokens hashed with bcrypt (cost 10). SHA256 prefixes for O(1) lookup. Legacy tokens supported. ## Breaking Changes - `/api/v1/write` → `/write` and `/api/v1/write/influxdb` → `/api/v2/write`. Update client config if using old Arc-specific paths. InfluxDB client libraries need no changes. ## Contributors Thanks to @schotime (Adam Schroder) and @khalid244 for their contributions to this release.

xe-nvdk merged commit 61ab359 into main Jan 25, 2026
5 checks passed

xe-nvdk mentioned this pull request Feb 1, 2026

release: Arc v26.02.1 #175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compaction): cleanup orphaned temp directories on startup#165

fix(compaction): cleanup orphaned temp directories on startup#165
xe-nvdk merged 1 commit intomainfrom
fix/cleanup-orphaned-compaction-temp-164

xe-nvdk commented Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xe-nvdk commented Jan 25, 2026

Summary

Changes

Behavior

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant