Skip to content

Timeseries queries only read active memtable — flushed partitions and WAL segments invisible #6

@emanzx

Description

@emanzx

Summary

When ingesting large volumes of data into a timeseries collection via ILP, queries only return results from the active memtable. Once data is flushed to TS partitions on disk or exists in WAL segments, it becomes completely invisible to all query operations — SELECT, COUNT(*), WHERE, and GROUP BY all only see the unflushed memtable.

This means at scale, the query engine is blind to 99%+ of ingested data.

Environment

  • NodeDB version: v0.0.0-beta.1 (branch main, commit fd8fb8f)
  • OS: Linux 6.17.0 / 8GB RAM / 4 CPU
  • Ingest method: ILP over TCP on port 8086
  • Collection type: Timeseries (CREATE TIMESERIES dns_mega)

Reproduction steps

  1. Start NodeDB with NODEDB_PORT_ILP=8086 ./target/release/nodedb
  2. Create a timeseries collection: CREATE TIMESERIES dns_mega
  3. Ingest 100M rows via ILP (high-cardinality DNS telemetry data with tags: client_ip, qname, qtype, rcode, upstream, cached, interface, interface_ip and field elapsed_ms)
  4. Wait for memtable flushes to occur (they happen automatically as data accumulates)
  5. Run: SELECT COUNT(*) FROM dns_mega

Expected behavior

COUNT(*) should return 100,000,000 (or close to it, accounting for any in-flight data).

Actual behavior

COUNT(*) returns 931,455 — which is exactly the number of rows in the active memtable. The other ~99M rows exist on disk but are not queried.

Evidence — data is on disk, just not queried

Flushed TS partitions (18 partitions, 18.8M rows)

$ ls -d ~/.local/share/nodedb/ts/dns_mega/ts-* | wc -l
18

$ # Row counts from partition.meta files:
ts-1774916767625_1774921964441: 1,048,782 rows
ts-1774921967625_1774927398621: 1,049,009 rows
ts-1774927398625_1774932638623: 1,049,040 rows
ts-1774932638625_1774937958623: 1,048,729 rows
...
(18 partitions, each ~1.05M rows)
TOTAL in flushed partitions: 18,880,226 rows

Each partition is ~15MB with proper columnar files (.col, .sym, schema.json, partition.meta, sparse_index.bin).

WAL segments (73 segments, ~81M rows)

$ ls ~/.local/share/nodedb/wal/*.seg | wc -l
73

$ ls -lh ~/.local/share/nodedb/wal/*.seg | head -3
-rw-rw-r-- 1 system system 65M ... wal-00000000000000021981.seg
-rw-rw-r-- 1 system system 65M ... wal-00000000000000033197.seg
-rw-rw-r-- 1 system system 65M ... wal-00000000000000040883.seg

73 WAL segments × 65MB each = 4.7GB in WAL. Each segment also has a corresponding .dwb file.

The math

Location Rows Size
Flushed TS partitions 18,880,226 256MB
WAL segments ~81,000,000 4.7GB
Active memtable 931,455 (in memory)
Total ~100,000,000 ~5GB

Only the memtable (931,455 rows) is visible to queries.

Additional observations

WHERE predicates are completely ignored on COUNT

Every WHERE clause returns the same 931,455 count, regardless of the predicate — even logically impossible ones:

SELECT COUNT(*) FROM dns_mega WHERE rcode = 'SERVFAIL'931,455
SELECT COUNT(*) FROM dns_mega WHERE rcode = 'NONEXISTENT'931,455  -- impossible value
SELECT COUNT(*) FROM dns_mega WHERE elapsed_ms > 99999931,455
SELECT COUNT(*) FROM dns_mega WHERE elapsed_ms < 0931,455  -- no negative values exist
SELECT COUNT(*) FROM dns_mega WHERE qtype = 'A'931,455  -- should be ~40% of total

This suggests the timeseries COUNT(*) code path bypasses predicate evaluation entirely and just returns the memtable row count.

GROUP BY works but only on memtable data

GROUP BY queries do function correctly against the memtable — they return proper aggregations with correct distributions. But they only see the ~931K memtable rows, not the full 100M.

WAL segments not reclaimed after TS flush

The 73 WAL segments (4.7GB) persist on disk even though 18 TS partitions have been flushed from them. This appears to be a separate WAL garbage collection issue, but is related — the WAL contains the "missing" ~81M rows that are neither in TS partitions nor in the memtable.

Impact

  • Critical for production use: Any dataset that exceeds memtable capacity becomes mostly unqueryable
  • The 100M row ingest completed at 684K rows/sec with 0 errors, RSS stayed under 220MB — the ingest path is solid, but the query path doesn't follow the data to disk
  • Users will see counts decrease over time as data flushes out of the memtable

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions