Query optimization (SCHEMA CHANGE REQUIRED) by ian-ross · Pull Request #1 · MIT-LAE/feder

ian-ross · 2026-04-23T11:43:22Z

Query optimization using Claude Code. The following description is lightly edited from Claude's own description of the changes.

Motivation

The flight query API is going to be used in a long-lived server process to load trajectory data for projecting onto ground-based camera video. A representative query — 1-hour time window, 50 km bounding box, spatially_crosses + filter_waypoints — was taking ~1.9 s, well above the ~200 ms target needed to keep below video-stream initialization time.

What was changed

1. Vectorized point parsing with numpy (`feder-common`)

Root cause: Point.unpack deserialized trajectory points one at a time in a Python loop, calling struct.unpack and round(..., 3) (milli()) per point. With ~890 K points unpacked per query call, this generated 22 M Python function calls and dominated the profile at ~76 % of wall time.

Fix:

Added a _POINT_DTYPE numpy structured dtype matching the on-disk binary layout ('>Lddddd?').
Added Point._unpack_blob(data) — parses a full decompressed blob in one numpy.frombuffer call.
Added Point._array_to_points(arr) — materialises Point objects only for trajectories that survive all filters.
Replaced the Python-lambda points_checker (spatial crossing check) and _make_point_filter (waypoint filter) with numpy boolean-mask operations, so Point objects are never constructed for the ~59 % of R-tree candidates that are ultimately rejected.
Removed milli() from the hot path. It was originally needed for round-tripping against an old CSV format; it is preserved in utils.py and remains in tests/conftest.py where it still makes comparisons robust.

Result: 1.90 s → 0.56 s (3.4×)

2. Parallel bz2 decompression with a thread pool (`feder-common`)

Root cause: After the numpy change, profiling showed bz2 decompression of ~1950 trajectory blobs was 92 % of remaining time. Decompressions were serial, and bz2 is a C extension that releases the GIL — straightforward to parallelise with threads.

Fix:

Added a module-level ThreadPoolExecutor (default 8 workers — empirically the sweet spot on the test machine; more threads increase scheduling and memory-bandwidth contention without adding throughput for this workload).
Added _process_blob(blob, points_check, pt_filter) as the per-trajectory unit of work submitted to the pool: decompress → numpy parse → spatial check → waypoint filter, all running in parallel.
Removed the batched(ids, 50) loop in query_flights: all R-tree IDs are now passed to _retrieve in a single SQL query (SQLite handles thousands of IN-clause parameters without issue), giving the pool the maximum available parallelism in one shot.

Result: 0.56 s → ~0.18 s median (further 3.1×)

Overall results

	Time (per call)	Speedup vs baseline
Baseline	1.90 s	—
+ numpy vectorisation	0.56 s	3.4×
+ thread pool	~0.18 s median	~10.5×

Measured on a warm server process (OS page cache hot, 10 repeated calls after one warm-up). Min observed: 128 ms; p90: ~275 ms (tail variance is from shared-host load, not the query path itself).

Dependencies added

numpy >= 1.24 added to feder-common.

What's next

The remaining ~150 ms is almost entirely bz2 decompression. Replacing bz2 with lz4 (decompresses at ~6 GB/s vs bz2's ~10 MB/s) would reduce decompression to near-zero and push all percentiles comfortably under 100 ms. That requires a data migration and ingestion-pipeline change, and is tracked separately.

Addendum: lz4 storage compression (Feder 1.0.0)

Motivation

After the numpy and thread-pool changes, profiling showed bz2 decompression of ~1 950 trajectory blobs still accounted for the bulk of remaining latency. lz4 decompresses at ~6 GB/s vs bz2's ~10 MB/s — roughly 600× faster — reducing decompression to near-zero.

This required converting 380 GB of historical data (1 307 daily SQLite files) and is a breaking change: old library versions will raise OSError: Invalid data stream against the new files. It is released as Feder 1.0.0, using the cumulative ~11× query speedup as justification for the compatibility break.

What was changed

3. lz4 storage format with version byte (`feder-common`, `feder-ingest`)

Blob format: each points blob is now stored as:

b'\x01' + lz4.frame.compress(Point.pack(traj.points))

The leading version byte (0x01) future-proofs subsequent compression changes without requiring another flag day.

Reader (db.py): validates the version byte and calls lz4.frame.decompress(blob[1:]). An unrecognised version byte raises ValueError with a message directing the user to upgrade.

Writer (writeable_db.py): prepends 0x01 and compresses with lz4.frame.

Thread-pool fix: switching from bz2 to lz4 exposed a granularity problem — lz4 decompressions take ~0.4 µs each (vs ~260 µs for bz2), making 1 950 individual thread-pool tasks slower than single-threaded due to per-task Python overhead. Fixed by splitting the row list into N_WORKERS chunks and submitting one future per chunk. _array_to_points is also moved inside the chunked work so Point construction is parallelised too.

4. Migration tooling (`scripts/`)

Nine standalone scripts for the cluster-side data migration:

Script	Purpose
`convert_file.py`	Per-file conversion: SQLite backup API + UPDATE + VACUUM + integrity check + row-count check
`bulk_convert.sh`	Slurm job array (one task per file, 20 concurrent)
`bulk_convert_parallel.sh`	GNU Parallel equivalent for local use
`tail_convert.sh`	Daily cron to keep the lz4 copy in sync until cutover
`verify_counts.py`	Row-count sweep across all file pairs
`verify_sample.py`	Decompress-and-compare spot check on a random sample
`verify_queries.py`	Canned FlightQuery calls against both directories, exact equality
`hypothesis_test.py`	Hypothesis property test with random queries, `max_examples=2000`
`soak_test.py`	`while True` loop for 48-hour pre-cutover confidence building

Overall results (cumulative)

	Time (per call)	Speedup vs baseline
Baseline	1.90 s	—
+ numpy vectorisation	0.56 s	3.4×
+ thread pool (bz2)	~0.18 s median	~10.5×
+ lz4 + chunked threads	~68 ms min / ~178 ms median	~11×

Measured on a warm server process against the converted live dataset. p90 is ~264 ms; tail variance is shared-host scheduling noise, not query-path cost.

Dependencies added

lz4 >= 4.4.5 added to feder-common and feder-ingest.

Breaking change

Users on old library versions querying converted data will see OSError: Invalid data stream. The fix is pip install --upgrade feder. This will be called out prominently in the 1.0.0 release notes and announced to users before the data cutover.

ian-ross

A couple of change requests to @claude.

claude · 2026-04-23T19:39:25Z

Claude finished @ian-ross's task in 3m 57s —— View job

Tasks

Gather context (read changed files)
Simplify code at db.py:416 (too "clever clever")
Add comment at models.py:14 explaining what it is and how it's used
Commit and push changes

ian-ross · 2026-04-23T20:15:28Z

@ian-ross There is a slightly silly thing that needs fixing here: all of the scripts that compare "old" and "new" schema databases import the current API version, which doesn't work with old schema databases... We need to think of a better way to handle that.

Replace the per-point Python struct loop in Point.unpack (and its accompanying milli()/round() calls on every float) with numpy.frombuffer on the whole blob at once. Spatial crossing checks and waypoint filters are likewise converted to numpy boolean-mask operations, so Point objects are only materialised for trajectories that survive all filters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eedup Add a module-level ThreadPoolExecutor (8 workers, the empirical sweet spot on a 28-core host — more threads increase scheduling and memory- bandwidth contention without adding throughput for this workload). _process_blob() is the per-trajectory unit of work: decompress, numpy- parse, spatial check, and waypoint filter all run in parallel. Batching of R-tree IDs is removed since passing all IDs in one SQL query is safe (SQLite handles thousands of IN-clause parameters) and gives the pool the most work to parallelise at once. Cumulative speedup vs. baseline: ~7.4x (1.9 s → ~180 ms median). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Commit 6c9f113 removed the `batched(ids, 50)` loop when refactoring for thread-pooled decompression, passing the full ID list straight into the `IN (?,?,...)` clause. Queries matching more than SQLite's default SQLITE_MAX_VARIABLE_NUMBER of trajectories (999 on older builds, 32766 on modern ones) fail with `OperationalError: too many SQL variables`. Restore the batching with a larger batch size (5000, as a named constant) so fewer round-trips are needed for large result sets while staying well under the modern default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each trajectory points blob is now stored as: b'\x01' + lz4.frame.compress(Point.pack(traj.points)) The leading version byte future-proofs subsequent compression migrations without requiring another flag day. lz4.frame decompresses ~600x faster than bz2, pushing remaining query latency toward zero. Reader (_process_blob in db.py): validate version byte, decompress with lz4.frame. An unsupported version byte raises ValueError with a clear message directing the user to upgrade. Writer (writeable_db.py): prepend version byte, compress with lz4.frame. Test data (tests/api/data/): converted to lz4 format in-place using the SQLite backup API + UPDATE approach that the migration script will use for the full 380 GB dataset. Dependencies: lz4>=4.4.5 added to feder-common and feder-ingest. Note: the benchmark requires the live data at /data2/feder/main/ to be migrated before it will run; tests/api/ passes against the converted test data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@given

Eight standalone scripts for the cluster-side data migration: convert_file.py Per-file conversion (SQLite backup API + UPDATE). Used by the Slurm job array and directly at cutover. bulk_convert.sh Slurm job array: one task per file, 20 concurrent. tail_convert.sh Daily cron to keep the lz4 copy in sync with new day-files until cutover. verify_counts.py Row-count sweep across all file pairs; runs in minutes. verify_sample.py Decompress-and-compare spot check on a random sample; asserts lz4.frame.decompress(blob[1:]) == bz2.decompress(blob). verify_queries.py Query-level check: canned FlightQuery calls against both directories, exact equality on every Point field. hypothesis_test.py Hypothesis @given test with random queries anchored to real data dates; max_examples=2000, exact equality. soak_test.py while-True loop for 48 h pre-cutover confidence building; reports every 100 queries. Also add bz2-lz4-migration-plan.md (full migration runbook) and pr-description.md (PR description for the query-optimization branch). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bulk_convert_parallel.sh is a drop-in alternative to bulk_convert.sh for running the bz2→lz4 conversion on a development machine without Slurm. Defaults to nproc workers; override with --jobs N or JOBS=N env var. Writes a job log to logs/parallel_convert.log so failed files can be retried with --retry-failed. Uses --halt soon,fail=1 so a run with widespread failures stops early rather than continuing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… task per blob With lz4, each decompression takes ~0.4 µs vs ~260 µs for bz2. Submitting 1950 individual tasks to the pool meant synchronisation overhead (~190 ms) dwarfed the actual work (~110 ms), giving worse performance than bz2+threads. Fix: split the row list into at most N_WORKERS chunks and submit one future per chunk. Thread-pool overhead is now O(workers) rather than O(trajectories). _array_to_points is also moved inside the threaded chunk work so Point construction is parallelised too. Result: 1.9 s baseline → ~68 ms min / ~178 ms median (cumulative ~11× speedup). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The UPDATE step replaces smaller bz2 blobs with larger lz4 blobs, freeing the original pages in place. VACUUM rewrites the file contiguously, improving I/O locality and recovering the freed space. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pr-description-addendum.md covers the lz4 compression change, the thread-pool granularity fix, migration tooling, cumulative benchmark results, and the breaking-change notice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-authored-by: Ian Ross <ian-ross@users.noreply.github.com>

(GitHub Claude bot tried to do it, but couldn't fix the same file in parallel with another change going on, so I applied the change manually.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The bz2→lz4 verification scripts previously reloaded the `feder` module with a changed FEDER_DATA_DIR to point at each store in turn. After commit 7a8d892 only the lz4 format decodes, so any query issued against a bz2 store from the current codebase fails. Introduce a long-running subprocess `scripts/query_executor.py` that can be invoked under either a bz2- or lz4-capable uv environment: uv --project <code_dir> run python scripts/query_executor.py \ <data_dir> --label {bz2,lz4} It reads newline-delimited JSON `FlightQuery` requests from stdin and writes responses to stdout: digest per trajectory by default (source_id, n_points, points_sha256), or full base64-encoded point bytes on `"full": true`. The op set is whitelisted (time_{intersects,within, starts_in,ends_in}, spatially_crosses, with_bounds, filter_waypoints, with_orig); per-query exceptions become `{"ok": false, ...}` replies so the subprocess stays alive for the next request. `scripts/_query_harness.py` provides a `ComparisonHarness` context manager that spawns one executor per format, validates a startup handshake with a 30s timeout, and exposes `compare(t1, t2, ops)` which returns a digest-based `Verdict`. On a digest mismatch it transparently re-sends the query with `"full": true` to both sides and diffs raw point bytes to pinpoint the offending trajectory and point index. Each executor's stderr is streamed to `scripts/logs/query_executor_{label}_{pid}_{ts}.log`. On context exit stdin is closed for graceful shutdown, with terminate → kill fallbacks. Configuration is env-var only: BZ2_CODE_DIR, BZ2_DATA_DIR, LZ4_CODE_DIR, LZ4_DATA_DIR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ness All three scripts previously imported `feder` and toggled FEDER_DATA_DIR per query; this no longer works now that only the lz4 format decodes. Each script now drives two `query_executor.py` subprocesses via the shared `ComparisonHarness`. Query shape updates, motivated by early soak runs that stalled on continent-wide unbounded queries: - Every generated query carries a bounding box drawn as a CONUS centre point plus independent N-S and E-W extents in [10, 200] km. Longitude half-width compensates for the centre latitude so kilometres on the ground are honoured, not just the degree delta. This models the primary use case: regions around ground-based cameras covering their effective field of view. - Time windows are 5-30 minutes (was 15-240), with start time offset by a random minute so windows aren't always aligned to the hour. Configuration is env-var only across all three: BZ2_CODE_DIR, BZ2_DATA_DIR, LZ4_CODE_DIR, LZ4_DATA_DIR. The positional bz2/lz4 data arguments to verify_queries.py and soak_test.py are removed; hypothesis_test.py's previous BZ2_DIR/LZ4_DIR env vars become BZ2_DATA_DIR/LZ4_DATA_DIR and are joined by the two new *_CODE_DIR vars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The source directory may not be writable by the user running the script, which breaks sqlite3's backup path. Since the source DB is guaranteed to be vacuumed with no concurrent writers, copy it into a writable temp directory and operate on that copy instead. Co-authored-by: Claude Opus 4.7 <anthropic-claude-opus-4-7@pi.local> Pi-Model: anthropic/claude-opus-4-7

ian-ross changed the title ~~Query optimization~~ Query optimization (no schema change) Apr 23, 2026

ian-ross changed the title ~~Query optimization (no schema change)~~ Query optimization (SCHEMA CHANGE REQUIRED) Apr 23, 2026

ian-ross force-pushed the query-optimization branch from 8cef2e1 to 2a08efa Compare April 23, 2026 19:36

ian-ross commented Apr 23, 2026

View reviewed changes

Comment thread libs/common/src/feder_common/db.py Outdated

Comment thread libs/common/src/feder_common/models.py

ian-ross force-pushed the query-optimization branch from e350746 to 1b7dc10 Compare April 24, 2026 11:19

ian-ross and others added 17 commits April 28, 2026 17:28

Add uv project in bulk_convert_parallel.sh

386669e

Add comment explaining _POINT_DTYPE purpose and usage

4834271

Co-authored-by: Ian Ross <ian-ross@users.noreply.github.com>

Fix too clever ceiling division expression.

8ecb746

(GitHub Claude bot tried to do it, but couldn't fix the same file in parallel with another change going on, so I applied the change manually.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove old planning and description files

1c54edd

Missing import

2a76580

ian-ross force-pushed the query-optimization branch from 6101d8c to 19b730f Compare April 28, 2026 15:28

ian-ross merged commit ba68de0 into main Apr 28, 2026
1 check passed

ian-ross deleted the query-optimization branch April 28, 2026 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query optimization (SCHEMA CHANGE REQUIRED)#1

Query optimization (SCHEMA CHANGE REQUIRED)#1
ian-ross merged 17 commits intomainfrom
query-optimization

ian-ross commented Apr 23, 2026 •

edited

Loading

Uh oh!

ian-ross left a comment

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 23, 2026 •

edited by ian-ross

Loading

Uh oh!

ian-ross commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ian-ross commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What was changed

1. Vectorized point parsing with numpy (feder-common)

2. Parallel bz2 decompression with a thread pool (feder-common)

Overall results

Dependencies added

What's next

Addendum: lz4 storage compression (Feder 1.0.0)

Motivation

What was changed

3. lz4 storage format with version byte (feder-common, feder-ingest)

4. Migration tooling (scripts/)

Overall results (cumulative)

Dependencies added

Breaking change

Uh oh!

ian-ross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 23, 2026 • edited by ian-ross Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

Uh oh!

ian-ross commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ian-ross commented Apr 23, 2026 •

edited

Loading

1. Vectorized point parsing with numpy (`feder-common`)

2. Parallel bz2 decompression with a thread pool (`feder-common`)

3. lz4 storage format with version byte (`feder-common`, `feder-ingest`)

4. Migration tooling (`scripts/`)

claude Bot commented Apr 23, 2026 •

edited by ian-ross

Loading