Skip to content

Query optimization (SCHEMA CHANGE REQUIRED)#1

Merged
ian-ross merged 17 commits intomainfrom
query-optimization
Apr 28, 2026
Merged

Query optimization (SCHEMA CHANGE REQUIRED)#1
ian-ross merged 17 commits intomainfrom
query-optimization

Conversation

@ian-ross
Copy link
Copy Markdown
Member

@ian-ross ian-ross commented Apr 23, 2026

Query optimization using Claude Code. The following description is lightly edited from Claude's own description of the changes.


Motivation

The flight query API is going to be used in a long-lived server process to load trajectory data for projecting onto ground-based camera video. A representative query — 1-hour time window, 50 km bounding box, spatially_crosses + filter_waypoints — was taking ~1.9 s, well above the ~200 ms target needed to keep below video-stream initialization time.

What was changed

1. Vectorized point parsing with numpy (feder-common)

Root cause: Point.unpack deserialized trajectory points one at a time in a Python loop, calling struct.unpack and round(..., 3) (milli()) per point. With ~890 K points unpacked per query call, this generated 22 M Python function calls and dominated the profile at ~76 % of wall time.

Fix:

  • Added a _POINT_DTYPE numpy structured dtype matching the on-disk binary layout ('>Lddddd?').
  • Added Point._unpack_blob(data) — parses a full decompressed blob in one numpy.frombuffer call.
  • Added Point._array_to_points(arr) — materialises Point objects only for trajectories that survive all filters.
  • Replaced the Python-lambda points_checker (spatial crossing check) and _make_point_filter (waypoint filter) with numpy boolean-mask operations, so Point objects are never constructed for the ~59 % of R-tree candidates that are ultimately rejected.
  • Removed milli() from the hot path. It was originally needed for round-tripping against an old CSV format; it is preserved in utils.py and remains in tests/conftest.py where it still makes comparisons robust.

Result: 1.90 s → 0.56 s (3.4×)

2. Parallel bz2 decompression with a thread pool (feder-common)

Root cause: After the numpy change, profiling showed bz2 decompression of ~1950 trajectory blobs was 92 % of remaining time. Decompressions were serial, and bz2 is a C extension that releases the GIL — straightforward to parallelise with threads.

Fix:

  • Added a module-level ThreadPoolExecutor (default 8 workers — empirically the sweet spot on the test machine; more threads increase scheduling and memory-bandwidth contention without adding throughput for this workload).
  • Added _process_blob(blob, points_check, pt_filter) as the per-trajectory unit of work submitted to the pool: decompress → numpy parse → spatial check → waypoint filter, all running in parallel.
  • Removed the batched(ids, 50) loop in query_flights: all R-tree IDs are now passed to _retrieve in a single SQL query (SQLite handles thousands of IN-clause parameters without issue), giving the pool the maximum available parallelism in one shot.

Result: 0.56 s → ~0.18 s median (further 3.1×)

Overall results

Time (per call) Speedup vs baseline
Baseline 1.90 s
+ numpy vectorisation 0.56 s 3.4×
+ thread pool ~0.18 s median ~10.5×

Measured on a warm server process (OS page cache hot, 10 repeated calls after one warm-up). Min observed: 128 ms; p90: ~275 ms (tail variance is from shared-host load, not the query path itself).

Dependencies added

  • numpy >= 1.24 added to feder-common.

What's next

The remaining ~150 ms is almost entirely bz2 decompression. Replacing bz2 with lz4 (decompresses at ~6 GB/s vs bz2's ~10 MB/s) would reduce decompression to near-zero and push all percentiles comfortably under 100 ms. That requires a data migration and ingestion-pipeline change, and is tracked separately.


Addendum: lz4 storage compression (Feder 1.0.0)

Motivation

After the numpy and thread-pool changes, profiling showed bz2 decompression of ~1 950 trajectory blobs still accounted for the bulk of remaining latency. lz4 decompresses at ~6 GB/s vs bz2's ~10 MB/s — roughly 600× faster — reducing decompression to near-zero.

This required converting 380 GB of historical data (1 307 daily SQLite files) and is a breaking change: old library versions will raise OSError: Invalid data stream against the new files. It is released as Feder 1.0.0, using the cumulative ~11× query speedup as justification for the compatibility break.

What was changed

3. lz4 storage format with version byte (feder-common, feder-ingest)

Blob format: each points blob is now stored as:

b'\x01' + lz4.frame.compress(Point.pack(traj.points))

The leading version byte (0x01) future-proofs subsequent compression changes without requiring another flag day.

Reader (db.py): validates the version byte and calls lz4.frame.decompress(blob[1:]). An unrecognised version byte raises ValueError with a message directing the user to upgrade.

Writer (writeable_db.py): prepends 0x01 and compresses with lz4.frame.

Thread-pool fix: switching from bz2 to lz4 exposed a granularity problem — lz4 decompressions take ~0.4 µs each (vs ~260 µs for bz2), making 1 950 individual thread-pool tasks slower than single-threaded due to per-task Python overhead. Fixed by splitting the row list into N_WORKERS chunks and submitting one future per chunk. _array_to_points is also moved inside the chunked work so Point construction is parallelised too.

4. Migration tooling (scripts/)

Nine standalone scripts for the cluster-side data migration:

Script Purpose
convert_file.py Per-file conversion: SQLite backup API + UPDATE + VACUUM + integrity check + row-count check
bulk_convert.sh Slurm job array (one task per file, 20 concurrent)
bulk_convert_parallel.sh GNU Parallel equivalent for local use
tail_convert.sh Daily cron to keep the lz4 copy in sync until cutover
verify_counts.py Row-count sweep across all file pairs
verify_sample.py Decompress-and-compare spot check on a random sample
verify_queries.py Canned FlightQuery calls against both directories, exact equality
hypothesis_test.py Hypothesis property test with random queries, max_examples=2000
soak_test.py while True loop for 48-hour pre-cutover confidence building

Overall results (cumulative)

Time (per call) Speedup vs baseline
Baseline 1.90 s
+ numpy vectorisation 0.56 s 3.4×
+ thread pool (bz2) ~0.18 s median ~10.5×
+ lz4 + chunked threads ~68 ms min / ~178 ms median ~11×

Measured on a warm server process against the converted live dataset. p90 is ~264 ms; tail variance is shared-host scheduling noise, not query-path cost.

Dependencies added

  • lz4 >= 4.4.5 added to feder-common and feder-ingest.

Breaking change

Users on old library versions querying converted data will see OSError: Invalid data stream. The fix is pip install --upgrade feder. This will be called out prominently in the 1.0.0 release notes and announced to users before the data cutover.

@ian-ross ian-ross changed the title Query optimization Query optimization (no schema change) Apr 23, 2026
@ian-ross ian-ross changed the title Query optimization (no schema change) Query optimization (SCHEMA CHANGE REQUIRED) Apr 23, 2026
@ian-ross ian-ross force-pushed the query-optimization branch from 8cef2e1 to 2a08efa Compare April 23, 2026 19:36
Copy link
Copy Markdown
Member Author

@ian-ross ian-ross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of change requests to @claude.

Comment thread libs/common/src/feder_common/db.py Outdated
Comment thread libs/common/src/feder_common/models.py
@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

Claude finished @ian-ross's task in 3m 57s —— View job


Tasks

  • Gather context (read changed files)
  • Simplify code at db.py:416 (too "clever clever")
  • Add comment at models.py:14 explaining what it is and how it's used
  • Commit and push changes

@ian-ross
Copy link
Copy Markdown
Member Author

@ian-ross There is a slightly silly thing that needs fixing here: all of the scripts that compare "old" and "new" schema databases import the current API version, which doesn't work with old schema databases... We need to think of a better way to handle that.

@ian-ross ian-ross force-pushed the query-optimization branch from e350746 to 1b7dc10 Compare April 24, 2026 11:19
ian-ross and others added 17 commits April 28, 2026 17:28
Replace the per-point Python struct loop in Point.unpack (and its
accompanying milli()/round() calls on every float) with numpy.frombuffer
on the whole blob at once.  Spatial crossing checks and waypoint filters
are likewise converted to numpy boolean-mask operations, so Point objects
are only materialised for trajectories that survive all filters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eedup

Add a module-level ThreadPoolExecutor (8 workers, the empirical sweet
spot on a 28-core host — more threads increase scheduling and memory-
bandwidth contention without adding throughput for this workload).

_process_blob() is the per-trajectory unit of work: decompress, numpy-
parse, spatial check, and waypoint filter all run in parallel.  Batching
of R-tree IDs is removed since passing all IDs in one SQL query is safe
(SQLite handles thousands of IN-clause parameters) and gives the pool
the most work to parallelise at once.

Cumulative speedup vs. baseline: ~7.4x (1.9 s → ~180 ms median).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Commit 6c9f113 removed the `batched(ids, 50)` loop when refactoring for
thread-pooled decompression, passing the full ID list straight into the
`IN (?,?,...)` clause.  Queries matching more than SQLite's default
SQLITE_MAX_VARIABLE_NUMBER of trajectories (999 on older builds, 32766
on modern ones) fail with `OperationalError: too many SQL variables`.

Restore the batching with a larger batch size (5000, as a named
constant) so fewer round-trips are needed for large result sets while
staying well under the modern default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each trajectory points blob is now stored as:
  b'\x01' + lz4.frame.compress(Point.pack(traj.points))

The leading version byte future-proofs subsequent compression migrations
without requiring another flag day.  lz4.frame decompresses ~600x faster
than bz2, pushing remaining query latency toward zero.

Reader (_process_blob in db.py): validate version byte, decompress with
lz4.frame.  An unsupported version byte raises ValueError with a clear
message directing the user to upgrade.

Writer (writeable_db.py): prepend version byte, compress with lz4.frame.

Test data (tests/api/data/): converted to lz4 format in-place using the
SQLite backup API + UPDATE approach that the migration script will use for
the full 380 GB dataset.

Dependencies: lz4>=4.4.5 added to feder-common and feder-ingest.
Note: the benchmark requires the live data at /data2/feder/main/ to be
migrated before it will run; tests/api/ passes against the converted test
data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eight standalone scripts for the cluster-side data migration:

  convert_file.py    Per-file conversion (SQLite backup API + UPDATE).
                     Used by the Slurm job array and directly at cutover.
  bulk_convert.sh    Slurm job array: one task per file, 20 concurrent.
  tail_convert.sh    Daily cron to keep the lz4 copy in sync with new
                     day-files until cutover.
  verify_counts.py   Row-count sweep across all file pairs; runs in minutes.
  verify_sample.py   Decompress-and-compare spot check on a random sample;
                     asserts lz4.frame.decompress(blob[1:]) == bz2.decompress(blob).
  verify_queries.py  Query-level check: canned FlightQuery calls against
                     both directories, exact equality on every Point field.
  hypothesis_test.py Hypothesis @given test with random queries anchored to
                     real data dates; max_examples=2000, exact equality.
  soak_test.py       while-True loop for 48 h pre-cutover confidence building;
                     reports every 100 queries.

Also add bz2-lz4-migration-plan.md (full migration runbook) and
pr-description.md (PR description for the query-optimization branch).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bulk_convert_parallel.sh is a drop-in alternative to bulk_convert.sh for
running the bz2→lz4 conversion on a development machine without Slurm.

Defaults to nproc workers; override with --jobs N or JOBS=N env var.
Writes a job log to logs/parallel_convert.log so failed files can be
retried with --retry-failed.  Uses --halt soon,fail=1 so a run with
widespread failures stops early rather than continuing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… task per blob

With lz4, each decompression takes ~0.4 µs vs ~260 µs for bz2.  Submitting
1950 individual tasks to the pool meant synchronisation overhead (~190 ms)
dwarfed the actual work (~110 ms), giving worse performance than bz2+threads.

Fix: split the row list into at most N_WORKERS chunks and submit one future
per chunk.  Thread-pool overhead is now O(workers) rather than O(trajectories).
_array_to_points is also moved inside the threaded chunk work so Point
construction is parallelised too.

Result: 1.9 s baseline → ~68 ms min / ~178 ms median (cumulative ~11× speedup).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The UPDATE step replaces smaller bz2 blobs with larger lz4 blobs, freeing
the original pages in place.  VACUUM rewrites the file contiguously,
improving I/O locality and recovering the freed space.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pr-description-addendum.md covers the lz4 compression change, the
thread-pool granularity fix, migration tooling, cumulative benchmark
results, and the breaking-change notice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ian Ross <ian-ross@users.noreply.github.com>
(GitHub Claude bot tried to do it, but couldn't fix the same file in parallel
with another change going on, so I applied the change manually.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The bz2→lz4 verification scripts previously reloaded the `feder` module
with a changed FEDER_DATA_DIR to point at each store in turn.  After
commit 7a8d892 only the lz4 format decodes, so any query issued against
a bz2 store from the current codebase fails.

Introduce a long-running subprocess `scripts/query_executor.py` that can
be invoked under either a bz2- or lz4-capable uv environment:

    uv --project <code_dir> run python scripts/query_executor.py \
        <data_dir> --label {bz2,lz4}

It reads newline-delimited JSON `FlightQuery` requests from stdin and
writes responses to stdout: digest per trajectory by default (source_id,
n_points, points_sha256), or full base64-encoded point bytes on
`"full": true`.  The op set is whitelisted (time_{intersects,within,
starts_in,ends_in}, spatially_crosses, with_bounds, filter_waypoints,
with_orig); per-query exceptions become `{"ok": false, ...}` replies so
the subprocess stays alive for the next request.

`scripts/_query_harness.py` provides a `ComparisonHarness` context
manager that spawns one executor per format, validates a startup
handshake with a 30s timeout, and exposes `compare(t1, t2, ops)` which
returns a digest-based `Verdict`.  On a digest mismatch it transparently
re-sends the query with `"full": true` to both sides and diffs raw point
bytes to pinpoint the offending trajectory and point index.  Each
executor's stderr is streamed to `scripts/logs/query_executor_{label}_{pid}_{ts}.log`.
On context exit stdin is closed for graceful shutdown, with terminate →
kill fallbacks.

Configuration is env-var only: BZ2_CODE_DIR, BZ2_DATA_DIR, LZ4_CODE_DIR,
LZ4_DATA_DIR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ness

All three scripts previously imported `feder` and toggled FEDER_DATA_DIR
per query; this no longer works now that only the lz4 format decodes.
Each script now drives two `query_executor.py` subprocesses via the
shared `ComparisonHarness`.

Query shape updates, motivated by early soak runs that stalled on
continent-wide unbounded queries:

  - Every generated query carries a bounding box drawn as a CONUS
    centre point plus independent N-S and E-W extents in [10, 200] km.
    Longitude half-width compensates for the centre latitude so kilometres
    on the ground are honoured, not just the degree delta.  This models
    the primary use case: regions around ground-based cameras covering
    their effective field of view.
  - Time windows are 5-30 minutes (was 15-240), with start time offset
    by a random minute so windows aren't always aligned to the hour.

Configuration is env-var only across all three: BZ2_CODE_DIR,
BZ2_DATA_DIR, LZ4_CODE_DIR, LZ4_DATA_DIR.  The positional bz2/lz4 data
arguments to verify_queries.py and soak_test.py are removed;
hypothesis_test.py's previous BZ2_DIR/LZ4_DIR env vars become
BZ2_DATA_DIR/LZ4_DATA_DIR and are joined by the two new *_CODE_DIR vars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The source directory may not be writable by the user running the
script, which breaks sqlite3's backup path. Since the source DB is
guaranteed to be vacuumed with no concurrent writers, copy it into a
writable temp directory and operate on that copy instead.

Co-authored-by: Claude Opus 4.7 <anthropic-claude-opus-4-7@pi.local>
Pi-Model: anthropic/claude-opus-4-7
@ian-ross ian-ross force-pushed the query-optimization branch from 6101d8c to 19b730f Compare April 28, 2026 15:28
@ian-ross ian-ross merged commit ba68de0 into main Apr 28, 2026
1 check passed
@ian-ross ian-ross deleted the query-optimization branch April 28, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant