Query optimization (SCHEMA CHANGE REQUIRED)#1
Merged
Conversation
8cef2e1 to
2a08efa
Compare
ian-ross
commented
Apr 23, 2026
Member
Author
|
@ian-ross There is a slightly silly thing that needs fixing here: all of the scripts that compare "old" and "new" schema databases import the current API version, which doesn't work with old schema databases... We need to think of a better way to handle that. |
e350746 to
1b7dc10
Compare
Replace the per-point Python struct loop in Point.unpack (and its accompanying milli()/round() calls on every float) with numpy.frombuffer on the whole blob at once. Spatial crossing checks and waypoint filters are likewise converted to numpy boolean-mask operations, so Point objects are only materialised for trajectories that survive all filters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eedup Add a module-level ThreadPoolExecutor (8 workers, the empirical sweet spot on a 28-core host — more threads increase scheduling and memory- bandwidth contention without adding throughput for this workload). _process_blob() is the per-trajectory unit of work: decompress, numpy- parse, spatial check, and waypoint filter all run in parallel. Batching of R-tree IDs is removed since passing all IDs in one SQL query is safe (SQLite handles thousands of IN-clause parameters) and gives the pool the most work to parallelise at once. Cumulative speedup vs. baseline: ~7.4x (1.9 s → ~180 ms median). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Commit 6c9f113 removed the `batched(ids, 50)` loop when refactoring for thread-pooled decompression, passing the full ID list straight into the `IN (?,?,...)` clause. Queries matching more than SQLite's default SQLITE_MAX_VARIABLE_NUMBER of trajectories (999 on older builds, 32766 on modern ones) fail with `OperationalError: too many SQL variables`. Restore the batching with a larger batch size (5000, as a named constant) so fewer round-trips are needed for large result sets while staying well under the modern default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each trajectory points blob is now stored as: b'\x01' + lz4.frame.compress(Point.pack(traj.points)) The leading version byte future-proofs subsequent compression migrations without requiring another flag day. lz4.frame decompresses ~600x faster than bz2, pushing remaining query latency toward zero. Reader (_process_blob in db.py): validate version byte, decompress with lz4.frame. An unsupported version byte raises ValueError with a clear message directing the user to upgrade. Writer (writeable_db.py): prepend version byte, compress with lz4.frame. Test data (tests/api/data/): converted to lz4 format in-place using the SQLite backup API + UPDATE approach that the migration script will use for the full 380 GB dataset. Dependencies: lz4>=4.4.5 added to feder-common and feder-ingest. Note: the benchmark requires the live data at /data2/feder/main/ to be migrated before it will run; tests/api/ passes against the converted test data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eight standalone scripts for the cluster-side data migration:
convert_file.py Per-file conversion (SQLite backup API + UPDATE).
Used by the Slurm job array and directly at cutover.
bulk_convert.sh Slurm job array: one task per file, 20 concurrent.
tail_convert.sh Daily cron to keep the lz4 copy in sync with new
day-files until cutover.
verify_counts.py Row-count sweep across all file pairs; runs in minutes.
verify_sample.py Decompress-and-compare spot check on a random sample;
asserts lz4.frame.decompress(blob[1:]) == bz2.decompress(blob).
verify_queries.py Query-level check: canned FlightQuery calls against
both directories, exact equality on every Point field.
hypothesis_test.py Hypothesis @given test with random queries anchored to
real data dates; max_examples=2000, exact equality.
soak_test.py while-True loop for 48 h pre-cutover confidence building;
reports every 100 queries.
Also add bz2-lz4-migration-plan.md (full migration runbook) and
pr-description.md (PR description for the query-optimization branch).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bulk_convert_parallel.sh is a drop-in alternative to bulk_convert.sh for running the bz2→lz4 conversion on a development machine without Slurm. Defaults to nproc workers; override with --jobs N or JOBS=N env var. Writes a job log to logs/parallel_convert.log so failed files can be retried with --retry-failed. Uses --halt soon,fail=1 so a run with widespread failures stops early rather than continuing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… task per blob With lz4, each decompression takes ~0.4 µs vs ~260 µs for bz2. Submitting 1950 individual tasks to the pool meant synchronisation overhead (~190 ms) dwarfed the actual work (~110 ms), giving worse performance than bz2+threads. Fix: split the row list into at most N_WORKERS chunks and submit one future per chunk. Thread-pool overhead is now O(workers) rather than O(trajectories). _array_to_points is also moved inside the threaded chunk work so Point construction is parallelised too. Result: 1.9 s baseline → ~68 ms min / ~178 ms median (cumulative ~11× speedup). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The UPDATE step replaces smaller bz2 blobs with larger lz4 blobs, freeing the original pages in place. VACUUM rewrites the file contiguously, improving I/O locality and recovering the freed space. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pr-description-addendum.md covers the lz4 compression change, the thread-pool granularity fix, migration tooling, cumulative benchmark results, and the breaking-change notice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ian Ross <ian-ross@users.noreply.github.com>
(GitHub Claude bot tried to do it, but couldn't fix the same file in parallel with another change going on, so I applied the change manually.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The bz2→lz4 verification scripts previously reloaded the `feder` module with a changed FEDER_DATA_DIR to point at each store in turn. After commit 7a8d892 only the lz4 format decodes, so any query issued against a bz2 store from the current codebase fails. Introduce a long-running subprocess `scripts/query_executor.py` that can be invoked under either a bz2- or lz4-capable uv environment: uv --project <code_dir> run python scripts/query_executor.py \ <data_dir> --label {bz2,lz4} It reads newline-delimited JSON `FlightQuery` requests from stdin and writes responses to stdout: digest per trajectory by default (source_id, n_points, points_sha256), or full base64-encoded point bytes on `"full": true`. The op set is whitelisted (time_{intersects,within, starts_in,ends_in}, spatially_crosses, with_bounds, filter_waypoints, with_orig); per-query exceptions become `{"ok": false, ...}` replies so the subprocess stays alive for the next request. `scripts/_query_harness.py` provides a `ComparisonHarness` context manager that spawns one executor per format, validates a startup handshake with a 30s timeout, and exposes `compare(t1, t2, ops)` which returns a digest-based `Verdict`. On a digest mismatch it transparently re-sends the query with `"full": true` to both sides and diffs raw point bytes to pinpoint the offending trajectory and point index. Each executor's stderr is streamed to `scripts/logs/query_executor_{label}_{pid}_{ts}.log`. On context exit stdin is closed for graceful shutdown, with terminate → kill fallbacks. Configuration is env-var only: BZ2_CODE_DIR, BZ2_DATA_DIR, LZ4_CODE_DIR, LZ4_DATA_DIR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ness
All three scripts previously imported `feder` and toggled FEDER_DATA_DIR
per query; this no longer works now that only the lz4 format decodes.
Each script now drives two `query_executor.py` subprocesses via the
shared `ComparisonHarness`.
Query shape updates, motivated by early soak runs that stalled on
continent-wide unbounded queries:
- Every generated query carries a bounding box drawn as a CONUS
centre point plus independent N-S and E-W extents in [10, 200] km.
Longitude half-width compensates for the centre latitude so kilometres
on the ground are honoured, not just the degree delta. This models
the primary use case: regions around ground-based cameras covering
their effective field of view.
- Time windows are 5-30 minutes (was 15-240), with start time offset
by a random minute so windows aren't always aligned to the hour.
Configuration is env-var only across all three: BZ2_CODE_DIR,
BZ2_DATA_DIR, LZ4_CODE_DIR, LZ4_DATA_DIR. The positional bz2/lz4 data
arguments to verify_queries.py and soak_test.py are removed;
hypothesis_test.py's previous BZ2_DIR/LZ4_DIR env vars become
BZ2_DATA_DIR/LZ4_DATA_DIR and are joined by the two new *_CODE_DIR vars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The source directory may not be writable by the user running the script, which breaks sqlite3's backup path. Since the source DB is guaranteed to be vacuumed with no concurrent writers, copy it into a writable temp directory and operate on that copy instead. Co-authored-by: Claude Opus 4.7 <anthropic-claude-opus-4-7@pi.local> Pi-Model: anthropic/claude-opus-4-7
6101d8c to
19b730f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Query optimization using Claude Code. The following description is lightly edited from Claude's own description of the changes.
Motivation
The flight query API is going to be used in a long-lived server process to load trajectory data for projecting onto ground-based camera video. A representative query — 1-hour time window, 50 km bounding box,
spatially_crosses+filter_waypoints— was taking ~1.9 s, well above the ~200 ms target needed to keep below video-stream initialization time.What was changed
1. Vectorized point parsing with numpy (
feder-common)Root cause:
Point.unpackdeserialized trajectory points one at a time in a Python loop, callingstruct.unpackandround(..., 3)(milli()) per point. With ~890 K points unpacked per query call, this generated 22 M Python function calls and dominated the profile at ~76 % of wall time.Fix:
_POINT_DTYPEnumpy structured dtype matching the on-disk binary layout ('>Lddddd?').Point._unpack_blob(data)— parses a full decompressed blob in onenumpy.frombuffercall.Point._array_to_points(arr)— materialisesPointobjects only for trajectories that survive all filters.points_checker(spatial crossing check) and_make_point_filter(waypoint filter) with numpy boolean-mask operations, so Point objects are never constructed for the ~59 % of R-tree candidates that are ultimately rejected.milli()from the hot path. It was originally needed for round-tripping against an old CSV format; it is preserved inutils.pyand remains intests/conftest.pywhere it still makes comparisons robust.Result: 1.90 s → 0.56 s (3.4×)
2. Parallel bz2 decompression with a thread pool (
feder-common)Root cause: After the numpy change, profiling showed bz2 decompression of ~1950 trajectory blobs was 92 % of remaining time. Decompressions were serial, and
bz2is a C extension that releases the GIL — straightforward to parallelise with threads.Fix:
ThreadPoolExecutor(default 8 workers — empirically the sweet spot on the test machine; more threads increase scheduling and memory-bandwidth contention without adding throughput for this workload)._process_blob(blob, points_check, pt_filter)as the per-trajectory unit of work submitted to the pool: decompress → numpy parse → spatial check → waypoint filter, all running in parallel.batched(ids, 50)loop inquery_flights: all R-tree IDs are now passed to_retrievein a single SQL query (SQLite handles thousands ofIN-clause parameters without issue), giving the pool the maximum available parallelism in one shot.Result: 0.56 s → ~0.18 s median (further 3.1×)
Overall results
Measured on a warm server process (OS page cache hot, 10 repeated calls after one warm-up). Min observed: 128 ms; p90: ~275 ms (tail variance is from shared-host load, not the query path itself).
Dependencies added
numpy >= 1.24added tofeder-common.What's next
The remaining ~150 ms is almost entirely bz2 decompression. Replacing bz2 with lz4 (decompresses at ~6 GB/s vs bz2's ~10 MB/s) would reduce decompression to near-zero and push all percentiles comfortably under 100 ms. That requires a data migration and ingestion-pipeline change, and is tracked separately.
Addendum: lz4 storage compression (Feder 1.0.0)
Motivation
After the numpy and thread-pool changes, profiling showed bz2 decompression of ~1 950 trajectory blobs still accounted for the bulk of remaining latency. lz4 decompresses at ~6 GB/s vs bz2's ~10 MB/s — roughly 600× faster — reducing decompression to near-zero.
This required converting 380 GB of historical data (1 307 daily SQLite files) and is a breaking change: old library versions will raise
OSError: Invalid data streamagainst the new files. It is released as Feder 1.0.0, using the cumulative ~11× query speedup as justification for the compatibility break.What was changed
3. lz4 storage format with version byte (
feder-common,feder-ingest)Blob format: each
pointsblob is now stored as:The leading version byte (
0x01) future-proofs subsequent compression changes without requiring another flag day.Reader (
db.py): validates the version byte and callslz4.frame.decompress(blob[1:]). An unrecognised version byte raisesValueErrorwith a message directing the user to upgrade.Writer (
writeable_db.py): prepends0x01and compresses withlz4.frame.Thread-pool fix: switching from bz2 to lz4 exposed a granularity problem — lz4 decompressions take ~0.4 µs each (vs ~260 µs for bz2), making 1 950 individual thread-pool tasks slower than single-threaded due to per-task Python overhead. Fixed by splitting the row list into
N_WORKERSchunks and submitting one future per chunk._array_to_pointsis also moved inside the chunked work so Point construction is parallelised too.4. Migration tooling (
scripts/)Nine standalone scripts for the cluster-side data migration:
convert_file.pybulk_convert.shbulk_convert_parallel.shtail_convert.shverify_counts.pyverify_sample.pyverify_queries.pyhypothesis_test.pymax_examples=2000soak_test.pywhile Trueloop for 48-hour pre-cutover confidence buildingOverall results (cumulative)
Measured on a warm server process against the converted live dataset. p90 is ~264 ms; tail variance is shared-host scheduling noise, not query-path cost.
Dependencies added
lz4 >= 4.4.5added tofeder-commonandfeder-ingest.Breaking change
Users on old library versions querying converted data will see
OSError: Invalid data stream. The fix ispip install --upgrade feder. This will be called out prominently in the 1.0.0 release notes and announced to users before the data cutover.