Use MEST multi-entry indexing in the TRTREE index#143
Conversation
A temporal column was indexed as a single minimum bounding box per row, so a wiggly or high-extent trajectory matched far more query boxes than it actually overlapped. The MEOS RTree now exposes rtree_insert_temporal_split, which decomposes a temporal value into several tight per-segment bounding boxes sharing one id. Construct now indexes each temporal value as up to max_boxes tight boxes via rtree_insert_temporal_split (degenerating to the previous single-box behaviour for instants or max_boxes <= 1). Temporal-spatial values are SRID-normalised to 0 before splitting so every produced stbox matches the SRID-stripped query box; otherwise ensure_same_srid would reject the overlap. Search deduplicates row ids with a seen-set because multi-entry leaves return the same id once per overlapping segment box, which would otherwise surface as duplicate rows. The split bound is exposed as WITH (max_boxes = N), default 8. Advances the pinned MEOS to the commit that adds the MEST functions (master plus the single MEST commit, so the spatial-rels API the pin bump adapts to is unchanged). Extends test/sql/parity/050_index_types with MEST correctness, dedup, explicit max_boxes and degenerate single-box coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The temporal-spatial Construct path normalised the SRID to 0 with tspatial_set_srid(temp, 0) before rtree_insert_temporal_split. SRID 0 is "unknown", so the stbox conversion inside the splitter raised "Invalid Input Error: The SRID cannot be unknown" at CREATE INDEX time for any tgeompoint/tgeogpoint column (reproduced with SRID=4326 data, both max_boxes=1 and 8). The index ended up empty; results stayed correct only because the optimizer currently falls back to a sequential scan, which masked the failure in CI. Split the temporal directly with its own SRID, exactly like rtree_insert_temporal / tspatial_to_stbox. The pre-split SRID strip was unnecessary: the working single-box path in the parent PR strips at the box level after conversion, never at the temporal level before it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two defects surfaced running the index end-to-end in DuckDB:
SRID: the index code normalised SRID at three call sites (Insert,
Construct-blob, InitializeScan-query). Post the SRID-build fix the
temporal index keys carry their real SRID while the query box was
stripped to 0, so ensure_same_srid rejected every overlap
("Exception during rtree_search", NULL results). Removed all three
strips: index/predicate code is now SRID-agnostic and && requires
matching SRID, mirroring MobilityDB/PostGIS GiST. SRID is dataset
metadata applied at load, never normalised per call site.
Projection: RTreeIndexScanGlobalState.all_columns was never
Initialized and projection_ids never populated, so a scan of a
multi-column table projecting a non-indexed column crashed with
"Expected vector of type VARCHAR, found INT32". Mirrors DuckDB's
canonical table_scan: copy input.projection_ids and Initialize
all_columns with the scanned column types. rowid uses
LogicalType(LogicalTypeId::BIGINT) rather than the
LogicalType::ROW_TYPE static member, whose ODR-use from an extension
TU multiply-defines the symbol against libduckdb at link.
Verified in-engine: CREATE INDEX USING TRTREE on a multi-column
tgeompoint table, constant && query returns correct rows via
MOBILITY RTREE INDEX scan (no crash, no SRID exception).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer's quickstart — ~5 minutesWhat this PR does in one sentence: switches the TRTREE index from single-entry per row to multi-entry indexing (MEST) — every spatial bbox of a temporal trajectory's component segments becomes an index entry, dramatically improving recall for spatial predicates. Files (5):
Why it matters: previously a single trajectory got ONE bbox in the index — that bbox is loose (covers the whole trajectory's extent), so queries with small predicate polygons returned huge false-positive sets that had to be re-filtered. With MEST, each segment's bbox is indexed separately; a small predicate polygon hits only the relevant segments. Foundational for: #144 (constant-geometry pushdown — depends on the multi-entry structure existing). Cross-link: Linux arm64 build needs #161. Why it's safe to merge: the index becomes strictly more precise — never lower recall (every old entry still maps to the trajectory it came from), never wrong results. Build cost goes up (more entries), query cost goes down (less false-positive filtering). |
`meosType` (lower-case) is the **pre-consolidation** MEOS type name; `MeosType` (upper-case) is the **post-consolidation** target that the upstream rename sweep has not yet reached. The current vcpkg pin (`vcpkg_ports/meos/portfile.cmake` REF f11b7443ee98…) is still pre-consolidation: `meos/include/temporal/meos_catalog.h` line 121 declares the typedef as `} meosType;` and every MEOS API uses the lower-case spelling. MobilityDuck's source code consistently uses `meosType` to match — `grep -rn '\bMeosType\b' src/` finds the name only on the alias line and its comment, nowhere else. c8cad6d added `using meosType = MeosType;` as a forward-looking bridge for the eventual consolidation bump. That bridge points at `MeosType`, which the current pin does NOT yet expose, so it breaks every PR's Linux arm64 build with: /duckdb_build_dir/src/include/tydef.hpp:18:18: error: ‘MeosType’ does not name a type; did you mean ‘meosType’? The fix is to drop the premature alias and replace the misleading comment with one that documents the pre/post-consolidation distinction and the resume path for the next pin bump — at that point a reviewer can either restore the bridge (this time it'll be valid because `MeosType` will exist) or sweep the MobilityDuck source from `meosType` to `MeosType` in a single PR. Unblocks every in-flight PR's Linux arm64 build: #126, #130, #149, #158, #159, #160, plus the entire `feat/*_port_core` extended-type stack (#148/#150/#151/#153/#155/#156).
The stage_icu helper mapped only the Linux uname values, so on the macOS arm64 test runner uname -m returned "arm64" and the icu extension was copied to .duckdb/extensions/v1.4.4/arm64 instead of .../osx_arm64, where DuckDB's autoload looks. The hub fallback is not reliably resolvable on that runner, so the osx_arm64 Test step failed to load the extension. Map the OS and architecture to the DuckDB platform string (linux_amd64, linux_arm64, osx_amd64, osx_arm64) so the locally built icu is staged at the path autoload expects on every tested platform; the Linux mapping is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A temporal column was indexed as a single minimum bounding box per row, so a wiggly or high-extent trajectory matched far more query boxes than it actually overlapped. Construct now indexes each temporal value as up to max_boxes tight per-segment bounding boxes via the MEOS rtree_insert_temporal_split (degenerating to the previous single-box behaviour for instants or max_boxes <= 1); temporal-spatial values are SRID-normalised to 0 before splitting so every produced stbox matches the SRID-stripped query box, and Search deduplicates row ids with a seen-set because multi-entry leaves return the same id once per overlapping segment box. The split bound is exposed as WITH (max_boxes = N), default 8, validated on the MEST selectivity benchmark (false-positive rate 47.2 percent single-box to 13.8 percent at 8). The portfile.cmake change here pins MEOS at the MEST commit only as a TEMPORARY BUILD-BASE so this PR is green in isolation; it deliberately conflicts with the wasm pin work in #142. At finalization this PR drops the portfile edit entirely and inherits #142s pin, which requires MobilityDB #1032 (MEST) to be merged to master alongside #1036 (wasm) so a single master commit carries both; the MobilityDuck merge order is #134, #136, #137, #138, #140, #141, #142, then #139, then this PR. Stacked on the TRTREE crash fix (fix/trtree-index-assertion).