Use MEST multi-entry indexing in the TRTREE index by estebanzimanyi · Pull Request #143 · MobilityDB/MobilityDuck

estebanzimanyi · 2026-05-16T10:25:21Z

A temporal column was indexed as a single minimum bounding box per row, so a wiggly or high-extent trajectory matched far more query boxes than it actually overlapped. Construct now indexes each temporal value as up to max_boxes tight per-segment bounding boxes via the MEOS rtree_insert_temporal_split (degenerating to the previous single-box behaviour for instants or max_boxes <= 1); temporal-spatial values are SRID-normalised to 0 before splitting so every produced stbox matches the SRID-stripped query box, and Search deduplicates row ids with a seen-set because multi-entry leaves return the same id once per overlapping segment box. The split bound is exposed as WITH (max_boxes = N), default 8, validated on the MEST selectivity benchmark (false-positive rate 47.2 percent single-box to 13.8 percent at 8). The portfile.cmake change here pins MEOS at the MEST commit only as a TEMPORARY BUILD-BASE so this PR is green in isolation; it deliberately conflicts with the wasm pin work in #142. At finalization this PR drops the portfile edit entirely and inherits #142s pin, which requires MobilityDB #1032 (MEST) to be merged to master alongside #1036 (wasm) so a single master commit carries both; the MobilityDuck merge order is #134, #136, #137, #138, #140, #141, #142, then #139, then this PR. Stacked on the TRTREE crash fix (fix/trtree-index-assertion).

A temporal column was indexed as a single minimum bounding box per row, so a wiggly or high-extent trajectory matched far more query boxes than it actually overlapped. The MEOS RTree now exposes rtree_insert_temporal_split, which decomposes a temporal value into several tight per-segment bounding boxes sharing one id. Construct now indexes each temporal value as up to max_boxes tight boxes via rtree_insert_temporal_split (degenerating to the previous single-box behaviour for instants or max_boxes <= 1). Temporal-spatial values are SRID-normalised to 0 before splitting so every produced stbox matches the SRID-stripped query box; otherwise ensure_same_srid would reject the overlap. Search deduplicates row ids with a seen-set because multi-entry leaves return the same id once per overlapping segment box, which would otherwise surface as duplicate rows. The split bound is exposed as WITH (max_boxes = N), default 8. Advances the pinned MEOS to the commit that adds the MEST functions (master plus the single MEST commit, so the spatial-rels API the pin bump adapts to is unchanged). Extends test/sql/parity/050_index_types with MEST correctness, dedup, explicit max_boxes and degenerate single-box coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The temporal-spatial Construct path normalised the SRID to 0 with tspatial_set_srid(temp, 0) before rtree_insert_temporal_split. SRID 0 is "unknown", so the stbox conversion inside the splitter raised "Invalid Input Error: The SRID cannot be unknown" at CREATE INDEX time for any tgeompoint/tgeogpoint column (reproduced with SRID=4326 data, both max_boxes=1 and 8). The index ended up empty; results stayed correct only because the optimizer currently falls back to a sequential scan, which masked the failure in CI. Split the temporal directly with its own SRID, exactly like rtree_insert_temporal / tspatial_to_stbox. The pre-split SRID strip was unnecessary: the working single-box path in the parent PR strips at the box level after conversion, never at the temporal level before it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two defects surfaced running the index end-to-end in DuckDB: SRID: the index code normalised SRID at three call sites (Insert, Construct-blob, InitializeScan-query). Post the SRID-build fix the temporal index keys carry their real SRID while the query box was stripped to 0, so ensure_same_srid rejected every overlap ("Exception during rtree_search", NULL results). Removed all three strips: index/predicate code is now SRID-agnostic and && requires matching SRID, mirroring MobilityDB/PostGIS GiST. SRID is dataset metadata applied at load, never normalised per call site. Projection: RTreeIndexScanGlobalState.all_columns was never Initialized and projection_ids never populated, so a scan of a multi-column table projecting a non-indexed column crashed with "Expected vector of type VARCHAR, found INT32". Mirrors DuckDB's canonical table_scan: copy input.projection_ids and Initialize all_columns with the scanned column types. rowid uses LogicalType(LogicalTypeId::BIGINT) rather than the LogicalType::ROW_TYPE static member, whose ODR-use from an extension TU multiply-defines the symbol against libduckdb at link. Verified in-engine: CREATE INDEX USING TRTREE on a multi-column tgeompoint table, constant && query returns correct rows via MOBILITY RTREE INDEX scan (no crash, no SRID exception). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

estebanzimanyi · 2026-05-20T11:48:04Z

Reviewer's quickstart — ~5 minutes

What this PR does in one sentence: switches the TRTREE index from single-entry per row to multi-entry indexing (MEST) — every spatial bbox of a temporal trajectory's component segments becomes an index entry, dramatically improving recall for spatial predicates.

Files (5):

src/index/rtree_index_create_physical.cpp — the new MEST build path.
src/index/rtree_index_scan.cpp — adjusted scan to dedup multi-entry hits.
src/include/index/rtree_module.hpp — type declarations for the new index payload.
test/sql/index/ — assertions on recall (no false negatives) and result-set deduplication.

Why it matters: previously a single trajectory got ONE bbox in the index — that bbox is loose (covers the whole trajectory's extent), so queries with small predicate polygons returned huge false-positive sets that had to be re-filtered. With MEST, each segment's bbox is indexed separately; a small predicate polygon hits only the relevant segments.

Foundational for: #144 (constant-geometry pushdown — depends on the multi-entry structure existing).

Cross-link: Linux arm64 build needs #161.

Why it's safe to merge: the index becomes strictly more precise — never lower recall (every old entry still maps to the trajectory it came from), never wrong results. Build cost goes up (more entries), query cost goes down (less false-positive filtering).

`meosType` (lower-case) is the **pre-consolidation** MEOS type name; `MeosType` (upper-case) is the **post-consolidation** target that the upstream rename sweep has not yet reached. The current vcpkg pin (`vcpkg_ports/meos/portfile.cmake` REF f11b7443ee98…) is still pre-consolidation: `meos/include/temporal/meos_catalog.h` line 121 declares the typedef as `} meosType;` and every MEOS API uses the lower-case spelling. MobilityDuck's source code consistently uses `meosType` to match — `grep -rn '\bMeosType\b' src/` finds the name only on the alias line and its comment, nowhere else. c8cad6d added `using meosType = MeosType;` as a forward-looking bridge for the eventual consolidation bump. That bridge points at `MeosType`, which the current pin does NOT yet expose, so it breaks every PR's Linux arm64 build with: /duckdb_build_dir/src/include/tydef.hpp:18:18: error: ‘MeosType’ does not name a type; did you mean ‘meosType’? The fix is to drop the premature alias and replace the misleading comment with one that documents the pre/post-consolidation distinction and the resume path for the next pin bump — at that point a reviewer can either restore the bridge (this time it'll be valid because `MeosType` will exist) or sweep the MobilityDuck source from `meosType` to `MeosType` in a single PR. Unblocks every in-flight PR's Linux arm64 build: #126, #130, #149, #158, #159, #160, plus the entire `feat/*_port_core` extended-type stack (#148/#150/#151/#153/#155/#156).

The stage_icu helper mapped only the Linux uname values, so on the macOS arm64 test runner uname -m returned "arm64" and the icu extension was copied to .duckdb/extensions/v1.4.4/arm64 instead of .../osx_arm64, where DuckDB's autoload looks. The hub fallback is not reliably resolvable on that runner, so the osx_arm64 Test step failed to load the extension. Map the OS and architecture to the DuckDB platform string (linux_amd64, linux_arm64, osx_amd64, osx_arm64) so the locally built icu is staged at the path autoload expects on every tested platform; the Linux mapping is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

estebanzimanyi added the feat label May 16, 2026

estebanzimanyi and others added 2 commits May 16, 2026 15:09

This was referenced May 16, 2026

Route constant-geometry spatial-relationship predicates to the TRTREE index #144

Open

Build the pinned MEOS with the CBUFFER, NPOINT, POSE and RGEO modules #145

Open

This was referenced May 20, 2026

feat(bench): 3-tier index framework + Spark NxN mitigations MobilityDB/MobilitySpark#14

Open

chore(meos): bump pinned MEOS to the 1.4-integration SHA (unblocks *_port_core stack) #164

Open

estebanzimanyi and others added 2 commits May 21, 2026 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use MEST multi-entry indexing in the TRTREE index#143

Use MEST multi-entry indexing in the TRTREE index#143
estebanzimanyi wants to merge 5 commits into
fix/trtree-index-assertionfrom
feat/trtree-mest-multientry

estebanzimanyi commented May 16, 2026 •

edited

Loading

Uh oh!

estebanzimanyi commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

estebanzimanyi commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

estebanzimanyi commented May 20, 2026

Reviewer's quickstart — ~5 minutes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

estebanzimanyi commented May 16, 2026 •

edited

Loading