Skip to content

Use MEST multi-entry indexing in the TRTREE index#143

Open
estebanzimanyi wants to merge 5 commits into
fix/trtree-index-assertionfrom
feat/trtree-mest-multientry
Open

Use MEST multi-entry indexing in the TRTREE index#143
estebanzimanyi wants to merge 5 commits into
fix/trtree-index-assertionfrom
feat/trtree-mest-multientry

Conversation

@estebanzimanyi
Copy link
Copy Markdown
Member

@estebanzimanyi estebanzimanyi commented May 16, 2026

A temporal column was indexed as a single minimum bounding box per row, so a wiggly or high-extent trajectory matched far more query boxes than it actually overlapped. Construct now indexes each temporal value as up to max_boxes tight per-segment bounding boxes via the MEOS rtree_insert_temporal_split (degenerating to the previous single-box behaviour for instants or max_boxes <= 1); temporal-spatial values are SRID-normalised to 0 before splitting so every produced stbox matches the SRID-stripped query box, and Search deduplicates row ids with a seen-set because multi-entry leaves return the same id once per overlapping segment box. The split bound is exposed as WITH (max_boxes = N), default 8, validated on the MEST selectivity benchmark (false-positive rate 47.2 percent single-box to 13.8 percent at 8). The portfile.cmake change here pins MEOS at the MEST commit only as a TEMPORARY BUILD-BASE so this PR is green in isolation; it deliberately conflicts with the wasm pin work in #142. At finalization this PR drops the portfile edit entirely and inherits #142s pin, which requires MobilityDB #1032 (MEST) to be merged to master alongside #1036 (wasm) so a single master commit carries both; the MobilityDuck merge order is #134, #136, #137, #138, #140, #141, #142, then #139, then this PR. Stacked on the TRTREE crash fix (fix/trtree-index-assertion).

A temporal column was indexed as a single minimum bounding box per row,
so a wiggly or high-extent trajectory matched far more query boxes than
it actually overlapped. The MEOS RTree now exposes
rtree_insert_temporal_split, which decomposes a temporal value into
several tight per-segment bounding boxes sharing one id.

Construct now indexes each temporal value as up to max_boxes tight
boxes via rtree_insert_temporal_split (degenerating to the previous
single-box behaviour for instants or max_boxes <= 1). Temporal-spatial
values are SRID-normalised to 0 before splitting so every produced
stbox matches the SRID-stripped query box; otherwise ensure_same_srid
would reject the overlap. Search deduplicates row ids with a seen-set
because multi-entry leaves return the same id once per overlapping
segment box, which would otherwise surface as duplicate rows. The split
bound is exposed as WITH (max_boxes = N), default 8.

Advances the pinned MEOS to the commit that adds the MEST functions
(master plus the single MEST commit, so the spatial-rels API the pin
bump adapts to is unchanged). Extends test/sql/parity/050_index_types
with MEST correctness, dedup, explicit max_boxes and degenerate
single-box coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
estebanzimanyi and others added 2 commits May 16, 2026 15:09
The temporal-spatial Construct path normalised the SRID to 0 with
tspatial_set_srid(temp, 0) before rtree_insert_temporal_split. SRID 0
is "unknown", so the stbox conversion inside the splitter raised
"Invalid Input Error: The SRID cannot be unknown" at CREATE INDEX time
for any tgeompoint/tgeogpoint column (reproduced with SRID=4326 data,
both max_boxes=1 and 8). The index ended up empty; results stayed
correct only because the optimizer currently falls back to a
sequential scan, which masked the failure in CI.

Split the temporal directly with its own SRID, exactly like
rtree_insert_temporal / tspatial_to_stbox. The pre-split SRID strip
was unnecessary: the working single-box path in the parent PR strips
at the box level after conversion, never at the temporal level before
it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two defects surfaced running the index end-to-end in DuckDB:

SRID: the index code normalised SRID at three call sites (Insert,
Construct-blob, InitializeScan-query). Post the SRID-build fix the
temporal index keys carry their real SRID while the query box was
stripped to 0, so ensure_same_srid rejected every overlap
("Exception during rtree_search", NULL results). Removed all three
strips: index/predicate code is now SRID-agnostic and && requires
matching SRID, mirroring MobilityDB/PostGIS GiST. SRID is dataset
metadata applied at load, never normalised per call site.

Projection: RTreeIndexScanGlobalState.all_columns was never
Initialized and projection_ids never populated, so a scan of a
multi-column table projecting a non-indexed column crashed with
"Expected vector of type VARCHAR, found INT32". Mirrors DuckDB's
canonical table_scan: copy input.projection_ids and Initialize
all_columns with the scanned column types. rowid uses
LogicalType(LogicalTypeId::BIGINT) rather than the
LogicalType::ROW_TYPE static member, whose ODR-use from an extension
TU multiply-defines the symbol against libduckdb at link.

Verified in-engine: CREATE INDEX USING TRTREE on a multi-column
tgeompoint table, constant && query returns correct rows via
MOBILITY RTREE INDEX scan (no crash, no SRID exception).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@estebanzimanyi
Copy link
Copy Markdown
Member Author

Reviewer's quickstart — ~5 minutes

What this PR does in one sentence: switches the TRTREE index from single-entry per row to multi-entry indexing (MEST) — every spatial bbox of a temporal trajectory's component segments becomes an index entry, dramatically improving recall for spatial predicates.

Files (5):

  • src/index/rtree_index_create_physical.cpp — the new MEST build path.
  • src/index/rtree_index_scan.cpp — adjusted scan to dedup multi-entry hits.
  • src/include/index/rtree_module.hpp — type declarations for the new index payload.
  • test/sql/index/ — assertions on recall (no false negatives) and result-set deduplication.

Why it matters: previously a single trajectory got ONE bbox in the index — that bbox is loose (covers the whole trajectory's extent), so queries with small predicate polygons returned huge false-positive sets that had to be re-filtered. With MEST, each segment's bbox is indexed separately; a small predicate polygon hits only the relevant segments.

Foundational for: #144 (constant-geometry pushdown — depends on the multi-entry structure existing).

Cross-link: Linux arm64 build needs #161.

Why it's safe to merge: the index becomes strictly more precise — never lower recall (every old entry still maps to the trajectory it came from), never wrong results. Build cost goes up (more entries), query cost goes down (less false-positive filtering).

estebanzimanyi and others added 2 commits May 21, 2026 17:36
`meosType` (lower-case) is the **pre-consolidation** MEOS type name;
`MeosType` (upper-case) is the **post-consolidation** target that the
upstream rename sweep has not yet reached.  The current vcpkg pin
(`vcpkg_ports/meos/portfile.cmake` REF f11b7443ee98…) is still
pre-consolidation: `meos/include/temporal/meos_catalog.h` line 121
declares the typedef as `} meosType;` and every MEOS API uses the
lower-case spelling.  MobilityDuck's source code consistently uses
`meosType` to match — `grep -rn '\bMeosType\b' src/` finds the name
only on the alias line and its comment, nowhere else.

c8cad6d added `using meosType = MeosType;` as a forward-looking
bridge for the eventual consolidation bump.  That bridge points at
`MeosType`, which the current pin does NOT yet expose, so it
breaks every PR's Linux arm64 build with:

  /duckdb_build_dir/src/include/tydef.hpp:18:18:
    error: ‘MeosType’ does not name a type; did you mean ‘meosType’?

The fix is to drop the premature alias and replace the misleading
comment with one that documents the pre/post-consolidation distinction
and the resume path for the next pin bump — at that point a reviewer
can either restore the bridge (this time it'll be valid because
`MeosType` will exist) or sweep the MobilityDuck source from
`meosType` to `MeosType` in a single PR.

Unblocks every in-flight PR's Linux arm64 build: #126, #130, #149,
#158, #159, #160, plus the entire `feat/*_port_core` extended-type
stack (#148/#150/#151/#153/#155/#156).
The stage_icu helper mapped only the Linux uname values, so on the
macOS arm64 test runner uname -m returned "arm64" and the icu
extension was copied to .duckdb/extensions/v1.4.4/arm64 instead of
.../osx_arm64, where DuckDB's autoload looks. The hub fallback is not
reliably resolvable on that runner, so the osx_arm64 Test step failed
to load the extension. Map the OS and architecture to the DuckDB
platform string (linux_amd64, linux_arm64, osx_amd64, osx_arm64) so
the locally built icu is staged at the path autoload expects on every
tested platform; the Linux mapping is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant