doc(rfc): TemporalParquet — Parquet footer convention for MobilityDB temporal types#911
doc(rfc): TemporalParquet — Parquet footer convention for MobilityDB temporal types#911estebanzimanyi wants to merge 5 commits into
Conversation
e16bc5f to
655535b
Compare
…temporal types Proposes the TemporalParquet convention for storing MEOS-WKB temporal columns in Parquet files, modelled directly on GeoParquet. Adds doc/temporal-parquet/README.md covering: - File structure: BYTE_ARRAY columns + temporal key in key_value_metadata - JSON schema: base_type, subtype, interpolation, srid, geodetic, has_z, encoding_version per column - Type coverage: all temporal types, spans, spansets, sets, bounding boxes - Encoding versioning (MAJOR.MINOR per file) - Implementation status: MobilityDuck temporalFooter() done; asBinary/fromBinary for tgeompoint/tgeogpoint/tint/tfloat/tbool/ttext done - Zero-data quickstart and AIS data-lake demo links - Beta-testing guide (two-part: testers + committers) at beta-testing.md Community Discussion: MobilityDB#870
0a1fee3 to
443b21a
Compare
…yDB#931 to Tier 1 MobilityDB#911 confirmed green (re-run passed); MobilityDB#922 closed; MobilityDB#820 root cause fixed (trgeo_spatialfuncs.c now compiled) — CI re-running. MobilityDB#931 added to Tier 1 as the guide itself.
Adds tools/scripts/pre-push-check.sh and a 'Pre-push checks' section to CONTRIBUTING.md that together close the gap between the documented 'CI green before push' policy (memory: feedback_ci_green_before_push.md) and the actual contributor workflow. Recent CI history surfaced three patterns where the policy was not applied: * PR MobilityDB#934 (meos/bootstrap-batch): introduces .github/workflows/ windows_msys2_meos.yml; the only push of the branch left that workflow red. The contributor likely had no Windows test environment; the script now documents this explicitly as an acknowledged gap with a PR-description note as the workaround. * PR MobilityDB#938 (feat/h3-static-geo-coverage): five consecutive pushes left the coverage-matrix variant red — first because libh3-dev wasn't installed in the workflow (fixed mid-branch), then because six th3index tests failed on the full build with 'Unknown spatiotemporal type: tbigint' on a (tbigint)::th3index cast. Running pre-push-check.sh locally would have caught this before any push. * PRs MobilityDB#899, MobilityDB#911, MobilityDB#929, MobilityDB#934 (coverage-only failures): all from coveralls.io 502 Bad Gateway during lcov upload. These are the documented external-infra exception; they can stay red until Coveralls recovers and the failed job is re-run. The script mirrors the coverage matrix variant locally (the most exhaustive Linux build), so contributors see the same failures CI would, on their own machine, before pushing.
…ution The th3index port shipped in lock-step across the three platforms: * MobilityDB MobilityDB#807 (type) + MobilityDB#866 (spatial wiring) + MobilityDB#938 (h3-static geo) * MobilityDuck MobilityDB#129 (66 MEOS exports — full H3 cell index API) * MobilitySpark #9 (10 UDFs covering the BerlinMOD-relevant subset) It is the first **extended** temporal type to ship a working binding pair on all three engines with a shared SQL surface. This commit folds the state into the RFC so the spec stops treating th3index as 'pending' and starts treating it as the reference shape for the remaining extended types (tcbuffer / tnpoint / tpose / trgeo / tpcpoint / tpcpatch). Concrete edits: * Status table: bumped date to 2026-05-11, added a 'Cross-platform th3index port' row marked 'done across the three platforms' with cross-references to all 5 PRs. * Type-specific optional fields: new subsection that names the discriminator-only-from-base_type contract. Adds 'h3_resolution' as a th3index-specific optional integer in [0, 15]. Consumers MAY rely on it to gate cell-membership prefilters (everEqH3IndexTh3Index / everEqTh3IndexTh3Index) without decoding any row. * Worked example: a new 'Worked example: th3index' section in the Proposal shows the BerlinMOD trips footer with both 'trip' (tgeompoint) and 'trip_h3' (th3index) columns, the SQL prefilter pattern, and the h3_resolution=7 soundness rationale (cell edge ~1.2 km vs BerlinMOD 3-10 m thresholds). * Open Questions: 'Coverage gap for asBinary/fromBinary' updated to remove th3index from the pending list and pin it as the reference shape for the remaining extended types. * Type coverage table: th3index split out from the bundled extended row into its own row, with a link to the worked example. * Related: section split into a general references list and a dedicated 'th3index reference port' subsection naming the 5 PRs. No structural change to the MEOS-WKB encoding or the JSON-schema — 'h3_resolution' is purely an optional consumer hint. Spec version stays at 1.0.0 (the th3index encoding has been on disk in MEOS-WKB since the th3index branch landed; this RFC just names how the metadata declares it).
|
Folded a follow-up commit (
What the commit adds:
No structural change to the MEOS-WKB encoding or the existing JSON schema. |
Two follow-ups to the th3index extension (commit 7bb80bf) that close the 'Convention spec (DocBook + docs page)' open item in the Status table and bring the beta-testing guide up to date with the cross-platform port that landed in MobilityDB MobilityDB#807/MobilityDB#866/MobilityDB#938 + MobilityDuck MobilityDB#129 + MobilitySpark #9. doc/temporal_parquet.xml (new appendix) * File structure overview + JSON footer example * Type coverage table (tbool/tint/tfloat/ttext, tgeompoint/tgeogpoint, tgeometry/tgeography, th3index, remaining extended types, boxes, spans/spansets/sets) — th3index broken out as its own row * Type-specific optional fields section with srid / geodetic / has_z / h3_resolution * Worked example for th3index with the BerlinMOD prefilter SQL * Encoding versioning rules (MAJOR.MINOR) * Reference-implementation pointer to scripts/parquet (PR MobilityDB#831) * Related references (RFC MobilityDB#911, MobilityDB#912, MEOS-WKB MobilityDB#833, GeoParquet) doc/mobilitydb-manual.xml * Wires &temporal_parquet; entity after &data_generator; doc/temporal-parquet/beta-testing.md * PRs and branches table: adds MobilityDB#912 (Temporal Data Lake), MobilityDB#129 (MobilityDuck th3index), #9 (MobilitySpark th3index) * Implementation status: th3index promoted to 'done' with PR links; the remaining-extended-types row updated to name th3index as the reference shape; tpcpoint / tpcpatch added to the list * New 'Test scenario: th3index cross-platform round-trip' section walks an export/annotate/import/Q5 recipe that produces identical answers on MobilityDB / MobilityDuck / MobilitySpark No structural spec change. The DocBook chapter mirrors the RFC body prose-for-prose so it stays in sync; the beta-testing guide updates acknowledge the th3index port as the first extended type to ship a cross-platform binding pair.
The DocBook chapter wired in commit 363a4b5 covers the 'DocBook' half of the previous combined 'DocBook + docs page' status row. The rendered docs.mobilitydb.com page lives downstream of the manual build and does not need a separate spec change, so it stays as the single 'open' item. Splits the status row into three explicit pointers so a reader can see at a glance which surface is live and which is pending.
…s under-sampling
The earlier worked example claimed the th3index prefilter at H3
resolution 7 was 'sound for BerlinMOD's 3-10 m thresholds'. The
2026-05-11 BerlinMOD ch1 + h3 prefilter benchmark contradicts this:
the prefilter drops ~81% of true 'eIntersects(trip, region.geom)'
hits, from two distinct sources:
1. Trip side — 'tgeompoint_to_th3index' samples one cell per source
instant. Cells traversed by the trip's straight-line segment
between consecutive instants are not visited.
2. Polygon side — 'meos/src/h3/h3_geo.c::polygon_to_cells_into' uses
libh3's CONTAINMENT_CENTER mode. Cells that intersect the polygon
but whose centroid is outside it are missed.
Both gaps are tracked: 'fix/th3index-srid-flags-lift' covers the three
latent th3index defects (SRID, spatial-flags, lifting); a separate
prefilter-soundness pass covers the polygon-side
'polygonToCellsExperimental' (libh3 4.2+) or 'gridDisk(c, 1)' (4.1
wrapper) and the trip-side over-sampling.
This commit corrects three documents to match reality:
doc/temporal-parquet/README.md
Replaces the 'sound for BerlinMOD's 3-10 m thresholds' line with
a soundness-note blockquote that explicitly names the under-
sampling, references the pending fix chain, and notes that the
metadata schema is unaffected — base_type / h3_resolution describe
the bytes regardless of production-time soundness.
doc/temporal_parquet.xml
Mirrors the same correction in the DocBook chapter as a <note>
block.
doc/temporal-parquet/beta-testing.md
Reframes the cross-platform round-trip scenario as a binary-and-
metadata test, drops the prefilter from the example Q5 SQL, and
adds an explicit 'Soundness note (2026-05-11)' that names both
gaps + the prefilter-loss measurement (81% of true hits in
BerlinMOD ch1).
No metadata-schema change; the TemporalParquet bytes + JSON footer
contract for th3index columns is unchanged. Spec version remains
1.0.0.
|
Package + clean-up checkpoint (2026-05-11). The TemporalParquet RFC scope-of-this-PR is now consolidated. Five commits, all on
Out of scope (deferred for a future task per the user direction):
Dependency chain for the soundness fix (tracked separately, not in this PR):
|
|
The spec content now ships in the consolidated Temporal Data Lake RFC PR #912 alongside the umbrella architecture document. The four layers (MEOS-WKB wire format / TemporalParquet file format / MEOS-API function registry / portable-SQL dialect) are presented as one integration story. |
Proposes the TemporalParquet convention for storing MEOS-WKB temporal columns in Parquet files, modelled directly on GeoParquet.
What this adds
A
doc/temporal-parquet/README.mdspec covering:BYTE_ARRAYcolumns +temporalkey in Parquetkey_value_metadatabase_type,encoding,srid,geodetic,subtype,interpolation)encoding_version)tgeogpoint("geodetic": true) for metre-unit analyticsReference implementation (already delivered in MobilityDuck)
tools/temporal_parquet.pytest/sql/parquet/temporal_parquet.testexamples/ais-data-lake/ais_data_lake.sqldocs/tgeogpoint-design.mdNo C-side changes to MobilityDB in this PR — the MEOS-WKB encoders/decoders already exist. This is purely the convention spec.
Closes the directional sign-off discussion at #830.