Skip to content

doc(rfc): TemporalParquet — Parquet footer convention for MobilityDB temporal types#911

Closed
estebanzimanyi wants to merge 5 commits into
MobilityDB:masterfrom
estebanzimanyi:rfc/temporal-parquet
Closed

doc(rfc): TemporalParquet — Parquet footer convention for MobilityDB temporal types#911
estebanzimanyi wants to merge 5 commits into
MobilityDB:masterfrom
estebanzimanyi:rfc/temporal-parquet

Conversation

@estebanzimanyi
Copy link
Copy Markdown
Member

@estebanzimanyi estebanzimanyi commented May 7, 2026

👀 Reviewers: tier ranking, dependency chains and the standards checklist live in doc/contributing/reviewer-guide.md.

Proposes the TemporalParquet convention for storing MEOS-WKB temporal columns in Parquet files, modelled directly on GeoParquet.

What this adds

A doc/temporal-parquet/README.md spec covering:

  • File structure: BYTE_ARRAY columns + temporal key in Parquet key_value_metadata
  • JSON schema for per-column metadata (base_type, encoding, srid, geodetic, subtype, interpolation)
  • Type coverage table (all scalar, spatial, extended, box, span, spanset types)
  • Encoding versioning (encoding_version)
  • Geodetic distance note: prefer tgeogpoint ("geodetic": true) for metre-unit analytics
  • Alternatives considered, open questions, related links

Reference implementation (already delivered in MobilityDuck)

Artefact Location
Python annotate/describe/verify CLI tools/temporal_parquet.py
Round-trip regression test test/sql/parquet/temporal_parquet.test
AIS data-lake demo examples/ais-data-lake/ais_data_lake.sql
TGEOGPOINT design note docs/tgeogpoint-design.md

No C-side changes to MobilityDB in this PR — the MEOS-WKB encoders/decoders already exist. This is purely the convention spec.

Closes the directional sign-off discussion at #830.

…temporal types

Proposes the TemporalParquet convention for storing MEOS-WKB temporal columns
in Parquet files, modelled directly on GeoParquet.

Adds doc/temporal-parquet/README.md covering:
- File structure: BYTE_ARRAY columns + temporal key in key_value_metadata
- JSON schema: base_type, subtype, interpolation, srid, geodetic, has_z,
  encoding_version per column
- Type coverage: all temporal types, spans, spansets, sets, bounding boxes
- Encoding versioning (MAJOR.MINOR per file)
- Implementation status: MobilityDuck temporalFooter() done; asBinary/fromBinary
  for tgeompoint/tgeogpoint/tint/tfloat/tbool/ttext done
- Zero-data quickstart and AIS data-lake demo links
- Beta-testing guide (two-part: testers + committers) at beta-testing.md

Community Discussion: MobilityDB#870
@estebanzimanyi estebanzimanyi force-pushed the rfc/temporal-parquet branch from 0a1fee3 to 443b21a Compare May 7, 2026 19:32
estebanzimanyi added a commit to estebanzimanyi/MobilityDB that referenced this pull request May 9, 2026
…yDB#931 to Tier 1

MobilityDB#911 confirmed green (re-run passed); MobilityDB#922 closed; MobilityDB#820 root cause
fixed (trgeo_spatialfuncs.c now compiled) — CI re-running. MobilityDB#931 added
to Tier 1 as the guide itself.
estebanzimanyi added a commit to estebanzimanyi/MobilityDB that referenced this pull request May 11, 2026
Adds tools/scripts/pre-push-check.sh and a 'Pre-push checks' section
to CONTRIBUTING.md that together close the gap between the documented
'CI green before push' policy (memory: feedback_ci_green_before_push.md)
and the actual contributor workflow.

Recent CI history surfaced three patterns where the policy was not
applied:

  * PR MobilityDB#934 (meos/bootstrap-batch): introduces .github/workflows/
    windows_msys2_meos.yml; the only push of the branch left that
    workflow red. The contributor likely had no Windows test
    environment; the script now documents this explicitly as an
    acknowledged gap with a PR-description note as the workaround.

  * PR MobilityDB#938 (feat/h3-static-geo-coverage): five consecutive pushes
    left the coverage-matrix variant red — first because libh3-dev
    wasn't installed in the workflow (fixed mid-branch), then because
    six th3index tests failed on the full build with
    'Unknown spatiotemporal type: tbigint' on a (tbigint)::th3index
    cast. Running pre-push-check.sh locally would have caught this
    before any push.

  * PRs MobilityDB#899, MobilityDB#911, MobilityDB#929, MobilityDB#934 (coverage-only failures): all from
    coveralls.io 502 Bad Gateway during lcov upload. These are the
    documented external-infra exception; they can stay red until
    Coveralls recovers and the failed job is re-run.

The script mirrors the coverage matrix variant locally (the most
exhaustive Linux build), so contributors see the same failures CI
would, on their own machine, before pushing.
…ution

The th3index port shipped in lock-step across the three platforms:

  * MobilityDB  MobilityDB#807 (type) + MobilityDB#866 (spatial wiring) + MobilityDB#938 (h3-static geo)
  * MobilityDuck MobilityDB#129 (66 MEOS exports — full H3 cell index API)
  * MobilitySpark #9 (10 UDFs covering the BerlinMOD-relevant subset)

It is the first **extended** temporal type to ship a working binding pair
on all three engines with a shared SQL surface. This commit folds the
state into the RFC so the spec stops treating th3index as 'pending' and
starts treating it as the reference shape for the remaining extended
types (tcbuffer / tnpoint / tpose / trgeo / tpcpoint / tpcpatch).

Concrete edits:

  * Status table: bumped date to 2026-05-11, added a 'Cross-platform
    th3index port' row marked 'done across the three platforms' with
    cross-references to all 5 PRs.

  * Type-specific optional fields: new subsection that names the
    discriminator-only-from-base_type contract. Adds 'h3_resolution'
    as a th3index-specific optional integer in [0, 15]. Consumers MAY
    rely on it to gate cell-membership prefilters
    (everEqH3IndexTh3Index / everEqTh3IndexTh3Index) without decoding
    any row.

  * Worked example: a new 'Worked example: th3index' section in the
    Proposal shows the BerlinMOD trips footer with both 'trip'
    (tgeompoint) and 'trip_h3' (th3index) columns, the SQL prefilter
    pattern, and the h3_resolution=7 soundness rationale (cell edge
    ~1.2 km vs BerlinMOD 3-10 m thresholds).

  * Open Questions: 'Coverage gap for asBinary/fromBinary' updated to
    remove th3index from the pending list and pin it as the reference
    shape for the remaining extended types.

  * Type coverage table: th3index split out from the bundled extended
    row into its own row, with a link to the worked example.

  * Related: section split into a general references list and a
    dedicated 'th3index reference port' subsection naming the 5 PRs.

No structural change to the MEOS-WKB encoding or the JSON-schema —
'h3_resolution' is purely an optional consumer hint. Spec version
stays at 1.0.0 (the th3index encoding has been on disk in MEOS-WKB
since the th3index branch landed; this RFC just names how the metadata
declares it).
@estebanzimanyi
Copy link
Copy Markdown
Member Author

Folded a follow-up commit (7bb80bfe7) extending the spec for the th3index cross-platform port that landed across the three platforms in lock-step since this RFC was first written:

  • MobilityDB #807 + #866 + #938
  • MobilityDuck #129 (66 MEOS exports)
  • MobilitySpark #9 (10 UDFs covering the BerlinMOD-relevant subset)

What the commit adds:

Section Edit
Status (2026-05-11) New row marking the cross-platform th3index port as done with PR links
Type coverage th3index split out from the bundled extended-types row into its own row pointing at the worked example
Type-specific optional fields New subsection naming srid / geodetic / has_z / h3_resolution. h3_resolution is an optional integer in [0, 15] declaring uniform resolution per column so consumers can gate cell-membership prefilters without decoding any row
Worked example: th3index New section. Concrete BerlinMOD-shape footer with both trip (tgeompoint) and trip_h3 (th3index, h3_resolution: 7) columns. Includes the Q5 prefilter SQL pattern and the soundness rationale (cell edge ≈ 1.2 km vs BerlinMOD 3-10 m thresholds)
Open Questions th3index removed from the "coverage gap" pending list; pinned as the reference shape for the remaining extended types (tcbuffer / tnpoint / tpose / trgeo / tpcpoint / tpcpatch)
Related New "th3index reference port" subsection with the 5-PR cross-reference

No structural change to the MEOS-WKB encoding or the existing JSON schema. h3_resolution is a purely-optional consumer hint; spec version stays at 1.0.0 (the th3index encoding has been on disk in MEOS-WKB since the th3index branch landed — this RFC just names how the metadata declares it).

Two follow-ups to the th3index extension (commit 7bb80bf) that close
the 'Convention spec (DocBook + docs page)' open item in the Status
table and bring the beta-testing guide up to date with the
cross-platform port that landed in MobilityDB MobilityDB#807/MobilityDB#866/MobilityDB#938 +
MobilityDuck MobilityDB#129 + MobilitySpark #9.

doc/temporal_parquet.xml (new appendix)
  * File structure overview + JSON footer example
  * Type coverage table (tbool/tint/tfloat/ttext, tgeompoint/tgeogpoint,
    tgeometry/tgeography, th3index, remaining extended types, boxes,
    spans/spansets/sets) — th3index broken out as its own row
  * Type-specific optional fields section with srid / geodetic / has_z /
    h3_resolution
  * Worked example for th3index with the BerlinMOD prefilter SQL
  * Encoding versioning rules (MAJOR.MINOR)
  * Reference-implementation pointer to scripts/parquet (PR MobilityDB#831)
  * Related references (RFC MobilityDB#911, MobilityDB#912, MEOS-WKB MobilityDB#833, GeoParquet)

doc/mobilitydb-manual.xml
  * Wires &temporal_parquet; entity after &data_generator;

doc/temporal-parquet/beta-testing.md
  * PRs and branches table: adds MobilityDB#912 (Temporal Data Lake), MobilityDB#129
    (MobilityDuck th3index), #9 (MobilitySpark th3index)
  * Implementation status: th3index promoted to 'done' with PR links;
    the remaining-extended-types row updated to name th3index as
    the reference shape; tpcpoint / tpcpatch added to the list
  * New 'Test scenario: th3index cross-platform round-trip' section
    walks an export/annotate/import/Q5 recipe that produces identical
    answers on MobilityDB / MobilityDuck / MobilitySpark

No structural spec change. The DocBook chapter mirrors the RFC body
prose-for-prose so it stays in sync; the beta-testing guide updates
acknowledge the th3index port as the first extended type to ship a
cross-platform binding pair.
The DocBook chapter wired in commit 363a4b5 covers the 'DocBook'
half of the previous combined 'DocBook + docs page' status row. The
rendered docs.mobilitydb.com page lives downstream of the manual
build and does not need a separate spec change, so it stays as the
single 'open' item.

Splits the status row into three explicit pointers so a reader can
see at a glance which surface is live and which is pending.
…s under-sampling

The earlier worked example claimed the th3index prefilter at H3
resolution 7 was 'sound for BerlinMOD's 3-10 m thresholds'. The
2026-05-11 BerlinMOD ch1 + h3 prefilter benchmark contradicts this:
the prefilter drops ~81% of true 'eIntersects(trip, region.geom)'
hits, from two distinct sources:

1. Trip side — 'tgeompoint_to_th3index' samples one cell per source
   instant. Cells traversed by the trip's straight-line segment
   between consecutive instants are not visited.

2. Polygon side — 'meos/src/h3/h3_geo.c::polygon_to_cells_into' uses
   libh3's CONTAINMENT_CENTER mode. Cells that intersect the polygon
   but whose centroid is outside it are missed.

Both gaps are tracked: 'fix/th3index-srid-flags-lift' covers the three
latent th3index defects (SRID, spatial-flags, lifting); a separate
prefilter-soundness pass covers the polygon-side
'polygonToCellsExperimental' (libh3 4.2+) or 'gridDisk(c, 1)' (4.1
wrapper) and the trip-side over-sampling.

This commit corrects three documents to match reality:

  doc/temporal-parquet/README.md
    Replaces the 'sound for BerlinMOD's 3-10 m thresholds' line with
    a soundness-note blockquote that explicitly names the under-
    sampling, references the pending fix chain, and notes that the
    metadata schema is unaffected — base_type / h3_resolution describe
    the bytes regardless of production-time soundness.

  doc/temporal_parquet.xml
    Mirrors the same correction in the DocBook chapter as a <note>
    block.

  doc/temporal-parquet/beta-testing.md
    Reframes the cross-platform round-trip scenario as a binary-and-
    metadata test, drops the prefilter from the example Q5 SQL, and
    adds an explicit 'Soundness note (2026-05-11)' that names both
    gaps + the prefilter-loss measurement (81% of true hits in
    BerlinMOD ch1).

No metadata-schema change; the TemporalParquet bytes + JSON footer
contract for th3index columns is unchanged. Spec version remains
1.0.0.
@estebanzimanyi
Copy link
Copy Markdown
Member Author

Package + clean-up checkpoint (2026-05-11). The TemporalParquet RFC scope-of-this-PR is now consolidated. Five commits, all on rfc/temporal-parquet:

Commit Change
443b21a6 Original RFC body
7bb80bfe th3index worked example + h3_resolution field + Status table
363a4b50 DocBook chapter wired in doc/mobilitydb-manual.xml + beta-testing.md th3index inclusion
31e35774 Status row split (DocBook = done; docs.mobilitydb.com page = open; pinned downstream of the manual build)
46afc3ee Corrective: soundness-of-th3index-prefilter language was wrong. BerlinMOD ch1 bench shows the prefilter drops ~81% of true eIntersects hits because of under-sampling (per-instant trip lift + CONTAINMENT_CENTER polygon mode). README + DocBook chapter now carry an explicit soundness note; beta-testing.md reframes the round-trip scenario as a bytes+metadata test and drops the prefilter from the example Q5. Metadata schema unchanged — base_type / h3_resolution describe the bytes regardless of production-time soundness

Out of scope (deferred for a future task per the user direction):

Dependency chain for the soundness fix (tracked separately, not in this PR):

  1. fix/th3index-srid-flags-lift (three latent th3index defects: SRID dispatch, spatial-flags dispatch, lifting machinery)
  2. Prefilter-soundness pass: polygon side via libh3 polygonToCellsExperimental (4.2+) or gridDisk(c, 1) wrapper (4.1); trip side via straight-line-segment cell visit
  3. Once both land, the example Q5 in beta-testing.md can re-add the WHERE everEqTh3IndexTh3Index(...) prefilter as a sound optimisation.

@estebanzimanyi
Copy link
Copy Markdown
Member Author

The spec content now ships in the consolidated Temporal Data Lake RFC PR #912 alongside the umbrella architecture document. The four layers (MEOS-WKB wire format / TemporalParquet file format / MEOS-API function registry / portable-SQL dialect) are presented as one integration story.

@estebanzimanyi estebanzimanyi deleted the rfc/temporal-parquet branch May 12, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant