Skip to content

docs(iceberg): readiness doc for TemporalParquet + Apache Polaris catalog#179

Open
estebanzimanyi wants to merge 9 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/iceberg-polaris-readiness
Open

docs(iceberg): readiness doc for TemporalParquet + Apache Polaris catalog#179
estebanzimanyi wants to merge 9 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/iceberg-polaris-readiness

Conversation

@estebanzimanyi
Copy link
Copy Markdown
Member

Summary

Stacks on #158 (the edge-to-cloud quickstart substrate). Adds docs/iceberg-polaris.md documenting how MobilityDuck's TemporalParquet shards interoperate with Apache Iceberg via Apache Polaris as the catalog server.

Content

Section What it covers
Why Polaris Why Polaris over alternatives (RBAC, OAuth2, credential vending, protocol-vanilla, Quarkus/Postgres operability)
TemporalParquet + Iceberg composition Iceberg/Polaris see TemporalParquet shards as opaque Parquet; bbox sidecars feed Iceberg column stats for free manifest-level pruning
Per-engine integration MobilityDuck temporal_iceberg_scan plan (~1 PR), MobilitySpark zero-cost (native Iceberg runtime), PyMEOS via pl.scan_iceberg
Snapshot-time vs valid-time orthogonality The two time axes never interfere
OAuth2 + credential vending Pattern for short-lived storage credentials per request — engines never hold static cloud secrets
Deployment recipe Dev (Docker compose) + production (Quarkus cluster + Postgres + cloud storage)
Status + future work What works today vs. follow-up PRs

Why stack on #158

The substrate (temporalFooter KV metadata + bbox sidecars + tgeogpoint SRID/geodetic fix) is the foundation Iceberg/Polaris compose with. Stacking here keeps the readiness doc adjacent to the substrate it depends on, so a reviewer reading #158 sees the Polaris readiness in the same PR train.

Branch base

feat/edge-to-cloud-quickstart-rebased (= #158's branch). Adds one commit; no submodule churn.

What this doc is NOT

This is not a temporal_iceberg_scan UDF implementation — that's a future ~1-PR delegator. It's the design + deployment recipe document that lands ahead of the implementation PR so reviewers have the design context before the UDF arrives.

Local verification

$ wc -l docs/iceberg-polaris.md
156 docs/iceberg-polaris.md

Doc-only PR; no build verification needed.

Related

…eogpoint tests

Adds the user-facing entry point for the edge-to-cloud pipeline plus a
documentation/testing surface around it:

- examples/quickstart/quickstart.sql — 5-vessel synthetic ingest →
  Parquet → Spark-readable layer.
- examples/generic-ingest/generic_ingest.sql — generic CSV → tgeompoint
  ingest template.
- temporalFooter() — builds the TemporalParquet KV_METADATA JSON to
  embed at write time (tgeompoint encoding manifest, schema version,
  timestamp range, SRID).
- SRID/geodetic mixin fix in tgeompoint_functions.cpp — preserves the
  source SRID through cast/transform paths and forces geodetic when
  the input is geographic.
- docs/beta-testing-edge-to-cloud.md — beta-testing guide.
- docs/testing-tz-neutral-policy.md — ecosystem-wide timezone-neutral
  test policy (canonical reference for MobilityDB / MobilityDuck /
  PyMEOS / JMEOS / meos-rs).
- docs/tgeogpoint-design.md — design notes.
- test/sql/tgeogpoint.test, test/sql/parquet/temporal_parquet.test —
  test coverage for the new pieces.
- .gitignore + CMakeLists.txt + extension load — wiring.
`meosType` (lower-case) is the **pre-consolidation** MEOS type name;
`MeosType` (upper-case) is the **post-consolidation** target that the
upstream rename sweep has not yet reached.  The current vcpkg pin
(`vcpkg_ports/meos/portfile.cmake` REF f11b7443ee98…) is still
pre-consolidation: `meos/include/temporal/meos_catalog.h` line 121
declares the typedef as `} meosType;` and every MEOS API uses the
lower-case spelling.  MobilityDuck's source code consistently uses
`meosType` to match — `grep -rn '\bMeosType\b' src/` finds the name
only on the alias line and its comment, nowhere else.

c8cad6d added `using meosType = MeosType;` as a forward-looking
bridge for the eventual consolidation bump.  That bridge points at
`MeosType`, which the current pin does NOT yet expose, so it
breaks every PR's Linux arm64 build with:

  /duckdb_build_dir/src/include/tydef.hpp:18:18:
    error: ‘MeosType’ does not name a type; did you mean ‘meosType’?

The fix is to drop the premature alias and replace the misleading
comment with one that documents the pre/post-consolidation distinction
and the resume path for the next pin bump — at that point a reviewer
can either restore the bridge (this time it'll be valid because
`MeosType` will exist) or sweep the MobilityDuck source from
`meosType` to `MeosType` in a single PR.

Unblocks every in-flight PR's Linux arm64 build: MobilityDB#126, MobilityDB#130, MobilityDB#149,
MobilityDB#158, MobilityDB#159, MobilityDB#160, plus the entire `feat/*_port_core` extended-type
stack (MobilityDB#148/MobilityDB#150/MobilityDB#151/MobilityDB#153/MobilityDB#155/MobilityDB#156).
…MobilityDB#136)

Cherry-picked from open PR MobilityDB#136 (commit 9e1d7a6) so this PR's CI goes
green before MobilityDB#136 lands. When MobilityDB#136 reaches main, the rebase will
collapse this commit to a no-op and it will drop out.

--- original commit body ---
Pre-stage icu extension for amd64 docker tests

LoadInternal calls ExtensionHelper::AutoLoadExtension(db, "icu") so the
Europe/Brussels timezone option is honoured. Inside the linux_amd64 test
docker container there is no network egress and the local extension
directory is empty, so the autoload fails. Copy the icu.duckdb_extension
that was just built locally (declared in extension_config.cmake) into the
expected path before running the unittester.
…en PR MobilityDB#140)

On macOS LP64 and Wasm/emscripten, int64 (long) and int64_t (long long) are
the same width but distinct types, so clang rejects passing bigint_to_set
where a Set *(*)(int64_t) is expected as a non-type template arg. Cherry-
picked from open PR MobilityDB#140 (a8b1755) so this PR goes green on osx_amd64,
osx_arm64, and wasm_mvp before MobilityDB#140 lands. The cast is a no-op on Linux,
where int64 and int64_t are both long. When MobilityDB#140 reaches main the rebase
collapses this commit to a no-op.
…alog

Adds docs/iceberg-polaris.md documenting how MobilityDuck's
TemporalParquet substrate (this PR stack's edge-to-cloud quickstart)
interoperates with Apache Iceberg via Apache Polaris as the catalog
server.

Covered:
  - Why Polaris (RBAC, OAuth2, credential vending, protocol-vanilla)
  - How TemporalParquet shards compose with Iceberg (opaque BLOB column,
    bbox sidecars feed Iceberg column stats for free manifest pruning)
  - Per-engine integration (DuckDB temporal_iceberg_scan plan,
    MobilitySpark zero-cost via native runtime, PyMEOS via pl.scan_iceberg)
  - Snapshot-time vs valid-time orthogonality
  - OAuth2 + credential vending pattern
  - Deployment recipe (dev compose + production cluster + storage)

The doc is a readiness document — the substrate works today with
vanilla Iceberg; Polaris adds the production-controls layer. It
captures the recommendation captured in memory + ships as part of
the substrate PR.
estebanzimanyi and others added 3 commits May 21, 2026 11:43
…ion init)

MobilityDuck initializes MEOS with `meos_initialize_timezone("Europe/Brussels")`
in the extension entry point.  tgeogpoint.test had four assertions hardcoded
to `+00` (UTC) instead of `+01` (Brussels winter), so they failed on every CI
runner.

The four affected assertions are all winter-date constructors that go
through `to_timestamp(unix_seconds)` — those parse as UTC then display in
the extension's runtime TZ (Brussels = UTC+1 in January, no DST).  Updates:

  - asText(TGEOGPOINT(...))              -> POINT(...)@2000-01-01 01:00:00+01
  - asEWKT(TGEOGPOINT(...))              -> SRID=4326;POINT(...)@... 01:00:00+01
  - asText(tgeogpointSeq(ARRAY[..., ...])) -> [POINT(...)@... 01:00:00+01, ...]

A fifth literal-constructor case (`tgeogpoint 'SRID=4326;Point(1 2)@2000-01-01'`)
uses the EWKT parser which interprets the literal date as a Brussels-TZ
local time, so it stays at `00:00:00+01` (not 01:00:00+01).

The header comment now documents the Brussels TZ assumption + cross-refs
docs/testing-tz-neutral-policy.md so future test authors don't repeat
this.
…spatch

DuckDB function resolution treats GEOMETRY, TGEOMPOINT, TGEOGPOINT,
TGEOMETRY, TGEOGRAPHY as alias-equivalent because each is a
LogicalType::BLOB with an alias label. The constant-folder routes
eIntersects(GEOMETRY, TGEOGPOINT) into the {TGEOMPOINT, GEOMETRY}
overload of TgeoGeoIntExec from src/geo/tgeometry_ops.cpp (earlier
registered), which assumes args.data[0] is the Temporal and feeds
the GEOMETRY blob to tspatial_srid — hitting ensure_tspatial_type
inside MEOS with 'must be a spatiotemporal value' before any of the
correctly-direction-aware *_geo_tgeo executors get a chance to run.

Add BlobLooksLikeTemporal to geo_util.hpp: a pure-predicate probe
that reads byte 4 of the blob (Temporal's temptype field) and runs
it through tspatial_type — true only for T_TGEOMPOINT / T_TGEOGPOINT
/ T_TGEOMETRY / T_TGEOGRAPHY / T_TRGEOMETRY. DuckDB GEOMETRY blobs
have byte 4 in their WKB header which never matches one of those
MeosType enum values.

Use the probe at the top of TgeoGeoIntExec in all three *_ops.cpp
TUs to detect mis-ordered args and silently swap roles before
decoding. Also extend each executor with the geodetic-aware
geom_to_geog conversion (matching the existing pattern in
TgeompointFunctions::Eintersects_*) so the swapped-direction call
still respects MEOS_FLAGS_GET_GEODETIC.

Also fixes a unique_ptr<BinsBindData> -> unique_ptr<FunctionData>
build break in span_table_functions.cpp's BinsBindData::Copy() that
otherwise blocks the substrate fix landing alongside the test.

Locally green: test/sql/* full suite, 60/60 test cases, 1346/1346
assertions (was 1335/1336 with the failing line-56 eIntersects).
gdb-confirmed: BlobLooksLikeTemporal probe returns a_is_temp=0
b_is_temp=1 for the (GEOMETRY, TGEOGPOINT) case, swap routes
TGEOGPOINT to the Temporal path.
The stage_icu helper mapped only the Linux uname values, so on the
macOS arm64 test runner uname -m returned "arm64" and the icu
extension was copied to .duckdb/extensions/v1.4.4/arm64 instead of
.../osx_arm64, where DuckDB's autoload looks. The hub fallback is not
reliably resolvable on that runner, so the osx_arm64 Test step failed
to load the extension. Map the OS and architecture to the DuckDB
platform string (linux_amd64, linux_arm64, osx_amd64, osx_arm64) so
the locally built icu is staged at the path autoload expects on every
tested platform; the Linux mapping is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant