Skip to content

fix error for initializing Geos#51

Closed
ngochoawindy wants to merge 1 commit intomainfrom
geos-init-handle
Closed

fix error for initializing Geos#51
ngochoawindy wants to merge 1 commit intomainfrom
geos-init-handle

Conversation

@ngochoawindy
Copy link
Copy Markdown
Collaborator

Mutex serialization for MobilityDuck scalar function bodies that ultimately call MEOS / liblwgeom. MEOS uses GEOS's legacy global API (initGEOS / finishGEOS) internally from worker threads; concurrent legacy GEOS init/teardown from DuckDB's parallelism triggers intermittent GEOS errors and allocator corruption when spatial and MobilityDuck coexist.
Wrapping each MEOS-backed ScalarFunction execution prevents overlapping MEOS/GEOS legacy pairs across concurrent pipelines while preserving parallelism elsewhere.

@ngochoawindy
Copy link
Copy Markdown
Collaborator Author

@estebanzimanyi can you help me review and fix the error in this PR.
In this PR, I tried to fix the error when running the query:
select tgeompointSeq(list(tgeompoint(ST_Transform(geom_4326, 'EPSG:4326', 'EPSG:25832', always_xy := true), t) ORDER BY t))
Sometimes this query throws the error of 'Uninitialize Geos', sometimes is 'malloc error', and sometimes it works.

@estebanzimanyi
Copy link
Copy Markdown
Member

Thanks Ngoc for the detailed report. The mutex approach in this PR should in principle serialize all MobilityDuck–MEOS calls against each other, which prevents concurrent GEOS legacy-API (initGEOS/finishGEOS) calls originating from within MobilityDuck itself.

However, your specific query mixes two extensions in the same pipeline:

tgeompointSeq(list(tgeompoint(ST_Transform(geom_4326, 'EPSG:4326', 'EPSG:25832', always_xy := true), t) ORDER BY t))

ST_Transform is from DuckDB's spatial extension — it does not go through our mutex. Even though coordinate transformation uses PROJ (not GEOS) for the actual math, the spatial extension does initialize its own GEOS handle at load time. In some GEOS versions the legacy global-API (initGEOS) and the handle-based API (initGEOS_r) share internal state, which means a race can still occur between the spatial extension's GEOS instance and MEOS's.

To narrow down the root cause, could you try:

  1. Remove ST_Transform — test with a geometry already in the target SRID to see if the error disappears without the spatial extension in the same pipeline.
  2. Disable DuckDB parallelism — add SET threads=1; before the query. If it stops failing, it is definitely a concurrency issue.
  3. Share the full error stack trace — the exact message ("Uninitialize Geos" vs. "malloc error") and any accompanying backtrace would help pinpoint which GEOS/MEOS code path is failing.

If the error vanishes when ST_Transform is removed (or when threads=1), the remaining fix is to ensure the spatial extension and MobilityDuck share one GEOS context — which requires either upgrading to GEOS's thread-safe handle API within MEOS, or using a process-wide GEOS init guard shared across both extensions. The correct long-term solution is the per-thread MEOS state work tracked in issue #404.

@ngochoawindy
Copy link
Copy Markdown
Collaborator Author

  1. Remove ST_Transform: there's still error: segmentation fault.
  2. Set threads = 1 works with and without ST_Transform
  3. The error without setting threads = 1 now is segmentation fault
SELECT
        mmsi,
        tgeompointSeq(list(tgeompoint(ST_Transform(geom_4326, 'EPSG:4326', 'EPSG:25832', always_xy := true), t) ORDER BY t)) AS trip
      FROM AISInputTarget
      GROUP BY mmsi;
zsh: segmentation fault  ./build/release/duckdb 
SELECT
        mmsi,
        tgeompointSeq(list(tgeompoint(geom_4326, t) ORDER BY t)) AS trip
      FROM AISInputTarget
      GROUP BY mmsi;
zsh: segmentation fault  ./build/release/duckdb 

@estebanzimanyi
Copy link
Copy Markdown
Member

Thank you for investigating this! The GEOS initialization segfault turned out to have a deeper root cause that the mutex approach doesn't fully address.

Root cause

MEOS PR #815 made all mutable MEOS state thread-local (MEOS_TLS): the PROJ context (PJ_CONTEXT), GSL RNGs (MEOS_GENERATION_RNG, MEOS_AGGREGATION_RNG), the error code (MEOS_ERR_NO), and the PROJ/Ways caches. The error handler (MEOS_ERROR_HANDLER) is the only piece that is process-global (written with __ATOMIC_RELEASE).

The initialization in mobilityduck_extension.cpp used std::call_once, which fires exactly once — on whichever thread first touches the extension. DuckDB's other worker threads never call meos_initialize(). With MEOS's TLS design, those threads hit NULL TLS pointers whenever they call any MEOS function → segfault.

Why the mutex doesn't fix it

The global std::mutex in this PR serializes every MEOS call through one thread, which has two problems:

  1. It destroys DuckDB's parallel execution (the whole point of a columnar engine).
  2. It still doesn't call meos_initialize() on the worker threads — the mutex only ensures exclusion, not initialization.

The correct fix

Each DuckDB worker thread must call meos_initialize() before touching any MEOS function. Branch fix/per-thread-meos-init replaces the mutex with a thread_local bool guard:

inline void EnsureMeosInitializedOnThread() {
    thread_local bool initialized = false;
    if (!initialized) {
        meos_initialize();
        meos_initialize_error_handler(&MobilityduckMeosErrorHandler);
        initialized = true;
    }
}

RegisterSerializedScalarFunction wraps every scalar function with a call to EnsureMeosInitializedOnThread() before dispatching — cost is one bool check per call, negligible.

Verified: tgeompointSeq(list(TGEOMPOINT(geom, ts) ORDER BY ts)) with SET threads = 4, 2000 rows / 50 vessels — previously segfaulted, now returns correct results with full parallelism.

Superseding this PR with fix/per-thread-meos-init. Thank you again for surfacing the issue!

@estebanzimanyi
Copy link
Copy Markdown
Member

Closing in favour of #fix/per-thread-meos-init — see comment above for full analysis.

estebanzimanyi added a commit that referenced this pull request May 10, 2026
Replaces the process-global `std::mutex` (from the now-closed PR #51)
that wrapped every MEOS scalar-function call with a per-thread
`thread_local bool` guard in `EnsureMeosInitializedOnThread()`.

Each DuckDB worker thread calls `meos_initialize()` and
`meos_initialize_error_handler()` exactly once on first use; cost is
one `bool` check per scalar-function call thereafter — negligible.
Full DuckDB parallelism is preserved.

Bundles the matching co-commit refinements (3 commits squashed to 1):

1. **Per-thread MEOS init** (was c237f6c)
   The mutex → thread_local swap.

2. **Cover cast functions** (was 55d4c5a)
   Cast functions bypass `RegisterSerializedScalarFunction` and would
   otherwise leave MEOS uninitialised when a cast is the first MEOS
   call (e.g. `SELECT stbox 'STBOX XT(...)'` before any scalar
   function).  Initialise on the loading thread to cover them.

3. **TZ-neutral test assertions** (was 9dd765a)
   Make `stbox.test` timestamp assertions timezone-neutral so the test
   passes under both local and CI session timezones.

## Root cause

MEOS [PR #815](MobilityDB/MobilityDB#815) made
all mutable MEOS state thread-local (`MEOS_TLS`).  Threads that hadn't
called `meos_initialize()` had NULL TLS pointers and segfaulted.  The
mutex was the conservative workaround; per-thread init is the correct
fix.

Closes #404.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants