Skip to content

Lost-update race in _refresh_cache_after_resolution (read-path side effect) #74

@ZmeiGorynych

Description

@ZmeiGorynych

Problem

_refresh_cache_after_resolution in slayer/engine/query_engine.py writes the entire previously-loaded model back to storage when refreshing the query-backed cache. Under concurrent load this is a lost-update race:

Time  Request A (execute "qb")           Request B (edit_model "qb")
---- --------------------------------    ----------------------------------
T0   storage.get_model("qb")  → vA
T1                                       storage.get_model("qb") → vA
T2                                       update description / joins …
T3                                       storage.save_model(vA')   ← persists vA'
T4   _query_as_model(...)
T5   _refresh_cache_after_resolution
       model_copy(update={
         columns, backing_query_sql,
         data_source})
       storage.save_model(vA + cache)    ← OVERWRITES B's update

Hit on query_engine.py:641 (the _expand_query_backed_model cache-refresh branch added in PR #67 Batch 3) and query_engine.py:933 (the _refresh_cache_after_resolution save itself).

This is a read-path side effect — engine.execute / inspect_model / get_column_types all trigger it transparently. Easy to overlook because the in-PR tests are single-threaded.

Why this is real

  • StorageBackend.save_model takes a full SlayerModel and overwrites the file (YAML) or row (SQLite) atomically. There is no compare-and-swap or partial-update primitive.
  • The model_copy at the call site preserves vA's description, joins, filters, query_variables, meta, etc. — every field other than the three cache fields. Any concurrent edit to any of those is silently undone.
  • Multi-tenant or multi-user deployments (the obvious target for an API server) will see this.
  • The window is wide: enrichment + SQL-gen + a possible second canonical render (Batch 10 fix) all happen between the read and the write.

Why we couldn't find it in unit tests

Tests are single-threaded and use :memory: SQLite. The race needs two concurrent requests sharing the same engine + storage; nothing in the existing suite exercises that.

Proposed approach

Two layers, roughly independent:

1. Add partial-update / patch semantics to StorageBackend

Extend the ABC with:

async def patch_model(
    self,
    name: str,
    *,
    expected_version: Optional[int] = None,  # optional CAS
    updates: Dict[str, Any],
) -> bool:
    """Apply a sparse update to the named model. Returns False if the model
    was concurrently mutated (when expected_version is set and mismatched).
    """

Implement in YAMLStorage (read-modify-write inside a file lock) and SQLiteStorage (row-level UPDATE with optional WHERE on a version column or content hash). The version field already exists on SlayerModel (version: int = 2) but it tracks the schema, not the row revision — add a separate revision: int column / field for CAS.

2. Use patch_model for all cache refreshes

_refresh_cache_after_resolution switches to patch_model(name, updates={"columns": ..., "backing_query_sql": ..., "data_source": ...}). If CAS fails (someone updated the model since we read it), drop the refresh — the next execute will retry and the cache will eventually converge.

Alternative (lower-cost short-term)

If full CAS is too much for this round: stop refreshing the cache from the read path. Move all cache writes into the explicit save path (engine.save_model), and have inspect_model / get_column_types / execute use whatever backing_query_sql they find at read time without writing back. Cache becomes "warmed by save, never updated by read" — slightly staler in some cases (a backing-query rewrite outside engine.save_model won't be picked up until the next save) but races disappear.

I'd start with the alternative (small, safe, ships in a single PR) and reserve the full CAS work for when we have a real concurrency story (e.g., when the API server gets behind a process pool).

Acceptance criteria

Whichever path:

  • A regression test that fires two concurrent engine.execute calls (or one execute + one engine.save_model) on the same query-backed model and asserts the post-state preserves both writes.
  • Documented behavior in docs/concepts/models.md ("Cache may be stale until the next save" or "Cache refresh is CAS — concurrent writers retry").

Background

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions