Problem
_refresh_cache_after_resolution in slayer/engine/query_engine.py writes the entire previously-loaded model back to storage when refreshing the query-backed cache. Under concurrent load this is a lost-update race:
Time Request A (execute "qb") Request B (edit_model "qb")
---- -------------------------------- ----------------------------------
T0 storage.get_model("qb") → vA
T1 storage.get_model("qb") → vA
T2 update description / joins …
T3 storage.save_model(vA') ← persists vA'
T4 _query_as_model(...)
T5 _refresh_cache_after_resolution
model_copy(update={
columns, backing_query_sql,
data_source})
storage.save_model(vA + cache) ← OVERWRITES B's update
Hit on query_engine.py:641 (the _expand_query_backed_model cache-refresh branch added in PR #67 Batch 3) and query_engine.py:933 (the _refresh_cache_after_resolution save itself).
This is a read-path side effect — engine.execute / inspect_model / get_column_types all trigger it transparently. Easy to overlook because the in-PR tests are single-threaded.
Why this is real
StorageBackend.save_model takes a full SlayerModel and overwrites the file (YAML) or row (SQLite) atomically. There is no compare-and-swap or partial-update primitive.
- The model_copy at the call site preserves vA's
description, joins, filters, query_variables, meta, etc. — every field other than the three cache fields. Any concurrent edit to any of those is silently undone.
- Multi-tenant or multi-user deployments (the obvious target for an API server) will see this.
- The window is wide: enrichment + SQL-gen + a possible second canonical render (Batch 10 fix) all happen between the read and the write.
Why we couldn't find it in unit tests
Tests are single-threaded and use :memory: SQLite. The race needs two concurrent requests sharing the same engine + storage; nothing in the existing suite exercises that.
Proposed approach
Two layers, roughly independent:
1. Add partial-update / patch semantics to StorageBackend
Extend the ABC with:
async def patch_model(
self,
name: str,
*,
expected_version: Optional[int] = None, # optional CAS
updates: Dict[str, Any],
) -> bool:
"""Apply a sparse update to the named model. Returns False if the model
was concurrently mutated (when expected_version is set and mismatched).
"""
Implement in YAMLStorage (read-modify-write inside a file lock) and SQLiteStorage (row-level UPDATE with optional WHERE on a version column or content hash). The version field already exists on SlayerModel (version: int = 2) but it tracks the schema, not the row revision — add a separate revision: int column / field for CAS.
2. Use patch_model for all cache refreshes
_refresh_cache_after_resolution switches to patch_model(name, updates={"columns": ..., "backing_query_sql": ..., "data_source": ...}). If CAS fails (someone updated the model since we read it), drop the refresh — the next execute will retry and the cache will eventually converge.
Alternative (lower-cost short-term)
If full CAS is too much for this round: stop refreshing the cache from the read path. Move all cache writes into the explicit save path (engine.save_model), and have inspect_model / get_column_types / execute use whatever backing_query_sql they find at read time without writing back. Cache becomes "warmed by save, never updated by read" — slightly staler in some cases (a backing-query rewrite outside engine.save_model won't be picked up until the next save) but races disappear.
I'd start with the alternative (small, safe, ships in a single PR) and reserve the full CAS work for when we have a real concurrency story (e.g., when the API server gets behind a process pool).
Acceptance criteria
Whichever path:
- A regression test that fires two concurrent
engine.execute calls (or one execute + one engine.save_model) on the same query-backed model and asserts the post-state preserves both writes.
- Documented behavior in
docs/concepts/models.md ("Cache may be stale until the next save" or "Cache refresh is CAS — concurrent writers retry").
Background
Problem
_refresh_cache_after_resolutioninslayer/engine/query_engine.pywrites the entire previously-loaded model back to storage when refreshing the query-backed cache. Under concurrent load this is a lost-update race:Hit on
query_engine.py:641(the_expand_query_backed_modelcache-refresh branch added in PR #67 Batch 3) andquery_engine.py:933(the_refresh_cache_after_resolutionsave itself).This is a read-path side effect —
engine.execute/inspect_model/get_column_typesall trigger it transparently. Easy to overlook because the in-PR tests are single-threaded.Why this is real
StorageBackend.save_modeltakes a fullSlayerModeland overwrites the file (YAML) or row (SQLite) atomically. There is no compare-and-swap or partial-update primitive.description,joins,filters,query_variables,meta, etc. — every field other than the three cache fields. Any concurrent edit to any of those is silently undone.Why we couldn't find it in unit tests
Tests are single-threaded and use
:memory:SQLite. The race needs two concurrent requests sharing the same engine + storage; nothing in the existing suite exercises that.Proposed approach
Two layers, roughly independent:
1. Add partial-update / patch semantics to
StorageBackendExtend the ABC with:
Implement in
YAMLStorage(read-modify-write inside a file lock) andSQLiteStorage(row-level UPDATE with optional WHERE on a version column or content hash). The version field already exists onSlayerModel(version: int = 2) but it tracks the schema, not the row revision — add a separaterevision: intcolumn / field for CAS.2. Use
patch_modelfor all cache refreshes_refresh_cache_after_resolutionswitches topatch_model(name, updates={"columns": ..., "backing_query_sql": ..., "data_source": ...}). If CAS fails (someone updated the model since we read it), drop the refresh — the next execute will retry and the cache will eventually converge.Alternative (lower-cost short-term)
If full CAS is too much for this round: stop refreshing the cache from the read path. Move all cache writes into the explicit save path (
engine.save_model), and haveinspect_model/get_column_types/executeuse whateverbacking_query_sqlthey find at read time without writing back. Cache becomes "warmed by save, never updated by read" — slightly staler in some cases (a backing-query rewrite outsideengine.save_modelwon't be picked up until the next save) but races disappear.I'd start with the alternative (small, safe, ships in a single PR) and reserve the full CAS work for when we have a real concurrency story (e.g., when the API server gets behind a process pool).
Acceptance criteria
Whichever path:
engine.executecalls (or one execute + oneengine.save_model) on the same query-backed model and asserts the post-state preserves both writes.docs/concepts/models.md("Cache may be stale until the next save" or "Cache refresh is CAS — concurrent writers retry").Background
73f69b0.sqlite:///:memory:not supported across multiple async calls #72 — both are about untangling execution/resolution concerns from the stored model schema. This issue is operational rather than schema-shaped, but lives in the same neighborhood.