Summary
The per-(agent, claude_uuid) resume lock at src/backend/routers/sessions.py:175-180 short-circuits to key=None when the session's cached_claude_session_id is unset (the cold-turn path). Two concurrent POSTs to a brand-new session both bypass the lock, race on update_cached_claude_session_id, and end up with one orphaned JSONL.
self._key = (
f"trinity:session:lock:{cached_claude_session_id}"
if cached_claude_session_id
else None
)
This is the same JSONL-corruption risk window the architecture doc already calls out (Anthropic claude-code #20992) — the lock was added specifically to prevent it for resume turns but doesn't yet cover cold turns.
Component
Backend / Session router (routers/sessions.py)
Priority
P2 — data-integrity edge case; requires concurrent POSTs on a fresh session.
Steps to Reproduce
Reproduced against dev (commit 9a57c6e2) via API:
# Create a brand-new session (no cached_claude_session_id yet)
SID=$(curl -s -X POST http://localhost:8000/api/agents/<name>/session \
-H "Authorization: Bearer $TOKEN" | jq -r .id)
# Fire two POSTs at the same time
( curl -s -X POST "http://localhost:8000/api/agents/<name>/sessions/$SID/message" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"message":"cold concurrent 1"}' ) &
( curl -s -X POST "http://localhost:8000/api/agents/<name>/sessions/$SID/message" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"message":"cold concurrent 2"}' ) &
wait
Observed result:
- Both POSTs return HTTP 200.
agent_session_messages ends with 4 rows (2 user + 2 assistant).
agent_sessions.cached_claude_session_id holds the second turn's UUID — the first turn's UUID is lost.
- Both assistant turns saw both user messages in context (they raced on the JSONL the other was writing), producing confused replies.
- The orphaned JSONL stays in the agent container until
session_cleanup_service reaps it.
Expected Behavior
Cold turns on the same session should serialize the same way warm turns do — the first POST establishes cached_claude_session_id, subsequent POSTs join the same Claude session via --resume. No orphaned JSONLs, no overwritten UUIDs.
Root Cause
_ResumeLock is keyed on the cached UUID, which doesn't exist yet during the first turn. The lock is bypassed precisely when concurrent first-turn races can corrupt state.
Reachable from the UI?
Yes, but rare:
- Double-click on the send button before
loading flips (single user, fast hands).
- Two browser tabs / two devices opening the same session and submitting in parallel.
- Programmatic API use with no client-side debouncing (CLI scripts, tests).
The Session tab's local loading ref makes this unlikely for the common single-tab single-click path, but the backend is the wrong place to rely on client-side debouncing.
Suggested Fix
Use a per-session lock key when cached_claude_session_id is null, so cold turns on the same session also serialize:
self._key = (
f"trinity:session:lock:{cached_claude_session_id}"
if cached_claude_session_id
else f"trinity:session:lock:cold:{session_id}"
)
This is purely additive: warm turns continue to use the existing UUID-keyed lock. Cold turns on the same session now block each other (correct), but cold turns on different sessions still run in parallel (the per-session key prevents cross-session contention).
The same 30s wait ceiling + 429 escalation already in place will apply.
Test
A regression test should:
- Create a fresh session
- Fire two concurrent POSTs to its
/message endpoint
- Assert: both succeed,
message_count == 4, both assistant messages reply to their own user message (not the other's), and exactly one JSONL exists in the agent container.
Related
Surfaced during reproduction work for #759. Not bundled into that PR because it's a separate code path and a separate risk (JSONL corruption vs. UI-state inconsistency).
Summary
The per-
(agent, claude_uuid)resume lock atsrc/backend/routers/sessions.py:175-180short-circuits tokey=Nonewhen the session'scached_claude_session_idis unset (the cold-turn path). Two concurrent POSTs to a brand-new session both bypass the lock, race onupdate_cached_claude_session_id, and end up with one orphaned JSONL.This is the same JSONL-corruption risk window the architecture doc already calls out (Anthropic claude-code #20992) — the lock was added specifically to prevent it for resume turns but doesn't yet cover cold turns.
Component
Backend / Session router (
routers/sessions.py)Priority
P2 — data-integrity edge case; requires concurrent POSTs on a fresh session.
Steps to Reproduce
Reproduced against
dev(commit9a57c6e2) via API:Observed result:
agent_session_messagesends with 4 rows (2 user + 2 assistant).agent_sessions.cached_claude_session_idholds the second turn's UUID — the first turn's UUID is lost.session_cleanup_servicereaps it.Expected Behavior
Cold turns on the same session should serialize the same way warm turns do — the first POST establishes
cached_claude_session_id, subsequent POSTs join the same Claude session via--resume. No orphaned JSONLs, no overwritten UUIDs.Root Cause
_ResumeLockis keyed on the cached UUID, which doesn't exist yet during the first turn. The lock is bypassed precisely when concurrent first-turn races can corrupt state.Reachable from the UI?
Yes, but rare:
loadingflips (single user, fast hands).The Session tab's local
loadingref makes this unlikely for the common single-tab single-click path, but the backend is the wrong place to rely on client-side debouncing.Suggested Fix
Use a per-session lock key when
cached_claude_session_idis null, so cold turns on the same session also serialize:This is purely additive: warm turns continue to use the existing UUID-keyed lock. Cold turns on the same session now block each other (correct), but cold turns on different sessions still run in parallel (the per-session key prevents cross-session contention).
The same 30s wait ceiling + 429 escalation already in place will apply.
Test
A regression test should:
/messageendpointmessage_count == 4, both assistant messages reply to their own user message (not the other's), and exactly one JSONL exists in the agent container.Related
Surfaced during reproduction work for #759. Not bundled into that PR because it's a separate code path and a separate risk (JSONL corruption vs. UI-state inconsistency).