Skip to content

bug(session-tab): cold-turn concurrent POSTs bypass resume lock — UUID overwrite + orphaned JSONL #779

@webmixgamer

Description

@webmixgamer

Summary

The per-(agent, claude_uuid) resume lock at src/backend/routers/sessions.py:175-180 short-circuits to key=None when the session's cached_claude_session_id is unset (the cold-turn path). Two concurrent POSTs to a brand-new session both bypass the lock, race on update_cached_claude_session_id, and end up with one orphaned JSONL.

self._key = (
    f"trinity:session:lock:{cached_claude_session_id}"
    if cached_claude_session_id
    else None
)

This is the same JSONL-corruption risk window the architecture doc already calls out (Anthropic claude-code #20992) — the lock was added specifically to prevent it for resume turns but doesn't yet cover cold turns.

Component

Backend / Session router (routers/sessions.py)

Priority

P2 — data-integrity edge case; requires concurrent POSTs on a fresh session.

Steps to Reproduce

Reproduced against dev (commit 9a57c6e2) via API:

# Create a brand-new session (no cached_claude_session_id yet)
SID=$(curl -s -X POST http://localhost:8000/api/agents/<name>/session \
  -H "Authorization: Bearer $TOKEN" | jq -r .id)

# Fire two POSTs at the same time
( curl -s -X POST "http://localhost:8000/api/agents/<name>/sessions/$SID/message" \
    -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
    -d '{"message":"cold concurrent 1"}' ) &
( curl -s -X POST "http://localhost:8000/api/agents/<name>/sessions/$SID/message" \
    -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
    -d '{"message":"cold concurrent 2"}' ) &
wait

Observed result:

  • Both POSTs return HTTP 200.
  • agent_session_messages ends with 4 rows (2 user + 2 assistant).
  • agent_sessions.cached_claude_session_id holds the second turn's UUID — the first turn's UUID is lost.
  • Both assistant turns saw both user messages in context (they raced on the JSONL the other was writing), producing confused replies.
  • The orphaned JSONL stays in the agent container until session_cleanup_service reaps it.

Expected Behavior

Cold turns on the same session should serialize the same way warm turns do — the first POST establishes cached_claude_session_id, subsequent POSTs join the same Claude session via --resume. No orphaned JSONLs, no overwritten UUIDs.

Root Cause

_ResumeLock is keyed on the cached UUID, which doesn't exist yet during the first turn. The lock is bypassed precisely when concurrent first-turn races can corrupt state.

Reachable from the UI?

Yes, but rare:

  • Double-click on the send button before loading flips (single user, fast hands).
  • Two browser tabs / two devices opening the same session and submitting in parallel.
  • Programmatic API use with no client-side debouncing (CLI scripts, tests).

The Session tab's local loading ref makes this unlikely for the common single-tab single-click path, but the backend is the wrong place to rely on client-side debouncing.

Suggested Fix

Use a per-session lock key when cached_claude_session_id is null, so cold turns on the same session also serialize:

self._key = (
    f"trinity:session:lock:{cached_claude_session_id}"
    if cached_claude_session_id
    else f"trinity:session:lock:cold:{session_id}"
)

This is purely additive: warm turns continue to use the existing UUID-keyed lock. Cold turns on the same session now block each other (correct), but cold turns on different sessions still run in parallel (the per-session key prevents cross-session contention).

The same 30s wait ceiling + 429 escalation already in place will apply.

Test

A regression test should:

  1. Create a fresh session
  2. Fire two concurrent POSTs to its /message endpoint
  3. Assert: both succeed, message_count == 4, both assistant messages reply to their own user message (not the other's), and exactly one JSONL exists in the agent container.

Related

Surfaced during reproduction work for #759. Not bundled into that PR because it's a separate code path and a separate risk (JSONL corruption vs. UI-state inconsistency).

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions