You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Seen in production while running #81 through the dev pipeline:
```
❌ Failed on #81: Error: --resume requires a valid session ID or session title
when used with --print. Usage: claude -p --resume <session-id|title>.
Provided value "dev-AInvirion-ctrlrelay-81-c49c00f7" is not a UUID and does
not match any session title.
```
The initial dev-pipeline session worked — PR #82 landed and closed #81 — but the post-merge / blocked-resume step errored. Operators see a misleading "❌ Failed" Telegram notification for an issue that actually shipped.
Root cause
`src/ctrlrelay/core/dispatcher.py` (`ClaudeDispatcher.spawn_session`) builds the resume command with:
```python
if resume_session_id:
cmd.extend(["--resume", resume_session_id])
```
...where `resume_session_id` is the composite identifier we coin ourselves — format `dev----` (see `src/ctrlrelay/pipelines/dev.py`, `run_dev_issue`).
Older `claude` CLI versions accepted arbitrary strings for `--resume` and just dropped you into a fresh session when there was no match. Newer versions (seen in the v2.0.x line) validate that the argument is either a UUID of a known session OR an exact match of a session "title". Our composite ids aren't UUIDs and we don't pass them as titles, so resume dies.
This only manifests on paths that resume a session — the post-merge handler's "PR merged, close the loop" step and the Telegram-bridge-driven answer-to-blocked-question flow. The initial `spawn_session` call (no `resume_session_id`) is unaffected, which is why PRs are still being opened successfully.
What Claude's own session id looks like
`claude -p --output-format json` returns a JSON payload with a `session_id` field — a real UUID that Claude stores in its local sessions database. That's the value `--resume` expects on subsequent calls.
Run a quick smoke to confirm the field name exactly before shipping (the shape has shifted historically): `claude -p --output-format json "noop"` and inspect the JSON keys.
Proposed fix
Capture Claude's UUID on every spawn. After `proc.communicate()` in `spawn_session`, parse `stdout` as JSON (when `--output-format json`), pull out the session UUID, and attach it to the returned `SessionResult`. Probably rename today's `SessionResult.session_id` → `SessionResult.orchestrator_session_id` (our composite) and add `SessionResult.agent_session_id` (Claude's UUID) for clarity; update the 4-5 call sites.
Persist the agent UUID. The state_db `sessions` table keeps our composite id as the primary key; add a nullable `agent_session_id` column alongside. Small Alembic-style migration (or a one-shot `ALTER TABLE` guarded by `PRAGMA user_version`) so existing rows backfill to NULL.
Use the agent UUID on resume. Wherever `dispatcher.spawn_session(resume_session_id=...)` is called, fetch the prior `agent_session_id` from the state DB (not our composite) and pass it through. If NULL (session predates this fix), skip the `--resume` and fall back to a fresh session — better than hard-failing.
Abstract into the protocol. Since we just shipped `AgentAdapter` in feat(agent): multi-agent config surface (claude -> agent, with type field) #73, this work naturally generalizes: every adapter will have its own "session identifier" concept (Codex's thread id, etc.). The protocol should define `session_id` (the opaque-to-us string the adapter wants on resume) independently of whatever composite id the orchestrator uses to track the session in state_db. That way the next adapter author doesn't re-discover this footgun.
Tests
Unit: mock `claude` stdout with a known JSON payload containing `session_id`; assert `SessionResult.agent_session_id` is populated.
Unit: state_db roundtrip — insert a session, fetch, assert both ids survive.
Unit: `spawn_session(resume_session_id=)` puts that UUID after `--resume` in the argv, not the composite.
Integration: two consecutive `spawn_session` calls — second call uses the agent UUID from the first's result. Mock `claude` binary behaviour.
Regression: `spawn_session(resume_session_id=None)` on a fresh start emits no `--resume` flag.
Out of scope for this issue
Cleaning up already-blocked sessions in state_db that predate the fix. Probably a separate admin CLI command (`ctrlrelay sessions cleanup`) to mark them "abandoned" so operators don't see zombie rows.
Title-based resume semantics (`claude -p --resume "some-title"`). We could set a title per session to double-up as a fallback; not worth the complexity until we see demand.
Operator impact right now
The bug doesn't block PR creation — only post-merge / mid-session resume. In practice:
Every dev-pipeline run that should resume after a blocked question → fails silently from the operator's perspective (they see "❌ Failed" even when the underlying work is fine).
Every dev-pipeline run whose PR merges and triggers the post-merge handler → emits the misleading Telegram "❌ Failed" even though the PR is already in.
Fix priority: high-ish, not emergency. Noise rather than data loss.
Symptom
Seen in production while running #81 through the dev pipeline:
```
❌ Failed on #81: Error: --resume requires a valid session ID or session title
when used with --print. Usage: claude -p --resume <session-id|title>.
Provided value "dev-AInvirion-ctrlrelay-81-c49c00f7" is not a UUID and does
not match any session title.
```
The initial dev-pipeline session worked — PR #82 landed and closed #81 — but the post-merge / blocked-resume step errored. Operators see a misleading "❌ Failed" Telegram notification for an issue that actually shipped.
Root cause
`src/ctrlrelay/core/dispatcher.py` (`ClaudeDispatcher.spawn_session`) builds the resume command with:
```python
if resume_session_id:
cmd.extend(["--resume", resume_session_id])
```
...where `resume_session_id` is the composite identifier we coin ourselves — format `dev----` (see `src/ctrlrelay/pipelines/dev.py`, `run_dev_issue`).
Older `claude` CLI versions accepted arbitrary strings for `--resume` and just dropped you into a fresh session when there was no match. Newer versions (seen in the v2.0.x line) validate that the argument is either a UUID of a known session OR an exact match of a session "title". Our composite ids aren't UUIDs and we don't pass them as titles, so resume dies.
This only manifests on paths that resume a session — the post-merge handler's "PR merged, close the loop" step and the Telegram-bridge-driven answer-to-blocked-question flow. The initial `spawn_session` call (no `resume_session_id`) is unaffected, which is why PRs are still being opened successfully.
What Claude's own session id looks like
`claude -p --output-format json` returns a JSON payload with a `session_id` field — a real UUID that Claude stores in its local sessions database. That's the value `--resume` expects on subsequent calls.
Run a quick smoke to confirm the field name exactly before shipping (the shape has shifted historically): `claude -p --output-format json "noop"` and inspect the JSON keys.
Proposed fix
Capture Claude's UUID on every spawn. After `proc.communicate()` in `spawn_session`, parse `stdout` as JSON (when `--output-format json`), pull out the session UUID, and attach it to the returned `SessionResult`. Probably rename today's `SessionResult.session_id` → `SessionResult.orchestrator_session_id` (our composite) and add `SessionResult.agent_session_id` (Claude's UUID) for clarity; update the 4-5 call sites.
Persist the agent UUID. The state_db `sessions` table keeps our composite id as the primary key; add a nullable `agent_session_id` column alongside. Small Alembic-style migration (or a one-shot `ALTER TABLE` guarded by `PRAGMA user_version`) so existing rows backfill to NULL.
Use the agent UUID on resume. Wherever `dispatcher.spawn_session(resume_session_id=...)` is called, fetch the prior `agent_session_id` from the state DB (not our composite) and pass it through. If NULL (session predates this fix), skip the `--resume` and fall back to a fresh session — better than hard-failing.
Abstract into the protocol. Since we just shipped `AgentAdapter` in feat(agent): multi-agent config surface (claude -> agent, with type field) #73, this work naturally generalizes: every adapter will have its own "session identifier" concept (Codex's thread id, etc.). The protocol should define `session_id` (the opaque-to-us string the adapter wants on resume) independently of whatever composite id the orchestrator uses to track the session in state_db. That way the next adapter author doesn't re-discover this footgun.
Tests
Out of scope for this issue
Operator impact right now
The bug doesn't block PR creation — only post-merge / mid-session resume. In practice:
Fix priority: high-ish, not emergency. Noise rather than data loss.