Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
6e81aac
checkpoint: pre-yolo 2026-05-04T10:29:04
aksOps May 4, 2026
eea4add
checkpoint: pre-yolo 2026-05-04T11:58:44
aksOps May 4, 2026
60523e0
checkpoint: pre-yolo 2026-05-04T12:55:34
aksOps May 4, 2026
7c682bb
feat(locks): per-session asyncio lock registry
aksOps May 4, 2026
941d62a
feat(storage): optimistic-concurrency version on Session rows
aksOps May 4, 2026
f6317a2
feat(storage): append-only session event log
aksOps May 4, 2026
860a29a
fix(gateway): persist pending_approval row to DB before interrupt
aksOps May 4, 2026
412e076
refactor(gateway): explicit prefixed-then-bare lookup order
aksOps May 4, 2026
40ad9f5
feat(incident): typed pydantic schemas for terminal tool requests
aksOps May 4, 2026
6b30e2f
feat(incident): add mark_resolved, mark_escalated, submit_hypothesis …
aksOps May 4, 2026
d76a036
feat(incident): typed schema for update_incident patch (extra=forbid)
aksOps May 4, 2026
33bfbf8
feat(graph): harvester reads confidence from typed terminal tool args
aksOps May 4, 2026
b7d10ab
test(harvester): align test name + values with spec verbatim
aksOps May 4, 2026
64d0f68
checkpoint: pre-yolo 2026-05-04T14:07:18
aksOps May 4, 2026
7fa420d
fix(graph): lock typed-terminal confidence/rationale against same-mes…
aksOps May 4, 2026
d6ece33
refactor(skills): use typed terminal tools (mark_resolved/escalated/s…
aksOps May 4, 2026
8114474
fix(skill): preserve matched_prior_inc validation guideline in deep_i…
aksOps May 4, 2026
c8e3129
fix(orchestrator): infer terminal status from tool history (no blind …
aksOps May 4, 2026
00d389c
fix(orchestrator): handle StaleVersionError + tighten finalize hygiene
aksOps May 4, 2026
875b711
fix(orchestrator): lock-guarded async finalize (no concurrent race)
aksOps May 4, 2026
d4f8d48
checkpoint: pre-yolo 2026-05-04T14:38:52
aksOps May 4, 2026
dab107d
fix(orchestrator): reject concurrent retry_session on same id
aksOps May 4, 2026
3193390
fix(orchestrator): operator log on retry_session rejection
aksOps May 4, 2026
6265eae
feat(mcp): config-roster-driven validation for environment + team
aksOps May 4, 2026
a54988c
feat(skills): load-time validation of tools.local + when:default routes
aksOps May 4, 2026
a6e49cf
fix(skill-validator): reject ambiguous bare tool refs + test supervis…
aksOps May 4, 2026
01a7f37
feat(storage): GC orphaned LangGraph checkpoints on startup
aksOps May 4, 2026
eb7c73f
test(checkpoint-gc): cover suffix-strip retry threads stay when base …
aksOps May 4, 2026
309f9fa
test+docs(mcp): lock per-instance isolation guarantee for IncidentMCP…
aksOps May 4, 2026
2e912ae
test(e2e): regression for finalize→needs_review on real Orchestrator
aksOps May 4, 2026
0574955
build: regen dist/* bundles for prompt-vs-code remediation
aksOps May 4, 2026
d5fb9ad
checkpoint: pre-yolo 2026-05-05T00:19:38
aksOps May 5, 2026
9cf86f6
fix(tests): remove unused 'inc' assignment flagged by ruff F841 in CI
aksOps May 5, 2026
c89fa4f
checkpoint: pre-yolo 2026-05-05T09:23:14
aksOps May 5, 2026
d86d57c
fix(quality): SonarCloud — async suppression + cognitive complexity r…
aksOps May 5, 2026
a8eb97b
checkpoint: pre-yolo 2026-05-05T11:34:48
aksOps May 5, 2026
997f9a4
checkpoint: pre-yolo 2026-05-06T07:53:14
aksOps May 6, 2026
ae0ee4d
checkpoint: pre-yolo 2026-05-06T07:57:13
aksOps May 6, 2026
ea43964
feat(concurrency): per-session task-reentrant lock with fail-fast Ses…
aksOps May 6, 2026
7ae577f
docs(plan): complete 01-01 summary — per-session lock + SessionBusy (…
aksOps May 6, 2026
50acaf6
feat(01-02): watchdog is_locked() skip + HARD-06 stop() drain + pytes…
aksOps May 6, 2026
3e2e813
test(concurrency): add 7 PVC-09 lock-protocol tests to test_session_lock
aksOps May 6, 2026
57cd7cd
checkpoint: pre-yolo 2026-05-06T12:46:37
aksOps May 6, 2026
f345e59
feat(01.1-01): add SessionLockRegistry.try_acquire (D-18)
aksOps May 6, 2026
5ce54ce
feat(01.1-01): wire try_acquire + api lock acquire + dist regen (D-09…
aksOps May 6, 2026
20bf02c
test(01.1-02): rewrite 5 sequential PVC-09 tests as concurrent (R3)
aksOps May 6, 2026
b6f45fc
test(01.1-02): add R1 watchdog-acquire + R2 api-429 tests
aksOps May 6, 2026
71b3c16
test(01.1-03): add R4 D-01 lock-cycle verification test
aksOps May 6, 2026
acfc15d
checkpoint: pre-yolo 2026-05-06T14:21:22
aksOps May 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .claude/worktrees/agent-a5e8856c1b01a8d2f
Submodule agent-a5e8856c1b01a8d2f added at 7ae577
1 change: 1 addition & 0 deletions .claude/worktrees/agent-ad51a9f71a5268747
Submodule agent-ad51a9f71a5268747 added at ae0ee4
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ Thumbs.db
AGENTS.md
ASR.md
docs/
REVIEW_*.md
review_*.md
.planning/

# Coverage / CI artefacts
coverage.xml
Expand Down
134 changes: 134 additions & 0 deletions .planning/phases/01-concurrency-foundation/01-01-SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---
phase: 01-concurrency-foundation
plan: 01
subsystem: infra
tags: [asyncio, locks, concurrency, fastapi, streamlit, session-management]

# Dependency graph
requires: []
provides:
- SessionBusy(RuntimeError) exception with session_id attribute
- SessionLockRegistry.is_locked(session_id) non-blocking predicate
- Per-session task-reentrant lock held across full graph turn including HITL pause
- HTTP 429 + Retry-After:1 on all three session-start/approval API callsites
- UI retry hint on SessionBusy at investigation form submission
- locks.py inlined into dist/ bundles
affects:
- 01-02-concurrency-foundation # approval_watchdog retry path uses SessionBusy

# Tech tracking
tech-stack:
added: []
patterns:
- class-name match for exception handling in api.py (no hard import at module load)
- task-reentrant asyncio lock with is_locked() fail-fast check before acquire()
- D-09: dist/ regeneration in same atomic commit as src/ changes

key-files:
created: []
modified:
- src/runtime/locks.py
- src/runtime/service.py
- src/runtime/api.py
- src/runtime/ui.py
- tests/test_session_lock.py
- scripts/build_single_file.py
- dist/app.py
- dist/ui.py
- dist/apps/incident-management.py
- dist/apps/code-review.py

key-decisions:
- "D-01: Lock held across entire graph turn including LangGraph interrupt() HITL pause"
- "D-02: Single acquire site inside _run() closure, not at start_session() entry"
- "D-03: Fail-fast contention — SessionBusy raised, not queued"
- "D-04: Reads stay lock-free throughout"
- "D-09: dist/ regenerated in same atomic commit as src/ changes"
- "D-10: Direct atomic commit on refactor/prompt-vs-code-remediation branch"
- "D-15: Slot eviction deferred to v2 — TODO comment added to _slots dict"
- "D-16 (location override): SessionBusy raised inside _run() at acquire site, NOT at start_session() entry — start_session() mints fresh session_id so no pre-existing lock slot exists"
- "D-17: EventLog stays lock-free"
- "locks.py added to RUNTIME_MODULE_ORDER in build_single_file.py (was missing)"

patterns-established:
- "Exception class-name matching pattern: e.__class__.__name__ in ('SessionCapExceeded', 'SessionBusy') — avoids hard import at module load time"
- "is_locked() + acquire() pattern: check is_locked() first for fail-fast, then async with acquire() for the body — non-contending in steady state"
- "asyncio_mode=auto: new async tests in tests/ do NOT need @pytest.mark.asyncio decorator"

requirements-completed:
- PVC-01

# Metrics
duration: ~35min
completed: 2026-05-06
---

# Phase 01: Concurrency Foundation — Plan 01 Summary

**Per-session task-reentrant asyncio lock with fail-fast SessionBusy, HTTP 429/Retry-After mapping at all three API callsites, UI retry hint, and locks.py bundled into dist/**

## Performance

- **Duration:** ~35 min
- **Started:** 2026-05-06T08:00:00Z
- **Completed:** 2026-05-06T08:35:00Z
- **Tasks:** 3
- **Files modified:** 10

## Accomplishments
- `SessionBusy(RuntimeError)` exception and `is_locked()` predicate added to `locks.py`; 5 new unit tests pass (838 total)
- `service.py._run()` wrapped with per-session lock acquire; fail-fast contention check via `is_locked()` before `acquire()`
- All three FastAPI callsites (`/investigate`, `POST /sessions`, approval submission) now map `SessionBusy` → HTTP 429 + `Retry-After: 1`; UI shows `st.warning` + early return
- `locks.py` added to `RUNTIME_MODULE_ORDER` in `build_single_file.py` (was omitted); all four dist bundles regenerated with `SessionBusy`, `is_locked`, `_locks.acquire` present

## Task Commits

All tasks committed atomically in a single commit per D-09/D-10:

1. **Tasks 1-3: All changes** - `ea43964` (feat)

## Files Created/Modified
- `src/runtime/locks.py` - Added `SessionBusy` class, `is_locked()` predicate, TODO(v2) eviction note
- `src/runtime/service.py` - Wrapped `_run()` body with `async with orch._locks.acquire(session_id):`; `is_locked()` fail-fast guard
- `src/runtime/api.py` - Extended class-name match at 2 existing handlers + 1 new handler at approval submission callsite
- `src/runtime/ui.py` - SessionBusy try/except at `asyncio.run()` investigation form path
- `tests/test_session_lock.py` - 5 new tests for `is_locked()` + `SessionBusy` (no `@pytest.mark.asyncio` per asyncio_mode=auto)
- `scripts/build_single_file.py` - Added `(RUNTIME_ROOT, "locks.py")` before `orchestrator.py` in `RUNTIME_MODULE_ORDER`
- `dist/app.py`, `dist/ui.py`, `dist/apps/incident-management.py`, `dist/apps/code-review.py` - Regenerated with locks.py inlined

## Decisions Made
- D-16 location override confirmed: `SessionBusy` raised inside `_run()` not at `start_session()` entry — `start_session()` mints a fresh `session_id` so there is no pre-existing lock slot to check
- `locks.py` was missing from `RUNTIME_MODULE_ORDER` in the build script — added before `orchestrator.py` which instantiates `SessionLockRegistry`
- Used `is_locked()` as a pre-check before `acquire()` to satisfy D-03 fail-fast without blocking; the acquire() itself is non-contending in the steady state

## Deviations from Plan

### Auto-fixed Issues

**1. [Rule 3 - Blocking] locks.py missing from build_single_file.py RUNTIME_MODULE_ORDER**
- **Found during:** Task 3 (dist/ regeneration verification)
- **Issue:** `def is_locked`, `class SessionBusy` absent from `dist/app.py` after initial build; `locks.py` was not listed in `RUNTIME_MODULE_ORDER`
- **Fix:** Added `(RUNTIME_ROOT, "locks.py")` to `RUNTIME_MODULE_ORDER` before `orchestrator.py`; rebuilt all four bundles
- **Files modified:** `scripts/build_single_file.py`, all four dist files
- **Verification:** `grep -c "def is_locked" dist/app.py` → 1; `grep -c "class SessionBusy" dist/app.py` → 1; `grep -c "_locks\.acquire" dist/app.py` → 2
- **Committed in:** `ea43964` (same atomic commit)

---

**Total deviations:** 1 auto-fixed (1 blocking — missing bundle entry)
**Impact on plan:** Essential fix for D-09 compliance. No scope creep.

## Issues Encountered
None beyond the locks.py bundle omission documented above.

## User Setup Required
None - no external service configuration required.

## Next Phase Readiness
- Per-session lock foundation complete; `SessionBusy` exception available for 01-02
- 01-02 (`approval_watchdog.py` retry path) can import `SessionBusy` from `runtime.locks` without circular import risk
- All 838 tests pass; ruff clean on all modified files

---
*Phase: 01-concurrency-foundation*
*Completed: 2026-05-06*
12 changes: 9 additions & 3 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,18 @@ runtime:
# only TIGHTENS — it can never relax a higher-risk tool to ``auto``.
gateway:
policy:
# Tool-name lookups try the server-prefixed (``<server>:<tool>``)
# AND bare forms — config can use either. Bare names below are
# easier to keep aligned with the MCP source.
update_incident: medium
"remediation:restart_service": high
"remediation:rollback": high
apply_fix: high
prod_overrides:
prod_environments:
- production
# Tools that ALWAYS require human approval in production. ``apply_fix``
# is the only currently-implemented remediation; ``update_incident``
# gates resolution closures (status: resolved/escalated). Globs are
# matched against the prefixed and bare forms.
resolution_trigger_tools:
- update_incident
- "remediation:*"
- apply_fix
Loading
Loading