Problem
Kennel's runtime concurrency is still spread across raw threads, thread events/locks, file locks (flock), registry-owned shared state, and background helper threads. Correctness depends on subtle interleavings between webhook handlers, worker turns, provider sessions, restart/recovery, and file-backed coordination.
On the free-threaded Python runtime this is exactly the wrong shape. The goal is not a better lock graph. The goal is a runtime kernel that owns concurrency so product code cannot keep inventing it.
Scope
This is the active runtime-concurrency subtree under #396.
It absorbs the still-relevant work from the older #548 coordination lane:
#550 contract / ownership vocabulary
#551 and #552 webhook ingress and coordinator boundaries
#553 durable outbox/store work
#554 worker recovery and wake/abort cleanup
#555 scenario-test migration
Hard decisions for this subtree
- One runtime instance per repo. The default shape is one repo-runtime instance per repo, likely one process per repo under a thin supervisor. Do not build one giant shared multi-repo runtime with shared mutable internals.
- SQLite is the durable truth. Use SQLite for the command inbox, parked frames, leases/session identity, epochs, outbox state, snapshots, and migrated task/state storage.
- Real preemption, not just prioritization. Urgent work must be able to interrupt an active provider turn immediately, seize the lease inside the runtime owner, park the worker, drain the urgent burst, then yield back and resume.
- Preemption transport stays out-of-band. SQLite is not the wakeup path. Immediate preemption should use a direct runtime-owned signal/channel while durable truth stays in SQLite.
- Hard to misuse. Product code should not touch raw SQLite, provider session objects, locks/events, or private runtime state directly. It should go through narrow runtime primitives.
- Restartability is required. Crash-safe intent, crash-safe ownership, and crash-safe parked resume are framework requirements, not follow-up polish.
Recent partial fix
Merged PR #706 closed #672 by making the review-thread claim barrier bidirectional between the webhook handler and Worker.handle_threads().
That was the right tactical fix for the duplicate-reply race, but it is still an in-process shared-claim workaround. It should be treated as proof of the needed semantics, not as the final architecture:
- ingress and worker are still coordinating through shared mutable process state
- the dedupe barrier still lives in product code instead of a runtime-owned command/idempotence boundary
- the regression coverage still leans on patch-heavy race-window tests rather than a scenario runtime harness
Required semantics
- Immediate provider-turn interruption for urgent webhook work.
- Session steal without worker death: interrupt the current provider turn, park the worker, drain the urgent burst, then yield back and resume the parked worker.
- Distinct preemption meanings:
STEAL_SESSION, ABORT_KEEP, and ABORT_DROP.
- Direct provider child processes remain allowed, but only under repo-runtime ownership.
- 100% framework-owned concurrency: no product code may spin raw threads, use mutexes/events, or coordinate with
flock.
- One authoritative owner per repo for provider/worktree/store mutation.
- Restartability: queued work survives crashes, stale lease holders cannot resume incorrectly, and parked work resumes from explicit frames/checkpoints.
GitHub hierarchy note
GitHub will not allow a deeper three-level subtree here once #681 lives under #396, because the repo's version tree already consumes most of the seven-layer hierarchy limit.
So the detailed runtime issues are intentionally tracked as direct children of #681.
Direct child issues
Contract and misuse rails
#682 Summary issue for the framework contract lane
#550 Define authoritative repo/PR/thread coordination model and transition vocabulary
#684 Specify repo-runtime states, parked worker frames, and epoch rules
#685 Define the only public concurrency API and ban direct reach-through
Runtime kernel
#686 Summary issue for the framework kernel lane
#687 Build supervisor and per-repo runtime processes
#688 Implement prioritized mailboxes, preemption transport, timers, and snapshots
#689 Supervise repo runtimes and provider children with explicit restart semantics
Durable store
#690 Summary issue for the framework store lane
#553 Unify durable owed-reply and task intents behind cohesive outbox/store services
#691 Design SQLite schema for commands, tasks, frames, leases, epochs, outbox state, and snapshots
#692 Migrate tasks.json, state.json, reply promises, and sync.lock semantics into the store
#693 Delete flock/lockfile protocols and make direct filesystem coordination unsupported
Provider leases and interruption
#694 Summary issue for the provider-lease lane
#695 Ensure only the repo runtime can own or talk to a provider session
#696 Interrupt current turns immediately, drain comment bursts, then yield back
#697 Persist session identity, suppress late results, and resume parked work safely
Product-flow migration
#698 Summary issue for the flow-migration lane
#551 Translate webhook ingress into injected commands with explicit idempotence keys
#552 Introduce repo, PR, and thread coordinators with constructor-injected collaborators
#700 Split worker execution into resumable phases and parked worker frames
#701 Route provider-using helper flows through runtime actions instead of direct session access
Cleanup and guardrails
#702 Summary issue for the cleanup lane
#554 Move worker recovery and wake/abort orchestration behind explicit transition services
#555 Rewrite coordination tests around scenario fakes instead of patching timing edges
#703 Move rescope, sync, watchdog, status, and registry orchestration onto runtime commands
#704 Remove raw background threads, locks, events, queues, and flock from product code
Adjacent infrastructure built in-place (pre-existing to this umbrella)
A chunk of substrate that fits this umbrella's shape has been built inside the existing kennel/ tree rather than as the unified kernel this issue describes. These should either be refactored into the kernel as its items land, or the kernel scope should shrink because the in-place versions prove sufficient.
| Module |
Lines |
Current shape |
Migration target inside this umbrella |
kennel/registry.py |
470 |
Per-repo WorkerThread lifecycle, activity + crash reporting, per-repo IssueTreeCache ownership, provider rescue across Worker crashes. Constructor-DI shaped. |
#687 per-repo runtimes (natural home). |
kennel/issue_cache.py |
457 |
Lock-protected per-repo issue tree cache. Idempotent webhook-event application with timestamp-ordered staleness rejection, pre-inventory queue, hourly reconcile. |
A Rocq-modeled Band-D item under #710 (new Dn, closest to #743 D5 webhook→command). |
kennel/cache_webhooks.py |
183 |
Pure value-only translator: raw GitHub webhook → cache event tuple. Handles issues and sub_issues event families. |
Subsumed by #743 D5 webhook→command when that Rocq model lands. |
kennel/rate_limit.py |
156 |
60s poller on GET /rate_limit; lock-protected snapshot. Exposed in kennel status. |
Framework timer primitive #688 (mailboxes/timers/snapshots) would own it. |
kennel/watchdog.py |
148 |
Two classes: Watchdog (WorkerThread liveness, restart-on-death) and ReconcileWatchdog (hourly cache reconcile). |
Watchdog → #689 supervisor with restart semantics. ReconcileWatchdog travels with whichever item subsumes issue_cache.py. |
These are real progress on the kernel-shaped substrate. The open question this umbrella resolves: are they refactored into the kernel, or does the kernel scope collapse because they already suffice?
Problem
Kennel's runtime concurrency is still spread across raw threads, thread events/locks, file locks (
flock), registry-owned shared state, and background helper threads. Correctness depends on subtle interleavings between webhook handlers, worker turns, provider sessions, restart/recovery, and file-backed coordination.On the free-threaded Python runtime this is exactly the wrong shape. The goal is not a better lock graph. The goal is a runtime kernel that owns concurrency so product code cannot keep inventing it.
Scope
This is the active runtime-concurrency subtree under
#396.It absorbs the still-relevant work from the older
#548coordination lane:#550contract / ownership vocabulary#551and#552webhook ingress and coordinator boundaries#553durable outbox/store work#554worker recovery and wake/abort cleanup#555scenario-test migrationHard decisions for this subtree
Recent partial fix
Merged PR
#706closed#672by making the review-thread claim barrier bidirectional between the webhook handler andWorker.handle_threads().That was the right tactical fix for the duplicate-reply race, but it is still an in-process shared-claim workaround. It should be treated as proof of the needed semantics, not as the final architecture:
Required semantics
STEAL_SESSION,ABORT_KEEP, andABORT_DROP.flock.GitHub hierarchy note
GitHub will not allow a deeper three-level subtree here once
#681lives under#396, because the repo's version tree already consumes most of the seven-layer hierarchy limit.So the detailed runtime issues are intentionally tracked as direct children of
#681.Direct child issues
Contract and misuse rails
#682Summary issue for the framework contract lane#550Define authoritative repo/PR/thread coordination model and transition vocabulary#684Specify repo-runtime states, parked worker frames, and epoch rules#685Define the only public concurrency API and ban direct reach-throughRuntime kernel
#686Summary issue for the framework kernel lane#687Build supervisor and per-repo runtime processes#688Implement prioritized mailboxes, preemption transport, timers, and snapshots#689Supervise repo runtimes and provider children with explicit restart semanticsDurable store
#690Summary issue for the framework store lane#553Unify durable owed-reply and task intents behind cohesive outbox/store services#691Design SQLite schema for commands, tasks, frames, leases, epochs, outbox state, and snapshots#692Migrate tasks.json, state.json, reply promises, and sync.lock semantics into the store#693Delete flock/lockfile protocols and make direct filesystem coordination unsupportedProvider leases and interruption
#694Summary issue for the provider-lease lane#695Ensure only the repo runtime can own or talk to a provider session#696Interrupt current turns immediately, drain comment bursts, then yield back#697Persist session identity, suppress late results, and resume parked work safelyProduct-flow migration
#698Summary issue for the flow-migration lane#551Translate webhook ingress into injected commands with explicit idempotence keys#552Introduce repo, PR, and thread coordinators with constructor-injected collaborators#700Split worker execution into resumable phases and parked worker frames#701Route provider-using helper flows through runtime actions instead of direct session accessCleanup and guardrails
#702Summary issue for the cleanup lane#554Move worker recovery and wake/abort orchestration behind explicit transition services#555Rewrite coordination tests around scenario fakes instead of patching timing edges#703Move rescope, sync, watchdog, status, and registry orchestration onto runtime commands#704Remove raw background threads, locks, events, queues, and flock from product codeAdjacent infrastructure built in-place (pre-existing to this umbrella)
A chunk of substrate that fits this umbrella's shape has been built inside the existing
kennel/tree rather than as the unified kernel this issue describes. These should either be refactored into the kernel as its items land, or the kernel scope should shrink because the in-place versions prove sufficient.kennel/registry.pyWorkerThreadlifecycle, activity + crash reporting, per-repoIssueTreeCacheownership, provider rescue across Worker crashes. Constructor-DI shaped.#687per-repo runtimes (natural home).kennel/issue_cache.py#710(newDn, closest to#743D5 webhook→command).kennel/cache_webhooks.pyissuesandsub_issuesevent families.#743D5 webhook→command when that Rocq model lands.kennel/rate_limit.pyGET /rate_limit; lock-protected snapshot. Exposed inkennel status.#688(mailboxes/timers/snapshots) would own it.kennel/watchdog.pyWatchdog(WorkerThread liveness, restart-on-death) andReconcileWatchdog(hourly cache reconcile).Watchdog→#689supervisor with restart semantics.ReconcileWatchdogtravels with whichever item subsumesissue_cache.py.These are real progress on the kernel-shaped substrate. The open question this umbrella resolves: are they refactored into the kernel, or does the kernel scope collapse because they already suffice?