Skip to content

Framework: replace mutexes, flock, and raw threads with a repo runtime kernel #681

@FidoCanCode

Description

@FidoCanCode

Problem

Kennel's runtime concurrency is still spread across raw threads, thread events/locks, file locks (flock), registry-owned shared state, and background helper threads. Correctness depends on subtle interleavings between webhook handlers, worker turns, provider sessions, restart/recovery, and file-backed coordination.

On the free-threaded Python runtime this is exactly the wrong shape. The goal is not a better lock graph. The goal is a runtime kernel that owns concurrency so product code cannot keep inventing it.

Scope

This is the active runtime-concurrency subtree under #396.

It absorbs the still-relevant work from the older #548 coordination lane:

  • #550 contract / ownership vocabulary
  • #551 and #552 webhook ingress and coordinator boundaries
  • #553 durable outbox/store work
  • #554 worker recovery and wake/abort cleanup
  • #555 scenario-test migration

Hard decisions for this subtree

  1. One runtime instance per repo. The default shape is one repo-runtime instance per repo, likely one process per repo under a thin supervisor. Do not build one giant shared multi-repo runtime with shared mutable internals.
  2. SQLite is the durable truth. Use SQLite for the command inbox, parked frames, leases/session identity, epochs, outbox state, snapshots, and migrated task/state storage.
  3. Real preemption, not just prioritization. Urgent work must be able to interrupt an active provider turn immediately, seize the lease inside the runtime owner, park the worker, drain the urgent burst, then yield back and resume.
  4. Preemption transport stays out-of-band. SQLite is not the wakeup path. Immediate preemption should use a direct runtime-owned signal/channel while durable truth stays in SQLite.
  5. Hard to misuse. Product code should not touch raw SQLite, provider session objects, locks/events, or private runtime state directly. It should go through narrow runtime primitives.
  6. Restartability is required. Crash-safe intent, crash-safe ownership, and crash-safe parked resume are framework requirements, not follow-up polish.

Recent partial fix

Merged PR #706 closed #672 by making the review-thread claim barrier bidirectional between the webhook handler and Worker.handle_threads().

That was the right tactical fix for the duplicate-reply race, but it is still an in-process shared-claim workaround. It should be treated as proof of the needed semantics, not as the final architecture:

  • ingress and worker are still coordinating through shared mutable process state
  • the dedupe barrier still lives in product code instead of a runtime-owned command/idempotence boundary
  • the regression coverage still leans on patch-heavy race-window tests rather than a scenario runtime harness

Required semantics

  1. Immediate provider-turn interruption for urgent webhook work.
  2. Session steal without worker death: interrupt the current provider turn, park the worker, drain the urgent burst, then yield back and resume the parked worker.
  3. Distinct preemption meanings: STEAL_SESSION, ABORT_KEEP, and ABORT_DROP.
  4. Direct provider child processes remain allowed, but only under repo-runtime ownership.
  5. 100% framework-owned concurrency: no product code may spin raw threads, use mutexes/events, or coordinate with flock.
  6. One authoritative owner per repo for provider/worktree/store mutation.
  7. Restartability: queued work survives crashes, stale lease holders cannot resume incorrectly, and parked work resumes from explicit frames/checkpoints.

GitHub hierarchy note

GitHub will not allow a deeper three-level subtree here once #681 lives under #396, because the repo's version tree already consumes most of the seven-layer hierarchy limit.

So the detailed runtime issues are intentionally tracked as direct children of #681.

Direct child issues

Contract and misuse rails

  • #682 Summary issue for the framework contract lane
  • #550 Define authoritative repo/PR/thread coordination model and transition vocabulary
  • #684 Specify repo-runtime states, parked worker frames, and epoch rules
  • #685 Define the only public concurrency API and ban direct reach-through

Runtime kernel

  • #686 Summary issue for the framework kernel lane
  • #687 Build supervisor and per-repo runtime processes
  • #688 Implement prioritized mailboxes, preemption transport, timers, and snapshots
  • #689 Supervise repo runtimes and provider children with explicit restart semantics

Durable store

  • #690 Summary issue for the framework store lane
  • #553 Unify durable owed-reply and task intents behind cohesive outbox/store services
  • #691 Design SQLite schema for commands, tasks, frames, leases, epochs, outbox state, and snapshots
  • #692 Migrate tasks.json, state.json, reply promises, and sync.lock semantics into the store
  • #693 Delete flock/lockfile protocols and make direct filesystem coordination unsupported

Provider leases and interruption

  • #694 Summary issue for the provider-lease lane
  • #695 Ensure only the repo runtime can own or talk to a provider session
  • #696 Interrupt current turns immediately, drain comment bursts, then yield back
  • #697 Persist session identity, suppress late results, and resume parked work safely

Product-flow migration

  • #698 Summary issue for the flow-migration lane
  • #551 Translate webhook ingress into injected commands with explicit idempotence keys
  • #552 Introduce repo, PR, and thread coordinators with constructor-injected collaborators
  • #700 Split worker execution into resumable phases and parked worker frames
  • #701 Route provider-using helper flows through runtime actions instead of direct session access

Cleanup and guardrails

  • #702 Summary issue for the cleanup lane
  • #554 Move worker recovery and wake/abort orchestration behind explicit transition services
  • #555 Rewrite coordination tests around scenario fakes instead of patching timing edges
  • #703 Move rescope, sync, watchdog, status, and registry orchestration onto runtime commands
  • #704 Remove raw background threads, locks, events, queues, and flock from product code

Adjacent infrastructure built in-place (pre-existing to this umbrella)

A chunk of substrate that fits this umbrella's shape has been built inside the existing kennel/ tree rather than as the unified kernel this issue describes. These should either be refactored into the kernel as its items land, or the kernel scope should shrink because the in-place versions prove sufficient.

Module Lines Current shape Migration target inside this umbrella
kennel/registry.py 470 Per-repo WorkerThread lifecycle, activity + crash reporting, per-repo IssueTreeCache ownership, provider rescue across Worker crashes. Constructor-DI shaped. #687 per-repo runtimes (natural home).
kennel/issue_cache.py 457 Lock-protected per-repo issue tree cache. Idempotent webhook-event application with timestamp-ordered staleness rejection, pre-inventory queue, hourly reconcile. A Rocq-modeled Band-D item under #710 (new Dn, closest to #743 D5 webhook→command).
kennel/cache_webhooks.py 183 Pure value-only translator: raw GitHub webhook → cache event tuple. Handles issues and sub_issues event families. Subsumed by #743 D5 webhook→command when that Rocq model lands.
kennel/rate_limit.py 156 60s poller on GET /rate_limit; lock-protected snapshot. Exposed in kennel status. Framework timer primitive #688 (mailboxes/timers/snapshots) would own it.
kennel/watchdog.py 148 Two classes: Watchdog (WorkerThread liveness, restart-on-death) and ReconcileWatchdog (hourly cache reconcile). Watchdog#689 supervisor with restart semantics. ReconcileWatchdog travels with whichever item subsumes issue_cache.py.

These are real progress on the kernel-shaped substrate. The open question this umbrella resolves: are they refactored into the kernel, or does the kernel scope collapse because they already suffice?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions