Skip to content

Rectify: MCP Server Startup Race#803

Merged
Trecek merged 11 commits intointegrationfrom
mcp-server-startup-race-open-kitchen-no-such-tool-available/775
Apr 13, 2026
Merged

Rectify: MCP Server Startup Race#803
Trecek merged 11 commits intointegrationfrom
mcp-server-startup-race-open-kitchen-no-such-tool-available/775

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Apr 13, 2026

Summary

The MCP server performs 5-7 seconds of synchronous blocking work before calling mcp.run(), which opens the stdio transport. Claude Code enforces a ~5-second connection timeout. When the startup work exceeds this budget, Claude Code declares the server unreachable and the model receives "No such tool available" for all MCP tools — an unrecoverable error class with no prompt-level retry.

The architectural weakness is the absence of a structural boundary between transport-critical and deferrable startup work. All six startup phases run in the same synchronous block, with no contract, test, or enforcement preventing heavy I/O from accumulating on the critical path.

Part A addresses the core structural immunity: a lazy token factory that removes the 5-second subprocess from the critical path, an architectural test that enforces the startup budget as a testable invariant, and an integration test that guards against timing regressions.

Part B will cover deferred _initialize() background tasks, async drift check, and prompt-level "No such tool available" retry — implement as a separate task.

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    %% TERMINALS %%
    START(["START<br/>━━━━━━━━━━<br/>claude mcp serve"])
    READY(["TRANSPORT OPEN<br/>━━━━━━━━━━<br/>Tools available"])
    SHUTDOWN(["SHUTDOWN<br/>━━━━━━━━━━<br/>SIGTERM / Ctrl-C"])

    subgraph Phase1 ["● Phase 1 — CLI Entry (cli/app.py serve)"]
        direction TB
        LOG1["● configure_logging<br/>━━━━━━━━━━<br/>Phase 1: early init<br/>stderr, INFO or DEBUG"]
        CFG["● load_config<br/>━━━━━━━━━━<br/>Project .autoskillit/<br/>config.yaml"]
        LOG2{"● verbose OR<br/>config DEBUG?"}
        LOG2R["● reconfigure_logging<br/>━━━━━━━━━━<br/>Most verbose wins"]
    end

    subgraph Phase2 ["● Phase 2 — Context Factory (_factory.py)"]
        direction TB
        TF["● TokenFactory<br/>━━━━━━━━━━<br/>Lazy: config → env → gh CLI<br/>NOT resolved at startup"]
        MC["● make_context<br/>━━━━━━━━━━<br/>22 services wired<br/>gate=CLOSED"]
        GH["● GitHub clients<br/>━━━━━━━━━━<br/>CIWatcher, GitHubFetcher<br/>MergeQueueWatcher<br/>token=callable"]
        CW["● Circular wiring<br/>━━━━━━━━━━<br/>executor → ctx<br/>migrations → executor.run"]
    end

    subgraph Phase3 ["● Phase 3 — Synchronous Init (_state.py)"]
        direction TB
        INIT["● _initialize(ctx)<br/>━━━━━━━━━━<br/>Set _ctx singleton"]
        SUBSET{"● subsets.disabled<br/>non-empty?"}
        SUBDIS["● mcp.disable<br/>━━━━━━━━━━<br/>Per-subset tag disable"]
        RUNNER{"● Runner type?"}
        REC["● McpRecordingMiddleware<br/>━━━━━━━━━━<br/>Scenario capture"]
        REP["● McpReplayMiddleware<br/>━━━━━━━━━━<br/>Scenario replay"]
        NOREC["No middleware"]
    end

    subgraph Phase4 ["● Phase 4 — Module-Level Visibility (server/__init__.py)"]
        direction TB
        KITDIS["● mcp.disable kitchen<br/>━━━━━━━━━━<br/>Hide 39 kitchen tools"]
        HEADLESS{"● AUTOSKILLIT<br/>_HEADLESS=1?"}
        HLEN["● mcp.enable headless<br/>━━━━━━━━━━<br/>Reveal test_check"]
    end

    subgraph Phase5 ["● Phase 5 — Lifespan Pre-Yield (_lifespan.py)"]
        direction TB
        LSSTART["● _autoskillit_lifespan<br/>━━━━━━━━━━<br/>Create _startup_ready Event"]
        SUBMIT["● background.submit<br/>━━━━━━━━━━<br/>2 background tasks"]
    end

    subgraph BG1 ["● Background: Deferred Init (_state.py)"]
        direction TB
        CRASH["● recover_crashed_sessions<br/>━━━━━━━━━━<br/>reads tmpfs traces"]
        AUDIT["● audit.load_from_log_dir<br/>━━━━━━━━━━<br/>Last 24h telemetry"]
        STALE["● cleanup_stale<br/>━━━━━━━━━━<br/>Session skill dirs"]
        SIGNAL["● ready_event.set<br/>━━━━━━━━━━<br/>Unblock waiting tools"]
    end

    subgraph BG2 ["● Background: Drift Check (_lifespan.py)"]
        direction TB
        HASH{"● hooks.json<br/>hash == registry?"}
        REGEN["● generate_hooks_json<br/>━━━━━━━━━━<br/>atomic_write self-heal"]
        DROK["Drift check OK"]
    end

    subgraph ToolGate ["● Tool Readiness Gate (tools_status.py)"]
        direction TB
        TCHECK{"● _startup_ready<br/>is_set?"}
        TWAIT["● asyncio.wait_for<br/>━━━━━━━━━━<br/>timeout=30s"]
        TPROCEED["● Execute tool<br/>━━━━━━━━━━<br/>Audit data available"]
    end

    subgraph Teardown ["● Teardown (_lifespan.py finally)"]
        direction TB
        TREC{"● runner is<br/>Recording?"}
        TFIN["● recorder.finalize<br/>━━━━━━━━━━<br/>Persist scenario data"]
    end

    %% CRITICAL PATH FLOW %%
    START --> LOG1
    LOG1 --> CFG
    CFG --> LOG2
    LOG2 -->|"yes"| LOG2R
    LOG2 -->|"no"| TF
    LOG2R --> TF

    TF --> MC
    MC --> GH
    GH --> CW

    CW --> INIT
    INIT --> SUBSET
    SUBSET -->|"yes"| SUBDIS
    SUBSET -->|"no"| RUNNER
    SUBDIS --> RUNNER
    RUNNER -->|"RECORD_SCENARIO=1"| REC
    RUNNER -->|"REPLAY_SCENARIO=1"| REP
    RUNNER -->|"normal"| NOREC
    REC --> KITDIS
    REP --> KITDIS
    NOREC --> KITDIS

    KITDIS --> HEADLESS
    HEADLESS -->|"yes"| HLEN
    HEADLESS -->|"no"| LSSTART
    HLEN --> LSSTART

    LSSTART --> SUBMIT
    SUBMIT -->|"yield — transport opens"| READY

    %% BACKGROUND TASKS (post-yield, concurrent) %%
    SUBMIT -.->|"background task 1"| CRASH
    CRASH --> AUDIT
    AUDIT --> STALE
    STALE --> SIGNAL

    SUBMIT -.->|"background task 2"| HASH
    HASH -->|"mismatch"| REGEN
    HASH -->|"match"| DROK

    %% TOOL READINESS GATE %%
    READY -.->|"gated tool call<br/>e.g. get_pipeline_report"| TCHECK
    TCHECK -->|"yes"| TPROCEED
    TCHECK -->|"no — await"| TWAIT
    TWAIT -->|"event fires"| TPROCEED
    SIGNAL -.->|"unblocks"| TWAIT

    %% SHUTDOWN %%
    READY -->|"SIGTERM / KeyboardInterrupt"| TREC
    TREC -->|"yes"| TFIN
    TREC -->|"no"| SHUTDOWN
    TFIN --> SHUTDOWN

    %% CLASS ASSIGNMENTS %%
    class START,READY,SHUTDOWN terminal;
    class LOG1,CFG,LOG2R cli;
    class LOG2,SUBSET,RUNNER,HEADLESS,HASH,TCHECK,TREC stateNode;
    class TF,MC,GH,CW,INIT,SUBDIS,REC,REP,NOREC handler;
    class KITDIS,HLEN,LSSTART,SUBMIT phase;
    class CRASH,AUDIT,STALE,SIGNAL,REGEN,DROK newComponent;
    class TWAIT,TPROCEED,TFIN detector;
Loading

Module Dependency Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 70, 'curve': 'basis'}}}%%
graph TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef integration fill:#c62828,stroke:#ef9a9a,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;

    subgraph L3 ["LAYER 3 — APPLICATION"]
        direction LR

        subgraph ServerPkg ["● server/"]
            direction TB
            SRV_INIT["● __init__.py<br/>━━━━━━━━━━<br/>FastMCP app, gating"]
            SRV_FACTORY["● _factory.py<br/>━━━━━━━━━━<br/>Composition root<br/>make_context()"]
            SRV_LIFESPAN["● _lifespan.py<br/>━━━━━━━━━━<br/>Lifespan manager<br/>deferred startup"]
            SRV_STATE["● _state.py<br/>━━━━━━━━━━<br/>Lazy init, deferred<br/>_startup_ready event"]
            SRV_STATUS["● tools_status.py<br/>━━━━━━━━━━<br/>Status tool handlers"]
        end

        subgraph CliPkg ["● cli/"]
            direction TB
            CLI_APP["● app.py<br/>━━━━━━━━━━<br/>CLI entry point<br/>serve, init, cook"]
            CLI_PROMPTS["● _prompts.py<br/>━━━━━━━━━━<br/>Orchestrator prompt<br/>builder"]
        end
    end

    subgraph L2 ["LAYER 2 — DOMAIN"]
        direction LR
        RECIPE["recipe/<br/>━━━━━━━━━━<br/>Recipe schema,<br/>validation, rules<br/>fan-in: 37"]
        MIGRATION["migration/<br/>━━━━━━━━━━<br/>Version migration<br/>engine"]
    end

    subgraph L1 ["LAYER 1 — SERVICES"]
        direction LR

        subgraph ExecPkg ["● execution/"]
            direction TB
            EXEC_PROCESS["● process.py<br/>━━━━━━━━━━<br/>Subprocess facade<br/>PTY, race, kill"]
            EXEC_CI["● ci.py<br/>━━━━━━━━━━<br/>CI watcher<br/>deferred startup"]
            EXEC_GH["● github.py<br/>━━━━━━━━━━<br/>Issue fetcher<br/>deferred startup"]
            EXEC_MQ["● merge_queue.py<br/>━━━━━━━━━━<br/>Merge queue watcher<br/>deferred startup"]
        end

        PIPELINE["pipeline/<br/>━━━━━━━━━━<br/>ToolContext, gate,<br/>audit, telemetry<br/>fan-in: 13"]
        CONFIG["config/<br/>━━━━━━━━━━<br/>Settings, defaults<br/>fan-in: 20"]
        WORKSPACE["workspace/<br/>━━━━━━━━━━<br/>Clone, worktree,<br/>skills resolver<br/>fan-in: 14"]
    end

    subgraph L0 ["LAYER 0 — FOUNDATION"]
        direction LR
        CORE["core/<br/>━━━━━━━━━━<br/>Types, I/O, paths,<br/>logging, enums<br/>fan-in: 103"]
        HOOK_REG["hook_registry.py<br/>━━━━━━━━━━<br/>HookDef, registry"]
    end

    subgraph TestsNew ["★ NEW TEST FILES"]
        direction LR
        T_ARCH["★ tests/arch/<br/>test_startup_budget.py"]
        T_CLI["★ tests/cli/<br/>test_startup_budget.py"]
    end

    %% === SERVER INTERNAL WIRING === %%
    SRV_INIT -->|"imports"| SRV_STATE
    SRV_INIT -->|"imports"| SRV_LIFESPAN
    SRV_INIT -->|"imports"| SRV_FACTORY
    SRV_INIT -->|"imports"| SRV_STATUS
    SRV_LIFESPAN -->|"deferred init"| SRV_STATE

    %% === SERVER → L1/L2/L0 === %%
    SRV_FACTORY -->|"composes"| PIPELINE
    SRV_FACTORY -->|"composes"| CONFIG
    SRV_FACTORY -->|"composes"| WORKSPACE
    SRV_FACTORY -->|"composes"| RECIPE
    SRV_FACTORY -->|"composes"| MIGRATION
    SRV_FACTORY -->|"composes all executors"| EXEC_CI
    SRV_FACTORY --> CORE

    SRV_LIFESPAN -->|"deferred I/O"| EXEC_PROCESS
    SRV_LIFESPAN --> HOOK_REG
    SRV_LIFESPAN --> CORE

    SRV_STATE --> CONFIG
    SRV_STATE --> PIPELINE
    SRV_STATE --> CORE

    SRV_INIT --> PIPELINE
    SRV_INIT --> CORE
    SRV_STATUS --> PIPELINE
    SRV_STATUS --> CORE

    %% === CLI → dependencies === %%
    CLI_APP -->|"serve"| SRV_INIT
    CLI_APP --> CONFIG
    CLI_APP --> RECIPE
    CLI_APP --> MIGRATION
    CLI_APP --> WORKSPACE
    CLI_APP --> CORE
    CLI_APP -->|"build_interactive_cmd"| EXEC_PROCESS

    CLI_PROMPTS --> RECIPE
    CLI_PROMPTS --> CONFIG
    CLI_PROMPTS --> CORE

    %% === EXECUTION INTERNAL === %%
    EXEC_CI -->|"github_headers"| EXEC_GH
    EXEC_MQ -->|"github_headers"| EXEC_GH
    EXEC_PROCESS --> CONFIG

    %% === EXECUTION → L0 === %%
    EXEC_CI --> CORE
    EXEC_GH --> CORE
    EXEC_MQ --> CORE
    EXEC_PROCESS --> CORE

    %% === L2 → L0/L1 === %%
    RECIPE --> CORE
    MIGRATION --> CORE

    %% === L1 → L0 === %%
    PIPELINE --> CORE
    CONFIG --> CORE
    WORKSPACE --> CORE

    %% === TEST COVERAGE === %%
    T_ARCH -.->|"validates"| SRV_LIFESPAN
    T_CLI -.->|"validates"| CLI_APP

    %% CLASS ASSIGNMENTS %%
    class SRV_INIT,SRV_FACTORY,SRV_LIFESPAN,SRV_STATE,SRV_STATUS,CLI_APP,CLI_PROMPTS cli;
    class EXEC_PROCESS,EXEC_CI,EXEC_GH,EXEC_MQ handler;
    class PIPELINE,CONFIG,WORKSPACE phase;
    class RECIPE,MIGRATION phase;
    class CORE,HOOK_REG stateNode;
    class T_ARCH,T_CLI newComponent;
Loading

Concurrency Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    START(["SERVER START<br/>cli/app.py serve()"])

    subgraph CriticalPath ["CRITICAL PATH (Sequential — No I/O)"]
        direction TB
        MAKE_CTX["● make_context()<br/>━━━━━━━━━━<br/>22 service contracts<br/>No subprocess calls<br/>TokenFactory deferred"]
        INIT["● _initialize(ctx)<br/>━━━━━━━━━━<br/>Assign _ctx singleton<br/>Wire middleware<br/>Apply subset visibility"]
        MCP_RUN["● mcp.run()<br/>━━━━━━━━━━<br/>Opens transport<br/>Enters lifespan"]
    end

    subgraph Lifespan ["● LIFESPAN PRE-YIELD (Fork Point)"]
        direction TB
        CREATE_EVT["● Create asyncio.Event<br/>━━━━━━━━━━<br/>_startup_ready = Event()"]
        SUBMIT_BG["● Submit background tasks<br/>━━━━━━━━━━<br/>ctx.background.submit()"]
    end

    subgraph BackgroundTasks ["BACKGROUND TASKS (Concurrent — After Transport Opens)"]
        direction LR
        subgraph DeferredInit ["● deferred_initialize()"]
            direction TB
            RECOVER["Crash recovery<br/>━━━━━━━━━━<br/>tmpfs session traces"]
            TELEMETRY["Telemetry load<br/>━━━━━━━━━━<br/>audit from log dir<br/>last 24h"]
            SKILL_CLEAN["Stale skill cleanup<br/>━━━━━━━━━━<br/>session_skill_manager"]
            SET_READY["ready_event.set()<br/>━━━━━━━━━━<br/>Signal: init complete"]
        end
        subgraph DriftCheck ["● _async_drift_check()"]
            direction TB
            THREAD_EXEC["run_in_executor<br/>━━━━━━━━━━<br/>Thread pool (sync I/O)"]
            HOOKS_HASH["Compare hooks.json<br/>━━━━━━━━━━<br/>vs HOOK_REGISTRY_HASH"]
            SELF_HEAL["atomic_write()<br/>━━━━━━━━━━<br/>Self-heal if stale"]
        end
    end

    subgraph Barrier ["SYNCHRONIZATION BARRIER"]
        direction TB
        READY_EVT{"● _startup_ready<br/>asyncio.Event"}
        WAIT_30S["● await wait_for()<br/>━━━━━━━━━━<br/>30s timeout guard"]
    end

    subgraph ToolAccess ["TOOL EXECUTION (After Barrier)"]
        direction TB
        PIPELINE_RPT["● get_pipeline_report()<br/>━━━━━━━━━━<br/>Awaits _startup_ready<br/>then reads ctx.audit"]
        GATED_TOOLS["Gated tools<br/>━━━━━━━━━━<br/>_get_ctx() access<br/>Gate must be enabled"]
    end

    subgraph LazyResolve ["● LAZY RESOLUTION (On First Use)"]
        direction TB
        TOKEN_FAC["● TokenFactory<br/>━━━━━━━━━━<br/>Sentinel-based cache<br/>GIL-safe single-writer"]
        GH_CLI["gh auth token<br/>━━━━━━━━━━<br/>5s subprocess timeout<br/>Deferred from startup"]
    end

    subgraph Supervisor ["DefaultBackgroundSupervisor"]
        direction TB
        TASK_SET["_tasks: set of Task<br/>━━━━━━━━━━<br/>Auto-discard on done"]
        SUPERVISE["_supervise_task()<br/>━━━━━━━━━━<br/>Exception capture<br/>Audit recording"]
        DRAIN["drain()<br/>━━━━━━━━━━<br/>asyncio.gather()<br/>Shutdown barrier"]
    end

    subgraph ProcessRace ["● PROCESS RACE (anyio TaskGroup)"]
        direction TB
        SPAWN_TG["anyio.create_task_group()<br/>━━━━━━━━━━<br/>Structured concurrency"]
        subgraph Watchers ["Concurrent Watchers"]
            direction LR
            W_PROC["_watch_process<br/>━━━━━━━━━━<br/>Wait for exit"]
            W_HB["_watch_heartbeat<br/>━━━━━━━━━━<br/>Channel A stdout"]
            W_SESS["_watch_session_log<br/>━━━━━━━━━━<br/>Channel B log dir"]
            W_IDLE["_watch_stdout_idle<br/>━━━━━━━━━━<br/>Idle timeout"]
        end
        RACE_ACC["● RaceAccumulator<br/>━━━━━━━━━━<br/>Lock-free: cooperative<br/>async guarantees no<br/>concurrent writes"]
        TRIGGER{"anyio.Event<br/>trigger"}
        RESOLVE["resolve_termination()<br/>━━━━━━━━━━<br/>Immutable RaceSignals"]
        SYM_DRAIN["Symmetric drain<br/>━━━━━━━━━━<br/>channel_b_ready.wait()<br/>bounded timeout"]
    end

    subgraph LifespanExit ["● LIFESPAN POST-YIELD (Teardown)"]
        direction TB
        RECORDER_FIN["recorder.finalize()<br/>━━━━━━━━━━<br/>Scenario data survives<br/>SIGTERM"]
    end

    COMPLETE(["SERVER STOP"])

    %% CRITICAL PATH FLOW %%
    START --> MAKE_CTX
    MAKE_CTX --> INIT
    INIT --> MCP_RUN
    MCP_RUN --> CREATE_EVT

    %% LIFESPAN FORK %%
    CREATE_EVT --> SUBMIT_BG
    SUBMIT_BG -->|"background task 1"| RECOVER
    SUBMIT_BG -->|"background task 2"| THREAD_EXEC

    %% DEFERRED INIT SEQUENCE %%
    RECOVER --> TELEMETRY
    TELEMETRY --> SKILL_CLEAN
    SKILL_CLEAN --> SET_READY

    %% DRIFT CHECK SEQUENCE %%
    THREAD_EXEC --> HOOKS_HASH
    HOOKS_HASH -->|"stale"| SELF_HEAL

    %% BARRIER %%
    SET_READY -->|"set()"| READY_EVT
    READY_EVT -->|"is_set()"| WAIT_30S
    WAIT_30S --> PIPELINE_RPT

    %% TOOL ACCESS %%
    PIPELINE_RPT --> GATED_TOOLS
    GATED_TOOLS -->|"first GitHub call"| TOKEN_FAC
    TOKEN_FAC -->|"resolve once"| GH_CLI

    %% SUPERVISOR STRUCTURE %%
    SUBMIT_BG -.->|"manages"| TASK_SET
    TASK_SET -.-> SUPERVISE
    SUPERVISE -.-> DRAIN

    %% PROCESS RACE %%
    GATED_TOOLS -->|"run_skill / headless"| SPAWN_TG
    SPAWN_TG --> W_PROC
    SPAWN_TG --> W_HB
    SPAWN_TG --> W_SESS
    SPAWN_TG --> W_IDLE
    W_PROC -->|"deposit signal"| RACE_ACC
    W_HB -->|"deposit signal"| RACE_ACC
    W_SESS -->|"deposit signal"| RACE_ACC
    W_IDLE -->|"deposit signal"| RACE_ACC
    RACE_ACC --> TRIGGER
    TRIGGER -->|"first signal fires"| RESOLVE
    RESOLVE --> SYM_DRAIN

    %% SHUTDOWN %%
    DRAIN --> RECORDER_FIN
    RECORDER_FIN --> COMPLETE

    %% CLASS ASSIGNMENTS %%
    class START,COMPLETE terminal;
    class MAKE_CTX,INIT,MCP_RUN phase;
    class CREATE_EVT,SUBMIT_BG detector;
    class RECOVER,TELEMETRY,SKILL_CLEAN handler;
    class SET_READY output;
    class THREAD_EXEC,HOOKS_HASH,SELF_HEAL handler;
    class READY_EVT stateNode;
    class WAIT_30S detector;
    class PIPELINE_RPT,GATED_TOOLS handler;
    class TOKEN_FAC,GH_CLI stateNode;
    class TASK_SET,SUPERVISE,DRAIN stateNode;
    class SPAWN_TG detector;
    class W_PROC,W_HB,W_SESS,W_IDLE newComponent;
    class RACE_ACC stateNode;
    class TRIGGER detector;
    class RESOLVE,SYM_DRAIN handler;
    class RECORDER_FIN output;
Loading

Closes #775

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/remediation-20260412-163055-100049/.autoskillit/temp/rectify/rectify_mcp_server_startup_race_2026-04-12_170200_part_a.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step uncached output cache_read cache_write count time
investigate 30 12.7k 962.1k 61.7k 1 8m 57s
rectify 33 24.5k 1.5M 82.9k 1 10m 11s
review 1.1k 7.7k 424.2k 60.0k 1 4m 43s
dry_walkthrough 1.5k 42.5k 3.6M 194.4k 2 18m 42s
implement 156 56.7k 11.6M 191.4k 2 31m 36s
assess 235 64.6k 9.9M 349.3k 7 1h 32m
prepare_pr 8 7.8k 101.6k 34.0k 1 2m 7s
run_arch_lenses 6.7k 22.3k 997.5k 145.1k 3 12m 13s
compose_pr 10 11.3k 256.0k 39.2k 1 3m 8s
Total 9.8k 250.0k 29.3M 1.2M 3h 4m

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested (2 critical, 11 warning findings). Cannot use REQUEST_CHANGES event on own PR — posting as COMMENT.

Comment thread src/autoskillit/server/_state.py Outdated
Comment thread src/autoskillit/server/_lifespan.py
return gate
structlog.contextvars.clear_contextvars()
with structlog.contextvars.bound_contextvars(tool="get_pipeline_report"):
from autoskillit.server._state import _startup_ready
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] arch: Function-scoped from _state import _startup_ready captures the current value at call time, not a live reference. If _startup_ready is reassigned in _state (e.g., on the next lifespan cycle), this local binding holds the stale event. Use _state._startup_ready (module attribute dereference) instead.

Comment thread src/autoskillit/server/tools_status.py Outdated

def __init__(self, *, token: str | None = None) -> None:
self._token = token
_UNRESOLVED = object()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] cohesion: DefaultCIWatcher, DefaultGitHubFetcher, and DefaultMergeQueueWatcher each re-implement the lazy sentinel token pattern (_UNRESOLVED + _token_factory + _resolve_token). TokenFactory in _factory.py (L3) provides this but L1 cannot import it. Extract the pattern to L0 core/.

monkeypatch.setattr(_state_mod, "_startup_ready", ready)

task = asyncio.create_task(get_pipeline_report())
await asyncio.sleep(0.05)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: asyncio.sleep(0.05) timing-based synchronization is flaky under xdist -n 4 on loaded CI. 50ms may not be enough for the task to block/unblock deterministically. Use event-based synchronization.

result2 = json.loads(await get_pipeline_report())
assert result2["total_failures"] == 0

@pytest.mark.anyio
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: @pytest.mark.anyio marker inconsistent with @pytest.mark.asyncio used by surrounding TestGetPipelineReport class. Mixing markers can cause fixture incompatibilities.

# --- T-INIT-5: Critical _initialize does not perform deferrable I/O ---


def test_initialize_critical_does_not_call_recovery(tmp_path, monkeypatch):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: Patches autoskillit.execution.recover_crashed_sessions at module level, but deferred_initialize imports it inside the function body (noqa PLC0415). If the import was already resolved in a prior test on the same xdist worker, the patch may not intercept.

import autoskillit.cli as cli_mod

start = time.monotonic()
cli_mod.serve()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: cli_mod.serve() installs signal.signal(SIGTERM) — a process-global handler. Not xdist-safe: parallel workers sharing the same process will have their SIGTERM handler replaced.


def tracking_submit(coro, **kw):
submitted.append(kw.get("label", ""))
coro.close()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: tracking_submit creates asyncio.create_task(asyncio.sleep(0)) that is never awaited or cancelled before the async with block exits. Dangling task may emit ResourceWarning.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. tracking_submit returns asyncio.create_task(asyncio.sleep(0)) as a standard stub for async task submission. sleep(0) completes within the same event loop tick, holds no resources or file descriptors, and resolves before the async with block exits during pytest-asyncio teardown. No ResourceWarning risk.

@Trecek Trecek enabled auto-merge April 13, 2026 15:07
Trecek and others added 11 commits April 13, 2026 08:08
Replace the eager _gh_cli_token() call in make_context() with a
TokenFactory that defers resolution until the first gated tool actually
needs a GitHub token. This removes a 5-second subprocess from the
critical path between serve() entry and mcp.run(), preventing Claude
Code's ~5s connection timeout from triggering "No such tool available".

- Add TokenFactory class: lazy-resolving, caching callable wrapper
- Update DefaultGitHubFetcher, DefaultCIWatcher to accept Callable token
- Update DefaultMergeQueueWatcher to defer httpx.AsyncClient creation
- Fix fetch_title() to use has_token instead of bare self._token access
- Update test_factory.py to use _resolve_token() instead of _token

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- tests/arch/test_startup_budget.py: AST-based structural guard that
  prevents subprocess calls or _gh_cli_token() from being added back
  to make_context() (REQ-STARTUP-001)
- tests/cli/test_startup_budget.py: Integration test measuring
  wall-clock time from serve() entry to mcp.run() call (2s budget)
- tests/server/test_factory.py: TokenFactory lazy resolution + caching
  tests, and make_context() subprocess-free verification

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _iter_eager_calls() helper distinguishes eager calls (executed
during make_context()) from deferred calls inside lambdas or inner
functions. _gh_cli_token() inside a TokenFactory lambda is deferred
and thus acceptable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test closed stdin before sending SIGTERM, causing the MCP stdio
transport to detect EOF and tear down the event loop before the signal
handler could fire. This bypassed lifespan cleanup and recorder.finalize().

Reversing the order ensures KeyboardInterrupt propagates through the
asyncio event loop, triggering proper lifespan __aexit__ teardown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split _initialize() into critical-path work (context set, tool visibility,
middleware registration) and deferred work (crash recovery, audit loading,
stale cleanup). The deferred operations now run as background tasks submitted
from the lifespan after mcp.run() opens the stdio transport.

Also moves run_startup_drift_check() into the lifespan as an async background
task, adds a readiness guard on get_pipeline_report to await deferred init
completion, and adds a prompt-level retry instruction for "No such tool
available" errors during server startup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update json.dumps allowlist line number in schema version convention
  test (tools_status.py shifted 333→338 from readiness guard addition)
- Harden SIGTERM recording test: increase deadline 5s→15s for xdist
  contention and skip gracefully if server never emits ready token
- Narrow mypy union-attr on timeout_scope (pre-existing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The lifespan sets _state._startup_ready via bare assignment during
_autoskillit_lifespan. When a lifespan test runs in the same xdist
worker before a test that calls get_pipeline_report(), the unset event
causes a 30s timeout caught by track_response_size, returning error
JSON without total_failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents 'virtual environment already exists' error when merge_worktree
runs install-worktree twice (pre-rebase and post-rebase test gates).
Without --clear, the second invocation fails because the venv from the
first gate still exists.
…vent.set()

Ensures ready_event.set() is always called even if CancelledError or other
BaseException propagates, preventing waiters from hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Returns a user-friendly fallback JSON instead of propagating an uncaught
TimeoutError when deferred_initialize does not complete within 30s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
….py line

The TimeoutError handling added to tools_status.py shifted the json.dumps
site from line 338 to 348.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Trecek Trecek force-pushed the mcp-server-startup-race-open-kitchen-no-such-tool-available/775 branch from a0eea04 to 1a2694c Compare April 13, 2026 15:11
@Trecek Trecek added this pull request to the merge queue Apr 13, 2026
Merged via the queue into integration with commit 7071065 Apr 13, 2026
2 checks passed
@Trecek Trecek deleted the mcp-server-startup-race-open-kitchen-no-such-tool-available/775 branch April 13, 2026 15:20
github-merge-queue bot pushed a commit that referenced this pull request Apr 14, 2026
#909)

## Summary

Two compounding failures erode the reliability of AutoSkillit's setup
and startup pathways:

1. **`run_doctor()` destroys the plugin installation state before
checking it**, producing a false OK, then never restoring it. Only
`autoskillit install` can recover. The fix removes
`_clear_plugin_cache()` from `run_doctor()` and adds read-only
diagnostic checks.

2. **A regression in #821 moved `run_startup_drift_check()` back onto
the synchronous critical path** before `mcp.run()`, undoing the core
structural fix from #803. The fix moves `run_startup_drift_check()` and
`deferred_initialize()` into lifespan background tasks, and removes
stale startup workarounds and dead monkeypatches.

### Commits

- `09fcbec6` fix: remove _clear_plugin_cache() from run_doctor() and add
read-only diagnostic checks
- `d571ecbd` fix: move run_startup_drift_check and deferred_initialize
into lifespan background tasks
- `eb095e0b` fix: remove stale startup workarounds and dead
monkeypatches
- `dddf537b` chore: bump version to 0.8.26
- `7ee9434b` fix: replace anyio task group with create_background_task
in lifespan and fix test expectations
- `8dbba7a1` test: add failing AST guards and integration tests for
phase boundary immunity

Closes #902

🤖 Generated with [Claude Code](https://claude.com/claude-code) via
AutoSkillit

## Token Usage Summary

| Step | uncached | output | cache_read | cache_write | count | time |
|------|----------|--------|------------|-------------|-------|------|
| plan | 328 | 28.2k | 1.8M | 132.8k | 2 | 7m 22s |
| verify | 2.3k | 19.5k | 1.0M | 89.5k | 2 | 5m 41s |
| implement | 2.6k | 22.6k | 2.9M | 99.5k | 2 | 8m 0s |
| prepare_pr | 60 | 14.2k | 194.7k | 35.3k | 1 | 3m 1s |
| run_arch_lenses | 2.7k | 23.7k | 584.4k | 156.7k | 3 | 10m 52s |
| compose_pr | 59 | 8.7k | 217.8k | 33.1k | 1 | 2m 26s |
| review_pr | 284 | 23.1k | 1.2M | 80.6k | 1 | 11m 29s |
| resolve_review | 221 | 9.4k | 919.1k | 34.9k | 1 | 5m 41s |
| fix | 182 | 6.9k | 754.3k | 31.9k | 1 | 10m 4s |
| **Total** | 8.8k | 156.4k | 9.5M | 694.2k | | 1h 4m |

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant