Skip to content

Spec 21: per-session static analysis pass — complexity / coverage / lint deltas #93

@0bserver07

Description

@0bserver07

Goal

For every session that touched code, compute pre/post static-analysis deltas: cyclomatic complexity, test-coverage, lint findings, type-completeness. Surface "sessions where the agent reduced complexity by 20%+" vs "increased it by 20%+" — the first metric that lets you say agent X is actually better than agent Y on YOUR code.

Why now

Outcome attribution today is git-correlation. That tells you the code shipped; it doesn't tell you whether the code is good. Static analysis is the cheapest objective signal.

Schema

v018static_analysis_findings table:

CREATE TABLE static_analysis_findings (
  id INTEGER PRIMARY KEY,
  session_id TEXT NOT NULL,
  file_path TEXT NOT NULL,
  language TEXT NOT NULL,         -- 'python' | 'typescript' | 'go' | ...
  ts TEXT NOT NULL,               -- when the analysis ran
  metric TEXT NOT NULL,           -- 'complexity' | 'coverage' | 'lint_count' | 'type_completeness'
  pre_value REAL,                 -- before the session's edits
  post_value REAL,                -- after
  delta REAL,                     -- post - pre (NULL when one side is unobservable)
  details_json TEXT,              -- per-metric extras (e.g. lint rule ids)
  UNIQUE (session_id, file_path, metric)
);
CREATE INDEX idx_sa_session ON static_analysis_findings(session_id);
CREATE INDEX idx_sa_file ON static_analysis_findings(file_path);

Additive, IF NOT EXISTS-guarded.

User-visible surface

  • CLI: stackunderflow analyze session <id> runs analysis on a single session's touched files (using Playback v2 to reconstruct pre/post states).
  • CLI: stackunderflow analyze backfill [--since 30d] [--limit N] runs analysis on every recent session lacking findings.
  • API: GET /api/static-analysis/session/{id} — return findings for a session.
  • Meta-agent tool: get_session_quality(session_id) returns a structured quality summary.
  • UI: Quality column on Sessions tab + a "Quality" panel on the per-session detail view.

Implementation plan

  1. v018 migration.
  2. New module stackunderflow/services/static_analysis/ with one analyzer per language:
    • python_analyzer.pyradon for complexity (already-popular, MIT, optional dep), coverage.py parse, ruff --output-format=json for lint, mypy --no-error-summary for type completeness.
    • typescript_analyzer.pytsc --noEmit --pretty false for type errors, eslint --format json for lint. Complexity: defer (no clean cross-toolchain answer).
    • go_analyzer.pygo vet, gocyclo, go test -coverprofile. Defer if go not on PATH.
  3. Coordinator in services/static_analysis/runner.py — reconstruct pre/post via Playback v2's reconstruct_fs_at(at_pre) / reconstruct_fs_at(at_post), write to a tmpdir, run the analyzer, persist deltas.
  4. Optional dep: add [analysis] extra in pyproject.toml with radon, coverage, mypy. Check for binaries (tsc, eslint, go) at runtime; skip cleanly if missing.
  5. CLI + API + meta-agent wiring.
  6. Backfill batch with concurrency cap (analyzers fork shell processes — cap at min(4, cpu_count)).

Tests

  • Each analyzer: synthetic file with known complexity/coverage/lint result, assert metric.
  • Coordinator: pre + post fixture, assert delta computation.
  • Missing-binary handling: TS analyzer skips cleanly when tsc not on PATH.
  • Backfill: idempotent (re-running doesn't duplicate findings).

Hard parts

  • Cross-language is genuinely hard. Python / TS / Go cover ~80% of usage; the long tail (Rust, Ruby, Java, Swift, etc.) is per-language adapter work. Document explicitly which languages are supported v1.
  • "pre" state for a session sometimes doesn't exist (the file was created in the session). Handle: pre_value = NULL, delta = NULL, details_json = {"reason": "file_created_in_session"}.
  • Some analyzers are slow (mypy on a big project can be 30s+). Use timeouts (default 60s per file) and cache results.
  • Coverage requires running tests — that's a SEPARATE deliverable, defer (Spec 22 sub-task). v1 handles complexity + lint + types only.

Out of scope

  • Test-running for coverage measurement (separate spec — needs sandboxing).
  • Rust / Java / Swift / Ruby analyzers.
  • Real-time analysis as the agent edits (defer; this is offline backfill).

Dependencies

  • None blocking. Playback v2 (shipped) provides pre/post reconstruction.
  • Consumed by Spec 22 (outcome attribution v2) and Spec 26 (comparative benchmark).

Estimated effort

Size L — single agent, ~2-2.5 hr.

Hard rules

  • DO NOT touch versions / CHANGELOG headings.
  • Pre-assigned schema slot: v018.
  • Branch: feat/static-analysis-pass off main.
  • New optional dep [analysis] in pyproject.toml is allowed (similar to [embeddings]).

Metadata

Metadata

Assignees

No one assigned

    Labels

    size-l~2 hr agent runspecSpec/feature for an agent to implementwave-2Wave 2: outcome-attribution rails

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions