Skip to content

Spec 23: LLM-graded session quality — offline rubric scoring of every session #95

@0bserver07

Description

@0bserver07

Goal

Run an offline LLM (local Ollama) over each session's transcript + tool calls and emit a quality grade against a multi-dimensional rubric: problem-solved, tests-added, code-clean, edge-cases-handled. Subjective per call, consistent in aggregate. Used for ranking, not absolute scores.

Why now

Static analysis (Spec 21) gives objective deltas; outcome attribution (Spec 22) gives "did the code ship". Neither answers "was the agent's reasoning sound?" or "did the agent skip an edge case?". An LLM can grade that — coarsely, but consistently.

Schema

v020session_quality_grades:

CREATE TABLE session_quality_grades (
  id INTEGER PRIMARY KEY,
  session_id TEXT NOT NULL,
  grader_model TEXT NOT NULL,     -- which model graded (e.g. 'qwen2.5-coder:7b')
  rubric_version INTEGER NOT NULL,-- bump when the rubric changes
  ts TEXT NOT NULL,
  problem_solved REAL,            -- [0, 1] or NULL if "can't tell"
  tests_added REAL,
  code_clean REAL,
  edge_cases REAL,
  overall REAL,                   -- weighted average; weights in details
  reasoning TEXT,                 -- ~200-char summary
  details_json TEXT,              -- per-rubric notes
  UNIQUE (session_id, grader_model, rubric_version)
);
CREATE INDEX idx_sqg_session ON session_quality_grades(session_id);
CREATE INDEX idx_sqg_overall ON session_quality_grades(overall DESC);

User-visible surface

  • CLI: stackunderflow grade session <id> [--model qwen2.5-coder:7b] runs the grader on one session.
  • CLI: stackunderflow grade backfill [--since 30d] [--limit N] [--model M] batches over recent ungraded sessions.
  • API: GET /api/grades/session/{id} and GET /api/grades/recent?min_overall=0.7.
  • Meta-agent tool: get_session_grade(session_id).
  • UI: a "Quality" sortable column on Sessions tab + a per-session breakdown panel.

Implementation plan

  1. v020 migration.
  2. New service stackunderflow/services/quality_grader.py:
    • RUBRIC_V1 — the prompt that asks the grader to score each dimension.
    • _serialize_session_for_grading(session_id, max_tokens=8000) — produce a compact transcript: user prompts, assistant summaries, key tool calls, final outcome. Stays under the model's context budget.
    • grade(conn, session_id, grader_model) — call Ollama, parse JSON-shaped response, persist.
    • backfill(conn, since, limit, model) — batch with concurrency cap (default 2).
  3. Reuse the meta-agent's Ollama proxy / NDJSON streaming where appropriate.
  4. CLI + API + meta-agent wiring.
  5. Frontend column.

Tests

  • Mock Ollama with a deterministic response → assert parse + persist.
  • Malformed grader response (non-JSON) → graceful failure (don't persist; log; continue).
  • Idempotency: re-grading with the same (session, model, rubric_version) upserts.
  • Rubric-version bump invalidates old grades on read (return latest).

Hard parts

  • The grader output must be machine-parseable. Use structured JSON output (Ollama supports format: "json"). Have a strict schema; reject malformed responses.
  • Context-window management: a 6000-message session can't fit. Truncate intelligently: keep first 5 + last 5 user prompts, sample the middle, include key signals (errors, reverts, "thanks").
  • The grader is a model — it has its own biases. Document this explicitly. Recommend running the grader with multiple models and averaging if precision matters.
  • Cost: grading is ~$0.001-0.01 per session at local Ollama (free) or ~$0.05-0.20 via cloud. Local-only by default.

Out of scope

  • Cloud-API grading by default (local Ollama only; cloud would be a future opt-in).
  • Real-time grading (this is offline backfill).
  • Cross-session "agent-skill ranking" (Spec 26's job).

Dependencies

  • Spec 21 (static analysis) is helpful context but not blocking.
  • Consumed by Spec 26 (comparative benchmark) — combines static + LLM grades into the scoring rubric.

Estimated effort

Size M — single agent, ~1-1.5 hr.

Hard rules

  • DO NOT touch versions / CHANGELOG headings.
  • Pre-assigned schema slot: v020.
  • Branch: feat/llm-graded-quality off main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    size-m~1 hr agent runspecSpec/feature for an agent to implementwave-3Wave 3: outcome attribution + grading

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions