diff --git a/docs/proposals/84-drift-measurement-pseudocode-impl.md b/docs/proposals/84-drift-measurement-pseudocode-impl.md new file mode 100644 index 000000000..1077fa68a --- /dev/null +++ b/docs/proposals/84-drift-measurement-pseudocode-impl.md @@ -0,0 +1,336 @@ +# Proposal 84: Pseudocode Implementation — Drift Measurement Task + +**Author:** EgonBot +**Date:** 2026-03-07 +**Status:** Proposal (pseudocode only — full impl pending review) +**Depends on:** Proposal 82 (framework), Proposal 83 (agent specification) +**Scope:** One new Luigi task, two Pydantic models, one LLM prompt sequence +**Files touched (when implemented):** 3–4 new files; zero changes to existing tasks + +--- + +## 1. Purpose + +This document presents a pseudocode implementation for `DriftEvaluationTask` — a post-pipeline Luigi task that measures how faithfully a generated plan represents the original user prompt. + +It is intended for neoneye review **before** any real code is written. + +Nothing here is executable. All class names, field names, and LLM prompt text are illustrative and subject to change. + +--- + +## 2. Placement in the Pipeline + +The task runs **after** the pipeline has fully completed. It is an optional post-processing step, not part of the core generation DAG. + +``` +[StartTimeTask] + | +[... all generation tasks ...] + | +[FinalReportTask] + | +[DriftEvaluationTask] ← new, optional +``` + +Inputs required: +- `001-2-initial_plan.txt` — the original user prompt (already exists) +- `final-report.md` or equivalent final plan artifact (already exists) + +Output: +- `drift-evaluation.json` — structured drift report +- `drift-evaluation.md` — human-readable verdict + +--- + +## 3. Pydantic Models (pseudocode) + +```python +# FILE: worker_plan/worker_plan_internal/plan/drift_models.py +# (pseudocode — not executable) + +class DriftIncident(BaseModel): + drift_type: str # TypeA–TypeJ from proposal 82 section 6 + severity: int # 0–4 + section: str # which plan section + source_reference: str # what the prompt said (or didn't say) + output_claim: str # what the plan claimed + explanation: str # why this is drift + +class DimensionScores(BaseModel): + scope_fidelity: int # 0–5 + constraint_fidelity: int # 0–5 + claim_strength_fidelity: int # 0–5 + evidence_grounding_fidelity: int # 0–5 + entity_fidelity: int # 0–5 + causal_fidelity: int # 0–5 + epistemic_fidelity: int # 0–5 + source_trace_fidelity: int # 0–5 + structural_priority_fidelity: int # 0–5 + language_posture_fidelity: int # 0–5 + +class PromptContract(BaseModel): + core_intent: str + primary_problem: str + proposed_solution: str + non_goals: list[str] + constraints: list[str] + core_entities: dict[str, str] # e.g. {"buyer": "...", "user": "..."} + optional_features: list[str] + uncertainties: list[str] + success_metrics: list[str] + +class DriftEvaluationResult(BaseModel): + prompt_contract: PromptContract + dimension_scores: DimensionScores + drift_incidents: list[DriftIncident] + overall_fidelity_score: float # weighted, 0–5 + overall_drift_risk: str # "low" | "medium" | "high" | "critical" + critical_drift_count: int + unsupported_claim_count: int + constraint_violation_count: int + confidence_inflation_count: int + usable_as_is: bool + verdict_preserved_well: list[str] + verdict_major_failures: list[str] + verdict_recommended_actions: list[str] +``` + +--- + +## 4. LLM Prompt Sequence (pseudocode) + +The evaluation is split into three sequential LLM calls. Each is structured output. + +### Call 1: Extract Prompt Contract + +``` +SYSTEM: + You are a strict prompt analyst. + Your job is to extract a structured contract from a user's planning prompt. + Do not add anything not in the prompt. + Do not invent goals, constraints, or entities. + If something is not stated, leave the field empty or say "not specified". + +USER: + Here is the initial user prompt: + --- + {initial_plan_text} + --- + Extract the prompt contract. Output ONLY the JSON. + +EXPECTED OUTPUT: PromptContract +``` + +### Call 2: Identify Drift Incidents + +``` +SYSTEM: + You are a strict drift evaluator. + You have been given: + 1. A prompt contract (the structured version of the original user prompt) + 2. A generated plan + Your job is to identify every place the generated plan departs from the prompt contract. + Use the drift type taxonomy from the instructions below: + TypeA = scope expansion + TypeB = constraint erosion + TypeC = unsupported invention + TypeD = confidence inflation + TypeE = business model drift + TypeF = customer drift + TypeG = mechanism drift + TypeH = priority drift + TypeI = governance/regulatory drift + TypeJ = style-induced semantic drift + Severity scale: 0=no drift, 1=minor, 2=moderate, 3=major, 4=critical + +USER: + Prompt contract: + --- + {prompt_contract_json} + --- + Generated plan: + --- + {final_report_text} + --- + Identify all drift incidents. Output ONLY the JSON list. + +EXPECTED OUTPUT: list[DriftIncident] +``` + +### Call 3: Score and Verdict + +``` +SYSTEM: + You are a strict drift scoring judge. + You have been given: + 1. A prompt contract + 2. A list of drift incidents with severities + Your job is to assign fidelity scores across 10 dimensions and produce a final verdict. + Scoring: 5=excellent fidelity, 4=good, 3=mixed, 2=weak, 1=severe drift, 0=failed. + Weighted fidelity score formula: + (constraint_fidelity * 0.20) + + (scope_fidelity * 0.15) + + (evidence_grounding_fidelity * 0.15) + + (causal_fidelity * 0.10) + + (entity_fidelity * 0.10) + + (epistemic_fidelity * 0.10) + + (structural_priority_fidelity * 0.08) + + (claim_strength_fidelity * 0.05) + + (source_trace_fidelity * 0.04) + + (language_posture_fidelity * 0.03) + Drift risk: critical if any severity-4 incident, else high if overall_fidelity_score < 2.5, + medium if < 3.5, else low. + Disqualifying conditions (force usable_as_is=false regardless of score): + - any explicit banned concept reintroduced + - target customer materially changed + - business model materially changed + - multiple critical unsupported numerical claims + - explicit non-goals violated + +USER: + Prompt contract: + --- + {prompt_contract_json} + --- + Drift incidents: + --- + {drift_incidents_json} + --- + Produce dimension scores and verdict. Output ONLY the JSON. + +EXPECTED OUTPUT: DriftEvaluationResult (dimension_scores + verdict fields) +``` + +--- + +## 5. Luigi Task (pseudocode) + +```python +# FILE: worker_plan/worker_plan_internal/plan/run_plan_pipeline.py +# (pseudocode — shows where the task fits, not production code) + +class DriftEvaluationTask(PlanTask): + """ + Post-pipeline task: evaluate how faithfully the generated plan + represents the original user prompt. + Optional — does not block report generation. + """ + + def requires(self): + return { + 'prompt': self.clone(SetupTask), # 001-2-initial_plan.txt + 'report': self.clone(FinalReportTask), # or equivalent final artifact + } + + def output(self): + return { + 'json': self.local_target('drift-evaluation.json'), + 'markdown': self.local_target('drift-evaluation.md'), + } + + def run_with_llm(self, llm: LLM) -> None: + # Read inputs + initial_plan_text = read(self.input()['prompt']) + final_report_text = read(self.input()['report']) + + # Call 1: extract prompt contract + prompt_contract = call_llm_structured( + llm, + system=SYSTEM_PROMPT_CONTRACT_EXTRACTION, + user=initial_plan_text, + output_model=PromptContract, + ) + + # Call 2: identify drift incidents + drift_incidents = call_llm_structured( + llm, + system=SYSTEM_DRIFT_INCIDENT_DETECTION, + user=format(prompt_contract, final_report_text), + output_model=list[DriftIncident], + ) + + # Call 3: score and verdict + result = call_llm_structured( + llm, + system=SYSTEM_DRIFT_SCORING, + user=format(prompt_contract, drift_incidents), + output_model=DriftEvaluationResult, + ) + + # Merge all three call results into final output + result.prompt_contract = prompt_contract + result.drift_incidents = drift_incidents + result.overall_fidelity_score = compute_weighted_score(result.dimension_scores) + result.overall_drift_risk = classify_risk(result) + + # Write outputs + write_json(self.output()['json'], result) + write_markdown(self.output()['markdown'], render_verdict(result)) +``` + +--- + +## 6. Output File: `drift-evaluation.md` (example render) + +```markdown +# Drift Evaluation Report + +**Overall Fidelity Score:** 3.8 / 5.0 +**Drift Risk:** medium +**Usable As-Is:** yes + +## What Was Preserved Well +- Core intent (HVT drone paintball simulation) unchanged +- Customer definition intact (players, event organisers) +- Budget uncertainty preserved + +## Major Failures +- Confidence inflation: "may reduce setup time" became "will eliminate logistics overhead" +- Unsupported invention: specific vendor names added (2 incidents, severity 3) + +## Recommended Actions +- Restore modal language in logistics section +- Remove unsupported vendor claims or flag as speculative + +## Dimension Scores +| Dimension | Score | +|---|---| +| Scope fidelity | 4 | +| Constraint fidelity | 4 | +| Evidence grounding | 3 | +| Entity fidelity | 5 | +| Epistemic fidelity | 3 | +| ... | ... | + +## Drift Incidents (3 total) +### Incident 1 — Severity 3 (TypeC: Unsupported Invention) +... +``` + +--- + +## 7. Open Questions for neoneye + +1. **Which final artifact to read?** The plan has multiple output files. Should this task read `final-report.md`, the full HTML report, or a specific intermediate markdown artifact? The HTML is ~700KB; a markdown section file may be more appropriate. + +2. **Is this task mandatory or always-optional?** Proposal suggests optional (does not block report). Confirm? + +3. **Three LLM calls per evaluation.** For large plans this is expensive on frontier models. Should `DriftEvaluationTask` use a cheaper model profile regardless of what profile ran the rest of the pipeline? + +4. **Scope of the incident list.** A 20-section plan could produce 50+ incidents. Should Call 2 be limited to top-N most severe, or exhaustive? + +5. **`list[DriftIncident]` as structured output.** This is an unbounded list. For local models this may hit token limits. Should the schema cap at e.g. `max_items=20`? + +--- + +## 8. What This Proposal Does NOT Include + +- No production code +- No changes to existing tasks +- No changes to `run_plan_pipeline.py` beyond the new task class +- No changes to report rendering +- No API endpoint changes +- No frontend changes + +Full implementation will be a separate PR after this proposal is approved.