Skip to content

Evaluate merged structure+action_items prompt quality with A/B comparison #4656

@beastoin

Description

@beastoin

Parent: #4635

Problem

Merging get_transcript_structure() + extract_action_items() into a single gpt-5.1 call (#4636) would eliminate one full transcript pass and save ~15-20% of pipeline costs. However, the action items prompt is highly complex (~2095 words) with strict extraction rules, while the structure prompt encourages compression — creating an instruction conflict risk.

Why We Can't Merge Blindly

  • Structure prompt says "condense into summary" (compression)
  • Action items prompt says "Read the ENTIRE conversation" 3 times and "resolve ALL vague references" (expansion)
  • Multi-task interference may cause the model to compromise between these goals
  • Action item quality (especially subtle task detection and coreference resolution) could degrade

Proposed Evaluation

  1. Select ~100 recent conversations with known action items (mix of simple and complex)
  2. Run current separate pipeline → collect baseline action items
  3. Run merged pipeline (with mitigations: explicit priority, staged workflow, structured outputs) → collect test action items
  4. Compare:
    • Precision: % of extracted action items that are correct
    • Recall: % of real action items that were detected
    • Due date accuracy: correct timezone conversion
    • Coreference resolution: vague references properly resolved
  5. Pass threshold: <5% regression on any metric

Merged prompt mitigations to test:

  • Hard section separation with headers
  • Explicit priority: "action item extraction takes precedence"
  • Two-stage internal workflow: "FIRST extract action items, THEN generate structure"
  • Structured Outputs with strict: true

Impact

If evals pass: eliminates 1 full gpt-5.1 call per conversation → ~15-20% cost reduction.
If evals fail: keep separate calls with prompt caching (#4654).

Risk

Medium — this is research/evaluation work. No production risk until results are confirmed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    intelligenceLayer: Summaries, insights, action itemsmaintainerLane: High-risk, cross-system changesp2Priority: Important (score 14-21)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions