-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
intelligenceLayer: Summaries, insights, action itemsLayer: Summaries, insights, action itemsmaintainerLane: High-risk, cross-system changesLane: High-risk, cross-system changesp2Priority: Important (score 14-21)Priority: Important (score 14-21)
Description
Parent: #4635
Problem
Merging get_transcript_structure() + extract_action_items() into a single gpt-5.1 call (#4636) would eliminate one full transcript pass and save ~15-20% of pipeline costs. However, the action items prompt is highly complex (~2095 words) with strict extraction rules, while the structure prompt encourages compression — creating an instruction conflict risk.
Why We Can't Merge Blindly
- Structure prompt says "condense into summary" (compression)
- Action items prompt says "Read the ENTIRE conversation" 3 times and "resolve ALL vague references" (expansion)
- Multi-task interference may cause the model to compromise between these goals
- Action item quality (especially subtle task detection and coreference resolution) could degrade
Proposed Evaluation
- Select ~100 recent conversations with known action items (mix of simple and complex)
- Run current separate pipeline → collect baseline action items
- Run merged pipeline (with mitigations: explicit priority, staged workflow, structured outputs) → collect test action items
- Compare:
- Precision: % of extracted action items that are correct
- Recall: % of real action items that were detected
- Due date accuracy: correct timezone conversion
- Coreference resolution: vague references properly resolved
- Pass threshold: <5% regression on any metric
Merged prompt mitigations to test:
- Hard section separation with headers
- Explicit priority: "action item extraction takes precedence"
- Two-stage internal workflow: "FIRST extract action items, THEN generate structure"
- Structured Outputs with
strict: true
Impact
If evals pass: eliminates 1 full gpt-5.1 call per conversation → ~15-20% cost reduction.
If evals fail: keep separate calls with prompt caching (#4654).
Risk
Medium — this is research/evaluation work. No production risk until results are confirmed.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
intelligenceLayer: Summaries, insights, action itemsLayer: Summaries, insights, action itemsmaintainerLane: High-risk, cross-system changesLane: High-risk, cross-system changesp2Priority: Important (score 14-21)Priority: Important (score 14-21)