Evaluate merged structure+action_items prompt quality with A/B comparison

Parent: #4635

## Problem

Merging `get_transcript_structure()` + `extract_action_items()` into a single gpt-5.1 call (#4636) would eliminate one full transcript pass and save ~15-20% of pipeline costs. However, the action items prompt is highly complex (~2095 words) with strict extraction rules, while the structure prompt encourages compression — creating an instruction conflict risk.

## Why We Can't Merge Blindly

- **Structure prompt** says "condense into summary" (compression)
- **Action items prompt** says "Read the ENTIRE conversation" 3 times and "resolve ALL vague references" (expansion)
- Multi-task interference may cause the model to compromise between these goals
- Action item quality (especially subtle task detection and coreference resolution) could degrade

## Proposed Evaluation

1. Select ~100 recent conversations with known action items (mix of simple and complex)
2. Run current separate pipeline → collect baseline action items
3. Run merged pipeline (with mitigations: explicit priority, staged workflow, structured outputs) → collect test action items
4. Compare:
   - **Precision:** % of extracted action items that are correct
   - **Recall:** % of real action items that were detected
   - **Due date accuracy:** correct timezone conversion
   - **Coreference resolution:** vague references properly resolved
5. Pass threshold: <5% regression on any metric

### Merged prompt mitigations to test:
- Hard section separation with headers
- Explicit priority: "action item extraction takes precedence"
- Two-stage internal workflow: "FIRST extract action items, THEN generate structure"
- Structured Outputs with `strict: true`

## Impact

If evals pass: eliminates 1 full gpt-5.1 call per conversation → ~15-20% cost reduction.
If evals fail: keep separate calls with prompt caching (#4654).

## Risk

Medium — this is research/evaluation work. No production risk until results are confirmed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate merged structure+action_items prompt quality with A/B comparison #4656

Problem

Why We Can't Merge Blindly

Proposed Evaluation

Merged prompt mitigations to test:

Impact

Risk

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate merged structure+action_items prompt quality with A/B comparison #4656

Description

Problem

Why We Can't Merge Blindly

Proposed Evaluation

Merged prompt mitigations to test:

Impact

Risk

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions