# JudgeAgent Scoring Analysis — Research Results

This notebook presents the findings from our sensitivity analysis of the JudgeAgent's
weighted-sum scoring algorithm. We analyze 221 real judge decisions from a Pantheon-to-Vatican
walking tour to understand how parameter changes affect content selection.

**Research questions:**
1. How does the scoring algorithm distribute scores across content types?
2. How sensitive is content selection to type-preference weights?
3. How sensitive is content selection to the relevance multiplier?
4. What are the implications for algorithm tuning?

In [None]:
import json
from pathlib import Path
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from IPython.display import Image, display

RESULTS = Path('..') / 'results'
METRICS = RESULTS / 'metrics'
FIGURES = RESULTS / 'figures'

# Load metrics
with open(METRICS / 'baseline_metrics.json') as f:
    baseline = json.load(f)
with open(METRICS / 'type_preference_sensitivity.json') as f:
    type_sens = json.load(f)
with open(METRICS / 'relevance_sensitivity.json') as f:
    rel_sens = json.load(f)

print('Metrics loaded successfully.')

## 1. Baseline Score Distribution

The JudgeAgent uses a weighted-sum model with 6 criteria:
- **Content quality** (30 pts): Penalizes short/empty titles and descriptions
- **Title relevance** (20 pts): Keyword overlap between title and location name
- **Description relevance** (15 pts): Keyword overlap in description
- **Keyword overlap** (20 pts max): `min(overlap_count * 5, 20)`
- **Type preference** (10/5/5): Text=10, Video=5, Music=5
- **URL quality** (5 pts): Valid URL bonus

In [None]:
print(f"Total judge decisions analyzed: {baseline['total_judgments']}")
print(f"\nAverage scores by content type:")
for ct, score in baseline['type_avg_scores'].items():
    print(f"  {ct:8s}: {score:.1f}/100")

print(f"\nWin rates (% of times selected as best):")
for ct, rate in baseline['type_win_rates'].items():
    print(f"  {ct:8s}: {rate:.1f}%")

print(f"\nAverage score gap (winner - runner-up): {baseline['avg_score_gap']:.1f} pts")

print(f"\nDetailed statistics:")
for ct, details in baseline['type_score_details'].items():
    print(f"  {ct:8s}: n={details['count']:3d}, min={details['min']:.0f}, max={details['max']:.0f}, avg={details['avg']:.1f}")

In [None]:
display(Image(filename=str(FIGURES / 'score_distribution.png')))

In [None]:
display(Image(filename=str(FIGURES / 'win_rate_pie.png')))

### Key Observations — Baseline

- **Text dominates** with 86.4% win rate and highest average score (87.0)
- **Video is competitive** with 81.9 avg score but only 10.0% win rate
- **Music lags significantly** at 61.1 avg score and 3.6% win rate
- **Score gap of 21.0 pts** indicates confident decisions (not close calls)
- Text scores range 80-100 while music is stuck at 60-65 — a narrow, low band

The text-agent advantage comes from Wikipedia's rich, keyword-dense content that scores
highly on both content quality and relevance criteria.

## 2. Experiment 1: Type Preference Sensitivity

We sweep type-preference weights across 6 configurations to measure how much
the 10/5/5 default (favoring text) affects outcomes.

In [None]:
print(f"{'Config':<30s}  {'Text%':>6s}  {'Video%':>6s}  {'Music%':>6s}")
print('-' * 56)
for r in type_sens:
    print(f"{r['config']:<30s}  {r['text_pct']:6.1f}  {r['video_pct']:6.1f}  {r['music_pct']:6.1f}")

In [None]:
display(Image(filename=str(FIGURES / 'type_preference_sensitivity.png')))

### Key Observations — Type Preference Sensitivity

| Finding | Detail |
|---------|--------|
| Text remains dominant in all configs | Even with no preference (0/0/0), text wins 80.9% |
| Removing text bonus costs ~9 ppts | Default 89.5% → Equal 80.9% |
| Music needs +15 to appear at all | Music-boosted (5/5/15) gives music only 6.2% |
| Video can reach 23% with +15 boost | But still far below text |

**Conclusion:** Type preference is a secondary factor. The 5-10 point preference bonus
is overshadowed by the ~26 point content quality gap between text and music.
To meaningfully diversify content selection, type-preference alone is insufficient.

## 3. Experiment 2: Relevance Multiplier Sensitivity

The relevance component uses `min(keyword_overlap * multiplier, 20)` where default
multiplier = 5. We sweep from 0 (disable relevance) to 10 (double weight).

In [None]:
print(f"{'Multiplier':>10s}  {'Text%':>6s}  {'Video%':>6s}  {'Music%':>6s}  {'Avg Gap':>8s}")
print('-' * 44)
for r in rel_sens:
    print(f"{r['multiplier']:10d}  {r['text_pct']:6.1f}  {r['video_pct']:6.1f}  {r['music_pct']:6.1f}  {r['avg_gap']:8.1f}")

In [None]:
display(Image(filename=str(FIGURES / 'relevance_sensitivity.png')))

### Key Observations — Relevance Sensitivity

| Finding | Detail |
|---------|--------|
| **Most sensitive parameter** | Text win rate swings from 74.6% (mult=0) to 89.5% (mult=1-8) |
| Disabling relevance (mult=0) | Music appears at 6.2%, video rises to 19.1% |
| High relevance (mult=10) | Inverts — text drops to 80.9%, video rises to 19.1% |
| Judge confidence increases linearly | Score gap: 9.7 (mult=0) → 24.5 (mult=8) |
| Sweet spot at mult=1-8 | Stable text dominance with clear confidence |

**Conclusion:** The relevance multiplier is the most influential tuning knob.
At mult=0, the algorithm loses its ability to reward location-specific content,
and selection becomes more random. At mult=10, over-weighting relevance slightly
hurts text (which already has high relevance) due to the cap at 20 points.
The default mult=5 sits in the stable plateau — a robust choice.

## 4. Summary and Recommendations

### Research Findings

1. **Text dominance is structural**, not an artifact of tuning. Wikipedia's rich content
   naturally scores high on content quality (30 pts) and description relevance (15 pts).

2. **Type preference weights have limited impact** (~9 ppts swing). They act as a
   tiebreaker, not a selection driver.

3. **Relevance multiplier is the most sensitive parameter** (15 ppt swing). It controls
   how strongly the algorithm favors location-specific content.

4. **Default parameters are well-chosen.** They sit in the stable region of the parameter
   space with robust text selection and clear confidence margins.

### Recommendations

| Recommendation | Rationale |
|---------------|----------|
| Keep defaults (mult=5, text pref=10) | Stable, well-tested operating point |
| Add diversity bonus for multi-modal tours | Current algorithm naturally clusters on text |
| Consider MMR-style re-ranking | Maximal Marginal Relevance could balance relevance + diversity |
| Improve Spotify metadata | Music's narrow 60-65 score band suggests poor search-to-location matching |