# Evolver Loop 6 Analysis: Diagnosing DistilBERT Underperformance

**Objective**: Understand why DistilBERT feature extraction only improved AUC by +0.0059 (0.6312 vs 0.6253 baseline) and design path forward.

**Key question**: Why did transformers underperform expectations, and what specific techniques will unlock their potential?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load experiment results for analysis
print("="*70)
print("EXPERIMENT PERFORMANCE COMPARISON")
print("="*70)

results = {
    'Experiment': ['exp_001 (TF-IDF baseline)', 'exp_006 (DistilBERT feature extract)'],
    'CV_AUC': [0.6253, 0.6312],
    'Improvement': [0.0, 0.0059],
    'Relative_Improvement': [0.0, 0.94],
    'Variance': [0.0334, 0.0236]
}

df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))
print()
print(f"Expected improvement: +0.03 to +0.05 AUC")
print(f"Actual improvement: +0.0059 AUC")
print(f"Shortfall: {0.03-0.0059:.4f} to {0.05-0.0059:.4f} AUC")
print(f"Performance: {0.0059/0.03:.1%} to {0.0059/0.05:.1%} of expected range")

EXPERIMENT PERFORMANCE COMPARISON
                          Experiment  CV_AUC  Improvement  Relative_Improvement  Variance
           exp_001 (TF-IDF baseline)  0.6253       0.0000                  0.00    0.0334
exp_006 (DistilBERT feature extract)  0.6312       0.0059                  0.94    0.0236

Expected improvement: +0.03 to +0.05 AUC
Actual improvement: +0.0059 AUC
Shortfall: 0.0241 to 0.0441 AUC
Performance: 19.7% to 11.8% of expected range


## 1. Root Cause Analysis

Based on research and experiment results, identify why DistilBERT underperformed.

In [2]:
print("="*70)
print("ROOT CAUSE ANALYSIS: DISTILBERT UNDERPERFORMANCE")
print("="*70)

root_causes = {
    'Cause': [
        'Feature extraction (frozen embeddings)',
        '[CLS] token pooling loses information',
        'No task-specific fine-tuning',
        'LightGBM suboptimal for dense features',
        'DistilBERT capacity limitations'
    ],
    'Impact': [
        'High - Embeddings don\'t adapt to pizza request domain',
        'Medium - Discards token-level patterns',
        'High - No learning from task labels',
        'Medium - Tree models prefer sparse features',
        'Low-Medium - 66M params vs 110M BERT'
    ],
    'Evidence': [
        'Research shows frozen embeddings underperform on specific tasks',
        'Token-level patterns (keywords, sentiment) lost in pooling',
        'No gradient updates from classification loss',
        'TF-IDF sparse features work better with LightGBM',
        'DistilBERT is 40% smaller than BERT base'
    ],
    'Fix_Strategy': [
        'Fine-tune with classification head (2-3 epochs)',
        'Use mean pooling or attention-weighted pooling',
        'End-to-end fine-tuning on pizza request task',
        'Try neural classifier (MLP) or keep LightGBM',
        'Upgrade to RoBERTa or DeBERTa if needed'
    ]
}

df_causes = pd.DataFrame(root_causes)
for i, row in df_causes.iterrows():
    print(f"\n{i+1}. {row['Cause']}")
    print(f"   Impact: {row['Impact']}")
    print(f"   Evidence: {row['Evidence']}")
    print(f"   Fix: {row['Fix_Strategy']}")

ROOT CAUSE ANALYSIS: DISTILBERT UNDERPERFORMANCE

1. Feature extraction (frozen embeddings)
   Impact: High - Embeddings don't adapt to pizza request domain
   Evidence: Research shows frozen embeddings underperform on specific tasks
   Fix: Fine-tune with classification head (2-3 epochs)

2. [CLS] token pooling loses information
   Impact: Medium - Discards token-level patterns
   Evidence: Token-level patterns (keywords, sentiment) lost in pooling
   Fix: Use mean pooling or attention-weighted pooling

3. No task-specific fine-tuning
   Impact: High - No learning from task labels
   Evidence: No gradient updates from classification loss
   Fix: End-to-end fine-tuning on pizza request task

4. LightGBM suboptimal for dense features
   Impact: Medium - Tree models prefer sparse features
   Evidence: TF-IDF sparse features work better with LightGBM
   Fix: Try neural classifier (MLP) or keep LightGBM

5. DistilBERT capacity limitations
   Impact: Low-Medium - 66M params vs 110M BERT
   

## 2. Research-Informed Solutions

Based on web research, identify proven techniques for improving transformer performance.

In [None]:
print("="*70)
print("RESEARCH-BACKED SOLUTIONS")
print("="*70)

solutions = {
    'Solution': [
        'Fine-tune DistilBERT (2-3 epochs)',
        'Freeze lower layers, train head only',
        'Use low learning rate (2e-5)',
        'Early stopping (patience=1)',
        'Ensemble TF-IDF + transformer',
        'Add enhanced meta-features',
        'Try RoBERTa base (125M params)',
        'Mean pooling instead of [CLS]'
    ],
    'Expected_Gain': [0.03, 0.02, 'Stability', 'Prevent overfit', 0.015, 0.02, 0.02, 0.01],
    'Risk_Level': ['Medium', 'Low', 'Low', 'Low', 'Low', 'Low', 'Medium', 'Low'],
    'Implementation_Time': ['3 hours', '2 hours', 'Included', 'Included', '1 hour', '2 hours', '2 hours', '1 hour'],
    'Research_Source': [
        'Multiple sources: fine-tuning adapts to task',
        'Freezing reduces overfit on small data',
        'Standard practice for transformer fine-tuning',
        'Critical for small datasets (<3000 samples)',
        'Kaggle winners: TF-IDF often gets 0.8 weight',
        'Readability, emotional intensity features',
        'Larger model, better performance than DistilBERT',
        'Captures token-level information better'
    ]
}

df_solutions = pd.DataFrame(solutions)
for i, row in df_solutions.iterrows():
    print(f"\n{i+1}. {row['Solution']}")
    print(f"   Expected gain: {row['Expected_Gain']}")
    print(f"   Risk: {row['Risk_Level']}")
    print(f"   Time: {row['Implementation_Time']}")
    print(f"   Source: {row['Research_Source']}")

## 3. Recommended Next Steps

Based on analysis, prioritize approaches with best risk/reward ratio.

In [None]:
print("="*70)
print("PRIORITIZED ACTION PLAN")
print("="*70)

action_plan = {
    'Priority': [1, 2, 3, 4, 5],
    'Action': [
        'Fine-tune DistilBERT with classification head',
        'Add enhanced meta-features (readability, emotion)',
        'Ensemble TF-IDF + fine-tuned DistilBERT',
        'Try RoBERTa if DistilBERT plateaus',
        'Experiment with pooling strategies'
    ],
    'Expected_CV': [0.65, 0.64, 0.66, 0.67, 0.645],
    'Confidence': ['High', 'Medium', 'High', 'Medium', 'Low'],
    'Rationale': [
        'Direct fix for root cause - task adaptation',
        'Builds on proven meta-feature approach',
        'Kaggle winners use this combination',
        'Larger model if needed',
        'Experimental, lower priority'
    ]
}

df_plan = pd.DataFrame(action_plan)
print(df_plan.to_string(index=False))

print("\n" + "="*70)
print("EXPECTED TRAJECTORY")
print("="*70)
trajectory = [
    ("Current", 0.6312),
    ("After fine-tuning", 0.65),
    ("After enhanced features", 0.66),
    ("After ensembling", 0.68),
    ("After RoBERTa upgrade", 0.70)
]

for stage, score in trajectory:
    print(f"{stage:20s}: {score:.4f} AUC")

## 4. Risk Mitigation

Identify and mitigate risks for each approach.

In [None]:
print("="*70)
print("RISK MITIGATION STRATEGY")
print("="*70)

risks = {
    'Risk': [
        'Overfitting on small dataset (2878 samples)',
        'No improvement from fine-tuning',
        'Computational constraints (time limit)',
        'Ensembling complexity',
        'RoBERTa too slow for iteration'
    ],
    'Probability': ['Medium', 'Low', 'Medium', 'Low', 'Medium'],
    'Impact': ['High', 'Medium', 'High', 'Low', 'Medium'],
    'Mitigation': [
        'Freeze lower layers, early stopping, max 3 epochs',
        'Try RoBERTa or enhanced meta-features instead',
        'Use DistilBERT (faster), batch_size=16, 2-3 epochs max',
        'Simple weighted average first, stacking later',
        'Only if DistilBERT clearly plateaus'
    ]
}

df_risks = pd.DataFrame(risks)
for i, row in df_risks.iterrows():
    print(f"\n{i+1}. {row['Risk']}")
    print(f"   Probability: {row['Probability']}, Impact: {row['Impact']}")
    print(f"   Mitigation: {row['Mitigation']}")