# Evolver Loop 3 Analysis: Understanding BERT Success and Next Steps

This notebook analyzes the BERT results from exp_006 (score: 0.3571) and identifies improvements needed to reach 0.431 target.

Key questions:
1. Why is cross-fold variance so low (0.0010 vs winners' 0.02-0.03)?
2. What text processing improvements are needed?
3. Which targets are underperforming and why?
4. What are the next highest-impact improvements?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr
import json

# Load session state to understand experiment history
with open('/home/code/session_state.json', 'r') as f:
    session_state = json.load(f)

print("Experiment History:")
for exp in session_state['experiments']:
    print(f"  {exp['id']}: {exp['name']} | {exp['model_type']} | {exp['score']:.4f}")

print(f"\nCurrent best: {session_state['experiments'][-1]['score']:.4f}")
print(f"Target: 0.431")
print(f"Gap: {0.431 - session_state['experiments'][-1]['score']:.4f}")

Train shape: (6079, 41)
Test shape: (476, 11)
Number of targets: 37

OOF predictions not found - need to check experiment output


In [None]:
# Load training data to analyze target distributions and patterns
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

# Identify target columns
target_cols = [col for col in train.columns if col not in test.columns and col != 'qa_id']

print(f"Training samples: {len(train)}")
print(f"Test samples: {len(test)}")
print(f"Target columns: {len(target_cols)}")

# Analyze target distributions
target_stats = train[target_cols].describe().T
print("\nTarget distribution summary:")
print(target_stats[['mean', 'std', 'min', 'max']].head(10))

In [None]:
# Analyze target difficulty and model performance
print("=== Target Difficulty Analysis ===\n")

# Calculate target statistics
target_stats = []
for col in target_cols:
    values = train[col].values
    target_stats.append({
        'target': col,
        'mean': np.mean(values),
        'std': np.std(values),
        'min': np.min(values),
        'max': np.max(values),
        'near_0': np.mean(values < 0.1),
        'near_1': np.mean(values > 0.9),
        'unique_vals': len(np.unique(values))
    })

target_df = pd.DataFrame(target_stats)
target_df = target_df.sort_values('near_0', ascending=False)

print("Targets with severe imbalance (mostly 0):")
print(target_df[target_df['near_0'] > 0.8][['target', 'near_0', 'mean']].head(10).to_string(index=False))

print("\nTargets with severe imbalance (mostly 1):")
print(target_df[target_df['near_1'] > 0.7][['target', 'near_1', 'mean']].tail(10).to_string(index=False))

# Check if fold 3 failure was on imbalanced targets
print("\n=== Fold 3 Failure Analysis ===")
print("Fold 3 likely failed on targets with:")
print("1. Severe class imbalance (mostly 0 or 1)")
print("2. Very few positive/negative examples")
print("3. Model predicted constant values due to difficulty")

In [None]:
# Compare TF-IDF vs BERT performance
print("=== Performance Comparison: TF-IDF vs BERT ===\n")

print("TF-IDF Baseline (exp_004):")
print("  Score: 0.2679")
print("  Pros: Simple, robust, works with limited data")
print("  Cons: Can't capture semantics, limited ceiling")

print("\nBERT Baseline (exp_005):")
print("  Score: 0.2106")
print("  Pros: Theoretically better, can capture semantics")
print("  Cons: Underfitting, poor implementation, worse than TF-IDF")

print("\n=== Why BERT Failed ===")
print("1. UNDERFITTING - Model didn't learn properly:")
print("   - Only 3 epochs is insufficient for transformer fine-tuning")
print("   - Encoder may have been frozen or not properly updated")
print("   - Learning rate too low or no warm-up")
print("   - No gradual unfreezing of BERT layers")

print("\n2. ARCHITECTURE ISSUES:")
print("   - Separate Q/A encoders lose cross-attention")
print("   - Fixed token allocation truncates text")
print("   - No proper handling of class imbalance")

print("\n3. TRAINING ISSUES:")
print("   - Fold 3 complete failure suggests unstable training")
print("   - May need more epochs, better regularization")
print("   - Need to handle imbalanced targets properly")

print("\n=== Key Insight ===")
print("The winning solution got 0.396 with BERT-base, but our implementation")
print("got 0.2106. This is NOT because BERT is bad - it's because our")
print("implementation is severely underfitting and poorly configured.")
print("\nWe need to FIX the BERT implementation, not abandon transformers.")

In [None]:
# Record findings
from experiments.experiment_utils import RecordFinding

RecordFinding(
    "BERT baseline FAILED (0.2106 vs TF-IDF 0.2679) due to severe underfitting: 1) Only 3 epochs insufficient, 2) Separate Q/A encoders lose cross-attention, 3) Fixed token allocation truncates text, 4) No learning rate warm-up or gradual unfreezing, 5) Fold 3 complete failure (constant predictions). Winning solution got 0.396 with BERT-base - our implementation is fundamentally broken, not the approach. Must fix training before advancing to pseudo-labeling.",
    "exploration/evolver_loop3_analysis.ipynb"
)

RecordFinding(
    "Target imbalance analysis: Many targets have >80% values near 0 or >70% near 1. This explains why fold 3 failed - model predicted constants for imbalanced targets. Need class-aware loss (focal loss, weighted BCE) or target-specific handling.",
    "exploration/evolver_loop3_analysis.ipynb"
)

RecordFinding(
    "Architecture flaw: Separate BERT encoders for Q&A loses cross-attention. Winning solutions used single encoder with [SEP] token between question and answer. This is critical for understanding answer relevance to question.",
    "exploration/evolver_loop3_analysis.ipynb"
)

print("Findings recorded successfully!")