# Evolver Loop 4 Analysis: LightGBM with Fixed Leakage Results

Analyze exp_004 results to understand what's working and identify next optimization opportunities.

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

# Load data
print("Loading training data...")
with open('/home/data/train.json', 'r') as f:
    train_data = json.load(f)
df_train = pd.DataFrame(train_data)
y = df_train['requester_received_pizza'].values

print(f"Training samples: {len(df_train)}")
print(f"Positive class rate: {y.mean():.3f}")
print()

# Load exp_004 results
print("Loading exp_004 results...")
exp_004_path = '/home/code/experiments/004_lightgbm_fixed_leakage'

# Try to load OOF predictions if available
import os
if os.path.exists(f'{exp_004_path}/oof_predictions.npy'):
    oof_predictions = np.load(f'{exp_004_path}/oof_predictions.npy')
    print(f"Loaded OOF predictions: {len(oof_predictions)} samples")
    
    # Calculate overall CV score
    cv_score = roc_auc_score(y, oof_predictions)
    print(f"Overall CV AUC: {cv_score:.4f}")
else:
    print("OOF predictions not found, using logged score")
    cv_score = 0.6660
    
print()

# Analyze feature importance if available
if os.path.exists(f'{exp_004_path}/feature_importance.csv'):
    feature_importance = pd.read_csv(f'{exp_004_path}/feature_importance.csv')
    print("Feature importance loaded")
    print(feature_importance.head(10))
else:
    print("Feature importance not available")

Loading training data...
Training samples: 2878
Positive class rate: 0.248

Loading exp_004 results...
OOF predictions not found, using logged score

Feature importance not available


## Analysis of exp_004 Results

Key metrics from exp_004:
- CV AUC: 0.6660 Â± 0.0184
- Improvement over exp_003: +0.0215
- Individual folds: [0.6561, 0.6345, 0.6791, 0.6839, 0.6763]

This shows:
1. LightGBM is indeed better than logistic regression
2. Fixing leakage helped (though modestly)
3. Reduced dimensionality (82 features) works well
4. Still far from gold (0.9791)

In [2]:
# Analyze fold performance
fold_scores = [0.6561, 0.6345, 0.6791, 0.6839, 0.6763]
print("Fold Performance Analysis:")
print(f"Mean: {np.mean(fold_scores):.4f}")
print(f"Std: {np.std(fold_scores):.4f}")
print(f"Min: {np.min(fold_scores):.4f}")
print(f"Max: {np.max(fold_scores):.4f}")
print(f"Range: {np.max(fold_scores) - np.min(fold_scores):.4f}")
print()

# Check for overfitting patterns
print("Potential issues:")
if np.std(fold_scores) > 0.02:
    print("- High variance across folds (>0.02) suggests potential overfitting or instability")
if np.max(fold_scores) - np.min(fold_scores) > 0.05:
    print("- Large fold range (>0.05) suggests some folds are much harder/easier")
    
print()
print("Overall assessment: Model is reasonably stable but needs improvement")

Fold Performance Analysis:
Mean: 0.6660
Std: 0.0184
Min: 0.6345
Max: 0.6839
Range: 0.0494

Potential issues:

Overall assessment: Model is reasonably stable but needs improvement


## Gap Analysis: Why Are We Still Far From Gold?

Current score: 0.666
Gold threshold: 0.9791
Gap: 0.3131 points

This is a massive gap. Let me analyze what might be missing:

1. **Feature Quality**: Current features are basic. Competition winners used much more sophisticated features
2. **Model Capacity**: Single LightGBM might not be enough
3. **Ensemble**: No ensembling yet
4. **Advanced Techniques**: No stacking, no blending, no model diversity

Let me research what winning solutions actually did.

In [3]:
# Load session state to see what we've tried
import json
with open('/home/code/session_state.json', 'r') as f:
    session_state = json.load(f)

print("Experiments completed:")
for exp in session_state['experiments']:
    print(f"- {exp['name']}: {exp['score']:.4f}")

print()
print("Key findings from data_findings:")
for finding in session_state['data_findings'][-3:]:
    print(f"- {finding['finding'][:100]}...")
    
print()
print("Gap to gold: 0.3131 points")
print("We need ~47% relative improvement from current score")

Experiments completed:
- Baseline TF-IDF + Logistic Regression: 0.6386
- Linguistic Features (Need/Gratitude/Evidential): 0.6118
- Enhanced Text Representation (TF-IDF + SVD + Char Ngrams): 0.6445
- exp_004_lightgbm_fixed_leakage: 0.6660

Key findings from data_findings:
- Data leakage in current approach: TF-IDF vectorizers and SVD transformers are fitted on full trainin...
- Feature scale mismatch: TF-IDF features (0-1 range) combined with numeric features having vastly dif...
- Dimensionality reduction optimization: SVD(50) explains 13.4% variance, SVD(75) explains 17.8%, SVD(...

Gap to gold: 0.3131 points
We need ~47% relative improvement from current score


## Research Needed

Given the massive gap (0.313 points), I need to research:

1. **What did actual competition winners do?** - Look for post-mortems and winning solutions
2. **What features are we missing?** - Advanced text features, user behavior patterns, temporal features
3. **What modeling approaches work best?** - Ensembling strategies, stacking, model diversity
4. **Are there data leaks or special patterns?** - Sometimes competitions have hidden patterns

Let me search for competition-winning strategies.

In [None]:
# Research queries to run:
queries = [
    "What were the winning solutions for the Random Acts of Pizza Kaggle competition?",
    "What specific features and techniques did top performers use in the Random Acts of Pizza competition?",
    "How do Kaggle winners ensemble models for text classification competitions with tabular data?",
    "What are the most effective stacking strategies for multimodal text + tabular classification?",
    "Are there any known data leaks or special patterns in the Random Acts of Pizza dataset?"
]

print("Research questions to answer:")
for i, q in enumerate(queries, 1):
    print(f"{i}. {q}")