# Evolver Loop 3 Analysis

## Objective
Analyze the current experimental results and identify critical issues based on evaluator feedback.

Focus areas:
1. Model mismatch - why logistic regression is suboptimal
2. Data leakage risk in current feature engineering
3. Dimensionality vs sample size analysis
4. Feature scaling issues
5. Path forward to gold threshold

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Load data
print("Loading data...")
with open('/home/data/train.json', 'r') as f:
    train_data = json.load(f)

df_train = pd.DataFrame(train_data)
print(f"Training samples: {len(df_train)}")
print(f"Features per sample: ~11,158 (from exp_002)")
print(f"Dimensionality ratio: {11158/len(df_train):.1f}:1 (MASSIVE OVERFITTING RISK)")
print(f"Positive class rate: {df_train['requester_received_pizza'].mean():.3f}")

Loading data...
Training samples: 2878
Features per sample: ~11,158 (from exp_002)
Dimensionality ratio: 3.9:1 (MASSIVE OVERFITTING RISK)
Positive class rate: 0.248


In [2]:
# Analyze feature scale mismatches
print("=== FEATURE SCALE ANALYSIS ===")
print("\nCurrent exp_002 features:")
print("- TF-IDF word features: 0-1 range (sparse)")
print("- TF-IDF char features: 0-1 range (sparse)")
print("- SVD components: centered, but varying scales")
print("- Numeric features: varying scales")

# Create sample numeric features like in exp_002
count_features = [
    'requester_number_of_comments_at_request',
    'requester_number_of_posts_at_request', 
    'requester_upvotes_plus_downvotes_at_request'
]

for feat in count_features:
    df_train[f'{feat}_log'] = np.log1p(df_train[feat])

df_train['upvotes_per_comment'] = df_train['requester_upvotes_plus_downvotes_at_request'] / (df_train['requester_number_of_comments_at_request'] + 1)
df_train['comments_per_post'] = df_train['requester_number_of_comments_at_request'] / (df_train['requester_number_of_posts_at_request'] + 1)
df_train['account_age_years'] = df_train['requester_account_age_in_days_at_request'] / 365.25
df_train['text_length'] = df_train['request_text_edit_aware'].fillna('').str.len()
df_train['word_count'] = df_train['request_text_edit_aware'].fillna('').str.split().str.len()

numeric_features = [
    'requester_number_of_comments_at_request_log',
    'requester_number_of_posts_at_request_log', 
    'requester_upvotes_plus_downvotes_at_request_log',
    'upvotes_per_comment',
    'comments_per_post',
    'account_age_years',
    'text_length',
    'word_count'
]

print("\nNumeric feature ranges:")
for feat in numeric_features:
    print(f"{feat}: {df_train[feat].min():.2f} to {df_train[feat].max():.2f}")

print("\n⚠️ PROBLEM: Features have wildly different scales!")
print("   - account_age_years: ~0-10")
print("   - text_length: ~0-5000")
print("   - upvotes_per_comment: ~0-100")
print("   - TF-IDF: 0-1")
print("\n Logistic regression will be dominated by large-scale features!")

=== FEATURE SCALE ANALYSIS ===

Current exp_002 features:
- TF-IDF word features: 0-1 range (sparse)
- TF-IDF char features: 0-1 range (sparse)
- SVD components: centered, but varying scales
- Numeric features: varying scales

Numeric feature ranges:
requester_number_of_comments_at_request_log: 0.00 to 6.89
requester_number_of_posts_at_request_log: 0.00 to 6.77
requester_upvotes_plus_downvotes_at_request_log: 0.00 to 14.07
upvotes_per_comment: 0.00 to 98921.00
comments_per_post: 0.00 to 384.00
account_age_years: 0.00 to 7.69
text_length: 0.00 to 4460.00
word_count: 0.00 to 854.00

⚠️ PROBLEM: Features have wildly different scales!
   - account_age_years: ~0-10
   - text_length: ~0-5000
   - upvotes_per_comment: ~0-100
   - TF-IDF: 0-1

 Logistic regression will be dominated by large-scale features!


In [3]:
# Demonstrate the leakage issue
print("=== DATA LEAKAGE DEMONSTRATION ===")
print("\nCurrent approach (WRONG):")
print("1. Fit TF-IDF on ALL training data")
print("2. Fit SVD on ALL training data")
print("3. Split into CV folds")
print("4. Train model on train_idx, evaluate on val_idx")
print("\n❌ PROBLEM: IDF statistics and SVD components contain information from validation set!")

print("\nCorrect approach:")
print("1. Split into CV folds first")
print("2. For each fold:")
print("   - Fit TF-IDF ONLY on train_idx")
print("   - Fit SVD ONLY on train_idx")
print("   - Transform both train_idx and val_idx")
print("   - Train model, evaluate")
print("\n✓ This prevents leakage of distributional information")

# Show how much this matters with a small example
print("\n=== IMPACT OF LEAKAGE ===")
print("With 2,878 samples and 5-fold CV:")
print("- Each validation fold: ~576 samples")
print("- Each training fold: ~2,302 samples")
print("- Validation set is 20% of total data")
print("\nFitting transformers on full data means:")
print("- IDF scores incorporate validation set word frequencies")
print("- SVD components are optimized for validation set patterns")
print("- CV scores may be OVEROPTIMISTIC")
print("\nThis is especially problematic with small datasets!")

=== DATA LEAKAGE DEMONSTRATION ===

Current approach (WRONG):
1. Fit TF-IDF on ALL training data
2. Fit SVD on ALL training data
3. Split into CV folds
4. Train model on train_idx, evaluate on val_idx

❌ PROBLEM: IDF statistics and SVD components contain information from validation set!

Correct approach:
1. Split into CV folds first
2. For each fold:
   - Fit TF-IDF ONLY on train_idx
   - Fit SVD ONLY on train_idx
   - Transform both train_idx and val_idx
   - Train model, evaluate

✓ This prevents leakage of distributional information

=== IMPACT OF LEAKAGE ===
With 2,878 samples and 5-fold CV:
- Each validation fold: ~576 samples
- Each training fold: ~2,302 samples
- Validation set is 20% of total data

Fitting transformers on full data means:
- IDF scores incorporate validation set word frequencies
- SVD components are optimized for validation set patterns
- CV scores may be OVEROPTIMISTIC

This is especially problematic with small datasets!


In [4]:
# Analyze model capacity issues
print("=== MODEL CAPACITY ANALYSIS ===")

print("\nLogistic Regression limitations:")
print("1. Linear decision boundaries only")
print("2. Assumes feature independence (no interactions)")
print("3. Sensitive to feature scaling")
print("4. Struggles with high dimensionality")
print("5. No built-in feature selection")

print("\nCurrent situation:")
print(f"- Samples: 2,878")
print(f"- Features: 11,158")
print(f"- Ratio: 3.9 features per sample")
print(f"- Feature types: Mixed (sparse TF-IDF + dense numeric)")
print(f"- Scales: Inconsistent (0-1 vs 0-5000)")

print("\nWhy LightGBM/XGBoost would be better:")
print("1. Handle mixed data types natively")
print("2. Capture non-linear interactions automatically")
print("3. Robust to unscaled features")
print("4. Built-in feature importance/selection")
print("5. Better regularization for high-dimensional data")
print("6. Tree-based splitting handles sparse features well")

print("\nExpected improvement from model upgrade:")
print("- Competition post-mortems: +0.03 to +0.08 AUC")
print("- Our gap to gold: 0.3346 points")
print("- Model upgrade alone won't get us to gold, but it's necessary")
print("- Combined with proper feature engineering: +0.10 to +0.15 potential")

=== MODEL CAPACITY ANALYSIS ===

Logistic Regression limitations:
1. Linear decision boundaries only
2. Assumes feature independence (no interactions)
3. Sensitive to feature scaling
4. Struggles with high dimensionality
5. No built-in feature selection

Current situation:
- Samples: 2,878
- Features: 11,158
- Ratio: 3.9 features per sample
- Feature types: Mixed (sparse TF-IDF + dense numeric)
- Scales: Inconsistent (0-1 vs 0-5000)

Why LightGBM/XGBoost would be better:
1. Handle mixed data types natively
2. Capture non-linear interactions automatically
3. Robust to unscaled features
4. Built-in feature importance/selection
5. Better regularization for high-dimensional data
6. Tree-based splitting handles sparse features well

Expected improvement from model upgrade:
- Competition post-mortems: +0.03 to +0.08 AUC
- Our gap to gold: 0.3346 points
- Model upgrade alone won't get us to gold, but it's necessary
- Combined with proper feature engineering: +0.10 to +0.15 potential


In [5]:
# Analyze dimensionality reduction options
print("=== DIMENSIONALITY REDUCTION ANALYSIS ===")

# Simulate different SVD component counts
sample_text = df_train['request_title'].fillna('') + ' ' + df_train['request_text_edit_aware'].fillna('')

# Fit TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1,2))
tfidf_matrix = vectorizer.fit_transform(sample_text)

print(f"Original TF-IDF dimensions: {tfidf_matrix.shape}")

# Test different SVD component counts
component_counts = [50, 75, 100, 150, 200]
explained_variances = []

for n_components in component_counts:
    svd = TruncatedSVD(n_components=n_components, random_state=42)
    svd.fit(tfidf_matrix)
    explained_variances.append(svd.explained_variance_ratio_.sum())
    print(f"SVD({n_components}): {svd.explained_variance_ratio_.sum():.3f} variance explained")

print("\nRecommendation:")
print("- Current: 100 word + 50 char = 150 components")
print("- Total features: 150 + 8 numeric = 158")
print("- Ratio: 158/2878 = 5.5% (much better than 387%)")
print("\nBut we can be more aggressive:")
print("- Try 50 word + 25 char = 75 components")
print("- Total features: 75 + 8 numeric = 83")
print("- Ratio: 83/2878 = 2.9% (even better)")
print("- Competition winners often use 50-100 components total")

=== DIMENSIONALITY REDUCTION ANALYSIS ===


Original TF-IDF dimensions: (2878, 5000)


SVD(50): 0.134 variance explained


SVD(75): 0.178 variance explained


SVD(100): 0.216 variance explained


SVD(150): 0.281 variance explained


SVD(200): 0.336 variance explained

Recommendation:
- Current: 100 word + 50 char = 150 components
- Total features: 150 + 8 numeric = 158
- Ratio: 158/2878 = 5.5% (much better than 387%)

But we can be more aggressive:
- Try 50 word + 25 char = 75 components
- Total features: 75 + 8 numeric = 83
- Ratio: 83/2878 = 2.9% (even better)
- Competition winners often use 50-100 components total


In [6]:
# Summary of findings
print("=== KEY FINDINGS ===")
print("\n1. MODEL MISMATCH (CRITICAL)")
print("   - Logistic regression is fundamentally wrong for this problem")
print("   - Need tree-based model (LightGBM/XGBoost) for non-linear patterns")
print("   - Expected gain: +0.03 to +0.08 AUC")

print("\n2. DATA LEAKAGE (HIGH PRIORITY)")
print("   - TF-IDF and SVD fitted on full data before CV splits")
print("   - Validation set information leaks into transformers")
print("   - CV scores may be optimistic")
print("   - Fix: Move ALL fitting inside CV loops")

print("\n3. DIMENSIONALITY CRISIS (HIGH PRIORITY)")
print("   - 11,158 features vs 2,878 samples = 3.9:1 ratio")
print("   - Extreme overfitting risk")
print("   - Solution: Aggressive SVD reduction (50-75 components total)")

print("\n4. FEATURE SCALING (MEDIUM PRIORITY)")
print("   - TF-IDF (0-1) mixed with numeric features (0-5000)")
print("   - Logistic regression dominated by large-scale features")
print("   - LightGBM/XGBoost don't need scaling, but proper preprocessing helps")

print("\n5. CONVERGENCE ISSUES (MEDIUM PRIORITY)")
print("   - Warnings despite max_iter=1000")
print("   - Indicates poor conditioning")
print("   - Will resolve with better model and dimensionality reduction")

print("\n=== PATH TO GOLD ===")
print("Current: 0.6445")
print("Target: 0.9791")
print("Gap: 0.3346")
print("\nNext experiment priorities:")
print("1. Switch to LightGBM (highest impact)")
print("2. Fix data leakage (proper CV)")
print("3. Reduce dimensionality (50-75 SVD components)")
print("4. Optimize feature combinations")
print("5. Ensemble diverse models")
print("\nExpected trajectory:")
print("- Experiment 004 (LightGBM + fix leakage): 0.68-0.72")
print("- Experiment 005 (optimized features): 0.75-0.80")
print("- Experiment 006 (ensembling): 0.82-0.85")
print("- Further iterations needed to reach 0.9791")

=== KEY FINDINGS ===

1. MODEL MISMATCH (CRITICAL)
   - Logistic regression is fundamentally wrong for this problem
   - Need tree-based model (LightGBM/XGBoost) for non-linear patterns
   - Expected gain: +0.03 to +0.08 AUC

2. DATA LEAKAGE (HIGH PRIORITY)
   - TF-IDF and SVD fitted on full data before CV splits
   - Validation set information leaks into transformers
   - CV scores may be optimistic
   - Fix: Move ALL fitting inside CV loops

3. DIMENSIONALITY CRISIS (HIGH PRIORITY)
   - 11,158 features vs 2,878 samples = 3.9:1 ratio
   - Extreme overfitting risk
   - Solution: Aggressive SVD reduction (50-75 components total)

4. FEATURE SCALING (MEDIUM PRIORITY)
   - TF-IDF (0-1) mixed with numeric features (0-5000)
   - Logistic regression dominated by large-scale features
   - LightGBM/XGBoost don't need scaling, but proper preprocessing helps

5. CONVERGENCE ISSUES (MEDIUM PRIORITY)
   - Indicates poor conditioning
   - Will resolve with better model and dimensionality reduction
