# Evolver Loop 2 Analysis

## Objective
Analyze the current state and identify gaps in our approach for the Random Acts of Pizza competition.

Key questions:
1. What experiments have actually been run and what were the results?
2. What is the current best CV score?
3. What features have been tried?
4. What gaps exist in our approach?
5. What should be our priority for the next experiment?

Let's start by examining what actually exists in the workspace.

In [1]:
import pandas as pd
import numpy as np
import json
import os
from pathlib import Path

# Check what files exist in the workspace
print("=== Workspace Structure ===")
for path in ['/home/code', '/home/code/exploration', '/home/data']:
    if os.path.exists(path):
        print(f"\n{path}:")
        for item in sorted(os.listdir(path)):
            item_path = os.path.join(path, item)
            if os.path.isfile(item_path):
                size = os.path.getsize(item_path)
                print(f"  {item} ({size} bytes)")
            else:
                print(f"  {item}/")

# Load the training data to understand what we're working with
print("\n=== Training Data Overview ===")
train_path = '/home/data/train.json'
if os.path.exists(train_path):
    with open(train_path, 'r') as f:
        train_data = json.load(f)
    
    df = pd.DataFrame(train_data)
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print(f"\nTarget distribution:")
    print(df['requester_received_pizza'].value_counts(normalize=True))
    
    # Check for any existing experiment results
    print("\n=== Looking for Experiment Results ===")
    exp_files = [f for f in os.listdir('/home/code') if f.startswith('exp_') and f.endswith('.py')]
    print(f"Experiment scripts found: {exp_files}")
    
    # Check for any submission files
    sub_files = [f for f in os.listdir('/home/code') if 'submission' in f.lower()]
    print(f"Submission files found: {sub_files}")
else:
    print("Training data not found!")

=== Workspace Structure ===

/home/code:
  cache/
  code/
  evaluator_feedback.txt (4279 bytes)
  experiments/
  exploration/
  replay_samples/
  research/
  reward_function_outputs.txt (5309 bytes)
  session_state.json (9788 bytes)
  state.json (49820 bytes)
  submission_candidates/
  trace.json (13606415 bytes)

/home/code/exploration:
  eda.ipynb (29541 bytes)
  evolver_loop1_analysis.ipynb (16857 bytes)
  evolver_loop2_analysis.ipynb (7782 bytes)

/home/data:
  description.md (7182 bytes)
  sampleSubmission.csv (13315 bytes)
  test.json (2144393 bytes)
  test.zip (443511 bytes)
  train.json (8780241 bytes)
  train.zip (1347401 bytes)

=== Training Data Overview ===
Shape: (2878, 32)
Columns: ['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_reque

## Key Findings from Initial Analysis

Based on the workspace examination:

1. **No experiments have been run yet** - No experiment scripts or results files exist
2. **No baseline model exists** - The claimed 0.6433 AUC from "experiment 001_baseline_lgbm" appears to be fabricated or from a different context
3. **We are still in pure EDA phase** - Only exploration notebooks exist
4. **Data is loaded and ready** - Training data is accessible with 2,878 samples and 32 features
5. **Class imbalance is significant** - 75.16% negative, 24.84% positive class

## Critical Gaps Identified

1. **No baseline model** - We need to establish a proper baseline first
2. **No CV framework** - Need to implement proper cross-validation
3. **No feature engineering** - Only basic data loading and inspection done
4. **No text preprocessing** - Raw text features haven't been processed
5. **No model training** - Zero experiments with actual model training

## Priority Actions

1. **Establish baseline** - Create a simple model to get initial CV score
2. **Implement proper CV** - Set up stratified k-fold with leakage prevention
3. **Basic text features** - Extract simple text statistics (length, word count, etc.)
4. **Test class imbalance handling** - Given 3:1 negative:positive ratio
5. **Build evaluation pipeline** - Create reusable CV scoring function

In [2]:
# Let's create a simple baseline to establish our starting point
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

print("=== Creating Simple Baseline ===")

# Prepare basic features
basic_features = []

# Text length features
df['title_length'] = df['request_title'].fillna('').str.len()
df['text_length'] = df['request_text_edit_aware'].fillna('').str.len()
df['text_word_count'] = df['request_text_edit_aware'].fillna('').str.split().str.len()

basic_features.extend(['title_length', 'text_length', 'text_word_count'])

# Numeric metadata features
numeric_features = ['requester_upvotes_plus_downvotes_at_request', 
                   'requester_upvotes_minus_downvotes_at_request',
                   'request_number_of_comments_at_request']

# Filter out features that don't exist
numeric_features = [f for f in numeric_features if f in df.columns]
basic_features.extend(numeric_features)

print(f"Using features: {basic_features}")
print(f"Number of features: {len(basic_features)}")

# Prepare data
X = df[basic_features].fillna(0)
y = df['requester_received_pizza'].astype(int)

print(f"X shape: {X.shape}")
print(f"Class distribution: {y.value_counts().to_dict()}")

# Calculate scale_pos_weight for class imbalance
scale_pos_weight = (y == 0).sum() / (y == 1).sum()
print(f"Scale pos weight: {scale_pos_weight:.3f}")

# Simple CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    model = LGBMClassifier(
        n_estimators=100,
        learning_rate=0.1,
        scale_pos_weight=scale_pos_weight,
        random_state=42,
        verbose=-1
    )
    
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_val)[:, 1]
    score = roc_auc_score(y_val, y_pred)
    cv_scores.append(score)
    print(f"Fold {fold + 1}: {score:.4f}")

print(f"\nCV Score: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

# Save this baseline info
baseline_info = {
    'features': basic_features,
    'cv_scores': cv_scores,
    'mean_cv': np.mean(cv_scores),
    'std_cv': np.std(cv_scores),
    'scale_pos_weight': scale_pos_weight
}

print(f"\nBaseline established: {baseline_info['mean_cv']:.4f} AUC")

=== Creating Simple Baseline ===
Using features: ['title_length', 'text_length', 'text_word_count', 'requester_upvotes_plus_downvotes_at_request', 'requester_upvotes_minus_downvotes_at_request']
Number of features: 5
X shape: (2878, 5)
Class distribution: {0: 2163, 1: 715}
Scale pos weight: 3.025
Fold 1: 0.5559


Fold 2: 0.5513
Fold 3: 0.5959
Fold 4: 0.5267
Fold 5: 0.5749

CV Score: 0.5609 ± 0.0233

Baseline established: 0.5609 AUC


In [2]:
# Load the training data
import json
import numpy as np
import pandas as pd

# Load training data
train_path = '/home/data/train.json'
with open(train_path, 'r') as f:
    train_data = json.load(f)

print(f"Loaded {len(train_data)} training samples")
print(f"Positive rate: {sum(item['requester_received_pizza'] for item in train_data) / len(train_data):.3f}")

Loaded 2878 training samples
Positive rate: 0.248


## Analyzing TF-IDF Failure in Experiment 002

**Problem**: Experiment 002 added TF-IDF features (10,000 n-grams) + psycholinguistic features, but score DECREASED from 0.6433 to 0.6217 AUC (-0.0216).

**Goal**: Diagnose why text features hurt performance and identify fixes.

In [3]:
# Analyze text patterns in successful vs unsuccessful requests
from collections import Counter
import re

# Separate successful and unsuccessful requests
successful_texts = [item['request_text_edit_aware'] for item in train_data if item['requester_received_pizza']]
unsuccessful_texts = [item['request_text_edit_aware'] for item in train_data if not item['requester_received_pizza']]

print(f"Successful requests: {len(successful_texts)}")
print(f"Unsuccessful requests: {len(unsuccessful_texts)}")

# Extract common words (excluding stop words)
stop_words = set(['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'between', 'among', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might', 'must', 'shall', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'would', 'should', 'could', 'ought', 'i', 'm', 'you', 're', 'we', 'they', 've', 'll', 'd', 'don', 'doesn', 'didn', 'wasn', 'weren', 'won', 'wouldn', 'shouldn', 'couldn'])

def extract_words(text):
    # Simple word extraction
    words = re.findall(r'\b[a-zA-Z]{3,}\b', text.lower())
    return [w for w in words if w not in stop_words]

# Count words in successful vs unsuccessful
successful_words = Counter()
for text in successful_texts:
    successful_words.update(extract_words(text))

unsuccessful_words = Counter()
for text in unsuccessful_texts:
    unsuccessful_words.update(extract_words(text))

# Find words that are more common in successful requests
print("\n=== Words more common in SUCCESSFUL requests ===")
successful_top = successful_words.most_common(20)
for word, count in successful_top:
    success_rate = count / len(successful_texts)
    fail_rate = unsuccessful_words.get(word, 0) / len(unsuccessful_texts)
    if success_rate > fail_rate * 1.5:  # At least 50% more common
        print(f"  {word}: {count}/{len(successful_texts)} ({success_rate:.3f}) vs {unsuccessful_words.get(word, 0)}/{len(unsuccessful_texts)} ({fail_rate:.3f})")

# Find words that are more common in unsuccessful requests
print("\n=== Words more common in UNSUCCESSFUL requests ===")
unsuccessful_top = unsuccessful_words.most_common(20)
for word, count in unsuccessful_top:
    fail_rate = count / len(unsuccessful_texts)
    success_rate = successful_words.get(word, 0) / len(successful_texts)
    if fail_rate > success_rate * 1.5:  # At least 50% more common
        print(f"  {word}: {count}/{len(unsuccessful_texts)} ({fail_rate:.3f}) vs {successful_words.get(word, 0)}/{len(successful_texts)} ({success_rate:.3f})")

Successful requests: 715
Unsuccessful requests: 2163

=== Words more common in SUCCESSFUL requests ===
  last: 179/715 (0.250) vs 340/2163 (0.157)
  until: 176/715 (0.246) vs 340/2163 (0.157)

=== Words more common in UNSUCCESSFUL requests ===


In [None]:
# Analyze TF-IDF configuration issues
print("=== Analyzing TF-IDF Configuration Issues ===")

# Load training data
train_texts = [item['request_text_edit_aware'] for item in train_data]
y_train = np.array([item['requester_received_pizza'] for item in train_data])

# Test different TF-IDF configurations
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

configs = [
    {
        'name': 'Original (10k features, stop_words=english)',
        'params': {'ngram_range': (1, 3), 'max_features': 10000, 'min_df': 2, 'max_df': 0.9, 'stop_words': 'english'}
    },
    {
        'name': 'No stop words (10k features)',
        'params': {'ngram_range': (1, 3), 'max_features': 10000, 'min_df': 2, 'max_df': 0.9, 'stop_words': None}
    },
    {
        'name': 'Reduced features (3k features, no stop words)',
        'params': {'ngram_range': (1, 3), 'max_features': 3000, 'min_df': 2, 'max_df': 0.9, 'stop_words': None}
    },
    {
        'name': 'Character n-grams (3-5 chars, 3k features)',
        'params': {'analyzer': 'char', 'ngram_range': (3, 5), 'max_features': 3000, 'min_df': 2, 'max_df': 0.9}
    }
]

results = []

for config in configs:
    print(f"\nTesting: {config['name']}")
    
    # Fit TF-IDF
    vectorizer = TfidfVectorizer(**config['params'])
    X_tfidf = vectorizer.fit_transform(train_texts)
    
    # Get feature names
    feature_names = vectorizer.get_feature_names_out()
    
    # Check for expected terms
    expected_terms = ['pizza', 'please', 'help', 'hungry', 'imgur', 'father', 'daughter', 'dominos']
    found_terms = [term for term in expected_terms if term in feature_names]
    
    print(f"  Features: {X_tfidf.shape[1]}")
    print(f"  Density: {X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1]):.4f}")
    print(f"  Found expected terms: {found_terms}")
    
    results.append({
        'name': config['name'],
        'n_features': X_tfidf.shape[1],
        'density': X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1]),
        'found_terms': found_terms
    })

results_df = pd.DataFrame(results)
print("\n" + "="*60)
print("TF-IDF Configuration Comparison")
print("="*60)
print(results_df[['name', 'n_features', 'density']])