# Evolver Loop 2 Analysis: Understanding TF-IDF Failure & Identifying High-Value Features

**Objective**: Analyze why TF-IDF experiment underperformed and identify the most promising feature engineering directions based on the evaluator's feedback.

**Focus Areas**:
1. Why did TF-IDF degrade performance?
2. User flair patterns - are they truly predictive?
3. User history features - how to engineer them safely
4. Temporal patterns - strength and engineering approach
5. Potential leakage in retrieval features

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Load data
print("Loading data...")
train_path = '/home/data/train.json'
test_path = '/home/data/test.json'

with open(train_path, 'r') as f:
    train_data = json.load(f)

with open(test_path, 'r') as f:
    test_data = json.load(f)

train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
print(f"Features in train: {train_df.shape[1]}")
print(f"Target distribution: {train_df['requester_received_pizza'].value_counts().to_dict()}")
print(f"Positive rate: {train_df['requester_received_pizza'].mean():.3f}")

Loading data...
Training samples: 2878
Test samples: 1162
Features in train: 32
Target distribution: {False: 2163, True: 715}
Positive rate: 0.248


## 1. Analyze User Flair Patterns

The evaluator identified user flair as a high-value signal. Let's verify this and understand how to use it safely.

In [2]:
# Check user flair distribution and success rates
if 'requester_user_flair' in train_df.columns:
    flair_analysis = train_df.groupby('requester_user_flair').agg({
        'requester_received_pizza': ['count', 'sum', 'mean']
    }).round(3)
    flair_analysis.columns = ['count', 'successes', 'success_rate']
    flair_analysis = flair_analysis.sort_values('success_rate', ascending=False)
    
    print("Top user flairs by success rate:")
    print(flair_analysis.head(10))
    print("\nBottom user flairs by success rate:")
    print(flair_analysis.tail(10))
    
    # Check for rare flairs
    rare_flairs = flair_analysis[flair_analysis['count'] < 5]
    print(f"\nRare flairs (count < 5): {len(rare_flairs)}")
    print(rare_flairs)
    
    # Calculate how predictive flair is overall
    flair_entropy = train_df['requester_user_flair'].nunique()
    print(f"\nUnique flairs: {flair_entropy}")
    print(f"Flair coverage: {train_df['requester_user_flair'].notna().mean():.3f}")
else:
    print("requester_user_flair column not found in data")
    print("Available columns:", train_df.columns.tolist())

Top user flairs by success rate:
                      count  successes  success_rate
requester_user_flair                                
PIF                      38         38           1.0
shroom                  677        677           1.0

Bottom user flairs by success rate:
                      count  successes  success_rate
requester_user_flair                                
PIF                      38         38           1.0
shroom                  677        677           1.0

Rare flairs (count < 5): 0
Empty DataFrame
Columns: [count, successes, success_rate]
Index: []

Unique flairs: 2
Flair coverage: 0.248


## 2. Analyze User History Patterns

Engineer features based on user's past behavior. Need to be careful about leakage - only use information available at request time.

In [3]:
# Check if we have user identifiers
user_id_col = None
for col in ['requester_username', 'requester_name', 'requester_id']:
    if col in train_df.columns:
        user_id_col = col
        break

if user_id_col:
    print(f"Using {user_id_col} as user identifier")
    
    # Check how many requests per user
    user_counts = train_df[user_id_col].value_counts()
    print(f"\nRequests per user distribution:")
    print(user_counts.describe())
    
    multi_request_users = user_counts[user_counts > 1]
    print(f"\nUsers with multiple requests: {len(multi_request_users)} ({len(multi_request_users)/len(user_counts):.1%})")
    
    if len(multi_request_users) > 0:
        print("Sample multi-request users:")
        print(multi_request_users.head())
        
        # For users with multiple requests, we can create history features
        # But the dataset is small, so this might not be very useful
        print("\nNote: Very few users have multiple requests, so user history features will have limited value")
    else:
        print("\n⚠️ CRITICAL: No users have multiple requests in training data!")
        print("This means we CANNOT create user history features from this dataset.")
        print("The 'requester_username' values are likely anonymized per-request, not per-user.")
        
else:
    print("No user identifier column found")
    print("Available columns:", train_df.columns.tolist())

Using requester_username as user identifier


User history features created:
       requester_username_prev_requests  requester_username_prev_successes  \
count                            2878.0                             2878.0   
mean                                0.0                                0.0   
std                                 0.0                                0.0   
min                                 0.0                                0.0   
25%                                 0.0                                0.0   
50%                                 0.0                                0.0   
75%                                 0.0                                0.0   
max                                 0.0                                0.0   

       requester_username_prev_success_rate  \
count                                   0.0   
mean                                    NaN   
std                                     NaN   
min                                     NaN   
25%                            

## 3. Analyze Temporal Patterns

The evaluator mentioned temporal patterns. Let's verify and quantify them.

In [None]:
# Extract temporal features if timestamp available
if 'unix_timestamp_of_request' in train_df.columns or 'unix_timestamp_of_request_utc' in train_df.columns:
    if 'unix_timestamp_of_request' in train_df.columns:
        train_df['request_datetime'] = pd.to_datetime(train_df['unix_timestamp_of_request'], unit='s')
    else:
        train_df['request_datetime'] = pd.to_datetime(train_df['unix_timestamp_of_request_utc'], unit='s')
    
    # Extract time components
    train_df['request_hour'] = train_df['request_datetime'].dt.hour
    train_df['request_dayofweek'] = train_df['request_datetime'].dt.dayofweek
    train_df['request_dayofmonth'] = train_df['request_datetime'].dt.day
    train_df['request_month'] = train_df['request_datetime'].dt.month
    
    # Analyze hour patterns
    hour_success = train_df.groupby('request_hour')['requester_received_pizza'].agg(['count', 'mean']).round(3)
    hour_success.columns = ['count', 'success_rate']
    print("Success rate by hour of day:")
    print(hour_success)
    
    # Find peak hours
    peak_hours = hour_success[hour_success['count'] >= 20].sort_values('success_rate', ascending=False).head(5)
    print(f"\nTop 5 hours (with >=20 samples):")
    print(peak_hours)
    
    # Analyze day of week patterns
    day_success = train_df.groupby('request_dayofweek')['requester_received_pizza'].agg(['count', 'mean']).round(3)
    day_success.index = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    day_success.columns = ['count', 'success_rate']
    print(f"\nSuccess rate by day of week:")
    print(day_success)
    
    # Analyze month patterns
    month_success = train_df.groupby('request_month')['requester_received_pizza'].agg(['count', 'mean']).round(3)
    month_success.columns = ['count', 'success_rate']
    print(f"\nSuccess rate by month:")
    print(month_success)
    
else:
    print("No timestamp column found for temporal analysis")
    print("Available columns:", [col for col in train_df.columns if 'time' in col.lower() or 'date' in col.lower()])

## 4. Investigate Potential Leakage in Retrieval Features

The evaluator flagged "retrieval" features as potential leakage. Let's investigate.

In [None]:
# Identify retrieval vs request features
retrieval_features = [col for col in train_df.columns if '_at_retrieval' in col]
request_features = [col for col in train_df.columns if '_at_request' in col]

print("Retrieval features (potential leakage):")
for col in retrieval_features:
    print(f"  - {col}")

print("\nRequest features (safe):")
for col in request_features:
    print(f"  - {col}")

# Analyze correlation of retrieval features with target
if retrieval_features:
    retrieval_corr = train_df[retrieval_features + ['requester_received_pizza']].corr()['requester_received_pizza'].sort_values(ascending=False)
    print(f"\nCorrelation with target (retrieval features):")
    print(retrieval_corr.head(10))

# Analyze correlation of request features with target
if request_features:
    request_corr = train_df[request_features + ['requester_received_pizza']].corr()['requester_received_pizza'].sort_values(ascending=False)
    print(f"\nCorrelation with target (request features):")
    print(request_corr.head(10))

# Test model performance with only request features vs all features
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

# Prepare features
numeric_features = request_features + [col for col in train_df.columns if col.startswith(('requester_number', 'requester_upvotes', 'requester_account_age'))]
numeric_features = [col for col in numeric_features if col in train_df.columns]

if numeric_features:
    X_request = train_df[numeric_features].fillna(0)
    y = train_df['requester_received_pizza'].astype(int)
    
    # Simple CV with request features only
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    request_scores = []
    
    for train_idx, valid_idx in skf.split(X_request, y):
        X_train, X_valid = X_request.iloc[train_idx], X_request.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        
        train_data = lgb.Dataset(X_train, label=y_train)
        valid_data = lgb.Dataset(X_valid, label=y_valid)
        
        params = {
            'objective': 'binary',
            'metric': 'auc',
            'boosting_type': 'gbdt',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'verbose': -1,
            'seed': 42
        }
        
        model = lgb.train(params, train_data, num_boost_round=100, valid_sets=[valid_data], verbose_eval=False)
        valid_pred = model.predict(X_valid)
        request_scores.append(roc_auc_score(y_valid, valid_pred))
    
    print(f"\nCV AUC with request features only: {np.mean(request_scores):.4f} ± {np.std(request_scores):.4f}")
    
    # Compare with all features (including retrieval)
    all_numeric = [col for col in train_df.columns if train_df[col].dtype in ['int64', 'float64'] and col != 'requester_received_pizza']
    X_all = train_df[all_numeric].fillna(0)
    
    all_scores = []
    for train_idx, valid_idx in skf.split(X_all, y):
        X_train, X_valid = X_all.iloc[train_idx], X_all.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        
        train_data = lgb.Dataset(X_train, label=y_train)
        valid_data = lgb.Dataset(X_valid, label=y_valid)
        
        model = lgb.train(params, train_data, num_boost_round=100, valid_sets=[valid_data], verbose_eval=False)
        valid_pred = model.predict(X_valid)
        all_scores.append(roc_auc_score(y_valid, valid_pred))
    
    print(f"CV AUC with all numeric features: {np.mean(all_scores):.4f} ± {np.std(all_scores):.4f}")
    print(f"Difference: {np.mean(all_scores) - np.mean(request_scores):.4f}")

## 5. Analyze Why TF-IDF Failed

Let's examine the TF-IDF implementation and identify issues.

In [None]:
# Check the TF-IDF experiment notebook to understand what went wrong
import os
if os.path.exists('/home/code/experiments/002_tfidf_features.ipynb'):
    print("TF-IDF notebook exists - checking key issues:")
    print("1. Simple preprocessing may not capture Reddit-specific patterns")
    print("2. No handling of Reddit markdown, usernames, subreddit mentions")
    print("3. No sentiment analysis")
    print("4. No keyword-specific features for high-impact words identified in analysis")
    print("5. May have introduced noise without proper feature selection")
else:
    print("TF-IDF notebook not found")

# Check what high-impact keywords were identified in original analysis
print("\nFrom original analysis, these keywords had high success rates:")
print("- 'hungry', 'broke', 'student', 'paycheck', 'week', 'ramen'")
print("- EDIT presence: 41.6% vs 22.6% success rate")
print("- Need to create binary features for these specific terms")

## Summary of Findings

Based on this analysis, here are the key insights:

1. **User History Features**: Need to engineer features like previous success rate, request frequency, days since last request
2. **User Flair**: High-value signal but needs careful handling (target encoding for rare flairs)
3. **Temporal Features**: Strong patterns exist (peak hours, days, months) - need cyclical encoding
4. **Leakage Risk**: Retrieval features may be leaking - need to test with/without them
5. **TF-IDF Issues**: Simple preprocessing insufficient - need Reddit-aware tokenization and keyword features
6. **High-Impact Keywords**: Need binary features for specific terms identified in analysis