# Loop 4 Analysis: Diagnosing Target Encoding Failure

## Objective
Investigate WHY target encoding + product features caused 933% performance degradation (CV: 0.020 → 0.212)

## Key Questions
1. Are product features leaking target information?
2. Is manual target encoding implementation flawed?
3. Which specific features are causing the problem?
4. How do winners implement these features successfully?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_log_error
import warnings
warnings.filterwarnings('ignore')

SEED = 42
np.random.seed(SEED)

# Load data
train_df = pd.read_csv('/home/code/data/train.csv')
test_df = pd.read_csv('/home/code/data/test.csv')

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
print("\nTarget statistics:")
print(train_df['Calories'].describe())

Train shape: (8000, 9)
Test shape: (2000, 9)

Target statistics:
count    8000.000000
mean      143.772778
std        76.566039
min        10.000000
25%        91.227982
50%       121.244149
75%       174.789980
max       500.000000
Name: Calories, dtype: float64


## 1. Analyze Product Features Correlation with Target

In [2]:
# Create product features as implemented in exp_003
train_analysis = train_df.copy()

# Product features
train_analysis['Weight_Duration'] = train_analysis['Weight'] * train_analysis['Duration']
train_analysis['Duration_Heart_Rate'] = train_analysis['Duration'] * train_analysis['Heart_Rate']
train_analysis['Height_Weight'] = train_analysis['Height'] * train_analysis['Weight']

# Calculate correlations
correlations = {}
for col in ['Weight_Duration', 'Duration_Heart_Rate', 'Height_Weight']:
    corr = train_analysis[col].corr(train_analysis['Calories'])
    correlations[col] = corr
    print(f"{col}: correlation with target = {corr:.4f}")

# Check if these are essentially the target in disguise
print("\n=== Checking for target leakage ===")
print("Sample of Weight_Duration vs Calories:")
print(pd.DataFrame({
    'Weight_Duration': train_analysis['Weight_Duration'].head(),
    'Calories': train_analysis['Calories'].head()
}))

# Try to predict Calories from product features alone (this would indicate leakage)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

X_products = train_analysis[['Weight_Duration', 'Duration_Heart_Rate', 'Height_Weight']]
y = train_analysis['Calories']

# CV score using ONLY product features
kf = KFold(n_splits=5, shuffle=True, random_state=SEED)
product_scores = []

for train_idx, val_idx in kf.split(X_products):
    X_train, X_val = X_products.iloc[train_idx], X_products.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Fit linear regression on training folds
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    
    # Predict on validation
    pred_val = lr.predict(X_val)
    pred_val = np.clip(pred_val, 0, None)  # Clip negative values
    
    # Calculate RMSLE
    score = np.sqrt(mean_squared_log_error(y_val, pred_val))
    product_scores.append(score)

print(f"\nCV RMSLE using ONLY product features: {np.mean(product_scores):.4f} ± {np.std(product_scores):.4f}")
print("If this is very low, product features are leaking target information!")

Weight_Duration: correlation with target = 0.9396
Duration_Heart_Rate: correlation with target = 0.9364
Height_Weight: correlation with target = 0.1042

=== Checking for target leakage ===
Sample of Weight_Duration vs Calories:
   Weight_Duration    Calories
0      1626.889988  120.317974
1      2665.097527  153.337764
2       500.000000   77.173770
3      2176.925537  181.706220
4      1901.755408   98.043957

CV RMSLE using ONLY product features: 0.2275 ± 0.0048
If this is very low, product features are leaking target information!


## 2. Analyze Target Encoding Implementation

In [4]:
# Analyze the manual target encoding implementation from exp_003

def manual_target_encode(train_df, test_df, column, target_col='Calories', p_smooth=20):
    """Manual target encoding as implemented in exp_003"""
    
    # Calculate global mean and count
    global_mean = train_df[target_col].mean()
    
    # Calculate category means and counts
    category_stats = train_df.groupby(column)[target_col].agg(['mean', 'count'])
    
    # Apply smoothing
    smooth_mean = (category_stats['mean'] * category_stats['count'] + global_mean * p_smooth) / (category_stats['count'] + p_smooth)
    
    # Create mapping
    mapping = smooth_mean.to_dict()
    
    # Apply to train and test
    train_encoded = train_df[column].map(mapping).fillna(global_mean)
    test_encoded = test_df[column].map(mapping).fillna(global_mean)
    
    return train_encoded, test_encoded, mapping

# Apply manual encoding
train_enc, test_enc, mapping = manual_target_encode(train_df, test_df, 'Sex')

print("Manual target encoding mapping:")
print(mapping)
print(f"\nGlobal mean: {train_df['Calories'].mean():.4f}")

# Check correlation
enc_corr = train_enc.corr(train_df['Calories'])
print(f"\nCorrelation between encoded Sex and target: {enc_corr:.4f}")

# The problem: With only 2 categories, this encoding is too coarse
print(f"\nNumber of unique values in Sex: {train_df['Sex'].nunique()}")
print("Value counts:")
print(train_df['Sex'].value_counts())

# Check if encoding is just memorizing the training data
print(f"\n=== Encoding Analysis ===")
print("Target statistics by Sex:")
sex_stats = train_df.groupby('Sex').agg({
    'Calories': ['mean', 'count']
}).round(4)
print(sex_stats)

print(f"\nEncoded values (manual implementation):")
for sex, encoded_val in mapping.items():
    actual_mean = train_df[train_df['Sex'] == sex]['Calories'].mean()
    print(f"{sex}: encoded={encoded_val:.4f}, actual_mean={actual_mean:.4f}")

# The issue: With only 2 categories, target encoding adds almost no information
# and may cause overfitting if smoothing is not appropriate

print(f"\n=== Smoothing Analysis ===")
for p in [5, 10, 20, 50, 100]:
    train_enc_p, _, _ = manual_target_encode(train_df, test_df, 'Sex', p_smooth=p)
    corr_p = train_enc_p.corr(train_df['Calories'])
    print(f"p_smooth={p:3d}: correlation = {corr_p:.4f}")

Manual target encoding mapping:
{'F': 132.1348050022123, 'M': 151.3127728164979}

Global mean: 143.7728

Correlation between encoded Sex and target: 0.1230

Number of unique values in Sex: 2
Value counts:
Sex
M    4859
F    3141
Name: count, dtype: int64

=== Encoding Analysis ===
Target statistics by Sex:
     Calories      
         mean count
Sex                
F    132.0607  3141
M    151.3438  4859

Encoded values (manual implementation):
F: encoded=132.1348, actual_mean=132.0607
M: encoded=151.3128, actual_mean=151.3438

=== Smoothing Analysis ===
p_smooth=  5: correlation = 0.1230
p_smooth= 10: correlation = 0.1230
p_smooth= 20: correlation = 0.1230
p_smooth= 50: correlation = 0.1230
p_smooth=100: correlation = 0.1230


## 3. Compare Baseline vs Target Encoding Model Predictions

In [5]:
# Load OOF predictions from experiments
import os

# Try to load OOF predictions from exp_003 (target encoding)
oof_path_003 = '/home/code/experiments/003_target_encoding/oof_predictions.npy'
oof_path_000 = '/home/code/experiments/001_baseline/oof_predictions.npy'

if os.path.exists(oof_path_003) and os.path.exists(oof_path_000):
    oof_003 = np.load(oof_path_003)
    oof_000 = np.load(oof_path_000)
    
    print("=== Comparing Predictions ===")
    print(f"Baseline OOF shape: {oof_000.shape}")
    print(f"Target encoding OOF shape: {oof_003.shape}")
    
    # Calculate residuals
    residuals_000 = train_df['Calories'] - oof_000
    residuals_003 = train_df['Calories'] - oof_003
    
    print(f"\nBaseline residuals - mean: {residuals_000.mean():.4f}, std: {residuals_000.std():.4f}")
    print(f"Target encoding residuals - mean: {residuals_003.mean():.4f}, std: {residuals_003.std():.4f}")
    
    # Check where predictions differ most
    pred_diff = np.abs(oof_003 - oof_000)
    print(f"\nMean absolute prediction difference: {pred_diff.mean():.4f}")
    print(f"Max absolute prediction difference: {pred_diff.max():.4f}")
    
    # Check if target encoding model is overfitting
    print(f"\n=== Overfitting Check ===")
    print("If target encoding model has much lower training error but higher CV error,")
    print("it's overfitting to the encoded features.")
    
else:
    print("OOF predictions not found. Need to run experiments first.")

OOF predictions not found. Need to run experiments first.


## 4. Investigate Winners' Implementation Differences

In [6]:
# Based on winning solution analysis, let's identify key differences

print("=== Winners' Approach vs Our Implementation ===")
print()
print("WINNERS (Chris Deotte, AngelosMar):")
print("1. Used sklearn's TargetEncoder with INTERNAL cross-fitting")
print("2. Applied target encoding to HIGH-cardinality features (not just Sex)")
print("3. Used product features BUT with careful regularization")
print("4. Used residual modeling (sequential approach)")
print("5. Used groupby z-score features")
print("6. Used MANY diverse models (7-12) with different feature sets")
print()
print("OUR IMPLEMENTATION (exp_003):")
print("1. Manual target encoding with simple smoothing")
print("2. Only applied to Sex (2 categories - too low cardinality)")
print("3. Added product features without additional regularization")
print("4. No residual modeling")
print("5. No groupby features")
print("6. Only 2 models so far")
print()
print("=== KEY INSIGHTS ===")
print()
print("PROBLEM 1: Sex has only 2 categories - target encoding adds minimal signal")
print("- With only 'male' and 'female', encoding just creates 2 values")
print("- This is essentially just a binary feature, not true target encoding")
print("- Winners encoded binned features (higher cardinality)")
print()
print("PROBLEM 2: Product features may be TOO predictive (leaking target)")
print("- Weight_Duration correlation with Calories: VERY HIGH")
print("- If product features alone can predict target well, they're too strong")
print("- Need to verify if these are legitimate or data leakage")
print()
print("PROBLEM 3: Manual encoding vs sklearn's TargetEncoder")
print("- sklearn's version uses internal K-fold cross-fitting")
print("- This prevents overfitting better than simple smoothing")
print("- Our manual implementation may not be robust enough")
print()
print("PROBLEM 4: No hyperparameter tuning for new features")
print("- Added 16 new features but kept same hyperparameters")
print("- Need stronger regularization (increase reg_alpha, reg_lambda)")
print("- Need to tune depth, min_child_samples")

=== Winners' Approach vs Our Implementation ===

WINNERS (Chris Deotte, AngelosMar):
1. Used sklearn's TargetEncoder with INTERNAL cross-fitting
2. Applied target encoding to HIGH-cardinality features (not just Sex)
3. Used product features BUT with careful regularization
4. Used residual modeling (sequential approach)
5. Used groupby z-score features
6. Used MANY diverse models (7-12) with different feature sets

OUR IMPLEMENTATION (exp_003):
1. Manual target encoding with simple smoothing
2. Only applied to Sex (2 categories - too low cardinality)
3. Added product features without additional regularization
4. No residual modeling
5. No groupby features
6. Only 2 models so far

=== KEY INSIGHTS ===

PROBLEM 1: Sex has only 2 categories - target encoding adds minimal signal
- With only 'male' and 'female', encoding just creates 2 values
- This is essentially just a binary feature, not true target encoding
- Winners encoded binned features (higher cardinality)

PROBLEM 2: Product featur

## 5. Recommendations for Next Experiments

In [None]:
print("=== RECOMMENDATIONS ===\n")

print("1. ABANDON target encoding on 'Sex' (only 2 categories)")
print("   - Not enough cardinality to be useful")
print("   - May cause overfitting with manual implementation")
print("   - Winners encoded binned features instead\n")

print("2. INVESTIGATE product features for target leakage")
print("   - Check if Weight_Duration, Duration_Heart_Rate are too predictive")
print("   - If correlation > 0.9, may be leaking target information")
print("   - Consider removing or transforming these features\n")

print("3. IMPLEMENT sklearn's TargetEncoder properly")
print("   - Use internal cross-fitting (cv=5)")
print("   - Apply to binned features (higher cardinality)")
print("   - Test on small subset first\n")

print("4. ADD HYPERPARAMETER TUNING for regularization")
print("   - Increase reg_alpha, reg_lambda for XGBoost")
print("   - Reduce max_depth or increase min_child_samples")
print("   - Run small grid search\n")

print("5. IMPLEMENT RESIDUAL MODELING (sequential approach)")
print("   - LinearRegression → NeuralNetwork → XGBoost")
print("   - This was key in winning solutions")
print("   - Captures complementary patterns\n")

print("6. ADD GROUPBY Z-SCORE FEATURES")
print("   - Group by Sex, compute z-scores for numerical features")
print("   - Creates relative positioning features")
print("   - Winners found these effective\n")

print("7. CREATE MORE DIVERSE MODELS")
print("   - LightGBM with different feature sets")
print("   - Neural Network with residual approach")
print("   - Linear Regression with many engineered features\n")

print("8. RUN ABLATION STUDIES")
print("   - Test each feature group separately")
print("   - Identify which features help vs hurt")
print("   - Systematically build up feature set\n")

print("=== IMMEDIATE NEXT STEPS ===\n")
print("1. Create experiment without product features (keep target encoding)")
print("2. Create experiment without target encoding (keep product features)")
print("3. Compare to identify which is the main culprit")
print("4. Implement sklearn's TargetEncoder on binned features")
print("5. Add groupby z-score features")
print("6. Start residual modeling pipeline")