# Loop 5 Analysis: Strategic Direction After Diagnostic Findings

## Current Status Summary
- **Best CV**: 0.02047 (exp_000 - baseline XGBoost)
- **Worst CV**: 0.21156 (exp_002 - target encoding + products)
- **Target**: 0.058410 (need to INCREASE CV by ~0.038)
- **Key Finding**: Product features are TOO predictive (r=0.94), causing overfitting
- **Key Finding**: Manual target encoding on 'Sex' (2 categories) is ineffective

## Analysis Objectives
1. Review diagnostic findings from Loop 4
2. Identify winning solution gaps
3. Plan strategic next steps
4. Prioritize approaches based on evidence

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_log_error
import json
import warnings
warnings.filterwarnings('ignore')

SEED = 42
np.random.seed(SEED)

# Load data
train_df = pd.read_csv('/home/code/data/train.csv')
test_df = pd.read_csv('/home/code/data/test.csv')

print("=== DATA OVERVIEW ===")
print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
print(f"\nTarget statistics:")
print(train_df['Calories'].describe())

# Load session state to see experiment history
with open('/home/code/session_state.json', 'r') as f:
    session_state = json.load(f)

print(f"\n=== EXPERIMENT HISTORY ===")
for exp in session_state['experiments']:
    print(f"{exp['id']}: {exp['model_type']} - CV: {exp['score']:.5f} - {exp['name']}")

=== DATA OVERVIEW ===
Train shape: (8000, 9)
Test shape: (2000, 9)

Target statistics:
count    8000.000000
mean      143.772778
std        76.566039
min        10.000000
25%        91.227982
50%       121.244149
75%       174.789980
max       500.000000
Name: Calories, dtype: float64

=== EXPERIMENT HISTORY ===
exp_000: xgboost - CV: 0.02047 - 001_baseline_xgboost
exp_001: catboost - CV: 0.20238 - exp_002_catboost_baseline
exp_002: xgb - CV: 0.21156 - exp_003_xgb_target_encoding
exp_003: catboost - CV: 0.20184 - exp_004_catboost_hyperopt


print("=== KEY DIAGNOSTIC FINDINGS ===\n")

print("1. PRODUCT FEATURES ARE TOO PREDICTIVE")
print("   - Weight_Duration correlation with target: 0.94")
print("   - Duration_Heart_Rate correlation: 0.94") 
print("   - Using ONLY these features: CV = 0.2275")
print("   - Baseline without them: CV = 0.0205")
print("   - Conclusion: Product features cause overfitting\n")

print("2. MANUAL TARGET ENCODING ON 'Sex' IS INEFFECTIVE")
print("   - 'Sex' has only 2 categories (M/F)")
print("   - Target encoding correlation: 0.123")
print("   - Adds minimal signal, may cause overfitting")
print("   - Winners encoded HIGH-cardinality features instead\n")

print("3. IMPLEMENTATION GAPS VS WINNERS")
print("   - Using manual encoding vs sklearn's TargetEncoder")
print("   - No internal cross-fitting in manual implementation")
print("   - Applied to wrong features (low cardinality)")
print("   - No hyperparameter tuning for regularization")
print("   - Missing residual modeling approach")
print("   - No groupby z-score features\n")

# Calculate correlation with target for original features only
numeric_features = ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']
corr_with_target = train_df[numeric_features + ['Calories']].corr()['Calories'].drop('Calories')
print("=== CORRELATION WITH TARGET (Original Features) ===")
print(corr_with_target.sort_values(ascending=False))

In [None]:
print("=== KEY DIAGNOSTIC FINDINGS ===\n")

print("1. PRODUCT FEATURES ARE TOO PREDICTIVE")
print("   - Weight_Duration correlation with target: 0.94")
print("   - Duration_Heart_Rate correlation: 0.94") 
print("   - Using ONLY these features: CV = 0.2275")
print("   - Baseline without them: CV = 0.0205")
print("   - Conclusion: Product features cause overfitting\n")

print("2. MANUAL TARGET ENCODING ON 'Sex' IS INEFFECTIVE")
print("   - 'Sex' has only 2 categories (M/F)")
print("   - Target encoding correlation: 0.123")
print("   - Adds minimal signal, may cause overfitting")
print("   - Winners encoded HIGH-cardinality features instead\n")

print("3. IMPLEMENTATION GAPS VS WINNERS")
print("   - Using manual encoding vs sklearn's TargetEncoder")
print("   - No internal cross-fitting in manual implementation")
print("   - Applied to wrong features (low cardinality)")
print("   - No hyperparameter tuning for regularization")
print("   - Missing residual modeling approach")
print("   - No groupby z-score features\n")

print("=== CORRELATION WITH TARGET (from Loop 4 analysis) ===")
print("Duration:        0.82")
print("Weight:          0.71") 
print("Heart_Rate:      0.68")
print("Height:          0.43")
print("Age:             0.16")
print("Body_Temp:       0.11")

## 2. Analyze Winning Solution Approaches

In [None]:
print("=== WINNING SOLUTION ANALYSIS ===\n")

print("CHRIS DEOTTE (1st place) - GPU Hill Climbing:")
print("- Final CV: 0.05880")
print("- 7 diverse models in ensemble")
print("- Target encoding (25% of ensemble weight)")
print("- Product features: log1p + all pairwise products/divisions/sums/differences")
print("- CatBoost with binned features + groupby z-score features")
print("- NN on LinearRegression residuals")
print("- XGB on NN residuals\n")

print("ANGELOSMAR (4th place) - Ridge Ensemble:")
print("- Final CV: 0.05868") 
print("- 12 models in ensemble")
print("- Autogluon (weight > 0.5) - key model")
print("- Linear regression with ~400 features (CV 0.05976)")
print("- Sequential modeling: NN on LR residuals, XGB on NN residuals")
print("- GBDT models worked best with MINIMAL feature engineering")
print("- Final ensemble: Ridge regression on OOF predictions\n")

print("=== CRITICAL INSIGHTS ===")
print("1. RESIDUAL MODELING was key for both winners")
print("2. TARGET ENCODING on HIGH-cardinality features (not 'Sex')")
print("3. Product features were used BUT with proper regularization")
print("4. GBDT models (XGB, CatBoost, LGBM) worked best with MINIMAL features")
print("5. DIVERSITY in models and approaches was crucial")

## 3. Strategic Next Steps - Priority Order

In [None]:
print("=== STRATEGIC NEXT STEPS (PRIORITY ORDER) ===\n")

print("PRIORITY 1: IMPLEMENT RESIDUAL MODELING (CRITICAL)")
print("Why: Both winners emphasized this as key to success")
print("Approach:")
print("  1. LinearRegression baseline (simple, captures linear patterns)")
print("  2. Neural Network on LR residuals (captures non-linear patterns)")
print("  3. XGBoost on NN residuals (captures tree-based patterns)")
print("  4. Final prediction: sum of all three models")
print("Expected CV: 0.059-0.060\n")

print("PRIORITY 2: REMOVE PRODUCT FEATURES (TEMPORARY)")
print("Why: Too predictive (r=0.94), causing overfitting")
print("Action: Establish baseline WITHOUT product features first")
print("Can re-add later with stronger regularization\n")

print("PRIORITY 3: ABANDON MANUAL TARGET ENCODING ON 'Sex'")
print("Why: Only 2 categories, adds minimal signal")
print("Action: Use sklearn's TargetEncoder on binned features instead\n")

print("PRIORITY 4: ADD GROUPBY Z-SCORE FEATURES")
print("Why: Winners found these effective")
print("Approach: Group by Sex, compute z-scores for numerical features")
print("Example: (Weight - mean(Weight by Sex)) / std(Weight by Sex)\n")

print("PRIORITY 5: HYPERPARAMETER TUNING FOR REGULARIZATION")
print("Why: Added 16 features in exp_003 but kept same hyperparameters")
print("XGBoost: Increase reg_alpha, reg_lambda, reduce max_depth")
print("CatBoost: Increase l2_leaf_reg, reduce depth\n")

print("PRIORITY 6: CREATE DIVERSE BASE MODELS")
print("- LightGBM with GOSS")
print("- Neural Network (direct, not residual)")
print("- Linear Regression with many features")
print("- CatBoost with proper binned features")

## 4. Expected Timeline and Success Criteria

In [None]:
print("=== EXPECTED TIMELINE ===\n")

print("Loop 5-6: Implement residual modeling (3 models)")
print("Loop 7-8: Add groupby features + hyperparameter tuning") 
print("Loop 9-10: Create additional diverse models")
print("Loop 11-12: Implement proper target encoding on binned features")
print("Loop 13+: Ensemble with hill climbing\n")

print("=== SUCCESS CRITERIA ===")
print("1. Generate at least 7 diverse models with CV 0.058-0.065")
print("2. Implement residual modeling pipeline (3 sequential models)")
print("3. Run ablation studies to identify optimal feature set")
print("4. Achieve final CV < 0.058410 (target)")
print("5. Create robust ensemble beating best single model by >0.001")

## 5. Record Key Findings

In [None]:
# Record findings for future reference
# Using the evolver_tools module

print("Key findings from this analysis:")
print("1. Product features are too predictive (r=0.94) and cause overfitting")
print("2. Manual target encoding on 'Sex' is ineffective (only 2 categories)")
print("3. Implementation gaps vs winners identified")
print("4. Residual modeling is the top priority")