# Evolver Loop 5 Analysis: Feature Selection & Original Dataset

## Objectives
1. Analyze exp_005 feature importance to identify valuable vs noisy features
2. Understand why 313 features didn't improve over 67 features in exp_004
3. Research winning solutions for feature selection strategies
4. Prepare for incorporating original Student Bag dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Load exp_005 results
print("Loading exp_005 data...")
train = pd.read_csv('/home/data/train.csv')
training_extra = pd.read_csv('/home/data/training_extra.csv')
combined_train = pd.concat([train, training_extra], ignore_index=True)

print(f"Combined train shape: {combined_train.shape}")
print(f"Target price range: {combined_train['Price'].min():.2f} - {combined_train['Price'].max():.2f}")
print(f"Target price mean: {combined_train['Price'].mean():.2f}")
print(f"Target price std: {combined_train['Price'].std():.2f}")

Loading exp_005 data...


Combined train shape: (3994318, 11)
Target price range: 15.00 - 150.00
Target price mean: 81.36
Target price std: 38.94


## 1. Feature Importance Analysis from exp_005

Load and analyze the feature importance data to understand patterns.

In [2]:
# Simulate feature importance from exp_005 (based on executor output)
# This is reconstructed from the experiment results

feature_importance_data = {
    # Baseline features (top ones)
    'weight_capacity': 58096,
    'weight_dec_2': 35493,
    'weight_dec_1': 25000,  # estimated
    'weight_round_7': 20000,  # estimated
    'brand_encoded': 15000,  # estimated
    'material_encoded': 12000,  # estimated
    'size_encoded': 10000,  # estimated
    
    # Groupby statistics (top ones)
    'weight_capacity_kg_mean_price': 35025,
    'brand_mean_price': 28000,  # estimated
    'material_mean_price': 22000,  # estimated
    'size_mean_price': 18000,  # estimated
    'weight_capacity_kg_count_price': 15000,  # estimated
    'color_mean_price': 12000,  # estimated
    
    # Histogram bins (top ones - weight_capacity)
    'weight_capacity_kg_hist_bin_0': 8000,  # estimated
    'weight_capacity_kg_hist_bin_1': 7500,  # estimated
    'weight_capacity_kg_hist_bin_2': 7200,  # estimated
    # ... many more histogram features
}

# Add the actual reported totals
total_histogram_importance = 1020884
total_groupby_importance = 281304
total_baseline_importance = 172796

print(f"Total feature importance by category:")
print(f"Histogram bins: {total_histogram_importance:,}")
print(f"Groupby statistics: {total_groupby_importance:,}")
print(f"Baseline features: {total_baseline_importance:,}")
print(f"Grand total: {total_histogram_importance + total_groupby_importance + total_baseline_importance:,}")

# Calculate percentages
grand_total = total_histogram_importance + total_groupby_importance + total_baseline_importance
print(f"\nPercentage of total importance:")
print(f"Histogram bins: {total_histogram_importance/grand_total*100:.1f}%")
print(f"Groupby statistics: {total_groupby_importance/grand_total*100:.1f}%")
print(f"Baseline features: {total_baseline_importance/grand_total*100:.1f}%")

Total feature importance by category:
Histogram bins: 1,020,884
Groupby statistics: 281,304
Baseline features: 172,796
Grand total: 1,474,984

Percentage of total importance:
Histogram bins: 69.2%
Groupby statistics: 19.1%
Baseline features: 11.7%


## 2. Feature Count vs Performance Analysis

Why did 313 features perform worse than 67 features?

In [3]:
# Analyze the feature count progression
experiments = {
    'exp_003 (baseline)': {'features': 19, 'cv_score': 38.825723},
    'exp_004 (groupby)': {'features': 67, 'cv_score': 38.660840},
    'exp_005 (histogram)': {'features': 313, 'cv_score': 38.663395},
}

print("Feature count vs CV performance:")
print("=" * 50)
for name, data in experiments.items():
    print(f"{name:20s}: {data['features']:3d} features → {data['cv_score']:.6f} RMSE")

print(f"\nImprovement from exp_003 to exp_004: {38.825723 - 38.660840:.6f} RMSE")
print(f"Change from exp_004 to exp_005: {38.663395 - 38.660840:.6f} RMSE (worse by 0.002555)")

# Calculate features per category in exp_005
baseline_features = 15
groupby_features = 48
histogram_features = 250

print(f"\nexp_005 feature breakdown:")
print(f"Baseline features: {baseline_features}")
print(f"Groupby statistics: {groupby_features}")
print(f"Histogram bins: {histogram_features}")
print(f"Total: {baseline_features + groupby_features + histogram_features}")

# Features added from exp_004 to exp_005
print(f"\nAdded from exp_004 to exp_005: {313 - 67} features (all histogram bins)")
print(f"Performance change: +0.002555 RMSE (worse)")
print(f"Conclusion: Adding 250 histogram features hurt performance slightly")

Feature count vs CV performance:
exp_003 (baseline)  :  19 features → 38.825723 RMSE
exp_004 (groupby)   :  67 features → 38.660840 RMSE
exp_005 (histogram) : 313 features → 38.663395 RMSE

Improvement from exp_003 to exp_004: 0.164883 RMSE
Change from exp_004 to exp_005: 0.002555 RMSE (worse by 0.002555)

exp_005 feature breakdown:
Baseline features: 15
Groupby statistics: 48
Histogram bins: 250
Total: 313

Added from exp_004 to exp_005: 246 features (all histogram bins)
Performance change: +0.002555 RMSE (worse)
Conclusion: Adding 250 histogram features hurt performance slightly


## 3. Histogram Bin Analysis

Analyze the histogram binning approach - 50 bins may be too granular.

In [None]:
# Analyze histogram bin distribution
# Let's look at the price distribution to understand if 50 bins is appropriate

price_data = combined_train['Price'].values
print(f"Price distribution analysis:")
print(f"Min: {price_data.min():.2f}")
print(f"Max: {price_data.max():.2f}")
print(f"Range: {price_data.max() - price_data.min():.2f}")
print(f"Mean: {price_data.mean():.2f}")
print(f"Median: {np.median(price_data):.2f}")
print(f"Std: {price_data.std():.2f}")

# Calculate what 50 bins means
bin_width = (price_data.max() - price_data.min()) / 50
print(f"\nWith 50 uniform bins:")
print(f"Each bin width: {bin_width:.2f}")
print(f"Average samples per bin: {len(price_data) / 50:.0f}")

# Check distribution across groups
print(f"\nGroup sizes for histogram binning:")
for col in ['Weight Capacity (kg)', 'Brand', 'Material', 'Size', 'Color']:
    n_groups = combined_train[col].nunique()
    avg_per_group = len(combined_train) / n_groups
    print(f"{col:20s}: {n_groups:4d} groups, avg {avg_per_group:6.0f} samples/group")
    
    # For histogram bins, each group gets 50 features
    # So total histogram features = n_groups × 50 (but we only use top groups)
    if col in ['Weight Capacity (kg)', 'Brand', 'Material', 'Size', 'Color']:
        print(f"{'':20s}  → {n_groups} groups × 50 bins = {n_groups * 50} potential features")

print(f"\nActual histogram features used: 250 (5 group keys × 50 bins)")
print(f"This means we're using histograms for: Weight Capacity, Brand, Material, Size, Color")

## 4. Feature Selection Strategy

Based on the analysis, we need aggressive feature selection.

In [None]:
# Propose feature selection strategy
print("PROPOSED FEATURE SELECTION STRATEGY")
print("=" * 50)

print("\n1. REMOVE zero-importance features (14 features):")
zero_importance_features = [
    'brand_min_price', 'brand_max_price',
    'material_min_price', 'material_max_price', 
    'size_min_price', 'size_max_price',
    'laptop_compartment_min_price', 'laptop_compartment_max_price',
    'waterproof_min_price', 'waterproof_max_price',
    'style_min_price', 'style_max_price',
    'color_min_price', 'color_max_price'
]
for feat in zero_importance_features:
    print(f"  - {feat}")

print(f"\n2. REDUCE histogram bins from 50 to 20-30:")
print(f"   - Current: 250 histogram features")
print(f"   - Proposed: 100-150 histogram features (5 keys × 20-30 bins)")
print(f"   - Reason: 50 bins too granular, many bins capture noise")

print(f"\n3. FOCUS on Weight Capacity histograms (highest importance):")
print(f"   - weight_capacity_kg has 58,096 importance (top feature)")
print(f"   - Keep all 50 bins for Weight Capacity only")
print(f"   - Reduce to 20-30 bins for Brand, Material, Size, Color")

print(f"\n4. KEEP top groupby statistics:")
print(f"   - mean_price features (strong signal)")
print(f"   - count_price features (useful for frequency)")
print(f"   - median_price features (robust statistic)")
print(f"   - REMOVE std, min, max (lower importance, many are zero)")

print(f"\n5. TARGET feature count: ~100-150 features")
print(f"   - Current: 313 features (overkill)")
print(f"   - Proposed: ~100-150 features (focused, less noise)")
print(f"   - Expected: Better or equal performance with less overfitting")

# Calculate expected feature count
baseline_keep = 15  # all baseline
groupby_keep = 24  # mean, count, median for 8 keys (3 × 8 = 24)
weight_capacity_hist = 50  # keep all 50 for weight capacity
other_hist = 80  # 4 keys × 20 bins each

print(f"\n6. ESTIMATED FEATURE COUNT AFTER SELECTION:")
print(f"   Baseline features: {baseline_keep}")
print(f"   Groupby statistics: {groupby_keep}")
print(f"   Weight Capacity histogram: {weight_capacity_hist}")
print(f"   Other histograms: {other_hist}")
print(f"   TOTAL: {baseline_keep + groupby_keep + weight_capacity_hist + other_hist} features")

## 5. Original Student Bag Dataset

Research the original dataset that 1st place used heavily.

In [None]:
# Research the original Student Bag dataset
print("ORIGINAL STUDENT BAG DATASET RESEARCH")
print("=" * 50)

print("\nFrom competition description and 1st place solution:")
print("- Original dataset: 'Student Bag Price Prediction Dataset'")
print("- URL: https://www.kaggle.com/datasets/souradippal/student-bag-price-prediction-dataset")
print("- 1st place (Chris Deotte) heavily exploited this dataset")
print("- Key insight: Compute mean Price by Weight Capacity as 'MSRP' feature")

print("\nWhat the original dataset likely contains:")
print("- More samples of backpacks with prices")
print("- Additional weight capacity values")
print("- MSRP (Manufacturer's Suggested Retail Price) or similar reference prices")
print("- More granular categorization")

print("\nHow 1st place used it:")
print("1. Download original dataset")
print("2. Compute mean Price by Weight Capacity (rounded to different decimals)")
print("3. Create 'MSRP' feature - reference price for each weight capacity")
print("4. Use this as a strong baseline predictor")
print("5. Combine with other features for final model")

print("\nWhat we can do without the original dataset:")
print("- Compute mean Price by Weight Capacity from OUR training data")
print("- Use different rounding levels (7-10 decimals) to capture variations")
print("- This is already partially done in our baseline features")
print("- But original dataset has MORE samples → better statistics")

print("\nACTION ITEM: Download original dataset and compute:")
print("- mean_price_by_weight_capacity (multiple rounding levels)")
print("- std_price_by_weight_capacity (price variance)")
print("- count_by_weight_capacity (sample size)")
print("- Merge these as additional features")

In [None]:
# Let's compute what we can from our current data
print("COMPUTING WEIGHT CAPACITY STATISTICS FROM TRAINING DATA")
print("=" * 60)

# Group by Weight Capacity and compute statistics
weight_stats = combined_train.groupby('Weight Capacity (kg)')['Price'].agg([
    ('mean_price', 'mean'),
    ('std_price', 'std'),
    ('count_price', 'count'),
    ('min_price', 'min'),
    ('max_price', 'max'),
    ('median_price', 'median')
]).reset_index()

print(f"Unique Weight Capacity values: {len(weight_stats)}")
print(f"Weight Capacity range: {weight_stats['Weight Capacity (kg)'].min():.10f} - {weight_stats['Weight Capacity (kg)'].max():.10f}")

# Show top weight capacities by count
print(f"\nTop 10 Weight Capacities by sample count:")
top_weights = weight_stats.nlargest(10, 'count_price')
for idx, row in top_weights.iterrows():
    print(f"  {row['Weight Capacity (kg)']:.10f}: {row['count_price']:6.0f} samples, mean price = {row['mean_price']:6.2f}")

# Show price variation
print(f"\nPrice variation by Weight Capacity:")
print(f"Mean price range: {weight_stats['mean_price'].min():.2f} - {weight_stats['mean_price'].max():.2f}")
print(f"Std dev range: {weight_stats['std_price'].min():.2f} - {weight_stats['std_price'].max():.2f}")

# Correlation between weight capacity and mean price
corr = np.corrcoef(weight_stats['Weight Capacity (kg)'], weight_stats['mean_price'])[0,1]
print(f"Correlation (weight capacity vs mean price): {corr:.4f}")

print(f"\nThis confirms Weight Capacity is a strong predictor!")
print(f"The original dataset would have MORE samples per weight capacity,")
print(f"giving us more reliable statistics (especially for rare weights).")

In [None]:
# Analyze potential reasons why histogram bins didn't improve performance
# despite high feature importance

print("=" * 70)
print("HYPOTHESIS 1: Overfitting to histogram bins despite high importance")
print("=" * 70)

# The histogram bins might be capturing noise rather than signal
# High feature importance doesn't always mean better generalization
print("Key insight: Feature importance measures contribution to training fit,")
print("not necessarily contribution to generalization/validation performance.")
print()

# Check if we're overfitting
print("Overfitting indicators:")
print(f"- Feature count increased from 67 → 313 (+246 features)")
print(f"- CV score got worse by 0.002555 despite more features")
print(f"- Histogram bins: 69.2% of feature importance but no score improvement")
print()

# Calculate feature importance per feature
baseline_imp_per_feat = 172796 / 15  # ~11,520 per feature
groupby_imp_per_feat = 281304 / 48   # ~5,860 per feature  
histogram_imp_per_feat = 1020884 / 250  # ~4,083 per feature

print("Average importance per feature by category:")
print(f"Baseline features:     {baseline_imp_per_feat:,.0f} per feature")
print(f"Groupby statistics:    {groupby_imp_per_feat:,.0f} per feature")
print(f"Histogram bins:        {histogram_imp_per_feat:,.0f} per feature")
print()
print("Conclusion: Histogram bins have LOWER average importance per feature")
print("despite high total importance. Many weak features = overfitting risk.")