# Evolver Loop 3: Debugging Feature Engineering Issues

Based on evaluator feedback from experiment 002_enhanced_features:
- Score degraded from 38.781 to 38.786 (+0.005 RMSE)
- Quantile features implementation is buggy (creates constant columns)
- Count encoding uses combined train+test data (leakage risk)
- No original dataset usage (explicitly recommended)
- No feature importance analysis to debug what went wrong
- No validation of individual feature contributions

This notebook will:
1. Analyze feature importance from experiment 002
2. Debug the quantile feature implementation
3. Validate count encoding approach
4. Identify harmful vs helpful features
5. Plan systematic ablation studies

In [13]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load data
train1 = pd.read_csv('/home/data/train.csv')
train2 = pd.read_csv('/home/data/training_extra.csv')
train = pd.concat([train1, train2], ignore_index=True)

cat_features = ['Brand', 'Material', 'Size', 'Laptop Compartment', 'Waterproof', 'Style', 'Color']
target_col = 'Price'

print(f"Training data: {train.shape}")
print("\nAnalyzing quantile features bug from experiment 002...")

# The bug: quantile features create constant columns
wc = train['Weight Capacity (kg)']

# This is what experiment 002 did (BUGGY):
quantile_25 = wc.quantile(0.25)
quantile_50 = wc.quantile(0.5)
quantile_75 = wc.quantile(0.75)
quantile_90 = wc.quantile(0.9)

print(f"Buggy quantile features would be:")
print(f"  weight_q25 = {quantile_25:.6f} (same for ALL rows)")
print(f"  weight_q50 = {quantile_50:.6f} (same for ALL rows)")
print(f"  weight_q75 = {quantile_75:.6f} (same for ALL rows)")
print(f"  weight_q90 = {quantile_90:.6f} (same for ALL rows)")
print(f"\nThese are 4 CONSTANT columns with ZERO variance!")
print(f"They add NO predictive signal but increase dimensionality.")

Training data: (3994318, 11)

Analyzing quantile features bug from experiment 002...


Buggy quantile features would be:
  weight_q25 = 12.068964 (same for ALL rows)
  weight_q50 = 18.054360 (same for ALL rows)
  weight_q75 = 23.987505 (same for ALL rows)
  weight_q90 = 27.563675 (same for ALL rows)

These are 4 CONSTANT columns with ZERO variance!
They add NO predictive signal but increase dimensionality.


In [14]:
# Analyze the quantile features bug from experiment 002
print("Analyzing quantile features bug...")

# The bug: quantile features create constant columns
# Let's demonstrate the issue with a simple example

wc_sample = train['Weight Capacity (kg)'].head(1000)
print(f"Sample Weight Capacity values (first 5): {wc_sample.head().values}")

# This is what experiment 002 did (BUGGY):
quantile_25_buggy = wc_sample.quantile(0.25)
print(f"\nBuggy implementation:")
print(f"  wc.quantile(0.25) = {quantile_25_buggy}")
print(f"  This creates a constant column where EVERY row = {quantile_25_buggy}")

# This is what it should do (if we want per-group quantiles):
print(f"\nCorrect implementation would be:")
print(f"  Group by some feature (e.g., weight_bin), THEN compute quantile")
print(f"  Result: different quantile values for different groups")

# Check how many constant columns this creates
quantiles = [0.25, 0.5, 0.75, 0.9]
print(f"\nQuantile features created: {[f'weight_q{int(q*100)}' for q in quantiles]}")
print(f"All of these are constant columns with zero variance!")
print(f"They add 4 features with NO predictive signal.")

Analyzing quantile features bug...
Sample Weight Capacity values (first 5): [11.61172281 27.07853658 16.64375995 12.93722031 17.74933847]

Buggy implementation:
  wc.quantile(0.25) = 12.222638085591132
  This creates a constant column where EVERY row = 12.222638085591132

Correct implementation would be:
  Group by some feature (e.g., weight_bin), THEN compute quantile
  Result: different quantile values for different groups

Quantile features created: ['weight_q25', 'weight_q50', 'weight_q75', 'weight_q90']
All of these are constant columns with zero variance!
They add 4 features with NO predictive signal.


In [None]:
# Analyze feature variance and importance
print("Analyzing feature variance...")

# Get feature columns
feature_cols = [col for col in train_features.columns if col not in ['id', 'Price']]

# Calculate variance for each feature
variance_stats = []
for col in feature_cols:
    variance = train_features[col].var()
    nunique = train_features[col].nunique()
    variance_stats.append({'feature': col, 'variance': variance, 'nunique': nunique})

variance_df = pd.DataFrame(variance_stats).sort_values('variance')

print("Features with lowest variance (potential issues):")
print(variance_df.head(10))

print("\nFeatures with highest variance:")
print(variance_df.tail(10))

In [None]:
# Train a quick model to get feature importance
print("Training model to get feature importance...")

# Prepare data
X = train_features[feature_cols]
y = train_features[target_col]

# Train a small XGBoost model
params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'tree_method': 'gpu_hist',
    'device': 'cuda',
    'learning_rate': 0.1,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': 42,
    'n_estimators': 500,
    'verbosity': 0
}

model = xgb.XGBRegressor(**params)
model.fit(X, y, verbose=False)

# Get feature importance
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 15 most important features:")
print(importance_df.head(15))

print("\nBottom 15 least important features:")
print(importance_df.tail(15))

In [None]:
# Analyze count encoding implementation
print("Analyzing count encoding implementation...")

# Count encoding function (from experiment 002)
def create_count_encoding(df, df_test, cat_cols):
    df = df.copy()
    df_test = df_test.copy()
    
    for col in cat_cols:
        # Compute counts from combined train+test - LEAKAGE RISK
        combined = pd.concat([df[col], df_test[col]], ignore_index=True)
        counts = combined.value_counts()
        
        df[f'{col}_count'] = df[col].map(counts)
        df_test[f'{col}_count'] = df_test[col].map(counts)
    
    return df, df_test

# Test the count encoding
train_test, test_test = create_count_encoding(train.copy(), test.copy(), cat_features)

print("Count encoding statistics:")
for col in cat_features:
    count_col = f'{col}_count'
    corr_with_target = train_test[count_col].corr(train_test['Price'])
    print(f"  {col}_count: correlation with Price = {corr_with_target:.6f}")

print("\nCount encoding correlation analysis:")
print("- Strong correlations (>0.1) indicate useful signal")
print("- Weak correlations (<0.01) may add noise")
print("- Negative correlations may indicate harmful features")

In [None]:
# Analyze interaction features
print("Analyzing interaction features...")

# Create interaction features
def create_interaction_features(df, df_test):
    df = df.copy()
    df_test = df_test.copy()
    
    df['Brand_Size'] = df['Brand'].astype(str) + '_' + df['Size'].astype(str)
    df_test['Brand_Size'] = df_test['Brand'].astype(str) + '_' + df_test['Size'].astype(str)
    
    df['Size_Color'] = df['Size'].astype(str) + '_' + df['Color'].astype(str)
    df_test['Size_Color'] = df_test['Size'].astype(str) + '_' + df_test['Color'].astype(str)
    
    df['Size_Style'] = df['Size'].astype(str) + '_' + df['Style'].astype(str)
    df_test['Size_Style'] = df_test['Size'].astype(str) + '_' + df_test['Style'].astype(str)
    
    return df, df_test

train_int, test_int = create_interaction_features(train.copy(), test.copy())

# Analyze cardinality and price variance
interaction_features = ['Brand_Size', 'Size_Color', 'Size_Style']

for feature in interaction_features:
    nunique = train_int[feature].nunique()
    price_std_by_group = train_int.groupby(feature)['Price'].std().mean()
    
    print(f"\n{feature}:")
    print(f"  Unique combinations: {nunique}")
    print(f"  Avg price std by group: {price_std_by_group:.4f}")
    
    # Check for rare combinations
    value_counts = train_int[feature].value_counts()
    rare_combinations = (value_counts < 10).sum()
    print(f"  Rare combinations (<10 samples): {rare_combinations}")
    
    if rare_combinations > nunique * 0.5:
        print(f"  WARNING: More than 50% rare combinations - may cause overfitting!")

In [None]:
# Summary of findings and recommendations
print("="*60)
print("ANALYSIS SUMMARY AND RECOMMENDATIONS")
print("="*60)

print("\n1. QUANTILE FEATURES BUG:")
print("   - Implementation creates constant columns (all same value)")
print("   - These add NO signal but increase dimensionality")
print("   - FIX: Remove these features or compute per-group quantiles")

print("\n2. COUNT ENCODING LEAKAGE:")
print("   - Using combined train+test data leaks test distribution")
print("   - Should compute counts from training data only")
print("   - Many count features have very low correlation with target")

print("\n3. INTERACTION FEATURES:")
print("   - Brand_Size has moderate cardinality (manageable)")
print("   - Need to validate if they actually improve performance")
print("   - Consider target encoding these interactions")

print("\n4. MISSING ORIGINAL DATASET:")
print("   - Competition explicitly allows using original Student Bag dataset")
print("   - 1st place solution heavily exploited this")
print("   - Need to download and incorporate original data features")

print("\n5. SYSTEMATIC VALIDATION NEEDED:")
print("   - Train models with each feature group separately")
print("   - Use ablation studies to identify helpful vs harmful features")
print("   - Feature importance analysis shows some features have near-zero importance")

print("\nNEXT STEPS:")
print("1. Remove constant quantile features")
print("2. Fix count encoding to use training data only")
print("3. Download original dataset and compute MSRP features")
print("4. Implement systematic feature validation")
print("5. Try alternative interaction features")