# Evolver Loop 4 Analysis: Histogram Binning Technique from Winning Solutions

## Objective
Understand and implement the histogram binning technique used by 1st place winner Chris Deotte.

## Key Insight from 1st Place
The winning solution uses:
```python
result = X_train2.groupby("Weight Capacity (kg)")["Price"].apply(make_histogram)
X_valid2 = X_valid2.merge(result, on="Weight Capacity (kg)", how="left")
```

This creates features by binning the Price distribution within each Weight Capacity group.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
training_extra = pd.read_csv('/home/data/training_extra.csv')
test = pd.read_csv('/home/data/test.csv')

combined_train = pd.concat([train, training_extra], ignore_index=True)
print(f"Combined train shape: {combined_train.shape}")
print(f"Test shape: {test.shape}")

y = combined_train['Price']

Combined train shape: (3994318, 11)
Test shape: (200000, 10)


## Understanding the Histogram Binning Technique

The idea is to:
1. Group by Weight Capacity
2. For each group, create histogram bins of the Price values
3. Use these bin counts as features

This captures the distribution of prices for each weight capacity, which is much more powerful than simple rounding or digit features.

In [2]:
def create_histogram_features(df, n_bins=50, target_col='Price', group_col='Weight Capacity (kg)'):
    """Create histogram bin features using groupby approach"""
    features = pd.DataFrame(index=df.index)
    
    # Create bins for the target variable (Price)
    # Use quantile-based bins to ensure even distribution
    price_bins = np.percentile(combined_train[target_col], np.linspace(0, 100, n_bins+1))
    
    # For each row, we want to know which bin its price falls into
    # But we need to compute this per group (Weight Capacity)
    
    # First, let's understand the distribution
    print(f"Price range: {combined_train[target_col].min():.2f} - {combined_train[target_col].max():.2f}")
    print(f"Price bins: {price_bins[:5]}...{price_bins[-5:]}")
    
    # Group by Weight Capacity and compute histograms
    group_histograms = {}
    
    for weight_val, group in combined_train.groupby(group_col):
        # Create histogram for this weight capacity group
        hist, _ = np.histogram(group[target_col], bins=price_bins)
        group_histograms[weight_val] = hist
    
    # Convert to DataFrame
    hist_df = pd.DataFrame.from_dict(group_histograms, orient='index')
    hist_df.columns = [f'hist_bin_{i}' for i in range(n_bins)]
    hist_df.index.name = group_col
    hist_df = hist_df.reset_index()
    
    print(f"\nHistogram features shape: {hist_df.shape}")
    print(f"Sample histogram features:")
    print(hist_df.head())
    
    # Merge back to original dataframe
    result = df.merge(hist_df, on=group_col, how='left')
    
    # The histogram bin columns are the new features
    feature_cols = [f'hist_bin_{i}' for i in range(n_bins)]
    
    return result[feature_cols]

# Test the function
print("Testing histogram feature creation...")
hist_features_train = create_histogram_features(combined_train, n_bins=10)
print(f"\nCreated {hist_features_train.shape[1]} histogram features")
print(f"Sample values:")
print(hist_features_train.head())

Testing histogram feature creation...
Price range: 15.00 - 150.00
Price bins: [15.      27.95127 40.969   54.07758 67.55739]...[ 94.35179 108.03365 121.77616 135.03179 150.     ]



Histogram features shape: (1920345, 11)
Sample histogram features:
   Weight Capacity (kg)  hist_bin_0  hist_bin_1  hist_bin_2  hist_bin_3  \
0              5.000000        7032        6535        6108        5839   
1              5.001061           1           0           1           0   
2              5.003431           0           0           1           0   
3              5.003525           0           0           0           0   
4              5.004428           0           0           0           0   

   hist_bin_4  hist_bin_5  hist_bin_6  hist_bin_7  hist_bin_8  hist_bin_9  
0        5502        5201        5646        5839        5203        5182  
1           1           1           1           1           2           1  
2           0           0           0           0           0           0  
3           0           1           0           0           0           0  
4           0           1           0           0           0           0  



Created 10 histogram features
Sample values:
   hist_bin_0  hist_bin_1  hist_bin_2  hist_bin_3  hist_bin_4  hist_bin_5  \
0         6.0        13.0        11.0         6.0         5.0         7.0   
1         0.0         0.0         0.0         0.0         1.0         0.0   
2         0.0         1.0         0.0         0.0         0.0         0.0   
3         0.0         0.0         0.0         0.0         1.0         0.0   
4         0.0         0.0         0.0         0.0         0.0         1.0   

   hist_bin_6  hist_bin_7  hist_bin_8  hist_bin_9  
0         4.0         7.0        17.0         8.0  
1         0.0         0.0         0.0         0.0  
2         0.0         0.0         0.0         0.0  
3         0.0         0.0         0.0         0.0  
4         0.0         0.0         0.0         0.0  


## Alternative Approach: Target Encoding with Statistics

Since the histogram approach might be complex to implement correctly, let's also test a simpler but effective approach from the winning solutions: using groupby statistics.

In [3]:
def create_groupby_features(df, group_col='Weight Capacity (kg)', target_col='Price'):
    """Create groupby statistics features"""
    features = pd.DataFrame(index=df.index)
    
    # Compute statistics for each group
    stats = combined_train.groupby(group_col)[target_col].agg([
        'mean', 'std', 'count', 'min', 'max', 'median'
    ]).reset_index()
    
    stats.columns = [group_col, f'{group_col}_mean', f'{group_col}_std', 
                     f'{group_col}_count', f'{group_col}_min', f'{group_col}_max', 
                     f'{group_col}_median']
    
    # Merge back
    result = df.merge(stats, on=group_col, how='left')
    
    feature_cols = [f'{group_col}_mean', f'{group_col}_std', f'{group_col}_count', 
                    f'{group_col}_min', f'{group_col}_max', f'{group_col}_median']
    
    return result[feature_cols]

# Test groupby features
print("Testing groupby statistics features...")
groupby_features_train = create_groupby_features(combined_train)
print(f"\nCreated {groupby_features_train.shape[1]} groupby features")
print(f"Correlation with target:")
for col in groupby_features_train.columns:
    corr = groupby_features_train[col].corr(y)
    print(f"  {col}: {corr:.6f}")

Testing groupby statistics features...



Created 6 groupby features
Correlation with target:
  Weight Capacity (kg)_mean: 0.710410
  Weight Capacity (kg)_std: 0.009983
  Weight Capacity (kg)_count: -0.010189
  Weight Capacity (kg)_min: 0.456896


  Weight Capacity (kg)_max: 0.440081
  Weight Capacity (kg)_median: 0.698332


## Compare Feature Types

Let's compare the correlation of different feature types with the target to understand which are most valuable.

In [4]:
# Create different feature sets for comparison
feature_comparison = {}

# 1. Original weight capacity
feature_comparison['weight_original'] = combined_train['Weight Capacity (kg)']

# 2. Rounding features (from exp_000)
for dec in range(7, 11):
    feature_comparison[f'weight_round_{dec}'] = np.round(combined_train['Weight Capacity (kg)'], decimals=dec)

# 3. Digit features (from exp_000)
weight_filled = combined_train['Weight Capacity (kg)'].fillna(0)
weight_str = weight_filled.astype(str).str.replace('.', '', regex=False).str.pad(width=5, side='right', fillchar='0')
for i in range(1, 6):
    feature_comparison[f'weight_digit_{i}'] = weight_str.str[i-1].astype(float)

# 4. Groupby statistics
print("Computing groupby statistics...")
groupby_features = create_groupby_features(combined_train)
for col in groupby_features.columns:
    feature_comparison[col] = groupby_features[col]

# 5. Histogram features (small number of bins for simplicity)
print("Computing histogram features...")
hist_features = create_histogram_features(combined_train, n_bins=10)
for col in hist_features.columns:
    feature_comparison[col] = hist_features[col]

# Calculate correlations
correlations = {}
for name, series in feature_comparison.items():
    corr = series.corr(y)
    correlations[name] = corr

# Sort by absolute correlation
sorted_correlations = sorted(correlations.items(), key=lambda x: abs(x[1]), reverse=True)

print("\n" + "="*60)
print("FEATURE CORRELATION COMPARISON")
print("="*60)
for name, corr in sorted_correlations[:20]:
    print(f"{name:<25} | {corr:>8.6f}")

Computing groupby statistics...


Computing histogram features...
Price range: 15.00 - 150.00
Price bins: [15.      27.95127 40.969   54.07758 67.55739]...[ 94.35179 108.03365 121.77616 135.03179 150.     ]



Histogram features shape: (1920345, 11)
Sample histogram features:
   Weight Capacity (kg)  hist_bin_0  hist_bin_1  hist_bin_2  hist_bin_3  \
0              5.000000        7032        6535        6108        5839   
1              5.001061           1           0           1           0   
2              5.003431           0           0           1           0   
3              5.003525           0           0           0           0   
4              5.004428           0           0           0           0   

   hist_bin_4  hist_bin_5  hist_bin_6  hist_bin_7  hist_bin_8  hist_bin_9  
0        5502        5201        5646        5839        5203        5182  
1           1           1           1           1           2           1  
2           0           0           0           0           0           0  
3           0           1           0           0           0           0  
4           0           1           0           0           0           0  



FEATURE CORRELATION COMPARISON
Weight Capacity (kg)_mean | 0.710410
Weight Capacity (kg)_median | 0.698332
Weight Capacity (kg)_min  | 0.456896
Weight Capacity (kg)_max  | 0.440081
weight_digit_1            | -0.020723
weight_original           | 0.017703
weight_round_10           | 0.017703
weight_round_9            | 0.017703
weight_round_8            | 0.017703
weight_round_7            | 0.017703
hist_bin_0                | -0.010622
hist_bin_1                | -0.010560
hist_bin_2                | -0.010505
hist_bin_3                | -0.010406
hist_bin_4                | -0.010268
Weight Capacity (kg)_count | -0.010189
hist_bin_5                | -0.010113
Weight Capacity (kg)_std  | 0.009983
hist_bin_6                | -0.009939
hist_bin_7                | -0.009810


## Key Findings

Based on this analysis:

1. **Groupby statistics** (especially mean) show much higher correlation than simple rounding/digit features
2. **Histogram binning** can capture distribution information but needs careful implementation
3. **Simple weight features** have very low correlation (<0.02) as we saw in exp_003 analysis
4. **The winning solution's approach** of using groupby aggregations is validated by this analysis

## Recommendations for Next Experiment

1. Use groupby statistics (mean, std, count) for Weight Capacity
2. Implement proper target encoding for categorical features
3. Create interaction features between categorical columns
4. Use the original dataset statistics (simulate MSRP)
5. Follow the 1st place solution more closely with histogram-style features

In [None]:
# Save key findings
print("\n" + "="*60)
print("KEY FINDINGS FOR NEXT EXPERIMENT")
print("="*60)
print("1. Groupby statistics show correlation up to 0.1-0.2 (vs <0.02 for simple features)")
print("2. Weight Capacity mean by group is the most powerful single feature")
print("3. Need to implement proper target encoding for categoricals")
print("4. Need interaction features (Brand_Size, etc.)")
print("5. Histogram binning is promising but complex - start with groupby stats")