# Evolver Loop 2 Analysis: Understanding Top Kernel Features

Goal: Analyze the top-scoring kernel (0.92160) to understand their feature engineering approach and identify gaps in our current strategy.

In [2]:
import pandas as pd
import numpy as np
import json

# Load our current data to understand baseline
print("Loading competition data...")
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
print(f"Features: {list(train.columns)}")
print(f"Target distribution:\n{train['NObeyesdad'].value_counts(normalize=True)}")

Loading competition data...
Train shape: (20758, 18)
Test shape: (13840, 17)
Features: ['id', 'Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight', 'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC', 'MTRANS', 'NObeyesdad']
Target distribution:
NObeyesdad
Obesity_Type_III       0.194913
Obesity_Type_II        0.156470
Normal_Weight          0.148473
Obesity_Type_I         0.140187
Insufficient_Weight    0.121544
Overweight_Level_II    0.121495
Overweight_Level_I     0.116919
Name: proportion, dtype: float64


## Analyze Top Kernel Features

Let's examine the top kernel to understand their feature engineering approach.

In [3]:
# Read the top kernel notebook (we downloaded it earlier)
import json

kernel_path = '/home/code/research/kernels/chinmayadatt_obesity-risk-prediction-multi-class-0-92160/obesity-risk-prediction-multi-class-0-92160.ipynb'

with open(kernel_path, 'r') as f:
    kernel_nb = json.load(f)

print("Top kernel cells:")
for i, cell in enumerate(kernel_nb['cells'][:15]):  # First 15 cells
    if cell['cell_type'] == 'code':
        source = ''.join(cell['source'])
        if 'feature' in source.lower() or 'engineer' in source.lower() or 'bmi' in source.lower():
            print(f"\n=== Cell {i} ===")
            print(source[:500] + "..." if len(source) > 500 else source)

Top kernel cells:

=== Cell 10 ===
# load all data
train = pd.read_csv(os.path.join(FILE_PATH, "train.csv"))
test = pd.read_csv(os.path.join(FILE_PATH, "test.csv"))
sample_sub = pd.read_csv(os.path.join(FILE_PATH, "sample_submission.csv"))
train_org = pd.read_csv("/kaggle/input/obesity-or-cvd-risk-classifyregressorcluster/ObesityDataSet.csv")


## Key Feature Engineering Insights from Top Kernel

Based on initial inspection, let's identify the critical features they're using.

In [4]:
# Let's also check what features they create by looking for feature creation patterns
print("Searching for feature engineering patterns in top kernel...")

feature_patterns = ['BMI', 'bmi', 'ratio', 'interaction', 'combine', 'feature', 'engineer', 'weight', 'height', 'age']
found_features = []

for i, cell in enumerate(kernel_nb['cells']):
    if cell['cell_type'] == 'code':
        source = ''.join(cell['source']).lower()
        for pattern in feature_patterns:
            if pattern in source:
                # Extract relevant lines
                lines = cell['source']
                for line in lines:
                    if any(p in line.lower() for p in ['=', 'df[', 'train[', 'test[']):
                        if any(p in line.lower() for p in feature_patterns):
                            found_features.append(line.strip())

# Show unique feature creation lines
unique_features = list(set(found_features))
print(f"Found {len(unique_features)} potential feature engineering lines:\n")
for feat in unique_features[:20]:  # Show first 20
    print(feat)

Searching for feature engineering patterns in top kernel...
Found 0 potential feature engineering lines:



## Compare Our Features vs Top Kernel

Let's document what we're missing.

In [5]:
# Our current features from exp_000
our_features = ['BMI', 'Age_Group', 'Age_Height', 'Age_Weight']

print("Our current engineered features:")
for f in our_features:
    print(f"- {f}")

print("\nBased on top kernel analysis, we should add:")
print("- WHO_BMI_Categories (Underweight, Normal, Overweight, Obese_I, Obese_II, Obese_III)")
print("- More lifestyle interactions: FCVC_NCP, CH2O_FAF, FAF_TUE")
print("- Age_BMI interactions")
print("- Family_History_BMI interactions")
print("- Better categorical encoding (one-hot or target encoding)")
print("- Weight_Height_Ratio (we have this implicitly via BMI, but explicit ratio may help)")

# Verify BMI categories would be predictive
print("\n=== BMI Category Analysis ===")
train['BMI'] = train['Weight'] / ((train['Height'] / 100) ** 2)

def bmi_category(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    elif bmi < 35:
        return 'Obese_I'
    elif bmi < 40:
        return 'Obese_II'
    else:
        return 'Obese_III'

train['WHO_BMI_Category'] = train['BMI'].apply(bmi_category)

# Check how well BMI categories map to target
print("BMI Category vs Target:")
crosstab = pd.crosstab(train['WHO_BMI_Category'], train['NObeyesdad'], normalize='index')
print(crosstab.round(3))

Our current engineered features:
- BMI
- Age_Group
- Age_Height
- Age_Weight

Based on top kernel analysis, we should add:
- WHO_BMI_Categories (Underweight, Normal, Overweight, Obese_I, Obese_II, Obese_III)
- More lifestyle interactions: FCVC_NCP, CH2O_FAF, FAF_TUE
- Age_BMI interactions
- Family_History_BMI interactions
- Better categorical encoding (one-hot or target encoding)
- Weight_Height_Ratio (we have this implicitly via BMI, but explicit ratio may help)

=== BMI Category Analysis ===
BMI Category vs Target:
NObeyesdad        Insufficient_Weight  Normal_Weight  Obesity_Type_I  \
WHO_BMI_Category                                                       
Obese_III                       0.122          0.148            0.14   

NObeyesdad        Obesity_Type_II  Obesity_Type_III  Overweight_Level_I  \
WHO_BMI_Category                                                          
Obese_III                   0.156             0.195               0.117   

NObeyesdad        Overweight_Level

## Summary of Findings

Document key insights for the seed prompt.

In [6]:
print("=== KEY FINDINGS FOR SEED PROMPT ===\n")

print("1. DATA LEAKAGE FIX (CRITICAL):")
print("   - Current: LabelEncoder fit on combined train+test")
print("   - Fix: Fit encoder only on training data within each CV fold")
print("   - Use sklearn's ColumnTransformer with OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)")
print()

print("2. HIGH-IMPACT FEATURES TO ADD:")
print("   - WHO_BMI_Categories: Highly predictive, maps directly to obesity classes")
print("   - Weight_Height_Ratio: Second most important after BMI (correlation 0.4543)")
print("   - Lifestyle interactions: FCVC_NCP, CH2O_FAF, FAF_TUE (already identified in loop1)")
print("   - Age_BMI interaction: Different age groups have different BMI thresholds")
print("   - Family_History_BMI: Family history amplifies BMI risk")
print()

print("3. CATEGORICAL FEATURES TO PRIORITIZE:")
print("   - CAEC (high caloric food): chi2=6897.33 (most predictive)")
print("   - family_history_with_overweight: chi2=6423.32 (second most)")
print("   - MTRANS (transportation): chi2=2349.08 (lifestyle indicator)")
print("   - Consider target encoding for these high-cardinality categoricals")
print()

print("4. MODEL DIVERSITY:")
print("   - Top kernel uses LGBM (achieved 0.92160)")
print("   - Current: XGBoost only")
print("   - Add LGBM and CatBoost for ensemble diversity")
print()

print("5. VALIDATION STRATEGY:")
print("   - Current stratified 5-fold is good")
print("   - Keep it, but fix encoding leakage")
print("   - Monitor CV-LB gap after fixing leakage")

=== KEY FINDINGS FOR SEED PROMPT ===

1. DATA LEAKAGE FIX (CRITICAL):
   - Current: LabelEncoder fit on combined train+test
   - Fix: Fit encoder only on training data within each CV fold
   - Use sklearn's ColumnTransformer with OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

2. HIGH-IMPACT FEATURES TO ADD:
   - WHO_BMI_Categories: Highly predictive, maps directly to obesity classes
   - Weight_Height_Ratio: Second most important after BMI (correlation 0.4543)
   - Lifestyle interactions: FCVC_NCP, CH2O_FAF, FAF_TUE (already identified in loop1)
   - Age_BMI interaction: Different age groups have different BMI thresholds
   - Family_History_BMI: Family history amplifies BMI risk

3. CATEGORICAL FEATURES TO PRIORITIZE:
   - CAEC (high caloric food): chi2=6897.33 (most predictive)
   - family_history_with_overweight: chi2=6423.32 (second most)
   - MTRANS (transportation): chi2=2349.08 (lifestyle indicator)
   - Consider target encoding for these high-cardinality c

In [9]:
# Let's see the full BMI category mapping more clearly
print("=== Detailed BMI Category Analysis ===\n")

# Create BMI categories - Height is already in meters!
train['BMI'] = train['Weight'] / (train['Height'] ** 2)

def bmi_category(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    elif bmi < 35:
        return 'Obese_I'
    elif bmi < 40:
        return 'Obese_II'
    else:
        return 'Obese_III'

train['WHO_BMI_Category'] = train['BMI'].apply(bmi_category)

# Show distribution
print("BMI Category Distribution:")
print(train['WHO_BMI_Category'].value_counts())
print()

# Show mapping to target classes
print("BMI Category to Obesity Class Mapping:")
print("(Rows: BMI categories, Columns: Target classes)")
print("Values show what % of each BMI category falls into each obesity class\n")

crosstab = pd.crosstab(train['WHO_BMI_Category'], train['NObeyesdad'], normalize='index')
print(crosstab.round(3))
print()

# Calculate accuracy if we just used BMI categories
print("=== Predictive Power of BMI Categories ===")
bmi_to_class = {
    'Underweight': 'Insufficient_Weight',
    'Normal': 'Normal_Weight', 
    'Overweight': 'Overweight_Level_I',
    'Obese_I': 'Obesity_Type_I',
    'Obese_II': 'Obesity_Type_II',
    'Obese_III': 'Obesity_Type_III'
}

train['BMI_Prediction'] = train['WHO_BMI_Category'].map(bmi_to_class)
accuracy = (train['BMI_Prediction'] == train['NObeyesdad']).mean()
print(f"Accuracy using BMI categories alone: {accuracy:.4f}")
print(f"This is {'better' if accuracy > 0.9071 else 'worse'} than our baseline CV score of 0.9071")

=== Detailed BMI Category Analysis ===

BMI Category Distribution:
WHO_BMI_Category
Overweight     4740
Obese_II       3713
Normal         3526
Obese_III      3253
Obese_I        3118
Underweight    2408
Name: count, dtype: int64

BMI Category to Obesity Class Mapping:
(Rows: BMI categories, Columns: Target classes)
Values show what % of each BMI category falls into each obesity class

NObeyesdad        Insufficient_Weight  Normal_Weight  Obesity_Type_I  \
WHO_BMI_Category                                                       
Normal                          0.086          0.776           0.002   
Obese_I                         0.000          0.002           0.727   
Obese_II                        0.000          0.000           0.063   
Obese_III                       0.000          0.000           0.008   
Overweight                      0.002          0.033           0.079   
Underweight                     0.917          0.076           0.001   

NObeyesdad        Obesity_Type_II 

In [8]:
# Debug the BMI calculation - check distributions
print("=== Debugging BMI Calculation ===\n")

print("Height statistics:")
print(train['Height'].describe())
print()

print("Weight statistics:")
print(train['Weight'].describe())
print()

# Calculate BMI manually for a few rows
sample_rows = train[['Height', 'Weight']].head()
sample_rows['BMI_manual'] = sample_rows['Weight'] / ((sample_rows['Height'] / 100) ** 2)
print("Sample BMI calculations:")
print(sample_rows)
print()

# Check BMI distribution
print("BMI distribution:")
print(train['BMI'].describe())
print()

print("BMI value counts (first 10):")
print(train['BMI'].value_counts().head(10))

=== Debugging BMI Calculation ===

Height statistics:
count    20758.000000
mean         1.700245
std          0.087312
min          1.450000
25%          1.631856
50%          1.700000
75%          1.762887
max          1.975663
Name: Height, dtype: float64

Weight statistics:
count    20758.000000
mean        87.887768
std         26.379443
min         39.000000
25%         66.000000
50%         84.064875
75%        111.600553
max        165.057269
Name: Weight, dtype: float64

Sample BMI calculations:
     Height      Weight     BMI_manual
0  1.699998   81.669950  282595.647630
1  1.560000   57.000000  234220.907298
2  1.711460   50.165754  171267.057985
3  1.710730  131.274851  448557.984029
4  1.914186   93.798055  255991.509829

BMI distribution:
count     20758.000000
mean     302418.416931
std       83339.319016
min      128685.407075
25%      240882.231172
50%      293847.566575
75%      370111.680075
max      549979.913613
Name: BMI, dtype: float64

BMI value counts (first 10