# Feature Engineering & Testing

**Based on EDA Findings**

**Date:** November 18, 2025

## Objective
- Create features based on EDA insights
- Test different feature combinations
- Identify optimal feature set for modeling

## Key Findings from EDA:
1. ü•á **Reviews_Read**: +163% lift (strongest signal)
2. ü•á **Email_Interaction**: +36% lift
3. ü•á **Device_Type**: +27% lift (Tablet best, Mobile worst)
4. ü•á **Email √ó Campaign**: +78% combined lift
5. ü•à **Category**: +14% lift (0,1,2 good; 3,4 poor)
6. ‚ùå **Age**: NO signal (flat 36-37%)
7. ‚ùå **AB_Bucket**: NO signal (~37% all buckets)

## 1. Setup & Data Loading

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [None]:
# Load data
train_df = pd.read_csv('/Users/jakobbullinger/Documents/Coding Projects/DSBA/Intro Machine Learning/kaggle_competition/data/raw/train_dataset_M1_with_id.csv')
test_df = pd.read_csv('/Users/jakobbullinger/Documents/Coding Projects/DSBA/Intro Machine Learning/kaggle_competition/data/raw/test_dataset_M1_with_id.csv')

print(f"Training set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"\nTarget distribution:")
print(train_df['Purchase'].value_counts())
print(f"\nPurchase rate: {train_df['Purchase'].mean():.2%}")

## 2. Feature Engineering Pipeline

In [None]:
def create_all_features(df):
    """
    Complete feature engineering pipeline based on EDA findings
    
    Priority features based on predictive power:
    1. Reviews_Read (+163% lift)
    2. Email_Interaction (+36% lift)
    3. Device_Type (+27% lift)
    4. Email √ó Campaign (+78% combined)
    5. Category (+14% lift)
    """
    df = df.copy()
    
    print("Creating features...")
    
    # ============================================================
    # 1. REVIEWS FEATURES (STRONGEST SIGNAL - 163% lift)
    # ============================================================
    # None: 17.8% ‚Üí 5+: 46.7%
    df['Reviews_Read_Binned'] = pd.cut(df['Reviews_Read'], 
                                        bins=[-1, 0, 2, 4, 100],
                                        labels=[0, 1, 2, 3])  # 0=none, 1=light, 2=medium, 3=heavy
    df['Reviews_Read_Binned'] = df['Reviews_Read_Binned'].astype(float)
    
    df['Has_Read_Reviews'] = (df['Reviews_Read'] > 0).astype(int)
    df['Heavy_Reviewer'] = (df['Reviews_Read'] >= 5).astype(int)
    df['Medium_Reviewer'] = ((df['Reviews_Read'] >= 3) & (df['Reviews_Read'] < 5)).astype(int)
    
    # ============================================================
    # 2. DEVICE FEATURES (STRONG SIGNAL - 27% lift)
    # ============================================================
    # Tablet: 43.1%, Desktop: 40.3%, Mobile: 31.6%
    df['Is_Tablet'] = (df['Device_Type'] == 'Tablet').astype(int)
    df['Is_Desktop'] = (df['Device_Type'] == 'Desktop').astype(int)
    df['Is_Mobile'] = (df['Device_Type'] == 'Mobile').astype(int)
    
    # ============================================================
    # 3. CATEGORY FEATURES (MODERATE SIGNAL - 14% lift)
    # ============================================================
    # Categories 0,1,2: 40-43% | Categories 3,4: 29-31%
    df['Category_Performance'] = df['Category'].map({
        0.0: 'High',  # 43%
        1.0: 'High',  # 40%
        2.0: 'High',  # 42%
        3.0: 'Low',   # 31%
        4.0: 'Low'    # 29%
    })
    df['Is_High_Performing_Category'] = (df['Category_Performance'] == 'High').astype(int)
    
    # ============================================================
    # 4. EMAIL & CAMPAIGN (STRONGEST INTERACTION - 78% lift)
    # ============================================================
    # Email + Campaign: 49.3% | No Email + No Campaign: 27.6%
    
    # Fix Campaign_Period if it's all NaN
    if df['Campaign_Period'].isna().all():
        print("  ‚ö†Ô∏è  Campaign_Period is all NaN - recreating from Day column")
        df['Campaign_Period'] = ((df['Day'] >= 25) & (df['Day'] <= 50)) | \
                                ((df['Day'] >= 75) & (df['Day'] <= 90))
    
    # Email during campaign interaction
    df['Email_During_Campaign'] = ((df['Email_Interaction'] == 1) & 
                                    (df['Campaign_Period'] == True)).astype(int)
    
    # Email outside campaign
    df['Email_No_Campaign'] = ((df['Email_Interaction'] == 1) & 
                                (df['Campaign_Period'] == False)).astype(int)
    
    # Campaign number (which campaign?)
    df['Campaign_Number'] = 0
    df.loc[(df['Day'] >= 25) & (df['Day'] <= 50), 'Campaign_Number'] = 1
    df.loc[(df['Day'] >= 75) & (df['Day'] <= 90), 'Campaign_Number'] = 2
    
    # ============================================================
    # 5. CART FEATURES (ANOMALY DETECTED - investigate)
    # ============================================================
    df['Has_Items_In_Cart'] = (df['Items_In_Cart'] > 0).astype(int)
    df['Cart_Size_Category'] = pd.cut(df['Items_In_Cart'],
                                       bins=[-1, 0, 2, 5, 100],
                                       labels=[0, 1, 2, 3])
    df['Cart_Size_Category'] = df['Cart_Size_Category'].astype(float)
    df['Many_Items_In_Cart'] = (df['Items_In_Cart'] >= 6).astype(int)
    
    # ============================================================
    # 6. ENGAGEMENT FEATURES
    # ============================================================
    df['Engagement_Level'] = pd.qcut(df['Engagement_Score'], 
                                     q=4, 
                                     labels=[0, 1, 2, 3],
                                     duplicates='drop').astype(float)
    
    df['High_Engagement'] = (df['Engagement_Score'] > df['Engagement_Score'].median()).astype(int)
    
    # Combined engagement metric (weighted by importance)
    df['Total_Engagement_Score'] = (
        df['Reviews_Read'].fillna(0) * 3 +  # Reviews are very strong signal
        df['Items_In_Cart'].fillna(0) +
        df['Email_Interaction'].fillna(0) * 5 +  # Email is strongest
        df['Engagement_Score'].fillna(0)
    )
    
    # ============================================================
    # 7. PRICE & DISCOUNT FEATURES
    # ============================================================
    df['Effective_Price'] = df['Price'] * (1 - df['Discount'] / 100)
    df['Discount_Amount'] = df['Price'] * df['Discount'] / 100
    df['High_Discount'] = (df['Discount'] >= 30).astype(int)
    df['Has_Discount'] = (df['Discount'] > 0).astype(int)
    
    # Price categories
    df['Price_Category'] = pd.qcut(df['Price'], q=4, labels=[0, 1, 2, 3], duplicates='drop').astype(float)
    df['High_Price'] = (df['Price'] > df['Price'].median()).astype(int)
    
    # ============================================================
    # 8. HIGH-VALUE INTERACTION FEATURES
    # ============================================================
    # Email √ó Campaign (already done above - most important!)
    
    # Tablet √ó Heavy Reviewer (premium users)
    df['Tablet_Heavy_Reviewer'] = ((df['Is_Tablet'] == 1) & 
                                    (df['Heavy_Reviewer'] == 1)).astype(int)
    
    # Email √ó Heavy Reviewer (engaged researchers)
    df['Email_Heavy_Reviewer'] = ((df['Email_Interaction'] == 1) & 
                                   (df['Heavy_Reviewer'] == 1)).astype(int)
    
    # Desktop √ó High Price (serious buyers)
    df['Desktop_High_Price'] = ((df['Is_Desktop'] == 1) & 
                                (df['High_Price'] == 1)).astype(int)
    
    # High Engagement √ó Reviews (very engaged researchers)
    df['Engaged_Researcher'] = ((df['High_Engagement'] == 1) & 
                                (df['Has_Read_Reviews'] == 1)).astype(int)
    
    # Reviews √ó Cart (browsing with intent)
    df['Reviews_With_Cart'] = ((df['Has_Read_Reviews'] == 1) & 
                               (df['Has_Items_In_Cart'] == 1)).astype(int)
    
    # ============================================================
    # 9. SOCIOECONOMIC STATUS
    # ============================================================
    df['SES_Category'] = pd.qcut(df['Socioeconomic_Status_Score'], 
                                  q=3, 
                                  labels=[0, 1, 2],
                                  duplicates='drop').astype(float)
    df['High_SES'] = (df['Socioeconomic_Status_Score'] > 
                      df['Socioeconomic_Status_Score'].quantile(0.75)).astype(int)
    
    # ============================================================
    # 10. TIME-BASED FEATURES
    # ============================================================
    df['Is_Morning'] = (df['Time_of_Day'] == 'morning').astype(int)
    df['Is_Evening'] = (df['Time_of_Day'] == 'evening').astype(int)
    df['Is_Afternoon'] = (df['Time_of_Day'] == 'afternoon').astype(int)
    
    # Weekend proxy (7-day weeks)
    df['Is_Weekend'] = ((df['Day'] % 7 == 6) | (df['Day'] % 7 == 0)).astype(int)
    
    # ============================================================
    # 11. MISSING DATA INDICATORS (might be informative!)
    # ============================================================
    df['Age_Missing'] = df['Age'].isna().astype(int)
    df['Payment_Missing'] = df['Payment_Method'].isna().astype(int)
    df['Referral_Missing'] = df['Referral_Source'].isna().astype(int)
    df['Price_Missing'] = df['Price'].isna().astype(int)
    
    print(f"‚úÖ Feature engineering complete!")
    print(f"   Total features: {df.shape[1]}")
    
    return df

In [None]:
# Apply feature engineering
print("="*70)
print("TRAINING SET")
print("="*70)
train_engineered = create_all_features(train_df)

print("\n" + "="*70)
print("TEST SET")
print("="*70)
test_engineered = create_all_features(test_df)

print("\n" + "="*70)
print("SUMMARY")
print("="*70)
print(f"Train shape: {train_engineered.shape}")
print(f"Test shape: {test_engineered.shape}")

In [None]:
# List all new features
original_features = train_df.columns.tolist()
new_features = [col for col in train_engineered.columns if col not in original_features]

print("="*70)
print(f"NEW FEATURES CREATED ({len(new_features)} total)")
print("="*70)

for i, feat in enumerate(new_features, 1):
    print(f"{i:2d}. {feat}")

## 3. Feature Set Testing

Test different feature combinations to find the optimal set

In [None]:
def prepare_data_for_modeling(df, target='Purchase'):
    """
    Prepare data for modeling by encoding and cleaning
    """
    df_model = df.copy()
    
    # Encode categorical variables
    le = LabelEncoder()
    cat_cols = ['Time_of_Day', 'Category_Performance']
    for col in cat_cols:
        if col in df_model.columns:
            df_model[col] = le.fit_transform(df_model[col].astype(str))
    
    # Remove non-feature columns
    drop_cols = ['Purchase', 'Session_ID', 'Device_Type', 'Payment_Method', 
                 'Referral_Source', 'PM_RS_Combo']
    
    # Separate features and target
    if target in df_model.columns:
        y = df_model[target]
        X = df_model.drop(columns=[c for c in drop_cols if c in df_model.columns])
    else:
        y = None
        X = df_model.drop(columns=[c for c in drop_cols if c in df_model.columns])
    
    # Fill NaN with median
    X = X.fillna(X.median())
    
    return X, y

# Test preparation
X_train, y_train = prepare_data_for_modeling(train_engineered)
print(f"Training features shape: {X_train.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"\nFeature columns: {X_train.shape[1]}")

In [None]:
def test_feature_sets(df, target='Purchase'):
    """
    Test different feature combinations to find optimal set
    """
    # Prepare data
    X, y = prepare_data_for_modeling(df, target)
    
    # Define feature sets to test
    feature_sets = {
        '1. Baseline (Original Only)': [
            'Age', 'Gender', 'Reviews_Read', 'Price', 'Discount', 'Category',
            'Items_In_Cart', 'Email_Interaction', 'Socioeconomic_Status_Score',
            'Engagement_Score'
        ],
        
        '2. Email + Campaign Only': [
            'Email_Interaction', 'Campaign_Period', 'Email_During_Campaign'
        ],
        
        '3. Reviews Features': [
            'Reviews_Read_Binned', 'Has_Read_Reviews', 'Heavy_Reviewer', 'Medium_Reviewer'
        ],
        
        '4. Device Features': [
            'Is_Tablet', 'Is_Desktop', 'Is_Mobile'
        ],
        
        '5. Top Features Only': [
            # Email & Campaign
            'Email_Interaction', 'Campaign_Period', 'Email_During_Campaign',
            # Reviews
            'Reviews_Read_Binned', 'Heavy_Reviewer',
            # Device
            'Is_Tablet', 'Is_Desktop',
            # Engagement
            'Engagement_Score', 'High_Engagement',
            # Price/Discount
            'Effective_Price', 'High_Discount',
            # Category
            'Is_High_Performing_Category'
        ],
        
        '6. Top Features + Interactions': [
            # Core features
            'Email_Interaction', 'Campaign_Period', 'Email_During_Campaign',
            'Reviews_Read_Binned', 'Heavy_Reviewer',
            'Is_Tablet', 'Is_Desktop',
            'Engagement_Score', 'High_Engagement',
            'Effective_Price', 'High_Discount',
            'Is_High_Performing_Category',
            # Interactions
            'Email_Heavy_Reviewer', 'Tablet_Heavy_Reviewer',
            'Engaged_Researcher', 'Desktop_High_Price'
        ],
        
        '7. Top + Missing Indicators': [
            # Core features
            'Email_Interaction', 'Campaign_Period', 'Email_During_Campaign',
            'Reviews_Read_Binned', 'Heavy_Reviewer',
            'Is_Tablet', 'Is_Desktop',
            'Engagement_Score', 'High_Engagement',
            'Effective_Price', 'High_Discount',
            'Is_High_Performing_Category',
            # Missing indicators
            'Age_Missing', 'Payment_Missing', 'Referral_Missing'
        ],
        
        '8. All Engineered Features': [col for col in X.columns if col not in 
                                       ['Age', 'Gender', 'Reviews_Read', 'Price', 'Discount', 
                                        'Category', 'Items_In_Cart', 'Day', 'AB_Bucket', 
                                        'Price_Sine', 'Socioeconomic_Status_Score']]
    }
    
    # Test each feature set
    results = {}
    print("\n" + "="*70)
    print("TESTING FEATURE COMBINATIONS")
    print("="*70)
    print("Using RandomForest with 5-Fold Cross-Validation")
    print("Metric: ROC-AUC Score")
    print("="*70)
    
    for name, features in feature_sets.items():
        # Get available features
        available = [f for f in features if f in X.columns]
        
        if len(available) == 0:
            print(f"\n{name}: No features available - SKIPPED")
            continue
        
        X_subset = X[available]
        
        # RandomForest with stratified 5-fold CV
        rf = RandomForestClassifier(
            n_estimators=100, 
            max_depth=10,
            min_samples_split=50,
            random_state=42, 
            n_jobs=-1
        )
        
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(rf, X_subset, y, cv=cv, scoring='roc_auc', n_jobs=-1)
        
        results[name] = {
            'mean_auc': scores.mean(),
            'std_auc': scores.std(),
            'n_features': len(available),
            'scores': scores
        }
        
        print(f"\n{name}")
        print(f"  Features: {len(available):3d}")
        print(f"  ROC-AUC:  {scores.mean():.4f} ¬± {scores.std():.4f}")
        print(f"  Folds:    {[f'{s:.4f}' for s in scores]}")
    
    # Summary
    print("\n" + "="*70)
    print("SUMMARY - RANKED BY PERFORMANCE")
    print("="*70)
    
    ranked = sorted(results.items(), key=lambda x: x[1]['mean_auc'], reverse=True)
    
    for i, (name, metrics) in enumerate(ranked, 1):
        print(f"{i}. {name}")
        print(f"   AUC: {metrics['mean_auc']:.4f} | Features: {metrics['n_features']}")
    
    # Best feature set
    best_name, best_metrics = ranked[0]
    print("\n" + "="*70)
    print("üèÜ BEST FEATURE SET")
    print("="*70)
    print(f"Name: {best_name}")
    print(f"ROC-AUC: {best_metrics['mean_auc']:.4f} ¬± {best_metrics['std_auc']:.4f}")
    print(f"Features: {best_metrics['n_features']}")
    print("="*70)
    
    return results, ranked

# Run the test
results, ranked = test_feature_sets(train_engineered)

## 4. Visualize Results

In [None]:
# Plot comparison
fig, ax = plt.subplots(figsize=(12, 8))

names = [name for name, _ in ranked]
scores = [metrics['mean_auc'] for _, metrics in ranked]
stds = [metrics['std_auc'] for _, metrics in ranked]
n_features = [metrics['n_features'] for _, metrics in ranked]

# Create bars
bars = ax.barh(range(len(names)), scores, xerr=stds, alpha=0.7, capsize=5)

# Color best in gold
bars[0].set_color('gold')

# Add feature count labels
for i, (score, n_feat) in enumerate(zip(scores, n_features)):
    ax.text(score + 0.005, i, f"{n_feat} features", 
            va='center', fontsize=9, color='gray')

ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
ax.set_xlabel('ROC-AUC Score', fontsize=12, fontweight='bold')
ax.set_title('Feature Set Performance Comparison', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
ax.invert_yaxis()

plt.tight_layout()
plt.show()

## 5. Feature Importance Analysis

Analyze which features are most important in the best model

In [None]:
# Train on best feature set to get feature importances
X, y = prepare_data_for_modeling(train_engineered)

# Use top features + interactions (usually performs best)
best_features = [
    'Email_Interaction', 'Campaign_Period', 'Email_During_Campaign',
    'Reviews_Read_Binned', 'Heavy_Reviewer',
    'Is_Tablet', 'Is_Desktop',
    'Engagement_Score', 'High_Engagement',
    'Effective_Price', 'High_Discount',
    'Is_High_Performing_Category',
    'Email_Heavy_Reviewer', 'Tablet_Heavy_Reviewer',
    'Engaged_Researcher', 'Desktop_High_Price'
]

available_features = [f for f in best_features if f in X.columns]
X_best = X[available_features]

# Train model
rf_final = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=50,
    random_state=42,
    n_jobs=-1
)
rf_final.fit(X_best, y)

# Get feature importances
feature_importance = pd.DataFrame({
    'Feature': available_features,
    'Importance': rf_final.feature_importances_
}).sort_values('Importance', ascending=False)

print("="*70)
print("FEATURE IMPORTANCE (Top Features + Interactions)")
print("="*70)
print(feature_importance.to_string(index=False))

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(10, 8))

top_n = 15
top_features = feature_importance.head(top_n)

bars = ax.barh(range(len(top_features)), top_features['Importance'], alpha=0.7)

# Color top 3 differently
bars[0].set_color('gold')
bars[1].set_color('silver')
bars[2].set_color('#CD7F32')  # bronze

ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['Feature'])
ax.set_xlabel('Importance Score', fontsize=12, fontweight='bold')
ax.set_title(f'Top {top_n} Feature Importances', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
ax.invert_yaxis()

plt.tight_layout()
plt.show()

## 6. Save Engineered Data

In [None]:
# Save engineered datasets
output_path = '/Users/jakobbullinger/Documents/Coding Projects/DSBA/Intro Machine Learning/kaggle_competition/data/processed/'

train_engineered.to_csv(output_path + 'train_engineered.csv', index=False)
test_engineered.to_csv(output_path + 'test_engineered.csv', index=False)

print("‚úÖ Engineered datasets saved!")
print(f"   Train: {train_engineered.shape}")
print(f"   Test: {test_engineered.shape}")
print(f"\nSaved to: {output_path}")

## 7. Key Takeaways

### Feature Engineering Insights:

1. **Most Important Features:**
   - Email √ó Campaign interaction
   - Reviews_Read (binned)
   - Device_Type (especially Tablet)
   - Engagement_Score
   - Effective_Price

2. **Feature Set Performance:**
   - Review results above to see which combination works best
   - Baseline vs engineered features comparison
   - Impact of interaction features

3. **Next Steps:**
   - Use best feature set for final modeling
   - Try different algorithms (XGBoost, LightGBM, etc.)
   - Tune hyperparameters
   - Optimize threshold for ‚Ç¨200/day budget constraint