# Lab 2: Feature Engineering Workshop - Create Better Features!

## Welcome to Feature Engineering!

This is where you'll learn the art and science of creating new features from existing data to boost model performance!

### What You'll Build:
A complete **feature engineering pipeline** for a customer churn dataset, transforming raw data into powerful predictive features!

### Learning Goals:
- Create domain knowledge features
- Apply mathematical transformations
- Extract datetime features
- Normalize and scale features
- Encode categorical variables (one-hot, label, target, frequency)
- Create interaction features
- Build automated feature engineering pipeline
- Evaluate feature importance

### Don't Panic!
- Read each instruction carefully
- Try the TODO exercises yourself first
- Hints are provided if you get stuck
- Solutions are at the end (but try not to peek!)

**Let's engineer some amazing features!**

## Step 1: Import Libraries

First, let's import the tools we need for feature engineering.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from scipy import stats
from scipy.special import boxcox
from datetime import datetime, timedelta

# Make plots look nice
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries imported successfully!")
print("You're ready to engineer some features!")

## Step 2: Create the Dataset

We'll create a **realistic telecom customer churn dataset** with rich features for engineering!

In [None]:
# Create telecom customer churn dataset
np.random.seed(42)
n_customers = 1000

# Generate base date range
base_date = pd.Timestamp('2024-01-01')
signup_dates = [base_date - timedelta(days=int(x)) for x in np.random.uniform(30, 730, n_customers)]

data = {
    'customer_id': range(1, n_customers + 1),
    'signup_date': signup_dates,
    'age': np.random.normal(45, 15, n_customers).clip(18, 85).astype(int),
    'tenure_months': np.random.normal(24, 15, n_customers).clip(1, 72).astype(int),
    'monthly_charges': np.random.gamma(5, 15, n_customers).clip(20, 150),
    'total_charges': None,  # Will calculate from monthly * tenure
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers, p=[0.55, 0.25, 0.20]),
    'payment_method': np.random.choice(['Credit card', 'Bank transfer', 'Electronic check', 'Mailed check'], n_customers, p=[0.3, 0.25, 0.3, 0.15]),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_customers, p=[0.35, 0.45, 0.20]),
    'online_security': np.random.choice(['Yes', 'No'], n_customers),
    'tech_support': np.random.choice(['Yes', 'No'], n_customers),
    'streaming_tv': np.random.choice(['Yes', 'No'], n_customers),
    'streaming_movies': np.random.choice(['Yes', 'No'], n_customers),
    'paperless_billing': np.random.choice(['Yes', 'No'], n_customers, p=[0.6, 0.4]),
    'num_support_calls': np.random.poisson(2.5, n_customers),
    'num_tech_tickets': np.random.poisson(1.5, n_customers),
    'num_admin_tickets': np.random.poisson(1.0, n_customers),
    'avg_session_duration_min': np.random.gamma(3, 20, n_customers).clip(5, 200),
    'data_usage_gb': np.random.gamma(2, 15, n_customers).clip(1, 150),
}

df = pd.DataFrame(data)

# Calculate total charges
df['total_charges'] = df['monthly_charges'] * df['tenure_months'] + np.random.normal(0, 100, n_customers)
df['total_charges'] = df['total_charges'].clip(20, 10000)

# Add last payment date
df['last_payment_date'] = pd.to_datetime('2024-01-01') - pd.to_timedelta(np.random.randint(0, 60, n_customers), unit='D')

# Generate churn based on features (realistic patterns)
churn_prob = (
    0.05 +  # base probability
    0.35 * (df['contract_type'] == 'Month-to-month') +
    0.25 * (df['tenure_months'] < 12) +
    0.15 * (df['num_support_calls'] > 4) +
    0.10 * (df['payment_method'] == 'Electronic check') +
    0.10 * (df['paperless_billing'] == 'No') +
    0.05 * (df['tech_support'] == 'No')
)
df['churn'] = (np.random.random(n_customers) < churn_prob.clip(0, 0.95)).astype(int)

print(f"Dataset created!")
print(f"Total customers: {len(df)}")
print(f"Total features: {len(df.columns) - 1}  (excluding target)")
print(f"\nTarget distribution:")
print(f"  Churned: {df['churn'].sum()} ({df['churn'].mean()*100:.1f}%)")
print(f"  Stayed:  {len(df) - df['churn'].sum()} ({(1-df['churn'].mean())*100:.1f}%)")
print(f"\nThis dataset is perfect for feature engineering!")

## Step 3: Explore the Dataset (ALWAYS DO THIS FIRST!)

### TODO 1: Explore the Dataset

Display:
1. First 10 rows
2. Data types and info
3. Identify numerical and categorical features
4. Basic statistics

💡 **Hint:** 
```python
df.head(10)
df.info()
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
```

In [None]:
# TODO 1: YOUR CODE HERE
# Explore the dataset



## Step 4: Create Domain Knowledge Features

Use your understanding of the business to create meaningful features!

### TODO 2: Create Domain-Specific Features

Create these business-meaningful features:
1. `tenure_years`: Convert tenure_months to years
2. `avg_monthly_charges`: total_charges / tenure_months
3. `charges_per_call`: monthly_charges / (num_support_calls + 1) to avoid division by zero
4. `total_tickets`: Sum of all ticket types
5. `has_streaming`: 1 if customer has streaming_tv OR streaming_movies
6. `has_protection`: 1 if customer has online_security AND tech_support

💡 **Hint:**
```python
df['tenure_years'] = df['tenure_months'] / 12
df['avg_monthly_charges'] = df['total_charges'] / df['tenure_months']
```

In [None]:
# TODO 2: YOUR CODE HERE
# Create domain knowledge features



✅ **Check:** Do your new features make business sense? 
- tenure_years should be between 0 and 6
- avg_monthly_charges should be close to monthly_charges

## Step 5: Create Mathematical Features

Create new features through mathematical combinations!

### TODO 3: Create Mathematical Features

Create ratio and product features:
1. `charges_to_tenure_ratio`: total_charges / tenure_months
2. `data_per_session`: data_usage_gb / (avg_session_duration_min + 1)
3. `support_intensity`: num_support_calls / tenure_months
4. `total_service_usage`: avg_session_duration_min * data_usage_gb

💡 **Hint:** Add small constants to avoid division by zero

In [None]:
# TODO 3: YOUR CODE HERE
# Create mathematical features



## Step 6: Extract Datetime Features

Dates contain rich information - let's extract it!

### TODO 4: Extract Datetime Features

From `signup_date` and `last_payment_date`, extract:
1. `signup_month`: Month of signup (1-12)
2. `signup_day_of_week`: Day of week (0=Monday, 6=Sunday)
3. `signup_quarter`: Quarter of year (1-4)
4. `is_signup_weekend`: 1 if signup was on weekend
5. `days_since_signup`: Days between signup and reference date (2024-01-01)
6. `days_since_last_payment`: Days between last payment and reference date

💡 **Hint:**
```python
df['signup_month'] = pd.to_datetime(df['signup_date']).dt.month
df['signup_day_of_week'] = pd.to_datetime(df['signup_date']).dt.dayofweek
df['signup_quarter'] = pd.to_datetime(df['signup_date']).dt.quarter
df['is_signup_weekend'] = (df['signup_day_of_week'] >= 5).astype(int)
reference_date = pd.Timestamp('2024-01-01')
df['days_since_signup'] = (reference_date - pd.to_datetime(df['signup_date'])).dt.days
```

In [None]:
# TODO 4: YOUR CODE HERE
# Extract datetime features



## Step 7: Apply Log Transformation

Log transformations help normalize skewed distributions!

### TODO 5: Apply Log Transformation to Skewed Features

1. Identify skewed numerical features (skewness > 1)
2. Apply log transformation: `log1p` (log(1+x) to handle zeros)
3. Visualize before and after distributions

Focus on: `total_charges`, `data_usage_gb`, `avg_session_duration_min`

💡 **Hint:**
```python
# Check skewness
print(df['total_charges'].skew())

# Apply log transformation
df['total_charges_log'] = np.log1p(df['total_charges'])
df['data_usage_log'] = np.log1p(df['data_usage_gb'])
```

In [None]:
# TODO 5: YOUR CODE HERE
# Apply log transformation



### TODO 6: Apply Box-Cox Transformation

Box-Cox is a more sophisticated transformation that finds optimal power:
1. Apply to `monthly_charges`
2. Compare with log transformation
3. Visualize both

💡 **Hint:**
```python
from scipy.stats import boxcox
df['monthly_charges_boxcox'], lambda_param = boxcox(df['monthly_charges'] + 1)
print(f"Optimal lambda: {lambda_param:.4f}")
```

⚠️ **Common Mistake:** Box-Cox requires all positive values! Add 1 if needed.

In [None]:
# TODO 6: YOUR CODE HERE
# Apply Box-Cox transformation



## Step 8: Normalize Numerical Features

Scaling ensures all features have similar ranges!

### TODO 7: Apply StandardScaler and MinMaxScaler

Compare two scaling methods:
1. **StandardScaler**: (x - mean) / std → mean=0, std=1
2. **MinMaxScaler**: (x - min) / (max - min) → range [0, 1]

Apply to: `age`, `tenure_months`, `monthly_charges`, `data_usage_gb`

💡 **Hint:**
```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler
scaler_std = StandardScaler()
cols_to_scale = ['age', 'tenure_months', 'monthly_charges', 'data_usage_gb']
df[['age_std', 'tenure_std', 'charges_std', 'data_std']] = scaler_std.fit_transform(df[cols_to_scale])

# MinMaxScaler
scaler_minmax = MinMaxScaler()
df[['age_minmax', 'tenure_minmax', 'charges_minmax', 'data_minmax']] = scaler_minmax.fit_transform(df[cols_to_scale])
```

⚠️ **Important:** In real ML pipelines, fit on training data only!

In [None]:
# TODO 7: YOUR CODE HERE
# Apply normalization



## Step 9: One-Hot Encode Categorical Variables

Convert categorical variables into binary columns!

### TODO 8: One-Hot Encode Low-Cardinality Categoricals

One-hot encode these features:
- `contract_type` (3 values)
- `internet_service` (3 values)

💡 **Hint:**
```python
# Using pandas
df_encoded = pd.get_dummies(df, columns=['contract_type', 'internet_service'], prefix=['contract', 'internet'])

# OR using sklearn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids multicollinearity
encoded = encoder.fit_transform(df[['contract_type', 'internet_service']])
```

In [None]:
# TODO 8: YOUR CODE HERE
# One-hot encode categorical variables



✅ **Check:** How many new columns were created? For n categories, you get n or n-1 columns (if drop='first')

## Step 10: Label Encode Ordinal Variables

For categories with natural order, use label encoding!

### TODO 9: Label Encode with Meaningful Order

Create an ordinal feature for `contract_type` with meaningful order:
- Month-to-month → 0 (least commitment)
- One year → 1 (medium commitment)
- Two year → 2 (most commitment)

💡 **Hint:**
```python
from sklearn.preprocessing import OrdinalEncoder
contract_order = [['Month-to-month', 'One year', 'Two year']]
ord_encoder = OrdinalEncoder(categories=contract_order)
df['contract_ordinal'] = ord_encoder.fit_transform(df[['contract_type']])
```

In [None]:
# TODO 9: YOUR CODE HERE
# Label encode ordinal variables



## Step 11: Target Encoding (Carefully!)

Encode categories using target mean - powerful but requires care to avoid leakage!

### TODO 10: Apply Target Encoding

Target encode `payment_method` using churn rate:
1. Calculate mean churn rate per payment method
2. Replace categories with their mean target value
3. Add smoothing to avoid overfitting

💡 **Hint:**
```python
# Calculate mean target per category
target_encoding = df.groupby('payment_method')['churn'].mean()
df['payment_method_target_enc'] = df['payment_method'].map(target_encoding)

# Add smoothing (blend with global mean)
global_mean = df['churn'].mean()
counts = df['payment_method'].value_counts()
smoothing = 10
smooth_encoding = (target_encoding * counts + global_mean * smoothing) / (counts + smoothing)
df['payment_method_smooth_enc'] = df['payment_method'].map(smooth_encoding)
```

⚠️ **Warning:** In real ML, do this only on training set, then apply to test set!

In [None]:
# TODO 10: YOUR CODE HERE
# Apply target encoding



## Step 12: Frequency Encoding

Encode by how often each category appears!

### TODO 11: Apply Frequency Encoding

Encode `payment_method` by frequency:
1. Count occurrences of each category
2. Replace category with its frequency

💡 **Hint:**
```python
freq_encoding = df['payment_method'].value_counts()
df['payment_method_freq'] = df['payment_method'].map(freq_encoding)
```

In [None]:
# TODO 11: YOUR CODE HERE
# Apply frequency encoding



## Step 13: Create Binned Features

Sometimes continuous variables work better as categories!

### TODO 12: Create Binned Features

Bin continuous features into categories:
1. **Equal width bins**: `pd.cut()` - divide range into equal intervals
2. **Equal frequency bins**: `pd.qcut()` - divide into quantiles

Apply to `age` and `monthly_charges`:
- Age: Young (18-35), Middle (36-55), Senior (56+)
- Charges: Low, Medium, High, Very High (quartiles)

💡 **Hint:**
```python
# Equal width bins
df['age_bin'] = pd.cut(df['age'], bins=[0, 35, 55, 100], labels=['Young', 'Middle', 'Senior'])

# Equal frequency bins (quartiles)
df['charges_quartile'] = pd.qcut(df['monthly_charges'], q=4, labels=['Low', 'Medium', 'High', 'VeryHigh'])
```

In [None]:
# TODO 12: YOUR CODE HERE
# Create binned features



## Step 14: Create Interaction Features

Capture relationships between features!

### TODO 13: Create Interaction Features

Create meaningful interactions:

**Manual interactions:**
1. `tenure_x_charges`: tenure_months * monthly_charges
2. `is_fiber_and_streaming`: 1 if Fiber optic AND has streaming
3. `senior_no_support`: 1 if age>60 AND tech_support='No'

**Polynomial features:**
4. Use PolynomialFeatures for automated interactions

💡 **Hint:**
```python
# Manual
df['tenure_x_charges'] = df['tenure_months'] * df['monthly_charges']

# Polynomial (creates all interactions + squares)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
features_to_interact = ['tenure_months', 'monthly_charges', 'num_support_calls']
interactions = poly.fit_transform(df[features_to_interact])
```

In [None]:
# TODO 13: YOUR CODE HERE
# Create interaction features



## Step 15: Build Complete Feature Engineering Pipeline

### TODO 14: Create Automated Feature Engineering Pipeline

Build a pipeline that combines:
1. Numerical transformations (log, scaling)
2. Categorical encoding (one-hot)
3. Feature selection (optional)

Use ColumnTransformer to handle different feature types!

💡 **Hint:**
```python
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define transformers for different column types
numerical_features = ['age', 'tenure_months', 'monthly_charges']
categorical_features = ['contract_type', 'internet_service']

numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])
```

In [None]:
# TODO 14: YOUR CODE HERE
# Build complete feature engineering pipeline



## Step 16: Evaluate Impact of Feature Engineering

### TODO 15: Compare Model Performance

Train models and compare:
1. **Baseline**: Original features only
2. **Engineered**: With all your new features

Use cross-validation for robust comparison!

💡 **Hint:**
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder

# Prepare baseline features
baseline_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges', 
                     'num_support_calls', 'num_tech_tickets']

# Prepare engineered features (select your best ones)
engineered_features = baseline_features + [
    'tenure_years', 'avg_monthly_charges', 'total_tickets',
    'charges_to_tenure_ratio', 'support_intensity',
    'total_charges_log', 'data_usage_log',
    # ... add your engineered features
]

# Encode categoricals if needed
le = LabelEncoder()
for col in df.select_dtypes(include=['object', 'category']).columns:
    if col in df.columns:
        df[col + '_encoded'] = le.fit_transform(df[col].astype(str))

# Train and compare
clf = RandomForestClassifier(n_estimators=100, random_state=42)

scores_baseline = cross_val_score(clf, df[baseline_features], df['churn'], cv=5)
scores_engineered = cross_val_score(clf, df[engineered_features], df['churn'], cv=5)
```

In [None]:
# TODO 15: YOUR CODE HERE
# Compare model performance



## Step 17: Feature Importance Analysis

### TODO 16: Analyze Feature Importance

1. Train a RandomForest on engineered features
2. Extract feature importances
3. Plot top 15 most important features
4. Identify which engineered features are most valuable

💡 **Hint:**
```python
# Train model
X = df[engineered_features]
y = df['churn']
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# Get feature importances
importances = clf.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': engineered_features,
    'importance': importances
}).sort_values('importance', ascending=False)

# Plot
plt.figure(figsize=(10, 8))
top_features = feature_importance_df.head(15)
plt.barh(top_features['feature'], top_features['importance'])
plt.xlabel('Importance')
plt.title('Top 15 Most Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
```

In [None]:
# TODO 16: YOUR CODE HERE
# Analyze feature importance



## Congratulations!

### You Did It!

You just:
- ✅ Created domain knowledge features
- ✅ Applied mathematical transformations
- ✅ Extracted datetime features
- ✅ Normalized and scaled features
- ✅ Encoded categorical variables (5 methods!)
- ✅ Created binned features
- ✅ Built interaction features
- ✅ Constructed automated pipeline
- ✅ Evaluated feature engineering impact
- ✅ Analyzed feature importance

### What You Learned:

**1. Feature Creation:**
- Domain knowledge features capture business logic
- Mathematical features reveal hidden relationships
- Datetime features extract temporal patterns

**2. Feature Transformation:**
- Log transforms reduce skewness
- Box-Cox finds optimal transformation
- Scaling ensures equal feature importance

**3. Encoding Strategies:**
- One-hot: Best for nominal categories (no order)
- Ordinal: Use when categories have natural order
- Target: Powerful but risk of leakage
- Frequency: Simple and effective
- Binning: Convert continuous to categorical

**4. Feature Interactions:**
- Manual interactions use domain knowledge
- Polynomial features automate discovery
- Captures non-linear relationships

**5. Best Practices:**
- Always split data BEFORE engineering (avoid leakage)
- Use pipelines for reproducibility
- Validate improvements with cross-validation
- Remove low-importance features
- Document your feature engineering decisions

### Key Insights:
- Good features > complex models
- Feature engineering is creative and iterative
- Domain knowledge is invaluable
- Not all engineered features help - test them!
- Pipelines make feature engineering reproducible

### Next Steps:
- Try Lab 3: Class Imbalance Project
- Apply to your own datasets
- Experiment with more advanced techniques

---

## Extension Exercises (Optional, Harder!)

1. **Automated Feature Engineering**: Try `featuretools` library
2. **Feature Selection**: Implement RFE (Recursive Feature Elimination)
3. **Time-Based Features**: Create lag features, rolling averages
4. **Text Features**: If you had text data, try TF-IDF, word embeddings
5. **Embeddings**: For high-cardinality categoricals, try entity embeddings
6. **Genetic Programming**: Use TPOT for automated feature engineering

---

## You're a Feature Engineering Master Now!

**You just mastered the art of feature engineering!**

**That's AMAZING! Keep engineering those features!**

---
## Solutions (Only Look After Trying!)

Here are the solutions to all TODOs. But remember: **you learn by doing, not by copying!**

In [None]:
# SOLUTION TO TODO 1
print("First 10 rows:")
print(df.head(10))

print("\nDataset Info:")
print(df.info())

numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
datetime_features = df.select_dtypes(include=['datetime64']).columns.tolist()

print(f"\nFeature Types:")
print(f"Numerical features ({len(numerical_features)}): {numerical_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")
print(f"Datetime features ({len(datetime_features)}): {datetime_features}")

print("\nBasic Statistics:")
print(df.describe())

In [None]:
# SOLUTION TO TODO 2
print("Creating domain knowledge features...")

# 1. Tenure in years
df['tenure_years'] = df['tenure_months'] / 12

# 2. Average monthly charges
df['avg_monthly_charges'] = df['total_charges'] / df['tenure_months']

# 3. Charges per support call
df['charges_per_call'] = df['monthly_charges'] / (df['num_support_calls'] + 1)

# 4. Total tickets
df['total_tickets'] = df['num_tech_tickets'] + df['num_admin_tickets'] + df['num_support_calls']

# 5. Has streaming services
df['has_streaming'] = ((df['streaming_tv'] == 'Yes') | (df['streaming_movies'] == 'Yes')).astype(int)

# 6. Has protection
df['has_protection'] = ((df['online_security'] == 'Yes') & (df['tech_support'] == 'Yes')).astype(int)

print("✅ Created 6 domain knowledge features")
print("\nSample values:")
print(df[['tenure_months', 'tenure_years', 'monthly_charges', 'avg_monthly_charges', 
          'total_tickets', 'has_streaming', 'has_protection']].head())

In [None]:
# SOLUTION TO TODO 3
print("Creating mathematical features...")

# 1. Charges to tenure ratio
df['charges_to_tenure_ratio'] = df['total_charges'] / (df['tenure_months'] + 1)

# 2. Old per session
df['data_per_session'] = df['data_usage_gb'] / (df['avg_session_duration_min'] + 1)

# 3. Support intensity
df['support_intensity'] = df['num_support_calls'] / (df['tenure_months'] + 1)

# 4. Total service usage
df['total_service_usage'] = df['avg_session_duration_min'] * df['data_usage_gb']

print("✅ Created 4 mathematical features")
print("\nSample values:")
print(df[['charges_to_tenure_ratio', 'data_per_session', 'support_intensity', 'total_service_usage']].head())

In [None]:
# SOLUTION TO TODO 4
print("Extracting datetime features...")

# Convert to datetime if needed
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['last_payment_date'] = pd.to_datetime(df['last_payment_date'])

# 1. Signup month
df['signup_month'] = df['signup_date'].dt.month

# 2. Signup day of week
df['signup_day_of_week'] = df['signup_date'].dt.dayofweek

# 3. Signup quarter
df['signup_quarter'] = df['signup_date'].dt.quarter

# 4. Is signup weekend
df['is_signup_weekend'] = (df['signup_day_of_week'] >= 5).astype(int)

# 5. Days since signup
reference_date = pd.Timestamp('2024-01-01')
df['days_since_signup'] = (reference_date - df['signup_date']).dt.days

# 6. Days since last payment
df['days_since_last_payment'] = (reference_date - df['last_payment_date']).dt.days

print("✅ Created 6 datetime features")
print("\nSample values:")
print(df[['signup_date', 'signup_month', 'signup_day_of_week', 'signup_quarter', 
          'is_signup_weekend', 'days_since_signup', 'days_since_last_payment']].head())

In [None]:
# SOLUTION TO TODO 5
print("Applying log transformation...\n")

# Check skewness first
features_to_transform = ['total_charges', 'data_usage_gb', 'avg_session_duration_min']
print("Skewness before transformation:")
for col in features_to_transform:
    print(f"  {col}: {df[col].skew():.2f}")

# Apply log transformation
df['total_charges_log'] = np.log1p(df['total_charges'])
df['data_usage_log'] = np.log1p(df['data_usage_gb'])
df['session_duration_log'] = np.log1p(df['avg_session_duration_min'])

print("\nSkewness after log transformation:")
print(f"  total_charges_log: {df['total_charges_log'].skew():.2f}")
print(f"  data_usage_log: {df['data_usage_log'].skew():.2f}")
print(f"  session_duration_log: {df['session_duration_log'].skew():.2f}")

# Visualize
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
for idx, col in enumerate(features_to_transform):
    # Before
    axes[idx, 0].hist(df[col], bins=50, alpha=0.7, color='red', edgecolor='black')
    axes[idx, 0].set_title(f'Before: {col}')
    axes[idx, 0].set_ylabel('Frequency')
    
    # After
    log_col = col.replace('total_charges', 'total_charges_log').replace('data_usage_gb', 'data_usage_log').replace('avg_session_duration_min', 'session_duration_log')
    axes[idx, 1].hist(df[log_col], bins=50, alpha=0.7, color='green', edgecolor='black')
    axes[idx, 1].set_title(f'After: {log_col}')
    axes[idx, 1].set_ylabel('Frequency')

plt.suptitle('Log Transformation: Before vs After', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n✅ Log transformation reduces skewness!")

In [None]:
# SOLUTION TO TODO 6
from scipy.stats import boxcox

print("Applying Box-Cox transformation...\n")

# Box-Cox requires positive values
df['monthly_charges_boxcox'], lambda_param = boxcox(df['monthly_charges'] + 1)
df['monthly_charges_log'] = np.log1p(df['monthly_charges'])

print(f"Optimal Box-Cox lambda: {lambda_param:.4f}")
print(f"\nSkewness comparison:")
print(f"  Original: {df['monthly_charges'].skew():.4f}")
print(f"  Log: {df['monthly_charges_log'].skew():.4f}")
print(f"  Box-Cox: {df['monthly_charges_boxcox'].skew():.4f}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(df['monthly_charges'], bins=50, alpha=0.7, color='red', edgecolor='black')
axes[0].set_title('Original')
axes[1].hist(df['monthly_charges_log'], bins=50, alpha=0.7, color='blue', edgecolor='black')
axes[1].set_title('Log Transform')
axes[2].hist(df['monthly_charges_boxcox'], bins=50, alpha=0.7, color='green', edgecolor='black')
axes[2].set_title(f'Box-Cox (λ={lambda_param:.2f})')
plt.suptitle('Comparison: Log vs Box-Cox Transformation', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n✅ Box-Cox finds optimal transformation automatically!")

In [None]:
# SOLUTION TO TODO 7
from sklearn.preprocessing import StandardScaler, MinMaxScaler

print("Applying normalization...\n")

cols_to_scale = ['age', 'tenure_months', 'monthly_charges', 'data_usage_gb']

# StandardScaler
scaler_std = StandardScaler()
scaled_std = scaler_std.fit_transform(df[cols_to_scale])
df[['age_std', 'tenure_std', 'charges_std', 'data_std']] = scaled_std

# MinMaxScaler
scaler_minmax = MinMaxScaler()
scaled_minmax = scaler_minmax.fit_transform(df[cols_to_scale])
df[['age_minmax', 'tenure_minmax', 'charges_minmax', 'data_minmax']] = scaled_minmax

print("StandardScaler results (mean≈0, std≈1):")
print(df[['age_std', 'tenure_std', 'charges_std', 'data_std']].describe())

print("\nMinMaxScaler results (range [0,1]):")
print(df[['age_minmax', 'tenure_minmax', 'charges_minmax', 'data_minmax']].describe())

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
test_col = 'age'
axes[0].hist(df[test_col], bins=30, alpha=0.7, color='red')
axes[0].set_title('Original')
axes[1].hist(df[test_col + '_std'], bins=30, alpha=0.7, color='blue')
axes[1].set_title('StandardScaler')
axes[2].hist(df[test_col + '_minmax'], bins=30, alpha=0.7, color='green')
axes[2].set_title('MinMaxScaler')
plt.suptitle(f'Normalization Comparison: {test_col}', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n✅ Both scalers transform features to comparable ranges!")

In [None]:
# SOLUTION TO TODO 8
print("One-hot encoding categorical variables...\n")

# Using pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['contract_type', 'internet_service'], 
                            prefix=['contract', 'internet'], drop_first=True)

print("One-hot encoded columns created:")
contract_cols = [col for col in df_encoded.columns if col.startswith('contract_')]
internet_cols = [col for col in df_encoded.columns if col.startswith('internet_')]
print(f"  Contract: {contract_cols}")
print(f"  Internet: {internet_cols}")

print(f"\nShape before: {df.shape}")
print(f"Shape after: {df_encoded.shape}")
print(f"New columns added: {df_encoded.shape[1] - df.shape[1]}")

print("\nSample encoded values:")
print(df_encoded[contract_cols + internet_cols].head())

print("\n✅ One-hot encoding complete! (drop_first=True avoids multicollinearity)")

# Update df to include encoded columns
df = df_encoded.copy()

In [None]:
# SOLUTION TO TODO 9
from sklearn.preprocessing import OrdinalEncoder

print("Label encoding ordinal variables...\n")

# Create a mapping with meaningful order
contract_mapping = {
    'Month-to-month': 0,
    'One year': 1,
    'Two year': 2
}

# Since we already one-hot encoded, let's create from original if available
# For demo, recreate the original contract_type column
if 'contract_Month-to-month' in df.columns:
    # Reconstruct original for demo
    df.loc[:, 'contract_type_ordinal'] = 0  # default
    if 'contract_One year' in df.columns:
        df.loc[df['contract_One year'] == 1, 'contract_type_ordinal'] = 1
    if 'contract_Two year' in df.columns:
        df.loc[df['contract_Two year'] == 1, 'contract_type_ordinal'] = 2
else:
    # Direct mapping if original still exists
    df['contract_type_ordinal'] = df['contract_type'].map(contract_mapping)

print("Contract type ordinal encoding:")
print("  Month-to-month → 0 (least commitment)")
print("  One year → 1 (medium commitment)")
print("  Two year → 2 (most commitment)")

print("\nValue distribution:")
print(df['contract_type_ordinal'].value_counts().sort_index())

print("\n✅ Ordinal encoding preserves natural order!")

In [None]:
# SOLUTION TO TODO 10
print("Applying target encoding...\n")

# Calculate mean churn rate per payment method
target_encoding = df.groupby('payment_method')['churn'].mean()
print("Mean churn rate by payment method:")
print(target_encoding.sort_values(ascending=False))

# Simple target encoding
df['payment_method_target_enc'] = df['payment_method'].map(target_encoding)

# Smoothed target encoding (to avoid overfitting)
global_mean = df['churn'].mean()
counts = df['payment_method'].value_counts()
smoothing = 10  # smoothing parameter

smooth_encoding = (target_encoding * counts + global_mean * smoothing) / (counts + smoothing)
df['payment_method_smooth_enc'] = df['payment_method'].map(smooth_encoding)

print("\nComparison: Simple vs Smoothed encoding:")
comparison = pd.DataFrame({
    'Simple': target_encoding,
    'Smoothed': smooth_encoding,
    'Count': counts
})
print(comparison)

print("\n⚠️ IMPORTANT: In real ML, compute on training set only, then apply to test set!")
print("✅ Target encoding captures target relationship in categorical variables!")

In [None]:
# SOLUTION TO TODO 11
print("Applying frequency encoding...\n")

# Calculate frequency of each category
freq_encoding = df['payment_method'].value_counts()
df['payment_method_freq'] = df['payment_method'].map(freq_encoding)

print("Frequency by payment method:")
print(freq_encoding)

print("\nEncoded values sample:")
print(df[['payment_method', 'payment_method_freq']].head(10))

print("\n✅ Frequency encoding is simple and captures category prevalence!")

In [None]:
# SOLUTION TO TODO 12
print("Creating binned features...\n")

# 1. Equal width bins for age
df['age_bin'] = pd.cut(df['age'], bins=[0, 35, 55, 100], labels=['Young', 'Middle', 'Senior'])

# 2. Equal frequency bins for monthly charges (quartiles)
df['charges_quartile'] = pd.qcut(df['monthly_charges'], q=4, labels=['Low', 'Medium', 'High', 'VeryHigh'])

print("Age bins (equal width):")
print(df['age_bin'].value_counts().sort_index())

print("\nCharges quartiles (equal frequency):")
print(df['charges_quartile'].value_counts().sort_index())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

df['age_bin'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Age Bins (Equal Width)')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Age Group')

df['charges_quartile'].value_counts().sort_index().plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Charges Quartiles (Equal Frequency)')
axes[1].set_ylabel('Count')
axes[1].set_xlabel('Charge Level')

plt.tight_layout()
plt.show()

print("\n✅ Binning converts continuous features to categorical!")

In [None]:
# SOLUTION TO TODO 13
print("Creating interaction features...\n")

# Manual interactions
df['tenure_x_charges'] = df['tenure_months'] * df['monthly_charges']

df['is_fiber_and_streaming'] = (
    (df['internet_Fiber optic'] if 'internet_Fiber optic' in df.columns else (df['internet_service'] == 'Fiber optic').astype(int)) & 
    (df['has_streaming'] == 1)
).astype(int)

df['senior_no_support'] = (
    (df['age'] > 60) & 
    (df['tech_support'] == 'No')
).astype(int)

print("Manual interactions created:")
print(f"  tenure_x_charges: {df['tenure_x_charges'].describe()}")
print(f"  is_fiber_and_streaming: {df['is_fiber_and_streaming'].sum()} customers")
print(f"  senior_no_support: {df['senior_no_support'].sum()} customers")

# Polynomial features (automated)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
features_to_interact = ['tenure_months', 'monthly_charges', 'num_support_calls']
interactions = poly.fit_transform(df[features_to_interact])
interaction_names = poly.get_feature_names_out(features_to_interact)

print(f"\nPolynomial interactions created: {len(interaction_names)} features")
print(f"Feature names: {list(interaction_names)}")

# Add to dataframe
for i, name in enumerate(interaction_names):
    df[f'poly_{name}'] = interactions[:, i]

print("\n✅ Interaction features capture relationships between variables!")

In [None]:
# SOLUTION TO TODO 14
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

print("Building feature engineering pipeline...\n")

# Define feature groups
numerical_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges',
                     'num_support_calls', 'num_tech_tickets', 'data_usage_gb']

categorical_features = ['payment_method', 'online_security', 'tech_support',
                       'streaming_tv', 'streaming_movies', 'paperless_billing']

# Create transformers
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop'  # Drop columns not specified
)

# Create full pipeline with model
from sklearn.ensemble import RandomForestClassifier
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

print("Pipeline structure:")
print(full_pipeline)

print("\n✅ Pipeline is ready! It will:")
print("   1. Scale numerical features")
print("   2. One-hot encode categorical features")
print("   3. Train a RandomForest classifier")
print("\nThis ensures consistent preprocessing on train and test data!")

In [None]:
# SOLUTION TO TODO 15
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

print("Comparing model performance: Baseline vs Engineered Features\n")
print("="*70)

# Prepare features
baseline_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges',
                    'num_support_calls', 'num_tech_tickets']

engineered_features = baseline_features + [
    'tenure_years', 'avg_monthly_charges', 'total_tickets',
    'charges_to_tenure_ratio', 'support_intensity',
    'total_charges_log', 'data_usage_log',
    'signup_month', 'signup_quarter', 'days_since_signup',
    'charges_per_call', 'data_per_session', 'total_service_usage',
    'contract_type_ordinal', 'payment_method_target_enc',
    'tenure_x_charges', 'has_streaming', 'has_protection'
]

# Filter to ensure columns exist
engineered_features = [f for f in engineered_features if f in df.columns]

# Train models
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

print("Training baseline model...")
scores_baseline = cross_val_score(clf, df[baseline_features], df['churn'], cv=5, scoring='accuracy')

print("Training engineered features model...")
scores_engineered = cross_val_score(clf, df[engineered_features], df['churn'], cv=5, scoring='accuracy')

# Results
print("\n" + "="*70)
print("RESULTS")
print("="*70)

print(f"\nBaseline Features ({len(baseline_features)} features):")
print(f"  Mean Accuracy: {scores_baseline.mean():.4f} (+/- {scores_baseline.std():.4f})")
print(f"  Scores: {[f'{s:.4f}' for s in scores_baseline]}")

print(f"\nEngineered Features ({len(engineered_features)} features):")
print(f"  Mean Accuracy: {scores_engineered.mean():.4f} (+/- {scores_engineered.std():.4f})")
print(f"  Scores: {[f'{s:.4f}' for s in scores_engineered]}")

improvement = (scores_engineered.mean() - scores_baseline.mean()) * 100
print(f"\n🎯 Improvement: {improvement:+.2f}% accuracy")

# Visualize
plt.figure(figsize=(10, 6))
plt.bar(['Baseline', 'Engineered'], 
        [scores_baseline.mean(), scores_engineered.mean()],
        yerr=[scores_baseline.std(), scores_engineered.std()],
        color=['coral', 'steelblue'], alpha=0.7, capsize=10)
plt.ylabel('Cross-Validation Accuracy')
plt.title('Model Performance: Baseline vs Engineered Features')
plt.ylim([scores_baseline.mean()-0.05, scores_engineered.mean()+0.05])
for i, (name, score) in enumerate([('Baseline', scores_baseline.mean()), ('Engineered', scores_engineered.mean())]):
    plt.text(i, score+0.01, f'{score:.4f}', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()

if improvement > 0:
    print("\n✅ Feature engineering improved model performance!")
else:
    print("\n⚠️ Feature engineering didn't help. Try different features or remove redundant ones.")

In [None]:
# SOLUTION TO TODO 16
print("Analyzing feature importance...\n")

# Train model on all engineered features
X = df[engineered_features]
y = df['churn']

clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X, y)

# Get feature importances
importances = clf.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': engineered_features,
    'importance': importances
}).sort_values('importance', ascending=False)

print("Top 20 Most Important Features:")
print("="*60)
for idx, row in feature_importance_df.head(20).iterrows():
    print(f"{row['feature']:35s}: {row['importance']:.6f}")

# Identify engineered features in top 10
top_10 = feature_importance_df.head(10)['feature'].tolist()
engineered_in_top10 = [f for f in top_10 if f not in baseline_features]

print(f"\n🎯 Engineered features in top 10: {len(engineered_in_top10)}")
if engineered_in_top10:
    print("  ", engineered_in_top10)

# Visualize
plt.figure(figsize=(10, 10))
top_features = feature_importance_df.head(15)
colors = ['steelblue' if f not in baseline_features else 'coral' for f in top_features['feature']]
plt.barh(top_features['feature'], top_features['importance'], color=colors, alpha=0.7)
plt.xlabel('Importance', fontsize=12)
plt.title('Top 15 Most Important Features', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='coral', alpha=0.7, label='Baseline Features'),
                   Patch(facecolor='steelblue', alpha=0.7, label='Engineered Features')]
plt.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

print("\n✅ Feature importance analysis complete!")
print("Blue bars = Engineered features, Orange bars = Baseline features")