# Day 6: Feature Engineering Part 1

## Learning Objectives
By the end of this lesson, you will be able to:
1. Understand and apply feature scaling techniques (standardization vs normalization)
2. Encode categorical variables using one-hot encoding and label encoding
3. Create new features from existing data
4. Handle datetime features effectively
5. Transform raw datasets into ML-ready format

---

## Table of Contents
1. [Introduction to Feature Engineering](#1.-Introduction-to-Feature-Engineering)
2. [Feature Scaling](#2.-Feature-Scaling)
   - 2.1 Why Scale Features?
   - 2.2 Standardization (Z-score Normalization)
   - 2.3 Min-Max Normalization
   - 2.4 Robust Scaling
   - 2.5 When to Use Which?
3. [Encoding Categorical Variables](#3.-Encoding-Categorical-Variables)
   - 3.1 Label Encoding
   - 3.2 One-Hot Encoding
   - 3.3 Ordinal Encoding
   - 3.4 Target Encoding
4. [Feature Creation](#4.-Feature-Creation)
   - 4.1 Polynomial Features
   - 4.2 Interaction Features
   - 4.3 Aggregation Features
   - 4.4 Domain-Specific Features
5. [Handling DateTime Features](#5.-Handling-DateTime-Features)
6. [Comprehensive Assignment](#6.-Assignment:-Transform-Raw-Dataset)

---

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn preprocessing
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    LabelEncoder,
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("All libraries imported successfully!")

---
## 1. Introduction to Feature Engineering

**Feature Engineering** is the process of transforming raw data into features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data.

### Why is Feature Engineering Important?

1. **Better Data Representation**: Raw data often needs transformation to be useful for ML algorithms
2. **Improved Model Performance**: Good features can significantly boost model accuracy
3. **Algorithm Requirements**: Many algorithms require numerical inputs and specific data formats
4. **Domain Knowledge Integration**: Allows incorporating expert knowledge into the data

### The Feature Engineering Process

```
Raw Data → Data Cleaning → Feature Transformation → Feature Creation → Feature Selection → ML-Ready Data
```

In [None]:
# Create a sample dataset for demonstrations
np.random.seed(42)

n_samples = 1000

# Generate sample data
sample_data = pd.DataFrame({
    'age': np.random.randint(18, 80, n_samples),
    'income': np.random.exponential(50000, n_samples) + 20000,
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples, 
                                   p=[0.3, 0.4, 0.2, 0.1]),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_samples),
    'purchase_date': pd.date_range('2023-01-01', periods=n_samples, freq='8H'),
    'transaction_amount': np.random.exponential(100, n_samples) + 10,
    'rating': np.random.choice(['Poor', 'Average', 'Good', 'Excellent'], n_samples,
                               p=[0.1, 0.3, 0.4, 0.2])
})

print("Sample Dataset Shape:", sample_data.shape)
print("\nColumn Types:")
print(sample_data.dtypes)
print("\nFirst 10 rows:")
sample_data.head(10)

---
## 2. Feature Scaling

Feature scaling is crucial for many machine learning algorithms. Different features often have different scales, which can cause problems during model training.

### 2.1 Why Scale Features?

Many ML algorithms are sensitive to the scale of features:

| Algorithm | Needs Scaling? | Reason |
|-----------|---------------|--------|
| Linear/Logistic Regression | Yes | Gradient descent converges faster |
| SVM | Yes | Distance-based algorithm |
| K-Nearest Neighbors | Yes | Distance-based algorithm |
| Neural Networks | Yes | Gradient-based optimization |
| PCA | Yes | Variance-based algorithm |
| Decision Trees | No | Split-based, scale-invariant |
| Random Forest | No | Ensemble of trees |
| XGBoost/LightGBM | No | Tree-based |

### The Problem with Unscaled Features

In [None]:
# Demonstrate the problem with unscaled features
print("Statistics of Unscaled Features:")
print("="*60)
print(f"\nAge:")
print(f"  Range: {sample_data['age'].min()} - {sample_data['age'].max()}")
print(f"  Mean: {sample_data['age'].mean():.2f}")
print(f"  Std: {sample_data['age'].std():.2f}")

print(f"\nIncome:")
print(f"  Range: ${sample_data['income'].min():,.2f} - ${sample_data['income'].max():,.2f}")
print(f"  Mean: ${sample_data['income'].mean():,.2f}")
print(f"  Std: ${sample_data['income'].std():,.2f}")

print(f"\nTransaction Amount:")
print(f"  Range: ${sample_data['transaction_amount'].min():.2f} - ${sample_data['transaction_amount'].max():.2f}")
print(f"  Mean: ${sample_data['transaction_amount'].mean():.2f}")
print(f"  Std: ${sample_data['transaction_amount'].std():.2f}")

In [None]:
# Visualize the scale difference
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot distributions
axes[0].hist(sample_data['age'], bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Age Distribution', fontsize=12)
axes[0].set_xlabel('Age')

axes[1].hist(sample_data['income'], bins=30, color='lightgreen', edgecolor='black')
axes[1].set_title('Income Distribution', fontsize=12)
axes[1].set_xlabel('Income ($)')

axes[2].hist(sample_data['transaction_amount'], bins=30, color='salmon', edgecolor='black')
axes[2].set_title('Transaction Amount Distribution', fontsize=12)
axes[2].set_xlabel('Amount ($)')

plt.suptitle('Original Feature Distributions (Different Scales)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### 2.2 Standardization (Z-score Normalization)

Standardization transforms features to have **zero mean** and **unit variance**.

**Formula:**
$$z = \frac{x - \mu}{\sigma}$$

Where:
- $x$ = original value
- $\mu$ = mean of the feature
- $\sigma$ = standard deviation of the feature

**Properties:**
- Resulting values are typically between -3 and 3
- Does NOT bound values to a specific range
- Less affected by outliers than Min-Max
- Best for normally distributed data

In [None]:
# Manual Standardization Implementation
def standardize_manual(X):
    """
    Manually standardize features (Z-score normalization)
    
    Parameters:
    -----------
    X : array-like
        Input features to standardize
    
    Returns:
    --------
    X_standardized : array
        Standardized features with mean=0 and std=1
    """
    mean = np.mean(X)
    std = np.std(X)
    X_standardized = (X - mean) / std
    return X_standardized, mean, std

# Apply manual standardization
income_standardized_manual, income_mean, income_std = standardize_manual(sample_data['income'])

print("Manual Standardization:")
print(f"Original Mean: ${income_mean:,.2f}")
print(f"Original Std: ${income_std:,.2f}")
print(f"Standardized Mean: {np.mean(income_standardized_manual):.10f} (should be ~0)")
print(f"Standardized Std: {np.std(income_standardized_manual):.10f} (should be ~1)")

In [None]:
# Using Scikit-learn's StandardScaler
numerical_features = ['age', 'income', 'transaction_amount']

# Initialize the scaler
standard_scaler = StandardScaler()

# Fit and transform
data_standardized = standard_scaler.fit_transform(sample_data[numerical_features])
df_standardized = pd.DataFrame(data_standardized, columns=[f'{col}_standardized' for col in numerical_features])

print("Scikit-learn StandardScaler Results:")
print("="*60)
print("\nStatistics after standardization:")
print(df_standardized.describe().round(4))

In [None]:
# Visualize standardized features
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, col in enumerate(numerical_features):
    axes[i].hist(df_standardized[f'{col}_standardized'], bins=30, color='steelblue', edgecolor='black')
    axes[i].axvline(x=0, color='red', linestyle='--', label='Mean (0)')
    axes[i].set_title(f'{col.title()} (Standardized)', fontsize=12)
    axes[i].set_xlabel('Standardized Value')
    axes[i].legend()

plt.suptitle('Standardized Feature Distributions (Same Scale)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### 2.3 Min-Max Normalization

Min-Max Normalization scales features to a fixed range, typically [0, 1].

**Formula:**
$$x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}$$

**For custom range [a, b]:**
$$x_{scaled} = a + \frac{(x - x_{min})(b - a)}{x_{max} - x_{min}}$$

**Properties:**
- Bounds values to [0, 1] (or specified range)
- Preserves zero entries in sparse data
- Sensitive to outliers (they compress the majority of data)
- Good for image data (pixel values)

In [None]:
# Manual Min-Max Normalization Implementation
def normalize_minmax_manual(X, feature_range=(0, 1)):
    """
    Manually normalize features using Min-Max scaling
    
    Parameters:
    -----------
    X : array-like
        Input features to normalize
    feature_range : tuple
        Desired range for the normalized data (default: (0, 1))
    
    Returns:
    --------
    X_normalized : array
        Normalized features in specified range
    """
    X_min = np.min(X)
    X_max = np.max(X)
    
    # Normalize to [0, 1]
    X_normalized = (X - X_min) / (X_max - X_min)
    
    # Scale to desired range
    range_min, range_max = feature_range
    X_normalized = X_normalized * (range_max - range_min) + range_min
    
    return X_normalized, X_min, X_max

# Apply manual normalization
income_normalized_manual, _, _ = normalize_minmax_manual(sample_data['income'])

print("Manual Min-Max Normalization:")
print(f"Min: {np.min(income_normalized_manual):.4f} (should be 0)")
print(f"Max: {np.max(income_normalized_manual):.4f} (should be 1)")

In [None]:
# Using Scikit-learn's MinMaxScaler
minmax_scaler = MinMaxScaler(feature_range=(0, 1))

# Fit and transform
data_normalized = minmax_scaler.fit_transform(sample_data[numerical_features])
df_normalized = pd.DataFrame(data_normalized, columns=[f'{col}_normalized' for col in numerical_features])

print("Scikit-learn MinMaxScaler Results:")
print("="*60)
print("\nStatistics after normalization:")
print(df_normalized.describe().round(4))

In [None]:
# Visualize normalized features
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, col in enumerate(numerical_features):
    axes[i].hist(df_normalized[f'{col}_normalized'], bins=30, color='coral', edgecolor='black')
    axes[i].axvline(x=0, color='blue', linestyle='--', label='Min (0)')
    axes[i].axvline(x=1, color='green', linestyle='--', label='Max (1)')
    axes[i].set_title(f'{col.title()} (Normalized)', fontsize=12)
    axes[i].set_xlabel('Normalized Value [0, 1]')
    axes[i].legend()

plt.suptitle('Min-Max Normalized Feature Distributions', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### 2.4 Robust Scaling

Robust Scaling uses statistics that are robust to outliers: **median** and **interquartile range (IQR)**.

**Formula:**
$$x_{robust} = \frac{x - Q_2}{Q_3 - Q_1}$$

Where:
- $Q_1$ = 25th percentile (first quartile)
- $Q_2$ = 50th percentile (median)
- $Q_3$ = 75th percentile (third quartile)

**Properties:**
- Not affected by outliers
- Centers around the median
- Spreads based on IQR
- Best for data with many outliers

In [None]:
# Create data with outliers
np.random.seed(42)
data_with_outliers = np.concatenate([
    np.random.normal(50, 10, 950),  # Normal data
    np.random.normal(200, 20, 50)    # Outliers
])

# Compare scalers on data with outliers
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Original data
axes[0, 0].hist(data_with_outliers, bins=50, color='gray', edgecolor='black')
axes[0, 0].set_title('Original Data (with outliers)', fontsize=12)
axes[0, 0].axvline(x=np.mean(data_with_outliers), color='red', linestyle='--', label=f'Mean: {np.mean(data_with_outliers):.1f}')
axes[0, 0].axvline(x=np.median(data_with_outliers), color='green', linestyle='--', label=f'Median: {np.median(data_with_outliers):.1f}')
axes[0, 0].legend()

# StandardScaler
standard_scaled = StandardScaler().fit_transform(data_with_outliers.reshape(-1, 1)).flatten()
axes[0, 1].hist(standard_scaled, bins=50, color='steelblue', edgecolor='black')
axes[0, 1].set_title('StandardScaler (affected by outliers)', fontsize=12)
axes[0, 1].axvline(x=0, color='red', linestyle='--', label='Mean=0')
axes[0, 1].legend()

# MinMaxScaler
minmax_scaled = MinMaxScaler().fit_transform(data_with_outliers.reshape(-1, 1)).flatten()
axes[1, 0].hist(minmax_scaled, bins=50, color='coral', edgecolor='black')
axes[1, 0].set_title('MinMaxScaler (compressed by outliers)', fontsize=12)
axes[1, 0].axvline(x=0.5, color='red', linestyle='--', label='Middle=0.5')
axes[1, 0].legend()

# RobustScaler
robust_scaled = RobustScaler().fit_transform(data_with_outliers.reshape(-1, 1)).flatten()
axes[1, 1].hist(robust_scaled, bins=50, color='seagreen', edgecolor='black')
axes[1, 1].set_title('RobustScaler (handles outliers well)', fontsize=12)
axes[1, 1].axvline(x=0, color='red', linestyle='--', label='Median=0')
axes[1, 1].legend()

plt.suptitle('Comparison of Scaling Methods with Outliers', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### 2.5 When to Use Which Scaling Method?

| Situation | Recommended Scaler | Reason |
|-----------|-------------------|--------|
| Normally distributed data | StandardScaler | Preserves distribution shape |
| Unknown distribution | StandardScaler | General purpose |
| Neural networks | MinMaxScaler | Bounded input values work better |
| Image data | MinMaxScaler | Pixel values bounded [0,1] |
| Data with many outliers | RobustScaler | Not affected by extreme values |
| Sparse data | MinMaxScaler | Preserves sparsity |
| Distance-based algorithms | StandardScaler | Equal weight to all features |

In [None]:
# Comprehensive comparison summary
comparison_data = sample_data[numerical_features].copy()

scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

print("Scaling Comparison Summary")
print("="*80)

for name, scaler in scalers.items():
    scaled_data = scaler.fit_transform(comparison_data)
    print(f"\n{name}:")
    print(f"  Mean range: [{scaled_data.mean(axis=0).min():.4f}, {scaled_data.mean(axis=0).max():.4f}]")
    print(f"  Std range: [{scaled_data.std(axis=0).min():.4f}, {scaled_data.std(axis=0).max():.4f}]")
    print(f"  Value range: [{scaled_data.min():.4f}, {scaled_data.max():.4f}]")

---
## 3. Encoding Categorical Variables

Machine learning algorithms require numerical input. Categorical variables must be converted to numbers.

### Types of Categorical Variables

1. **Nominal**: No inherent order (e.g., colors, cities, names)
2. **Ordinal**: Natural order exists (e.g., ratings, education levels)

### 3.1 Label Encoding

Assigns a unique integer to each category.

**Example:**
- Red → 0
- Green → 1
- Blue → 2

**When to use:**
- Ordinal variables (order matters)
- Tree-based algorithms (can handle encoded values)

**When NOT to use:**
- Nominal variables with linear models (implies false ordering)

In [None]:
# Label Encoding Example
print("Label Encoding Demonstration")
print("="*60)

# Original categories
print("\nOriginal 'city' values:")
print(sample_data['city'].value_counts())

# Apply Label Encoding
label_encoder = LabelEncoder()
city_encoded = label_encoder.fit_transform(sample_data['city'])

print("\nLabel Encoding Mapping:")
for i, class_name in enumerate(label_encoder.classes_):
    print(f"  {class_name} → {i}")

# Show before and after
comparison_df = pd.DataFrame({
    'Original': sample_data['city'].head(10),
    'Label_Encoded': city_encoded[:10]
})
print("\nSample comparison:")
print(comparison_df)

In [None]:
# Manual Label Encoding Implementation
def label_encode_manual(series):
    """
    Manually implement label encoding
    
    Parameters:
    -----------
    series : pd.Series
        Categorical series to encode
    
    Returns:
    --------
    encoded : np.array
        Integer-encoded values
    mapping : dict
        Category to integer mapping
    """
    unique_values = sorted(series.unique())
    mapping = {val: idx for idx, val in enumerate(unique_values)}
    encoded = series.map(mapping).values
    return encoded, mapping

# Test manual implementation
city_encoded_manual, city_mapping = label_encode_manual(sample_data['city'])
print("Manual Label Encoding Mapping:")
for k, v in city_mapping.items():
    print(f"  {k} → {v}")

### 3.2 One-Hot Encoding

Creates binary columns for each category.

**Example:**
| Original | Red | Green | Blue |
|----------|-----|-------|------|
| Red      | 1   | 0     | 0    |
| Green    | 0   | 1     | 0    |
| Blue     | 0   | 0     | 1    |

**When to use:**
- Nominal variables (no order)
- Linear models, neural networks
- When categories don't have inherent ranking

**Considerations:**
- High cardinality = many columns (dimensionality explosion)
- May need to drop one column (dummy variable trap)

In [None]:
# One-Hot Encoding with pandas (most common method)
print("One-Hot Encoding with pandas get_dummies()")
print("="*60)

# Method 1: pd.get_dummies()
city_onehot_pandas = pd.get_dummies(sample_data['city'], prefix='city')

print("\nOriginal shape:", sample_data['city'].shape)
print("One-Hot encoded shape:", city_onehot_pandas.shape)
print("\nNew columns created:")
print(city_onehot_pandas.columns.tolist())
print("\nSample one-hot encoded data:")
print(city_onehot_pandas.head(10))

In [None]:
# One-Hot Encoding with scikit-learn
print("One-Hot Encoding with sklearn OneHotEncoder")
print("="*60)

onehot_encoder = OneHotEncoder(sparse_output=False, drop=None)  # drop='first' for dummy encoding
city_onehot_sklearn = onehot_encoder.fit_transform(sample_data[['city']])

# Get feature names
feature_names = onehot_encoder.get_feature_names_out(['city'])
print("\nFeature names:", feature_names)

# Create DataFrame
city_onehot_df = pd.DataFrame(city_onehot_sklearn, columns=feature_names)
print("\nSample data:")
print(city_onehot_df.head(10))

In [None]:
# Manual One-Hot Encoding Implementation
def onehot_encode_manual(series):
    """
    Manually implement one-hot encoding
    
    Parameters:
    -----------
    series : pd.Series
        Categorical series to encode
    
    Returns:
    --------
    onehot_df : pd.DataFrame
        One-hot encoded DataFrame
    """
    unique_values = sorted(series.unique())
    onehot_dict = {}
    
    for val in unique_values:
        col_name = f"{series.name}_{val}"
        onehot_dict[col_name] = (series == val).astype(int)
    
    return pd.DataFrame(onehot_dict)

# Test manual implementation
city_onehot_manual = onehot_encode_manual(sample_data['city'])
print("Manual One-Hot Encoding:")
print(city_onehot_manual.head(10))

In [None]:
# Handling the Dummy Variable Trap
print("Dummy Variable Trap")
print("="*60)
print("""
The Dummy Variable Trap occurs when one-hot encoded columns are 
perfectly multicollinear (one can be predicted from others).

For k categories, we only need k-1 binary columns because the 
last category can be inferred when all others are 0.

Solution: Drop one column (use drop='first' or drop='if_binary')
""")

# Demonstrate with drop='first'
city_dummy = pd.get_dummies(sample_data['city'], prefix='city', drop_first=True)
print("\nWith drop_first=True:")
print(f"Original categories: {sample_data['city'].nunique()}")
print(f"Dummy columns: {city_dummy.shape[1]}")
print(f"Columns: {city_dummy.columns.tolist()}")

### 3.3 Ordinal Encoding

Similar to label encoding, but explicitly handles the order of categories.

**Use Case:** When categories have a meaningful order (e.g., education levels, ratings)

In [None]:
# Ordinal Encoding for ordered categories
print("Ordinal Encoding Demonstration")
print("="*60)

# Education has a natural order: High School < Bachelor < Master < PhD
education_order = ['High School', 'Bachelor', 'Master', 'PhD']

print("\nOriginal education distribution:")
print(sample_data['education'].value_counts())

# Using sklearn OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[education_order])
education_ordinal = ordinal_encoder.fit_transform(sample_data[['education']])

print("\nOrdinal Encoding Mapping:")
for i, edu in enumerate(education_order):
    print(f"  {edu} → {i}")

# Compare
comparison_df = pd.DataFrame({
    'Original': sample_data['education'].head(15),
    'Ordinal_Encoded': education_ordinal[:15].flatten()
})
print("\nSample comparison:")
print(comparison_df)

In [None]:
# Ordinal Encoding for Rating (another example)
rating_order = ['Poor', 'Average', 'Good', 'Excellent']

ordinal_encoder_rating = OrdinalEncoder(categories=[rating_order])
rating_ordinal = ordinal_encoder_rating.fit_transform(sample_data[['rating']])

print("Rating Ordinal Encoding:")
for i, rating in enumerate(rating_order):
    print(f"  {rating} → {i}")

# Create a comprehensive encoding summary
encoding_summary = pd.DataFrame({
    'education_original': sample_data['education'],
    'education_ordinal': education_ordinal.flatten(),
    'rating_original': sample_data['rating'],
    'rating_ordinal': rating_ordinal.flatten()
})

print("\nEncoding Summary (first 10 rows):")
print(encoding_summary.head(10))

### 3.4 Target Encoding (Mean Encoding)

Encodes categories using the target variable's mean for each category.

**Advantages:**
- Captures relationship between category and target
- Single column (no dimensionality explosion)
- Works well with high-cardinality features

**Disadvantages:**
- Risk of target leakage (requires careful validation split)
- Needs smoothing for rare categories

In [None]:
# Target Encoding Example
print("Target Encoding Demonstration")
print("="*60)

# Create a target variable based on income
sample_data['high_value_customer'] = (sample_data['income'] > sample_data['income'].median()).astype(int)

# Calculate target mean for each city
city_target_mean = sample_data.groupby('city')['high_value_customer'].mean()
print("\nTarget Mean by City:")
print(city_target_mean.sort_values(ascending=False))

# Apply target encoding
sample_data['city_target_encoded'] = sample_data['city'].map(city_target_mean)

print("\nSample with Target Encoding:")
print(sample_data[['city', 'city_target_encoded', 'high_value_customer']].head(15))

In [None]:
# Target Encoding with Smoothing (to handle rare categories)
def target_encode_with_smoothing(df, category_col, target_col, smoothing=10):
    """
    Target encoding with smoothing to handle rare categories
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input DataFrame
    category_col : str
        Name of categorical column
    target_col : str
        Name of target column
    smoothing : int
        Smoothing factor (higher = more regularization)
    
    Returns:
    --------
    encoded : pd.Series
        Target-encoded values with smoothing
    """
    global_mean = df[target_col].mean()
    
    # Calculate category stats
    agg = df.groupby(category_col)[target_col].agg(['mean', 'count'])
    
    # Apply smoothing formula:
    # smoothed_mean = (count * category_mean + smoothing * global_mean) / (count + smoothing)
    smoothed_mean = (agg['count'] * agg['mean'] + smoothing * global_mean) / (agg['count'] + smoothing)
    
    return df[category_col].map(smoothed_mean), smoothed_mean

# Apply smoothed target encoding
city_smoothed, smoothed_mapping = target_encode_with_smoothing(
    sample_data, 'city', 'high_value_customer', smoothing=10
)

print("Target Encoding with Smoothing:")
print("\nRaw vs Smoothed means:")
comparison = pd.DataFrame({
    'Raw Mean': city_target_mean,
    'Smoothed Mean': smoothed_mapping
})
print(comparison)

### Encoding Method Summary

In [None]:
# Visual comparison of encoding methods
print("Encoding Methods Comparison")
print("="*80)

encoding_comparison = pd.DataFrame({
    'Method': ['Label Encoding', 'One-Hot Encoding', 'Ordinal Encoding', 'Target Encoding'],
    'Best For': ['Tree models, ordinal data', 'Linear models, nominal data', 
                 'Ordered categories', 'High cardinality'],
    'Creates Columns': ['1', 'k (or k-1)', '1', '1'],
    'Preserves Order': ['No', 'No', 'Yes', 'No'],
    'Risk': ['False ordering', 'Dimensionality', 'None', 'Target leakage']
})

print(encoding_comparison.to_string(index=False))

---
## 4. Feature Creation

Creating new features from existing ones can significantly improve model performance.

### 4.1 Polynomial Features

Creates polynomial combinations of features up to a specified degree.

For features [a, b] with degree=2:
- Output: [1, a, b, a², ab, b²]

In [None]:
# Polynomial Features Example
print("Polynomial Features Demonstration")
print("="*60)

# Sample data
simple_data = pd.DataFrame({
    'x1': [1, 2, 3, 4, 5],
    'x2': [2, 4, 6, 8, 10]
})

print("Original Features:")
print(simple_data)

# Create polynomial features (degree=2)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(simple_data)
poly_names = poly.get_feature_names_out(['x1', 'x2'])

poly_df = pd.DataFrame(poly_features, columns=poly_names)
print("\nPolynomial Features (degree=2):")
print(poly_df)

In [None]:
# Manual polynomial feature creation
def create_polynomial_features_manual(df, features, degree=2):
    """
    Manually create polynomial features
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input DataFrame
    features : list
        List of feature names to use
    degree : int
        Maximum polynomial degree
    
    Returns:
    --------
    result_df : pd.DataFrame
        DataFrame with polynomial features added
    """
    result = df[features].copy()
    
    # Add squared terms
    for feat in features:
        result[f'{feat}^2'] = df[feat] ** 2
        if degree >= 3:
            result[f'{feat}^3'] = df[feat] ** 3
    
    # Add interaction terms
    for i, feat1 in enumerate(features):
        for feat2 in features[i+1:]:
            result[f'{feat1}*{feat2}'] = df[feat1] * df[feat2]
    
    return result

# Test manual implementation
poly_manual = create_polynomial_features_manual(simple_data, ['x1', 'x2'], degree=2)
print("Manual Polynomial Features:")
print(poly_manual)

### 4.2 Interaction Features

Capture relationships between multiple features.

In [None]:
# Interaction Features
print("Interaction Features Demonstration")
print("="*60)

# Using our sample data
interaction_df = sample_data[['age', 'income', 'transaction_amount']].copy()

# Create interaction features
interaction_df['age_income_interaction'] = interaction_df['age'] * interaction_df['income']
interaction_df['income_per_age'] = interaction_df['income'] / interaction_df['age']
interaction_df['transaction_to_income_ratio'] = interaction_df['transaction_amount'] / interaction_df['income']

print("Original + Interaction Features:")
print(interaction_df.head(10))
print("\nNew Features Statistics:")
print(interaction_df[['age_income_interaction', 'income_per_age', 'transaction_to_income_ratio']].describe())

### 4.3 Aggregation Features

Create summary statistics at different granularities.

In [None]:
# Aggregation Features
print("Aggregation Features Demonstration")
print("="*60)

# Calculate aggregations by city
city_agg = sample_data.groupby('city').agg({
    'income': ['mean', 'median', 'std', 'min', 'max'],
    'transaction_amount': ['mean', 'sum', 'count']
}).reset_index()

# Flatten column names
city_agg.columns = ['_'.join(col).strip() if col[1] else col[0] for col in city_agg.columns]
city_agg = city_agg.rename(columns={'city_': 'city'})

print("City-level Aggregation Features:")
print(city_agg)

# Merge back to original data
sample_with_agg = sample_data.merge(city_agg, on='city', how='left')

# Create relative features
sample_with_agg['income_vs_city_avg'] = sample_with_agg['income'] / sample_with_agg['income_mean']
sample_with_agg['income_city_percentile'] = sample_with_agg.groupby('city')['income'].transform(
    lambda x: x.rank(pct=True)
)

print("\nRelative Features:")
print(sample_with_agg[['city', 'income', 'income_mean', 'income_vs_city_avg', 'income_city_percentile']].head(15))

### 4.4 Domain-Specific Features

Features created based on domain knowledge.

In [None]:
# Domain-Specific Features for Customer Data
print("Domain-Specific Features")
print("="*60)

# Age-based features
sample_data['age_group'] = pd.cut(sample_data['age'], 
                                   bins=[0, 25, 35, 45, 55, 65, 100],
                                   labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+'])

# Income-based features
sample_data['income_bracket'] = pd.qcut(sample_data['income'], q=5, 
                                         labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Transaction patterns
sample_data['is_high_spender'] = (sample_data['transaction_amount'] > 
                                   sample_data['transaction_amount'].quantile(0.75)).astype(int)

# Log transformations for skewed distributions
sample_data['log_income'] = np.log1p(sample_data['income'])
sample_data['log_transaction'] = np.log1p(sample_data['transaction_amount'])

print("New Domain Features:")
print(sample_data[['age', 'age_group', 'income', 'income_bracket', 
                   'transaction_amount', 'is_high_spender', 'log_income']].head(15))

In [None]:
# Visualize log transformation effect
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Original income distribution
axes[0, 0].hist(sample_data['income'], bins=50, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Original Income Distribution (Skewed)', fontsize=12)
axes[0, 0].set_xlabel('Income')

# Log-transformed income
axes[0, 1].hist(sample_data['log_income'], bins=50, color='lightgreen', edgecolor='black')
axes[0, 1].set_title('Log-Transformed Income (More Normal)', fontsize=12)
axes[0, 1].set_xlabel('Log(Income)')

# Original transaction distribution
axes[1, 0].hist(sample_data['transaction_amount'], bins=50, color='salmon', edgecolor='black')
axes[1, 0].set_title('Original Transaction Distribution (Skewed)', fontsize=12)
axes[1, 0].set_xlabel('Transaction Amount')

# Log-transformed transaction
axes[1, 1].hist(sample_data['log_transaction'], bins=50, color='plum', edgecolor='black')
axes[1, 1].set_title('Log-Transformed Transaction (More Normal)', fontsize=12)
axes[1, 1].set_xlabel('Log(Transaction)')

plt.suptitle('Effect of Log Transformation on Skewed Distributions', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
## 5. Handling DateTime Features

DateTime features contain rich information that can be extracted for ML models.

### Common DateTime Extractions:
- Year, Month, Day
- Hour, Minute, Second
- Day of Week
- Week of Year
- Quarter
- Is Weekend
- Time Since Event
- Cyclical Encoding

In [None]:
# DateTime Feature Engineering
print("DateTime Feature Engineering")
print("="*60)

# Create a copy with datetime features
datetime_df = sample_data[['purchase_date', 'transaction_amount']].copy()

# Ensure datetime type
datetime_df['purchase_date'] = pd.to_datetime(datetime_df['purchase_date'])

print("Original DateTime:")
print(datetime_df['purchase_date'].head(10))

In [None]:
# Extract basic datetime components
datetime_df['year'] = datetime_df['purchase_date'].dt.year
datetime_df['month'] = datetime_df['purchase_date'].dt.month
datetime_df['day'] = datetime_df['purchase_date'].dt.day
datetime_df['hour'] = datetime_df['purchase_date'].dt.hour
datetime_df['minute'] = datetime_df['purchase_date'].dt.minute

# Additional datetime features
datetime_df['day_of_week'] = datetime_df['purchase_date'].dt.dayofweek  # 0=Monday, 6=Sunday
datetime_df['day_name'] = datetime_df['purchase_date'].dt.day_name()
datetime_df['week_of_year'] = datetime_df['purchase_date'].dt.isocalendar().week
datetime_df['quarter'] = datetime_df['purchase_date'].dt.quarter
datetime_df['is_weekend'] = (datetime_df['day_of_week'] >= 5).astype(int)
datetime_df['is_month_start'] = datetime_df['purchase_date'].dt.is_month_start.astype(int)
datetime_df['is_month_end'] = datetime_df['purchase_date'].dt.is_month_end.astype(int)

print("Extracted DateTime Features:")
print(datetime_df.head(15))

In [None]:
# Time-based features
reference_date = datetime_df['purchase_date'].max()
datetime_df['days_since_purchase'] = (reference_date - datetime_df['purchase_date']).dt.days

# Part of day
def get_part_of_day(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

datetime_df['part_of_day'] = datetime_df['hour'].apply(get_part_of_day)

print("Additional Time Features:")
print(datetime_df[['purchase_date', 'hour', 'part_of_day', 'days_since_purchase']].head(15))

In [None]:
# Cyclical Encoding for DateTime Features
print("Cyclical Encoding for DateTime Features")
print("="*60)
print("""
Problem with simple numerical encoding:
- Hour 23 and Hour 0 are far apart numerically (23 - 0 = 23)
- But they are actually close in time (1 hour apart)

Solution: Cyclical encoding using sine and cosine
- Maps cyclical values to a circle
- Hour 23 and Hour 0 become adjacent on the circle
""")

def cyclical_encode(df, col, max_val):
    """
    Encode cyclical features using sin/cos transformation
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input DataFrame
    col : str
        Column name to encode
    max_val : int
        Maximum value in the cycle (e.g., 24 for hours, 7 for days)
    
    Returns:
    --------
    sin_col, cos_col : pd.Series
        Sine and cosine encoded values
    """
    sin_col = np.sin(2 * np.pi * df[col] / max_val)
    cos_col = np.cos(2 * np.pi * df[col] / max_val)
    return sin_col, cos_col

# Apply cyclical encoding
datetime_df['hour_sin'], datetime_df['hour_cos'] = cyclical_encode(datetime_df, 'hour', 24)
datetime_df['dow_sin'], datetime_df['dow_cos'] = cyclical_encode(datetime_df, 'day_of_week', 7)
datetime_df['month_sin'], datetime_df['month_cos'] = cyclical_encode(datetime_df, 'month', 12)

print("Cyclical Encoded Features:")
print(datetime_df[['hour', 'hour_sin', 'hour_cos', 'day_of_week', 'dow_sin', 'dow_cos']].head(15))

In [None]:
# Visualize cyclical encoding
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Hour encoding
hours = np.arange(24)
hour_sin = np.sin(2 * np.pi * hours / 24)
hour_cos = np.cos(2 * np.pi * hours / 24)

axes[0].scatter(hour_cos, hour_sin, c=hours, cmap='viridis', s=100)
for i, hour in enumerate(hours):
    axes[0].annotate(str(hour), (hour_cos[i], hour_sin[i]), fontsize=8)
axes[0].set_xlabel('Cosine')
axes[0].set_ylabel('Sine')
axes[0].set_title('Hour Cyclical Encoding', fontsize=12)
axes[0].set_aspect('equal')

# Day of week encoding
days = np.arange(7)
day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
day_sin = np.sin(2 * np.pi * days / 7)
day_cos = np.cos(2 * np.pi * days / 7)

axes[1].scatter(day_cos, day_sin, c=days, cmap='tab10', s=100)
for i, day in enumerate(day_names):
    axes[1].annotate(day, (day_cos[i], day_sin[i]), fontsize=8)
axes[1].set_xlabel('Cosine')
axes[1].set_ylabel('Sine')
axes[1].set_title('Day of Week Cyclical Encoding', fontsize=12)
axes[1].set_aspect('equal')

# Month encoding
months = np.arange(1, 13)
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_sin = np.sin(2 * np.pi * months / 12)
month_cos = np.cos(2 * np.pi * months / 12)

axes[2].scatter(month_cos, month_sin, c=months, cmap='hsv', s=100)
for i, month in enumerate(month_names):
    axes[2].annotate(month, (month_cos[i], month_sin[i]), fontsize=8)
axes[2].set_xlabel('Cosine')
axes[2].set_ylabel('Sine')
axes[2].set_title('Month Cyclical Encoding', fontsize=12)
axes[2].set_aspect('equal')

plt.suptitle('Cyclical Encoding Visualization', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Complete DateTime Feature Engineering Function
def engineer_datetime_features(df, datetime_col, reference_date=None):
    """
    Comprehensive datetime feature engineering
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input DataFrame
    datetime_col : str
        Name of datetime column
    reference_date : datetime, optional
        Reference date for calculating time differences
    
    Returns:
    --------
    result_df : pd.DataFrame
        DataFrame with engineered datetime features
    """
    result = pd.DataFrame()
    dt = pd.to_datetime(df[datetime_col])
    
    # Basic extractions
    result[f'{datetime_col}_year'] = dt.dt.year
    result[f'{datetime_col}_month'] = dt.dt.month
    result[f'{datetime_col}_day'] = dt.dt.day
    result[f'{datetime_col}_hour'] = dt.dt.hour
    result[f'{datetime_col}_dayofweek'] = dt.dt.dayofweek
    result[f'{datetime_col}_quarter'] = dt.dt.quarter
    result[f'{datetime_col}_weekofyear'] = dt.dt.isocalendar().week.astype(int)
    
    # Boolean features
    result[f'{datetime_col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
    result[f'{datetime_col}_is_month_start'] = dt.dt.is_month_start.astype(int)
    result[f'{datetime_col}_is_month_end'] = dt.dt.is_month_end.astype(int)
    
    # Cyclical encoding
    result[f'{datetime_col}_hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24)
    result[f'{datetime_col}_hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24)
    result[f'{datetime_col}_dow_sin'] = np.sin(2 * np.pi * dt.dt.dayofweek / 7)
    result[f'{datetime_col}_dow_cos'] = np.cos(2 * np.pi * dt.dt.dayofweek / 7)
    result[f'{datetime_col}_month_sin'] = np.sin(2 * np.pi * dt.dt.month / 12)
    result[f'{datetime_col}_month_cos'] = np.cos(2 * np.pi * dt.dt.month / 12)
    
    # Time since reference
    if reference_date is None:
        reference_date = dt.max()
    result[f'{datetime_col}_days_since'] = (reference_date - dt).dt.days
    
    return result

# Test the function
datetime_features = engineer_datetime_features(sample_data, 'purchase_date')
print("Complete DateTime Feature Engineering:")
print(f"Number of features created: {datetime_features.shape[1]}")
print(f"\nFeatures: {datetime_features.columns.tolist()}")

---
## 6. Assignment: Transform Raw Dataset to ML-Ready Format

### Task
Transform a raw dataset with mixed data types (numerical, categorical, datetime) into a machine learning-ready format.

### Dataset
We'll use a synthetic e-commerce dataset with:
- Customer information
- Purchase history
- Mixed data types

In [None]:
# Create the raw e-commerce dataset
np.random.seed(42)
n_samples = 2000

raw_data = pd.DataFrame({
    # Numerical features
    'customer_age': np.random.randint(18, 75, n_samples),
    'annual_income': np.random.exponential(60000, n_samples) + 25000,
    'account_age_days': np.random.randint(1, 3650, n_samples),
    'num_purchases': np.random.poisson(15, n_samples),
    'avg_purchase_value': np.random.exponential(80, n_samples) + 20,
    'website_visits': np.random.poisson(50, n_samples),
    'cart_abandonment_rate': np.random.beta(2, 5, n_samples),
    
    # Categorical features
    'gender': np.random.choice(['Male', 'Female', 'Other'], n_samples, p=[0.48, 0.48, 0.04]),
    'membership_type': np.random.choice(['Basic', 'Silver', 'Gold', 'Platinum'], n_samples,
                                        p=[0.4, 0.3, 0.2, 0.1]),
    'preferred_category': np.random.choice(['Electronics', 'Fashion', 'Home', 'Sports', 'Books'], n_samples),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'PayPal', 'Cash'], n_samples),
    'device_type': np.random.choice(['Desktop', 'Mobile', 'Tablet'], n_samples, p=[0.4, 0.5, 0.1]),
    'customer_segment': np.random.choice(['New', 'Regular', 'VIP', 'At-Risk'], n_samples,
                                         p=[0.25, 0.45, 0.15, 0.15]),
    
    # DateTime features
    'registration_date': pd.date_range('2019-01-01', periods=n_samples, freq='4H') + 
                         pd.to_timedelta(np.random.randint(0, 30, n_samples), unit='D'),
    'last_purchase_date': pd.date_range('2023-06-01', periods=n_samples, freq='30T') +
                          pd.to_timedelta(np.random.randint(0, 180, n_samples), unit='D'),
    
    # Target variable (will the customer make a purchase in next 30 days?)
    'will_purchase': np.random.choice([0, 1], n_samples, p=[0.6, 0.4])
})

# Add some missing values for realism
missing_indices = np.random.choice(n_samples, 100, replace=False)
raw_data.loc[missing_indices[:50], 'annual_income'] = np.nan
raw_data.loc[missing_indices[50:], 'cart_abandonment_rate'] = np.nan

print("Raw E-Commerce Dataset")
print("="*60)
print(f"Shape: {raw_data.shape}")
print(f"\nColumn Types:\n{raw_data.dtypes}")
print(f"\nMissing Values:\n{raw_data.isnull().sum()[raw_data.isnull().sum() > 0]}")
print("\nFirst 10 rows:")
raw_data.head(10)

### Step 1: Data Exploration

In [None]:
# Analyze the raw data
print("Raw Data Summary")
print("="*60)

# Identify column types
numerical_cols = raw_data.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = raw_data.select_dtypes(include=['object']).columns.tolist()
datetime_cols = raw_data.select_dtypes(include=['datetime64']).columns.tolist()

print(f"\nNumerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"\nCategorical columns ({len(categorical_cols)}): {categorical_cols}")
print(f"\nDatetime columns ({len(datetime_cols)}): {datetime_cols}")

# Statistical summary
print("\nNumerical Statistics:")
print(raw_data[numerical_cols].describe().round(2))

In [None]:
# Categorical value counts
print("Categorical Value Distributions")
print("="*60)

for col in categorical_cols:
    print(f"\n{col}:")
    print(raw_data[col].value_counts())

### Step 2: Handle Missing Values

In [None]:
# Handle missing values
print("Handling Missing Values")
print("="*60)

# Create a copy for transformation
df = raw_data.copy()

# Fill numerical missing values with median
for col in ['annual_income', 'cart_abandonment_rate']:
    median_val = df[col].median()
    df[col].fillna(median_val, inplace=True)
    print(f"Filled {col} missing values with median: {median_val:.2f}")

print(f"\nMissing values after handling: {df.isnull().sum().sum()}")

### Step 3: Feature Engineering - Numerical Features

In [None]:
# Create new numerical features
print("Creating New Numerical Features")
print("="*60)

# Derived features
df['total_spending'] = df['num_purchases'] * df['avg_purchase_value']
df['purchase_frequency'] = df['num_purchases'] / (df['account_age_days'] / 30)  # purchases per month
df['visits_per_purchase'] = df['website_visits'] / (df['num_purchases'] + 1)  # +1 to avoid division by 0
df['income_per_purchase'] = df['annual_income'] / (df['num_purchases'] + 1)

# Log transformations for skewed features
df['log_annual_income'] = np.log1p(df['annual_income'])
df['log_total_spending'] = np.log1p(df['total_spending'])

# Binning continuous variables
df['age_group'] = pd.cut(df['customer_age'], 
                         bins=[0, 25, 35, 45, 55, 65, 100],
                         labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+'])

df['income_bracket'] = pd.qcut(df['annual_income'], q=5, 
                               labels=['Very_Low', 'Low', 'Medium', 'High', 'Very_High'])

print("New numerical features created:")
new_num_features = ['total_spending', 'purchase_frequency', 'visits_per_purchase', 
                    'income_per_purchase', 'log_annual_income', 'log_total_spending']
print(df[new_num_features].describe().round(2))

### Step 4: Feature Engineering - DateTime Features

In [None]:
# Engineer datetime features
print("Engineering DateTime Features")
print("="*60)

# Reference date (today)
reference_date = pd.Timestamp('2024-01-01')

# Registration date features
df['reg_year'] = df['registration_date'].dt.year
df['reg_month'] = df['registration_date'].dt.month
df['reg_dayofweek'] = df['registration_date'].dt.dayofweek
df['days_since_registration'] = (reference_date - df['registration_date']).dt.days

# Last purchase date features
df['last_purchase_month'] = df['last_purchase_date'].dt.month
df['last_purchase_dayofweek'] = df['last_purchase_date'].dt.dayofweek
df['days_since_last_purchase'] = (reference_date - df['last_purchase_date']).dt.days
df['last_purchase_is_weekend'] = (df['last_purchase_dayofweek'] >= 5).astype(int)

# Cyclical encoding for months
df['reg_month_sin'] = np.sin(2 * np.pi * df['reg_month'] / 12)
df['reg_month_cos'] = np.cos(2 * np.pi * df['reg_month'] / 12)
df['last_purchase_month_sin'] = np.sin(2 * np.pi * df['last_purchase_month'] / 12)
df['last_purchase_month_cos'] = np.cos(2 * np.pi * df['last_purchase_month'] / 12)

# Recency feature (important for customer behavior)
df['recency_score'] = 1 / (df['days_since_last_purchase'] + 1)  # Higher = more recent

print("DateTime features created:")
datetime_features = ['days_since_registration', 'days_since_last_purchase', 
                     'last_purchase_is_weekend', 'recency_score']
print(df[datetime_features].describe().round(4))

### Step 5: Encode Categorical Variables

In [None]:
# Encode categorical variables
print("Encoding Categorical Variables")
print("="*60)

# Ordinal encoding for ordered categories
membership_order = ['Basic', 'Silver', 'Gold', 'Platinum']
segment_order = ['New', 'At-Risk', 'Regular', 'VIP']

ordinal_encoder_membership = OrdinalEncoder(categories=[membership_order])
ordinal_encoder_segment = OrdinalEncoder(categories=[segment_order])

df['membership_encoded'] = ordinal_encoder_membership.fit_transform(df[['membership_type']]).flatten()
df['segment_encoded'] = ordinal_encoder_segment.fit_transform(df[['customer_segment']]).flatten()

print("\nOrdinal Encoding:")
print(f"Membership: {dict(zip(membership_order, range(4)))}")
print(f"Segment: {dict(zip(segment_order, range(4)))}")

# One-hot encoding for nominal categories
onehot_cols = ['gender', 'preferred_category', 'payment_method', 'device_type', 'age_group', 'income_bracket']

df_encoded = pd.get_dummies(df, columns=onehot_cols, drop_first=False)

print(f"\nOne-Hot encoded columns: {onehot_cols}")
print(f"Total columns after encoding: {df_encoded.shape[1]}")

### Step 6: Scale Numerical Features

In [None]:
# Scale numerical features
print("Scaling Numerical Features")
print("="*60)

# Identify final numerical columns to scale
cols_to_scale = [
    'customer_age', 'annual_income', 'account_age_days', 'num_purchases',
    'avg_purchase_value', 'website_visits', 'cart_abandonment_rate',
    'total_spending', 'purchase_frequency', 'visits_per_purchase',
    'income_per_purchase', 'log_annual_income', 'log_total_spending',
    'days_since_registration', 'days_since_last_purchase', 'recency_score',
    'membership_encoded', 'segment_encoded'
]

# Use StandardScaler
scaler = StandardScaler()
df_encoded[cols_to_scale] = scaler.fit_transform(df_encoded[cols_to_scale])

print(f"Scaled {len(cols_to_scale)} numerical columns")
print("\nScaled features statistics (should have mean~0, std~1):")
print(df_encoded[cols_to_scale].describe().round(2).loc[['mean', 'std']])

### Step 7: Prepare Final ML-Ready Dataset

In [None]:
# Prepare final dataset
print("Preparing Final ML-Ready Dataset")
print("="*60)

# Drop original categorical and datetime columns
cols_to_drop = ['membership_type', 'customer_segment', 'registration_date', 
                'last_purchase_date', 'reg_year', 'reg_month', 'reg_dayofweek',
                'last_purchase_month', 'last_purchase_dayofweek']

df_final = df_encoded.drop(columns=cols_to_drop, errors='ignore')

# Separate features and target
X = df_final.drop(columns=['will_purchase'])
y = df_final['will_purchase']

print(f"Final dataset shape: {df_final.shape}")
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nTarget distribution:\n{y.value_counts(normalize=True).round(3)}")

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train-Test Split")
print("="*60)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining target distribution:\n{y_train.value_counts(normalize=True).round(3)}")
print(f"\nTest target distribution:\n{y_test.value_counts(normalize=True).round(3)}")

In [None]:
# Final summary
print("\n" + "="*80)
print("FEATURE ENGINEERING SUMMARY")
print("="*80)

print(f"""
Original Dataset:
  - Samples: {raw_data.shape[0]}
  - Features: {raw_data.shape[1]}
  - Numerical: {len(numerical_cols)}
  - Categorical: {len(categorical_cols)}
  - DateTime: {len(datetime_cols)}
  - Missing values: {raw_data.isnull().sum().sum()}

Final ML-Ready Dataset:
  - Samples: {X.shape[0]}
  - Features: {X.shape[1]}
  - Missing values: {X.isnull().sum().sum()}
  - All features are numerical: {X.select_dtypes(include=[np.number]).shape[1] == X.shape[1]}

Transformations Applied:
  1. Missing value imputation (median)
  2. Feature creation (derived features)
  3. Log transformations (skewed features)
  4. Binning (age groups, income brackets)
  5. DateTime feature extraction
  6. Cyclical encoding (months)
  7. Ordinal encoding (membership, segment)
  8. One-hot encoding (gender, category, payment, device)
  9. Standard scaling (all numerical features)
  10. Train-test split (80-20)
""")

print("\nFeature List:")
for i, col in enumerate(X.columns, 1):
    print(f"  {i:2d}. {col}")

In [None]:
# Save the processed data
X_train.to_csv('X_train_processed.csv', index=False)
X_test.to_csv('X_test_processed.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

print("Processed data saved to CSV files!")
print("\nFiles created:")
print("  - X_train_processed.csv")
print("  - X_test_processed.csv")
print("  - y_train.csv")
print("  - y_test.csv")

---
## Summary

### Key Takeaways

1. **Feature Scaling**
   - StandardScaler: Mean=0, Std=1 (best for most cases)
   - MinMaxScaler: Range [0,1] (best for bounded data)
   - RobustScaler: Based on median/IQR (best for outliers)

2. **Categorical Encoding**
   - Label Encoding: For ordinal data or tree models
   - One-Hot Encoding: For nominal data with linear models
   - Target Encoding: For high cardinality features

3. **Feature Creation**
   - Polynomial features for non-linear relationships
   - Interaction features to capture combined effects
   - Domain-specific features using business knowledge

4. **DateTime Features**
   - Extract components (year, month, day, hour)
   - Create cyclical encodings for periodic features
   - Calculate time-based metrics (recency, age)

### Best Practices

- Always fit scalers on training data only
- Handle missing values before scaling
- Choose encoding based on data type and model
- Create features based on domain knowledge
- Document all transformations for reproducibility