In [None]:
# Feature Engineering Workflow

**📊 Category**: ML Workflow

**👤 Author**: Data Science Team

**📅 Created**: 2024-01-15

**🔄 Last Updated**: 2024-01-15

**⏱️ Estimated Runtime**: 15-20 minutes

**🎯 Purpose**: Demonstrate advanced feature engineering techniques including automated feature selection, transformation, and preprocessing for machine learning workflows.

**📋 Prerequisites**: 
- Basic understanding of machine learning concepts
- Familiarity with pandas and scikit-learn
- Understanding of feature engineering principles

**📊 Datasets Used**:
- Synthetic retail dataset: Generated dataset with sales, product, and customer features
- Size: 10,000 rows, 15 features
- Target: Sales prediction (regression) or customer category (classification)

**🔧 Tools & Libraries**:
- pandas: Data manipulation and analysis
- scikit-learn: Machine learning algorithms and preprocessing
- matplotlib/seaborn: Data visualization
- numpy: Numerical computations
- Custom ml_utils: Advanced ML utilities

**📈 Key Outcomes**:
- Automated feature selection and engineering
- Comprehensive preprocessing pipeline
- Feature importance analysis
- Optimized feature set for model training

**🔗 Related Notebooks**:
- [02-model-training.ipynb](02-model-training.ipynb): Model training with engineered features
- [03-model-evaluation.ipynb](03-model-evaluation.ipynb): Model evaluation and validation
- [data-exploration-template.ipynb](../templates/data-exploration-template.ipynb): Data exploration template

**📝 Change Log**:
- v1.0.0 (2024-01-15): Initial implementation with automated feature engineering
- v1.0.1 (2024-01-15): Added polynomial features and interaction terms
- v1.0.2 (2024-01-15): Enhanced feature selection methods

---

## 📚 Table of Contents

1. [Environment Setup](#environment-setup)
2. [Data Loading and Exploration](#data-loading-exploration)
3. [Feature Preprocessing](#feature-preprocessing)
4. [Automated Feature Selection](#automated-feature-selection)
5. [Feature Engineering](#feature-engineering)
6. [Feature Validation](#feature-validation)
7. [Pipeline Creation](#pipeline-creation)
8. [Results Summary](#results-summary)
9. [Next Steps](#next-steps)

---


In [None]:
## 1. Environment Setup

This section sets up the environment, imports necessary libraries, and configures the notebook for feature engineering workflows.


In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
import os
import sys

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 12

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, SelectFromModel, RFE, f_classif, f_regression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, r2_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Add shared utilities to path
sys.path.append('../shared')

# Import custom utilities
try:
    from ml_utils import FeatureEngineer, create_synthetic_dataset
    from visualization_utils import create_subplot_grid, plot_feature_importance
    from data_connectors import DataConnector
    print("✅ Successfully imported custom utilities")
except ImportError as e:
    print(f"⚠️  Warning: Could not import custom utilities: {e}")
    print("Using standard libraries only")

# Configuration
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display configuration
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("🚀 Environment setup complete!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")
print(f"📈 Matplotlib version: {plt.matplotlib.__version__}")
print(f"🎯 Random state: {RANDOM_STATE}")


In [None]:
## 2. Data Loading and Exploration

Load the dataset and perform initial exploration to understand the data structure, distributions, and relationships between features.


In [None]:
# Create synthetic retail dataset for demonstration
print("📦 Creating synthetic retail dataset...")

# Generate synthetic data
np.random.seed(RANDOM_STATE)
n_samples = 10000

# Create retail-specific features
data = {
    'customer_age': np.random.randint(18, 80, n_samples),
    'annual_income': np.random.lognormal(10, 0.5, n_samples),
    'purchase_frequency': np.random.poisson(12, n_samples),
}

# Create DataFrame
df = pd.DataFrame(data)

print(f"✅ Dataset created with {len(df)} samples and {len(df.columns)} features")


In [None]:
# Create synthetic retail dataset for demonstration
print("📦 Creating synthetic retail dataset...")

# Generate synthetic data
np.random.seed(RANDOM_STATE)
n_samples = 10000

# Create retail-specific features
data = {
    'customer_age': np.random.randint(18, 80, n_samples),
    'annual_income': np.random.lognormal(10, 0.5, n_samples),
    'purchase_frequency': np.random.poisson(12, n_samples),
    'avg_basket_size': np.random.exponential(50, n_samples),
    'days_since_last_purchase': np.random.exponential(30, n_samples),
    'loyalty_score': np.random.beta(2, 5, n_samples) * 100,
    'product_category_electronics': np.random.binomial(1, 0.3, n_samples),
    'product_category_clothing': np.random.binomial(1, 0.4, n_samples),
    'product_category_food': np.random.binomial(1, 0.5, n_samples),
    'seasonal_factor': np.sin(np.random.uniform(0, 2*np.pi, n_samples)) + 1,
    'marketing_channel_online': np.random.binomial(1, 0.6, n_samples),
    'marketing_channel_email': np.random.binomial(1, 0.4, n_samples),
    'customer_segment': np.random.choice(['Premium', 'Standard', 'Budget'], n_samples, p=[0.2, 0.5, 0.3]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'is_weekend': np.random.binomial(1, 0.3, n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)

# Create target variable (sales amount) with realistic relationships
target_sales = (
    df['annual_income'] * 0.001 +
    df['purchase_frequency'] * 5 +
    df['avg_basket_size'] * 2 +
    df['loyalty_score'] * 1.5 +
    df['product_category_electronics'] * 50 +
    df['seasonal_factor'] * 20 +
    np.random.normal(0, 50, n_samples)  # Add noise
)

# Ensure positive values
df['target_sales'] = np.maximum(target_sales, 0)

# Create binary classification target
df['target_high_value'] = (df['target_sales'] > df['target_sales'].quantile(0.7)).astype(int)

print(f"✅ Dataset created with {len(df)} samples and {len(df.columns)} features")
print(f"📊 Target variable (sales) range: ${df['target_sales'].min():.2f} - ${df['target_sales'].max():.2f}")
print(f"🎯 High-value customers: {df['target_high_value'].sum()} ({df['target_high_value'].mean()*100:.1f}%)")


In [None]:
# Create synthetic retail dataset for demonstration
print("📦 Creating synthetic retail dataset...")

# Generate synthetic data
np.random.seed(RANDOM_STATE)
n_samples = 10000

# Create retail-specific features
data = {
    'customer_age': np.random.randint(18, 80, n_samples),
    'annual_income': np.random.lognormal(10, 0.5, n_samples),
    'purchase_frequency': np.random.poisson(12, n_samples),
    'avg_basket_size': np.random.exponential(50, n_samples),
    'days_since_last_purchase': np.random.exponential(30, n_samples),
    'loyalty_score': np.random.beta(2, 5, n_samples) * 100,
    'product_category_electronics': np.random.binomial(1, 0.3, n_samples),
    'product_category_clothing': np.random.binomial(1, 0.4, n_samples),
    'product_category_food': np.random.binomial(1, 0.5, n_samples),
    'seasonal_factor': np.sin(np.random.uniform(0, 2*np.pi, n_samples)) + 1,
    'marketing_channel_online': np.random.binomial(1, 0.6, n_samples),
    'marketing_channel_email': np.random.binomial(1, 0.4, n_samples),
    'customer_segment': np.random.choice(['Premium', 'Standard', 'Budget'], n_samples, p=[0.2, 0.5, 0.3]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'is_weekend': np.random.binomial(1, 0.3, n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)

# Create target variable (sales amount) with realistic relationships
target_sales = (
    df['annual_income'] * 0.001 +
    df['purchase_frequency'] * 5 +
    df['avg_basket_size'] * 2 +
    df['loyalty_score'] * 1.5 +
    df['product_category_electronics'] * 50 +
    df['seasonal_factor'] * 20 +
    np.random.normal(0, 50, n_samples)  # Add noise
)

# Ensure positive values
df['target_sales'] = np.maximum(target_sales, 0)

# Create binary classification target
df['target_high_value'] = (df['target_sales'] > df['target_sales'].quantile(0.7)).astype(int)

print(f"✅ Dataset created with {len(df)} samples and {len(df.columns)} features")
print(f"📊 Target variable (sales) range: ${df['target_sales'].min():.2f} - ${df['target_sales'].max():.2f}")
print(f"🎯 High-value customers: {df['target_high_value'].sum()} ({df['target_high_value'].mean()*100:.1f}%)")


In [None]:
# Create synthetic retail dataset for demonstration
print("📦 Creating synthetic retail dataset...")

# Generate synthetic data
np.random.seed(RANDOM_STATE)
n_samples = 10000

# Create retail-specific features
data = {
    'customer_age': np.random.randint(18, 80, n_samples),
    'annual_income': np.random.lognormal(10, 0.5, n_samples),
    'purchase_frequency': np.random.poisson(12, n_samples),
    'avg_basket_size': np.random.exponential(50, n_samples),
    'days_since_last_purchase': np.random.exponential(30, n_samples),
    'loyalty_score': np.random.beta(2, 5, n_samples) * 100,
    'product_category_electronics': np.random.binomial(1, 0.3, n_samples),
    'product_category_clothing': np.random.binomial(1, 0.4, n_samples),
    'product_category_food': np.random.binomial(1, 0.5, n_samples),
    'seasonal_factor': np.sin(np.random.uniform(0, 2*np.pi, n_samples)) + 1,
    'marketing_channel_online': np.random.binomial(1, 0.6, n_samples),
    'marketing_channel_email': np.random.binomial(1, 0.4, n_samples),
    'customer_segment': np.random.choice(['Premium', 'Standard', 'Budget'], n_samples, p=[0.2, 0.5, 0.3]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'is_weekend': np.random.binomial(1, 0.3, n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)

# Create target variable (sales amount) with realistic relationships
target_sales = (
    df['annual_income'] * 0.001 +
    df['purchase_frequency'] * 5 +
    df['avg_basket_size'] * 2 +
    df['loyalty_score'] * 1.5 +
    df['product_category_electronics'] * 50 +
    df['seasonal_factor'] * 20 +
    np.random.normal(0, 50, n_samples)  # Add noise
)

# Ensure positive values
df['target_sales'] = np.maximum(target_sales, 0)

# Create binary classification target
df['target_high_value'] = (df['target_sales'] > df['target_sales'].quantile(0.7)).astype(int)

print(f"✅ Dataset created with {len(df)} samples and {len(df.columns)} features")
print(f"📊 Target variable (sales) range: ${df['target_sales'].min():.2f} - ${df['target_sales'].max():.2f}")
print(f"🎯 High-value customers: {df['target_high_value'].sum()} ({df['target_high_value'].mean()*100:.1f}%)")


In [None]:
# Feature Engineering Workflow

**📊 Category**: ML Workflow

**👤 Author**: Data Science Team

**📅 Created**: 2024-01-15

**🔄 Last Updated**: 2024-01-15

**⏱️ Estimated Runtime**: 15-20 minutes

**🎯 Purpose**: Demonstrate advanced feature engineering techniques including automated feature selection, transformation, and preprocessing for machine learning workflows.

**📋 Prerequisites**: 
- Basic understanding of machine learning concepts
- Familiarity with pandas and scikit-learn
- Understanding of feature engineering principles

**📊 Datasets Used**:
- Synthetic retail dataset: Generated dataset with sales, product, and customer features
- Size: 10,000 rows, 15 features
- Target: Sales prediction (regression) or customer category (classification)

**🔧 Tools & Libraries**:
- pandas: Data manipulation and analysis
- scikit-learn: Machine learning algorithms and preprocessing
- matplotlib/seaborn: Data visualization
- numpy: Numerical computations
- Custom ml_utils: Advanced ML utilities

**📈 Key Outcomes**:
- Automated feature selection and engineering
- Comprehensive preprocessing pipeline
- Feature importance analysis
- Optimized feature set for model training

**🔗 Related Notebooks**:
- [02-model-training.ipynb](02-model-training.ipynb): Model training with engineered features
- [03-model-evaluation.ipynb](03-model-evaluation.ipynb): Model evaluation and validation
- [data-exploration-template.ipynb](../templates/data-exploration-template.ipynb): Data exploration template

**📝 Change Log**:
- v1.0.0 (2024-01-15): Initial implementation with automated feature engineering
- v1.0.1 (2024-01-15): Added polynomial features and interaction terms
- v1.0.2 (2024-01-15): Enhanced feature selection methods

---

## 📚 Table of Contents

1. [Environment Setup](#environment-setup)
2. [Data Loading and Exploration](#data-loading-exploration)
3. [Feature Preprocessing](#feature-preprocessing)
4. [Automated Feature Selection](#automated-feature-selection)
5. [Feature Engineering](#feature-engineering)
6. [Feature Validation](#feature-validation)
7. [Pipeline Creation](#pipeline-creation)
8. [Results Summary](#results-summary)
9. [Next Steps](#next-steps)

---
