# Module 10: Advanced Topics and Best Practices

---

This final module covers practical topics that bridge the gap between academic exercises and real-world machine learning work: handling imbalanced data, feature engineering strategies, model persistence, pipeline design, and a complete end-to-end project.

---

## Table of Contents

1. [Handling Imbalanced Data](#1.-Handling-Imbalanced-Data)
2. [Advanced Feature Engineering](#2.-Advanced-Feature-Engineering)
3. [Model Persistence (Saving and Loading)](#3.-Model-Persistence)
4. [End-to-End ML Pipeline](#4.-End-to-End-ML-Pipeline)
5. [Common Pitfalls and Best Practices](#5.-Common-Pitfalls-and-Best-Practices)
6. [Capstone Exercise](#6.-Capstone-Exercise)
7. [Course Summary and Next Steps](#7.-Course-Summary-and-Next-Steps)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (accuracy_score, classification_report, confusion_matrix,
                              f1_score, roc_auc_score, roc_curve)
from sklearn.datasets import make_classification

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

---

## 1. Handling Imbalanced Data

In many real-world problems — fraud detection, medical diagnosis, rare event prediction — one class vastly outnumbers the other. Standard algorithms optimize for overall accuracy, which can be misleading when 95% of samples belong to one class.

In [None]:
# Create an imbalanced dataset (95% class 0, 5% class 1)
X_imb, y_imb = make_classification(
    n_samples=2000, n_features=10, n_informative=5,
    weights=[0.95, 0.05], random_state=42, flip_y=0.01
)

print(f"Class distribution:")
unique, counts = np.unique(y_imb, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f"  Class {cls}: {cnt} samples ({cnt/len(y_imb):.1%})")

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb
)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Baseline: train without handling imbalance
lr_baseline = LogisticRegression(max_iter=1000, random_state=42)
lr_baseline.fit(X_train_i, y_train_i)
y_pred_bl = lr_baseline.predict(X_test_i)

print("BASELINE (no imbalance handling)")
print(f"  Accuracy: {accuracy_score(y_test_i, y_pred_bl):.4f}")
print(f"  F1 (minority class): {f1_score(y_test_i, y_pred_bl):.4f}")
print(f"\n{classification_report(y_test_i, y_pred_bl)}")

In [None]:
# Strategy 1: Class weight adjustment
lr_weighted = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
lr_weighted.fit(X_train_i, y_train_i)
y_pred_w = lr_weighted.predict(X_test_i)

# Strategy 2: SMOTE (Synthetic Minority Over-sampling Technique)
# Note: requires imbalanced-learn library
try:
    from imblearn.over_sampling import SMOTE
    smote = SMOTE(random_state=42)
    X_smote, y_smote = smote.fit_resample(X_train_i, y_train_i)
    
    lr_smote = LogisticRegression(max_iter=1000, random_state=42)
    lr_smote.fit(X_smote, y_smote)
    y_pred_s = lr_smote.predict(X_test_i)
    smote_available = True
    print(f"After SMOTE: {np.bincount(y_smote)} (balanced)")
except ImportError:
    smote_available = False
    print("imbalanced-learn not installed. Run: pip install imbalanced-learn")

# Strategy 3: Threshold adjustment
y_proba_bl = lr_baseline.predict_proba(X_test_i)[:, 1]
y_pred_thresh = (y_proba_bl >= 0.3).astype(int)  # lower threshold

# Compare results
print("\nCOMPARISON")
print("=" * 60)
strategies = [
    ('Baseline', y_pred_bl),
    ('Class Weights', y_pred_w),
    ('Threshold=0.3', y_pred_thresh),
]
if smote_available:
    strategies.append(('SMOTE', y_pred_s))

for name, preds in strategies:
    print(f"  {name:>20s}: Accuracy={accuracy_score(y_test_i, preds):.4f}  F1(minority)={f1_score(y_test_i, preds):.4f}")

In [None]:
# Visualize confusion matrices side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 4.5))
matrices = [('Baseline', y_pred_bl), ('Class Weights', y_pred_w), ('Threshold=0.3', y_pred_thresh)]

for idx, (name, preds) in enumerate(matrices):
    cm = confusion_matrix(y_test_i, preds)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                square=True, linewidths=1, annot_kws={'size': 14})
    axes[idx].set_title(name, fontsize=13, fontweight='bold')
    axes[idx].set_xlabel('Predicted')
    axes[idx].set_ylabel('Actual')

plt.suptitle('Confusion Matrices — Imbalanced Data Strategies', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

---

## 2. Advanced Feature Engineering

Feature engineering — creating new features from existing data — often has a greater impact on model performance than algorithm selection.

In [None]:
# Create a synthetic dataset with datetime-like features
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10.5, 0.5, n).astype(int),
    'years_employed': np.random.randint(0, 40, n),
    'num_products': np.random.randint(1, 6, n),
    'credit_score': np.random.randint(300, 850, n),
    'balance': np.random.lognormal(8, 1.5, n).astype(int),
})

# Target: whether the customer churns
data['churned'] = ((data['credit_score'] < 500) |
                   (data['num_products'] >= 4) |
                   (data['balance'] < 1000) &
                   (np.random.random(n) > 0.7)).astype(int)

print("Original features:")
print(data.head())
print(f"\nShape: {data.shape}")

In [None]:
# Feature engineering techniques
data_fe = data.copy()

# 1. Ratio features
data_fe['income_per_year'] = data_fe['income'] / (data_fe['years_employed'] + 1)
data_fe['balance_to_income'] = data_fe['balance'] / (data_fe['income'] + 1)

# 2. Binning
data_fe['age_group'] = pd.cut(data_fe['age'], bins=[0, 25, 35, 50, 100],
                               labels=['Young', 'Mid', 'Senior', 'Elder'])
data_fe['credit_tier'] = pd.cut(data_fe['credit_score'], bins=[0, 500, 650, 750, 900],
                                 labels=['Poor', 'Fair', 'Good', 'Excellent'])

# 3. Polynomial interactions
data_fe['age_x_income'] = data_fe['age'] * data_fe['income']

# 4. Log transformations (for skewed distributions)
data_fe['log_income'] = np.log1p(data_fe['income'])
data_fe['log_balance'] = np.log1p(data_fe['balance'])

print("Feature-engineered dataset:")
print(data_fe[['age', 'income', 'income_per_year', 'balance_to_income',
               'age_group', 'credit_tier', 'log_income']].head())
print(f"\nOriginal features: {data.shape[1] - 1}")
print(f"After engineering: {data_fe.shape[1] - 1}")

In [None]:
# Compare model performance before and after feature engineering
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

# Original features
X_orig = data.drop('churned', axis=1)
y_target = data['churned']

scores_orig = cross_val_score(
    GradientBoostingClassifier(n_estimators=100, random_state=42),
    X_orig, y_target, cv=5, scoring='f1'
)

# Engineered features
X_eng = data_fe.drop('churned', axis=1)
# Encode categorical columns
cat_cols = X_eng.select_dtypes(include='category').columns
for col in cat_cols:
    X_eng[col] = X_eng[col].cat.codes

scores_eng = cross_val_score(
    GradientBoostingClassifier(n_estimators=100, random_state=42),
    X_eng, y_target, cv=5, scoring='f1'
)

print(f"F1 Score Comparison (5-fold CV):")
print(f"  Original features:    {scores_orig.mean():.4f} (+/- {scores_orig.std():.4f})")
print(f"  Engineered features:  {scores_eng.mean():.4f} (+/- {scores_eng.std():.4f})")

---

## 3. Model Persistence (Saving and Loading)

Once you have trained a model, you need to save it for later use (deployment, sharing, reproducibility). The two standard approaches are `pickle` and `joblib`.

In [None]:
import joblib
import pickle
import os

# Train a model
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42
)
scaler_p = StandardScaler()
X_train_ps = scaler_p.fit_transform(X_train_p)
X_test_ps = scaler_p.transform(X_test_p)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_ps, y_train_p)
print(f"Original test accuracy: {rf_model.score(X_test_ps, y_test_p):.4f}")

# Method 1: joblib (recommended for large numpy arrays)
joblib.dump(rf_model, 'model_rf.joblib')
joblib.dump(scaler_p, 'scaler.joblib')
print(f"\nSaved model with joblib: {os.path.getsize('model_rf.joblib') / 1024:.1f} KB")

# Load and verify
rf_loaded = joblib.load('model_rf.joblib')
scaler_loaded = joblib.load('scaler.joblib')
X_test_verify = scaler_loaded.transform(X_test_p)
print(f"Loaded model accuracy: {rf_loaded.score(X_test_verify, y_test_p):.4f}")

# Method 2: pickle
with open('model_rf.pkl', 'wb') as f:
    pickle.dump(rf_model, f)
print(f"Saved model with pickle: {os.path.getsize('model_rf.pkl') / 1024:.1f} KB")

# Clean up
for f in ['model_rf.joblib', 'scaler.joblib', 'model_rf.pkl']:
    if os.path.exists(f):
        os.remove(f)
print("\nCleaned up saved files.")

---

## 4. End-to-End ML Pipeline

An end-to-end ML pipeline integrates all steps — data loading, preprocessing, feature engineering, model training, evaluation, and prediction — into a single reproducible workflow.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

# Create a realistic dataset with mixed types and missing values
np.random.seed(42)
n = 800

df = pd.DataFrame({
    'age': np.random.randint(18, 80, n).astype(float),
    'income': np.random.lognormal(10.5, 0.6, n),
    'credit_score': np.random.randint(300, 850, n).astype(float),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
    'employment': np.random.choice(['Employed', 'Self-Employed', 'Unemployed', 'Retired'], n),
    'balance': np.random.lognormal(8, 1.5, n),
})

# Introduce missing values
for col in ['age', 'income', 'credit_score']:
    mask = np.random.random(n) < 0.05
    df.loc[mask, col] = np.nan

# Create target variable
df['approved'] = ((df['credit_score'].fillna(500) > 550) &
                  (df['income'] > 30000) |
                  (np.random.random(n) > 0.6)).astype(int)

print(f"Dataset shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"\n{df.head()}")

In [None]:
# Define the full pipeline
X = df.drop('approved', axis=1)
y = df['approved']

# Identify column types
numeric_features = ['age', 'income', 'credit_score', 'balance']
categorical_features = ['education', 'employment']

# Preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline: preprocessing + model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# Split data
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Fit the entire pipeline
pipeline.fit(X_train_p, y_train_p)

print(f"Pipeline Test Accuracy: {pipeline.score(X_test_p, y_test_p):.4f}")
print(f"\n{classification_report(y_test_p, pipeline.predict(X_test_p))}")

In [None]:
# Hyperparameter tuning within the pipeline
param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [3, 5],
    'classifier__learning_rate': [0.05, 0.1]
}

grid = GridSearchCV(
    pipeline, param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1', n_jobs=-1, verbose=0
)

grid.fit(X_train_p, y_train_p)

print(f"Best Parameters: {grid.best_params_}")
print(f"Best CV F1: {grid.best_score_:.4f}")
print(f"Test Accuracy: {grid.score(X_test_p, y_test_p):.4f}")

print("\nThe pipeline ensures that:")
print("  1. Preprocessing is applied consistently to train and test data")
print("  2. No data leakage occurs during cross-validation")
print("  3. The entire workflow is reproducible and deployable as a single object")

---

## 5. Common Pitfalls and Best Practices

### Pitfalls to Avoid

| Pitfall | Problem | Solution |
|---------|---------|----------|
| Data leakage | Information from test set influences training | Use pipelines; scale/impute AFTER splitting |
| Overfitting | Model memorizes training data | Cross-validation, regularization, early stopping |
| Using accuracy on imbalanced data | Misleading metric (predict majority class = high accuracy) | Use F1, precision, recall, AUC |
| Not scaling features | Algorithms sensitive to scale give poor results | Always scale for distance-based models (KNN, SVM) |
| Cherry-picking results | Reporting best of many runs | Use cross-validation with fixed seeds |
| Ignoring feature distributions | Outliers and skewness affect models | Visualize distributions, apply transformations |

In [None]:
# Demonstrate data leakage
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC

cancer = load_breast_cancer()
X_leak = cancer.data
y_leak = cancer.target

# WRONG: Scale before splitting (data leakage)
scaler_wrong = StandardScaler()
X_leak_scaled = scaler_wrong.fit_transform(X_leak)  # fit on ALL data including test
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(
    X_leak_scaled, y_leak, test_size=0.2, random_state=42
)
svm_wrong = SVC(random_state=42).fit(X_train_w, y_train_w)

# CORRECT: Split first, then scale
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_leak, y_leak, test_size=0.2, random_state=42
)
scaler_right = StandardScaler()
X_train_rs = scaler_right.fit_transform(X_train_r)  # fit only on training data
X_test_rs = scaler_right.transform(X_test_r)        # transform test with train stats
svm_right = SVC(random_state=42).fit(X_train_rs, y_train_r)

print("Data Leakage Demonstration")
print("=" * 50)
print(f"  With leakage (scale before split): {svm_wrong.score(X_test_w, y_test_w):.4f}")
print(f"  Without leakage (scale after split): {svm_right.score(X_test_rs, y_test_r):.4f}")
print("\nThe difference may be small here, but in production systems it can be significant.")
print("Always use pipelines to prevent leakage automatically.")

### Best Practices Checklist

1. **Always visualize your data first** before building any model.
2. **Split data early**, before any preprocessing or feature engineering.
3. **Use pipelines** to prevent data leakage and ensure reproducibility.
4. **Start simple** — try logistic regression or a decision tree before deep learning.
5. **Use cross-validation** instead of a single train/test split.
6. **Choose metrics wisely** — accuracy is not always the right metric.
7. **Tune hyperparameters systematically** using GridSearchCV or RandomizedSearchCV.
8. **Document everything** — data sources, preprocessing steps, model choices, results.
9. **Save your models** along with the preprocessing pipeline.
10. **Monitor model performance** over time after deployment (model drift).

---

## 6. Capstone Exercise

In [None]:
# CAPSTONE EXERCISE: End-to-End ML Project
#
# Build a complete ML pipeline on a classification problem:
#
# 1. DATA PREPARATION
#    - Generate or load a dataset (e.g., make_classification with 1000 samples, 15 features)
#    - Add some missing values randomly
#    - Add 2-3 categorical features
#    - Explore the data with visualizations
#
# 2. PREPROCESSING
#    - Build a ColumnTransformer pipeline:
#      * Impute missing values (median for numeric, mode for categorical)
#      * Scale numeric features
#      * One-hot encode categorical features
#
# 3. MODEL SELECTION
#    - Try at least 3 different classifiers
#    - Use 5-fold cross-validation to compare
#    - Select the best model
#
# 4. HYPERPARAMETER TUNING
#    - Use GridSearchCV or RandomizedSearchCV on the best model
#    - Report the best parameters
#
# 5. FINAL EVALUATION
#    - Train the final model on the full training set
#    - Evaluate on the test set
#    - Print classification report and confusion matrix
#    - Plot the ROC curve
#
# 6. SAVE
#    - Save the pipeline with joblib

# Your code here:


---

## 7. Course Summary and Next Steps

### Complete Course Overview

| Module | Topic | Key Takeaway |
|--------|-------|-------------|
| 1 | Introduction to ML | ML is learning patterns from data; three types: supervised, unsupervised, reinforcement |
| 2 | Mathematical Foundations | Linear algebra, statistics, probability, and calculus are the language of ML |
| 3 | Data Preprocessing | Raw data must be cleaned, encoded, and scaled before modeling |
| 4 | Regression | Predicting continuous values with linear models and regularization |
| 5 | Classification | Predicting categories with logistic regression, KNN, SVM, trees, and Naive Bayes |
| 6 | Model Evaluation | Confusion matrix, ROC/AUC, cross-validation, and hyperparameter tuning |
| 7 | Unsupervised Learning | Clustering (K-Means, DBSCAN) and dimensionality reduction (PCA, t-SNE) |
| 8 | Ensemble Methods | Combining models via bagging, boosting, and stacking for stronger predictions |
| 9 | Neural Networks | From perceptron to deep learning with Keras/TensorFlow |
| 10 | Advanced Topics | Imbalanced data, feature engineering, pipelines, and best practices |

### Recommended Next Steps

1. **Practice on real datasets**: Explore datasets on [Kaggle](https://www.kaggle.com/datasets) or the [UCI ML Repository](https://archive.ics.uci.edu/ml/index.php).
2. **Participate in competitions**: Start with beginner competitions on Kaggle.
3. **Study deep learning in depth**: Work through TensorFlow tutorials or the fast.ai course.
4. **Learn MLOps**: Understand model deployment, monitoring, and maintenance.
5. **Read foundational books**:
   - Aurélien Géron, *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*
   - James et al., *Introduction to Statistical Learning (ISLR)*
   - Ian Goodfellow et al., *Deep Learning*

---

This concludes the Machine Learning Courseware. The skills you have developed — from data preprocessing to neural networks — provide a comprehensive foundation for applying machine learning to real-world problems.