Here is **Chapter 7: Supervised Learning - Classification** — predicting discrete categories with confidence.

---

# **CHAPTER 7: SUPERVISED LEARNING - CLASSIFICATION**

*Decisions and Boundaries*

## **Chapter Overview**

Classification is the workhorse of industry ML: spam detection, fraud prevention, medical diagnosis, and image recognition. Unlike regression, we care about decision boundaries, probability calibration, and the severe consequences of false negatives versus false positives. This chapter covers everything from probabilistic foundations to handling severe class imbalance.

**Estimated Time:** 50-60 hours (3-4 weeks)  
**Prerequisites:** Chapters 1-6 (Math, Python, Preprocessing, Regression foundations)

---

## **7.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement logistic regression from scratch using maximum likelihood estimation
2. Select appropriate evaluation metrics beyond accuracy (F1, AUC-ROC, AUC-PR, Matthews Correlation Coefficient)
3. Handle class imbalance using resampling, cost-sensitive learning, and threshold optimization
4. Apply multiclass strategies (One-vs-Rest, One-vs-One, Softmax) for multi-category problems
5. Calibrate probability estimates to ensure they reflect true confidence levels
6. Build complete classification pipelines for imbalanced and cost-sensitive domains

---

## **7.1 Logistic Regression: The Probabilistic Foundation**

#### **7.1.1 From Linear to Logistic**

We want probabilities $P(y=1|\mathbf{x}) \in [0,1]$, but linear regression outputs $\in (-\infty, \infty)$. Use the sigmoid (logistic) function:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

The model:
$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$

**Decision Boundary:** Where $P(y=1|\mathbf{x}) = 0.5$, i.e., $\mathbf{w}^T\mathbf{x} + b = 0$ (linear boundary).

```python
import numpy as np

def sigmoid(z):
    # Clip for numerical stability
    return np.clip(1 / (1 + np.exp(-z)), 1e-7, 1 - 1e-7)

class LogisticRegression:
    def __init__(self, lr=0.01, n_iter=1000):
        self.lr = lr
        self.n_iter = n_iter
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iter):
            # Linear model
            linear = np.dot(X, self.weights) + self.bias
            y_pred = sigmoid(linear)
            
            # Gradients (derived from cross-entropy loss)
            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)
            
            # Update
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
        return self
    
    def predict_proba(self, X):
        linear = np.dot(X, self.weights) + self.bias
        return sigmoid(linear)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)
```

#### **7.1.2 Cross-Entropy Loss (Log Loss)**

Maximum Likelihood Estimation: Maximize probability of observed data.

$$\mathcal{L}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$

**Why this loss?** Convex, differentiable, heavy penalty for confident wrong predictions.

**Gradient:**
$$\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{n}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})$$

(Same form as linear regression, but $\hat{y}$ is now sigmoid-transformed.)

#### **7.1.3 Multiclass: Softmax Regression**

For $K$ classes, output vector $\mathbf{z} \in \mathbb{R}^K$. Softmax converts to probabilities:

$$P(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}$$

**Loss:** Categorical Cross-Entropy
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K \mathbb{1}(y_i=k) \log(\hat{y}_{i,k})$$

```python
def softmax(z):
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)
```

---

## **7.2 Evaluation Metrics: Beyond Accuracy**

**The Accuracy Paradox:** If 99% of emails are not spam, predicting "not spam" always gives 99% accuracy but is useless.

#### **7.2.1 The Confusion Matrix**

|                | Predicted Positive | Predicted Negative |
|----------------|-------------------|-------------------|
| **Actual Positive** | True Positive (TP)  | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN)  |

```python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
disp.plot()
```

#### **7.2.2 Precision, Recall, and F1**

**Precision (Positive Predictive Value):** Of predicted positives, how many are actual?
$$\text{Precision} = \frac{TP}{TP + FP}$$

**Recall (Sensitivity, True Positive Rate):** Of actual positives, how many did we catch?
$$\text{Recall} = \frac{TP}{TP + FN}$$

**F1-Score:** Harmonic mean (punishes extreme imbalances)
$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

**When to use what:**
- **Precision-focused:** Spam detection (don't block important emails), search ranking
- **Recall-focused:** Disease screening (don't miss sick patients), fraud detection
- **F1:** Balanced need for both

```python
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

print(classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1']))
```

#### **7.2.3 ROC Curve and AUC**

**True Positive Rate (Recall):** $TPR = \frac{TP}{TP + FN}$  
**False Positive Rate:** $FPR = \frac{FP}{FP + TN}$

ROC Curve plots TPR vs FPR at various thresholds. AUC = Area Under Curve (1.0 = perfect, 0.5 = random).

```python
from sklearn.metrics import roc_curve, auc

y_scores = model.predict_proba(X_test)[:, 1]  # Probability of positive class
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
```

**When AUC fails:** Severe class imbalance. Use Precision-Recall (PR) AUC instead.

#### **7.2.4 Precision-Recall Curve and AUC-PR**

Plots Precision vs Recall. More informative than ROC for imbalanced data (dominated by negatives).

```python
from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, _ = precision_recall_curve(y_test, y_scores)
avg_precision = average_precision_score(y_test, y_scores)

plt.plot(recall, precision, label=f'PR curve (AP = {avg_precision:.2f})')
```

**Baseline for PR AUC:** Not 0.5, but the proportion of positive class.

#### **7.2.5 Matthews Correlation Coefficient (MCC)**

The best single metric for imbalanced binary classification. Range $[-1, 1]$ (1 = perfect, 0 = random, -1 = inverse).

$$MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

```python
from sklearn.metrics import matthews_corrcoef
mcc = matthews_corrcoef(y_true, y_pred)
```

---

## **7.3 Handling Class Imbalance**

#### **7.3.1 Resampling Techniques**

**Random Undersampling:** Remove majority class samples. Fast but loses information.

**Random Oversampling:** Duplicate minority samples. Prone to overfitting.

**SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic examples by interpolating between nearest neighbors.

```python
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Combine SMOTE with undersampling for balance
resampling = ImbPipeline([
    ('over', SMOTE(sampling_strategy=0.5, k_neighbors=5)),  # Upsample to 50% minority
    ('under', RandomUnderSampler(sampling_strategy=0.8))     # Downsample to 80% majority
])

X_res, y_res = resampling.fit_resample(X_train, y_train)
```

**Advanced:** BorderlineSMOTE (focus on hard examples), ADASYN (adaptive synthetic sampling).

#### **7.3.2 Class Weights**

Penalize mistakes on minority class more heavily.

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n w_{y_i} \left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$

```python
from sklearn.utils.class_weight import compute_class_weight

# Automatic balancing
model = LogisticRegression(class_weight='balanced')

# Manual weights
class_weights = {0: 1, 1: 10}  # Penalize minority 10x more
model = RandomForestClassifier(class_weight=class_weights)
```

#### **7.3.3 Threshold Tuning**

Default threshold is 0.5, but optimal threshold depends on business costs.

```python
from sklearn.metrics import f1_score

# Find threshold that maximizes F1
thresholds = np.arange(0.1, 0.9, 0.05)
f1_scores = [f1_score(y_test, (y_scores >= t).astype(int)) for t in thresholds]
optimal_threshold = thresholds[np.argmax(f1_scores)]

# Or use Youden's J statistic for ROC
# J = Sensitivity + Specificity - 1 = TPR - FPR
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
```

**Cost-Sensitive Thresholding:**
If false negatives cost 5x more than false positives:
$$\text{Threshold} = \frac{\text{Cost}_{FP}}{\text{Cost}_{FP} + \text{Cost}_{FN}} = \frac{1}{1+5} = 0.167$$

---

## **7.4 Multiclass Classification Strategies**

#### **7.4.1 One-vs-Rest (OvR / One-vs-All)**

Train $K$ binary classifiers. Classify as the class with highest confidence score.

- **Pros:** Simple, parallelizable, works with any binary classifier
- **Cons:** Calibration issues (scores not comparable across classifiers), imbalanced binary problems

```python
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

ovr = OneVsRestClassifier(SVC(probability=True))
ovr.fit(X, y)  # y has classes 0, 1, 2, ...
```

#### **7.4.2 One-vs-One (OvO)**

Train $\binom{K}{2}$ classifiers (one per pair). Vote for final class.

- **Pros:** Each classifier trains on balanced subset (good for SVM)
- **Cons:** $O(K^2)$ classifiers (slow for many classes)

```python
from sklearn.multiclass import OneVsOneClassifier

ovo = OneVsOneClassifier(SVC())
ovo.fit(X, y)
```

#### **7.4.3 Multinomial (Softmax)**

Direct multi-class probability distribution. Native to logistic regression, neural networks, and gradient boosting.

- **Pros:** Probabilities sum to 1, calibrated, efficient
- **Cons:** Requires classifier support (not all algorithms)

```python
# Logistic Regression with multinomial loss
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X, y)
probs = model.predict_proba(X)  # Shape (n_samples, n_classes), sums to 1
```

---

## **7.5 Classification Algorithms**

#### **7.5.1 Naive Bayes**

Probabilistic classifier based on Bayes' theorem with "naive" independence assumption.

$$P(y_k | \mathbf{x}) \propto P(y_k) \prod_{i=1}^d P(x_i | y_k)$$

**Variants:**
- **GaussianNB:** Continuous features (assumes Gaussian distribution)
- **MultinomialNB:** Discrete counts (text classification)
- **ComplementNB:** Good for imbalanced text data

```python
from sklearn.naive_bayes import GaussianNB

# Extremely fast, works well with high dimensions despite "naive" assumption
model = GaussianNB()
model.fit(X_train, y_train)
```

**When to use:** Text classification (spam), very large feature spaces, baseline model, real-time prediction (extremely fast).

#### **7.5.2 K-Nearest Neighbors (KNN)**

Non-parametric: Classify based on majority vote of $k$ nearest training examples.

**Distance Metrics:**
- Euclidean: $\sqrt{\sum(x_i - y_i)^2}$
- Manhattan: $\sum|x_i - y_i|$ (robust to outliers)
- Cosine: $1 - \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}$ (good for text/high dimensions)

```python
from sklearn.neighbors import KNeighborsClassifier

# Weight by distance (closer neighbors count more)
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='minkowski', p=2)
knn.fit(X_train, y_train)
```

**Complexity:** Training $O(1)$ (just stores data), Prediction $O(nd)$ (search all points). Use KD-Tree or Ball-Tree for faster search: $O(d \log n)$.

#### **7.5.3 Support Vector Machines (SVM)**

Find hyperplane that maximizes margin between classes.

**Kernel Trick:** Maps to high-dimensional space without explicit computation.

- **Linear:** $\mathbf{x} \cdot \mathbf{y}$ (fast, good for high dim)
- **RBF:** $\exp(-\gamma \|\mathbf{x} - \mathbf{y}\|^2)$ (non-linear, universal approximator)
- **Polynomial:** $(\gamma \mathbf{x} \cdot \mathbf{y} + r)^d$

```python
from sklearn.svm import SVC

# C: regularization (inverse of regularization strength)
# gamma: kernel coefficient ('scale' = 1/(n_features * X.var()))
svm = SVC(kernel='rbf', C=1.0, gamma='scale', probability=True)  # probability=True enables predict_proba (slow)
svm.fit(X_train, y_train)
```

**When to use:** High-dimensional data (text, genomics), small-to-medium datasets (scales poorly with $n$), when interpretability not required.

#### **7.5.4 Tree-Based Methods**

**Decision Tree:** Recursive partitioning to minimize impurity (Gini or Entropy).

```python
from sklearn.tree import DecisionTreeClassifier

# max_depth prevents overfitting
# min_samples_leaf ensures leaves have enough samples (smooths decision boundary)
tree = DecisionTreeClassifier(
    criterion='gini',  # or 'entropy'
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    class_weight='balanced'
)
```

**Random Forest:** Bagging of trees. Reduces variance.

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_leaf=5,
    max_features='sqrt',  # sqrt(n_features) considered at each split
    class_weight='balanced_subsample',  # Balance in each bootstrap sample
    n_jobs=-1
)
```

**Gradient Boosting:** XGBoost/LightGBM/CatBoost (see Chapter 6 for regression, same API for classification with `objective='binary:logistic'` or `objective='multi:softprob'`).

---

## **7.6 Probability Calibration**

Predicted probabilities should reflect true confidence. If model predicts 0.8 for 100 samples, ~80 should be positive.

**Calibration Curves (Reliability Diagrams):**
Plot predicted probability vs actual frequency.

```python
from sklearn.calibration import calibration_curve

prob_true, prob_pred = calibration_curve(y_test, y_scores, n_bins=10)

plt.plot(prob_pred, prob_true, marker='o', label='Model')
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
```

**Calibration Methods:**

1. **Platt Scaling (Sigmoid Calibration):** Fit logistic regression on model outputs.
2. **Isotonic Regression:** Non-parametric, monotonic calibration. Better if enough data.

```python
from sklearn.calibration import CalibratedClassifierCV

# Wrap base estimator
calibrated = CalibratedClassifierCV(
    base_estimator=RandomForestClassifier(),
    method='isotonic',  # or 'sigmoid'
    cv=5
)
calibrated.fit(X_train, y_train)
# Now predict_proba is calibrated
```

**When to calibrate:** When probabilities are used for decision-making (threshold selection, cost-sensitive learning, medical risk assessment).

---

## **7.7 Multilabel Classification**

Each sample can belong to multiple classes simultaneously (e.g., tags on a blog post).

**Strategies:**
- **Binary Relevance:** Train independent binary classifier per label
- **Classifier Chains:** Chain classifiers, using previous predictions as features
- **Label Powerset:** Treat each label combination as single class (explodes for many labels)

```python
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

# Y is shape (n_samples, n_labels) with binary indicators
multi_label_model = MultiOutputClassifier(RandomForestClassifier())
multi_label_model.fit(X_train, Y_train)
predictions = multi_label_model.predict(X_test)  # Shape (n_samples, n_labels)
```

**Metrics:**
- **Hamming Loss:** Fraction of wrong labels (normalized by total labels)
- **Subset Accuracy:** Exact match of all labels (very strict)
- **Macro/Micro F1:** Average across labels

---

## **7.8 Workbook Labs**

### **Lab 1: Logistic Regression from Scratch**
Implement binary logistic regression with:
1. Batch gradient descent
2. L2 regularization
3. Learning rate decay
4. Early stopping based on validation loss

**Deliverable:** Match sklearn's `LogisticRegression` coefficients within 0.1% on test dataset.

### **Lab 2: Imbalanced Classification Challenge**
Credit card fraud detection (0.1% fraud rate):
1. Compare baseline vs SMOTE vs Class weights vs Threshold tuning
2. Plot Precision-Recall curves for all methods
3. Calculate cost savings: Assume FN costs $500, FP costs $10, find optimal operating point

**Deliverable:** Report showing money saved vs naive accuracy-based model.

### **Lab 3: Multiclass Calibration**
On 10-class classification (e.g., CIFAR-10 or MNIST):
1. Train Random Forest and SVM
2. Plot calibration curves per class
3. Apply temperature scaling or Platt scaling
4. Measure Expected Calibration Error (ECE) before/after

**Deliverable:** Visualization showing calibration improvement and ECE numbers.

### **Lab 4: Error Analysis**
Build a text classifier (e.g., sentiment analysis):
1. Identify false positives and false negatives
2. Cluster errors by type (confused classes, specific keywords)
3. Engineer features to fix top 3 error categories
4. Show error rate reduction per category

**Deliverable:** Error analysis report with before/after confusion matrices.

---

## **7.9 Common Pitfalls**

1. **Using Accuracy on Imbalanced Data:** Always check class distribution first. Use F1, MCC, or AUC-PR.

2. **Data Leakage in SMOTE:** Never apply SMOTE before train/test split! Synthetic samples leak information.

3. **Ignoring Decision Costs:** Not all errors are equal. Medical screening needs high recall even at cost of precision.

4. **Threshold at 0.5 by Default:** 0.5 is arbitrary. Tune based on business metrics or ROC/PR analysis.

5. **One-Hot Encoding Target for Binary Classification:** Don't create 2 columns for binary target (dummy variable trap). Use single column with 0/1.

6. **Not Calibrating Probabilities:** Using `predict_proba` from SVM or Random Forest directly for risk scoring without calibration.

---

## **7.10 Interview Questions**

**Q1:** Why does logistic regression use sigmoid and cross-entropy instead of MSE?
*A: MSE with sigmoid leads to non-convex loss (vanishing gradients when predictions are confident and wrong). Cross-entropy provides convex loss and stronger gradients for confident errors. Additionally, cross-entropy is the negative log-likelihood of Bernoulli distribution, making it the statistically principled choice.*

**Q2:** Explain the difference between macro-average and micro-average F1.
*A: Macro averages F1 per class then averages (treats all classes equally, good for balanced classes). Micro aggregates contributions globally (counts total TP, FP, FN then computes F1, good for imbalanced or when you care about total performance). Weighted macro accounts for class support.*

**Q3:** When would you use PR AUC over ROC AUC?
*A: ROC AUC can be optimistic on imbalanced datasets (dominated by TN rate). PR AUC focuses on positive class performance (precision vs recall) and is more informative when positives are rare. Use PR AUC when positive class is important and rare.*

**Q4:** How do you handle a dataset with 99.9% negatives and 0.1% positives?
*A: 1) Don't use accuracy. 2) Try class weights (inverse frequency). 3) Use appropriate sampling (SMOTEENN combination). 4) Use proper metrics (AUC-PR, F1, MCC). 5) Consider anomaly detection instead of classification. 6) Cost-sensitive learning if business costs known. 7) Use stratified sampling for train/test splits.*

**Q5:** What's the difference between One-vs-Rest and Softmax for multiclass?
*A: OvR trains K independent binary classifiers (each class vs all others). Scores aren't probabilities (don't sum to 1) and calibration varies. Softmax is a single model with joint probability distribution (mutually exclusive classes, probabilities sum to 1). Softmax is preferred when classes are mutually exclusive and model supports it; OvR works with any binary classifier.*

---

## **7.11 Further Reading**

**Books:**
- *Pattern Recognition and Machine Learning* (Bishop) - Bayesian classification, Gaussian processes
- *Applied Predictive Modeling* (Kuhn & Johnson) - Comprehensive metrics and preprocessing

**Papers:**
- "SMOTE: Synthetic Minority Over-sampling Technique" (Chawla et al., 2002)
- "Predicting Good Probabilities with Supervised Learning" (Niculescu-Mizil & Caruana, 2005) - Calibration

**Imbalanced Learning:**
- Imbalanced-learn documentation: https://imbalanced-learn.org/

---

## **7.12 Checkpoint Project: Fraud Detection System**

Build a production-ready fraud detection classifier for financial transactions.

**Dataset:** Credit card transactions (highly imbalanced, 0.1% fraud).

**Requirements:**

1. **Feature Engineering:**
   - Time-based features (velocity: transactions per hour)
   - Amount statistics (z-score relative to user history)
   - Merchant category encoding (target encoding with smoothing)
   - Geolocation features (distance from home, unusual locations)

2. **Modeling Strategy:**
   - Baseline: Logistic Regression with class weights
   - Advanced: XGBoost with scale_pos_weight tuning
   - Ensemble: Stacking of 3 models with different strengths

3. **Threshold Optimization:**
   - Business cost matrix: FN = $500 (missed fraud), FP = $10 (investigation cost)
   - Find threshold minimizing total cost, not maximizing F1

4. **Explainability:**
   - SHAP values for top fraud predictions (investigators need reasons)
   - Feature importance stability across time windows

5. **Monitoring:**
   - Concept drift detection (fraud patterns change)
   - Weekly retraining pipeline simulation

**Deliverables:**
- `fraud_detector/` package with train, predict, and explain modules
- API endpoint `/predict` returning {fraud_probability, explanation, threshold_recommendation}
- Report: "At optimal threshold, system saves $X per month vs current rule-based system"

**Success Criteria:**
- Catch >80% of fraud (recall) with <5% false positive rate
- Model inference time <50ms per transaction
- Calibrated probabilities (Brier score <0.1)

---

**End of Chapter 7**

*You can now classify data with proper evaluation and handle real-world challenges like imbalance. Chapter 8 will cover Unsupervised Learning — finding patterns without labels.*

---

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='6. supervised_learning_regression.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='8. unsupervised_learning.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
