# Week 3 Lecture 2

## Classification

In this notebook we follow the coding examples from the Week 3 Lecture 1 focusing on the MNIST dataset.

We cover:
- Loading and exploring MNIST and Visualizing digits
- Preparing binary classification targets
- Training an SGD classifier and Making predictions
- Evaluation: Precision, Recall, Confusion Matrix, ROC
- Multiclass Classification.

In [None]:
#import libraries
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml

### Loading the MNIST Dataset

Scikit-Learn provides a helper function to fetch popular datasets, including MNIST.

In [None]:
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

# Explore the returned dictionary
print(mnist.keys())

Datasets loaded by Scikit-Learn usually have this structure:
- `data`: array with one row per instance, one column per feature
- `target`: array with the labels

In [None]:
# Extract features and labels
X, y = mnist["data"], mnist["target"]

print(X.shape) 
print(y.shape)  

Each image is 28×28 pixels flattened into 784 features. Pixel values range from 0 (white) to 255 (black).

In [None]:
print(y[0])  

In [None]:
# Most ML algorithms expect numbers, so convert to integer
y = y.astype(np.uint8)
print(y[0])

### Train / Test Split

The MNIST dataset is already split: first 60,000 images = training set, last 10,000 = test set.
The training set is shuffled, which is good for cross-validation and to avoid order bias.

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

print(X_train.shape) 
print(X_test.shape)   

### Training a Binary Classifier (5-detector)

We simplify the task: detect whether a digit is 5 or not-5 → binary classification.

In [None]:
# Create target vectors for binary classification (5 vs not-5)
y_train_5 = (y_train == 5)   # True for all 5s, False for others
y_test_5  = (y_test == 5)

print(y_train_5[:10])   # example: shows True/False array

We use **Stochastic Gradient Descent (SGD) classifier** 

In [None]:
from sklearn.linear_model import SGDClassifier

# Create and train the classifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

**Tip**: `random_state` ensures reproducibility because SGD is stochastic.

In [None]:
# Predict on the first digit (which is a 5)
some_digit = X_train[0]  
print(sgd_clf.predict([some_digit]))

### Evaluation

Now we dive into **performance measures** for classifiers, starting with cross-validation accuracy, then confusion matrices, precision/recall/F1, the precision/recall trade-off, ROC curves, and some error analysis insights.

#### Measuring Accuracy Using Cross-Validation
A good way to evaluate the model is using K-fold cross-validation (as in Chapter 2).

In [None]:
from sklearn.model_selection import cross_val_score

# Evaluate the SGDClassifier with 3-fold CV
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

High accuracy (~95%) — but is it meaningful?

**Warning**: Accuracy can be misleading on skewed datasets (only ~10% of digits are 5s).

#### Confusion Matrix

A better evaluation: count how often the model confuses classes.
Use cross_val_predict to get clean predictions (out-of-fold).

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

# Get cross-validated predictions (not the same as fit/predict on full train)
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

# Confusion matrix
confusion_matrix(y_train_5, y_train_pred)

In [None]:
labels = ['Negative', 'Positive']
tn, fp, fn, tp = cm.ravel()

plt.figure(figsize=(5,5))
plt.imshow(cm, cmap=plt.cm.Blues)
plt.title("Confusion Matrix (Negative / Positive)")
plt.colorbar()
ticks = np.arange(len(labels))
plt.xticks(ticks, labels)
plt.yticks(ticks, labels)

annotations = np.array([['TN\n{}'.format(tn), 'FP\n{}'.format(fp)],
                        ['FN\n{}'.format(fn), 'TP\n{}'.format(tp)]])

for (ii, jj), text in np.ndenumerate(annotations):
    plt.text(jj, ii, text, ha='center', va='center',
             color='white' if cm[ii, jj] > cm.max()/2 else 'black', fontsize=12)

plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

#### Precision, Recall, and F1-Score

Precision = TP / (TP + FP) → of detected 5s, how many are real 5s?  
Recall = TP / (TP + FN) → of all real 5s, how many were detected?  
F1 = harmonic mean of precision and recall (balances both).

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision:", precision_score(y_train_5, y_train_pred))
print("Recall:", recall_score(y_train_5, y_train_pred))
print("F1-score:", f1_score(y_train_5, y_train_pred))

Typical values:
- Precision: ~0.837 (83.7% of detected 5s are correct)
- Recall: ~0.651 (65.1% of real 5s found)
- F1: ~0.733 (balance)

Precision and recall trade off: higher precision often lowers recall (and vice versa).

#### ROC Curve

Receiver Operating Characteristic curve: plots True Positive Rate (recall) vs. False Positive Rate.
Area Under Curve (AUC) summarizes performance (1.0 = perfect).

In [None]:
from sklearn.metrics import roc_curve

# Obtain decision scores in a cross-validated manner
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')  # dashed diagonal
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (Recall)')

auc = np.trapz(tpr, fpr)  # approximate AUC
plot_roc_curve(fpr, tpr, label=f"SGD (AUC = {auc:.3f})")
plt.legend(loc="lower right")
plt.show()

## Multiclass Classification

Goal: Predict one of 10 classes (digits 0-9) per image.

SGDClassifier supports multiclass natively (uses OvR internally: one binary classifier per class, pick highest score).

In [None]:
from sklearn.linear_model import SGDClassifier

# Train on full multiclass targets (y_train with 0-9)
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)

# Predict on the example digit
print(sgd_clf.predict([some_digit]))  # Should be [5]

# See the decision scores for OvR (10 scores, one per class)
some_digit_scores = sgd_clf.decision_function([some_digit])
print(some_digit_scores)

# The predicted class is the one with max score
print(np.argmax(some_digit_scores))  # 5

# List of classes
print(sgd_clf.classes_)

**OvR vs OvO Strategies**

- **One-vs-Rest (OvR)**: Train N binary classifiers (one per class). Pick class with highest score. Simpler, faster training, good for imbalanced data.
- **One-vs-One (OvO)**: Train N×(N-1)/2 binary classifiers (every pair). Use voting. More classifiers but each trains on smaller data, can be better for some algorithms like SVM.

SVC uses OvO internally (45 classifiers for 10 classes).

In [None]:
from sklearn.svm import SVC

# SVC is slow on full dataset → use small subset for demo
svm_clf = SVC(gamma="auto", random_state=42)
svm_clf.fit(X_train[:2000], y_train[:2000])  # 2000 instances for speed

# Predict
print(svm_clf.predict([some_digit]))

# Decision function shape: (1, 45) → one score per pair (OvO)
print(svm_clf.decision_function([some_digit]).shape)

**Multiclass Evaluation: Confusion Matrix**

Use cross-validation for clean predictions.

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# Cross-validated predictions (takes time)
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3)

# N×N confusion matrix (10×10)
conf_mx = confusion_matrix(y_train, y_train_pred)
print(conf_mx)

In [None]:
# Plot confusion matrix as heatmap (diagonal = correct, off-diagonal = errors)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

## Multilabel Classification

Goal: Assign multiple labels to each instance (e.g., "is the digit large (>=7)?" and "is it odd?").

We use KNeighborsClassifier (supports multilabel natively).

In [None]:
# Create multilabel targets: two binary labels
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]  # shape (60000, 2)

print(y_multilabel[:5])  # Example: [False True] for odd small digits, etc.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

# Predict on example digit (5: not large, odd)
print(knn_clf.predict([some_digit]))  # Typically [[False True]]

**Evaluation**: Use F1-score averaged across labels (macro, weighted, etc.).

In [None]:
from sklearn.metrics import f1_score

# Cross-validated predictions (slow on full data, but for illustration)
# y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)

# Example F1 (macro average)
# print(f1_score(y_multilabel, y_train_knn_pred, average="macro"))

## Multioutput Classification

Generalization of multilabel: each "label" can have >2 possible values (multiclass outputs).

Example (inspired by chapter/PPT): Predict the digit class + parity (even/odd) + a complexity score (e.g., binned number of black pixels).

Multioutput models output a vector per instance (e.g., [digit_class, parity_class, complexity_class]).

In [None]:
# Example setup: add two more outputs
# Parity: 0=even, 1=odd (binary but could be multiclass)
# Complexity: low/medium/high based on pixel sum (multiclass)

pixel_sums = X_train.sum(axis=1)  # total intensity per image
complexity_bins = np.digitize(pixel_sums, bins=[np.percentile(pixel_sums, 33), np.percentile(pixel_sums, 66)])
# 0=low, 1=med, 2=high

y_multioutput = np.c_[y_train, (y_train % 2), complexity_bins]  # shape (60000, 3)

# Use a model that supports multioutput, e.g., RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
forest_clf.fit(X_train[:2000], y_multioutput[:2000])  # subset for speed

# Predict: returns array of shape (n_samples, n_outputs)
some_predictions = forest_clf.predict([some_digit])
print(some_predictions)  # e.g., [[5 1 1]] (digit 5, odd, medium complexity)

Note: Multioutput is rare in pure classification but useful when tasks are related (e.g., digit + properties). The line blurs with multi-task learning.