**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART I**  

**The Fundamentals of Machine Learning**  

---

**CHAPTER 3 - Classification**  

---

Chapter 3 shifts focus from regression to classification tasks, using the MNIST handwritten digit dataset to explore binary and multiclass classification systems, performance evaluation metrics, and error analysis techniques  

---

**MNIST**  

The MNIST dataset consists of 70,000 small images of handwritten digits (0-9) by high school students and US Census Bureau employees, known as the "hello world" of Machine Learning. Each image is labeled with its corresponding digit.  

**Loading the Dataset**  
Scikit-Learn provides helper functions to download popular datasets:  

In [None]:
>>> from sklearn.datasets import fetch_openml
>>> mnist = fetch_openml('mnist_784', version=1)
>>> mnist.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',
           'categories', 'url'])

Scikit-Learn datasets have a standard dictionary structure with DESCR (dataset description), data (feature array), and target (label array) keys.  

**Exploring the Data**  
Examine the arrays:

In [None]:
>>> X, y = mnist["data"], mnist["target"]
>>> X.shape
(70000, 784)
>>> y.shape
(70000,)

There are 70,000 images with 784 features each (28 × 28 pixels), where each feature represents pixel intensity from 0 (white) to 255 (black).

Display a digit image:

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()

Check the label:

In [None]:
>>> y[0]
'5'

Convert labels from strings to integers:

In [None]:
>>> y = y.astype(np.uint8)

![Figure3-1.jpg](./03.Chapter-03/Figure3-1.jpg)  

**Creating Train/Test Sets**  

The MNIST dataset is pre-split into training (first 60,000 images) and test (last 10,000 images) sets:

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

The training set is already shuffled, ensuring similar cross-validation folds and preventing poor performance from learning algorithms sensitive to instance order. Shuffling may be inappropriate for time series data.​  

---

**Training a Binary Classifier**  

Simplify the problem by creating a "5-detector" binary classifier distinguishing between 5 and not-5:

In [None]:
y_train_5 = (y_train == 5)  # True for all 5s, False for all other digits
y_test_5 = (y_test == 5)

Train a Stochastic Gradient Descent (SGD) classifier:

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

The SGDClassifier handles very large datasets efficiently by processing training instances independently, making it suitable for online learning. Setting random_state ensures reproducible results.​

Make predictions:

In [None]:
>>> sgd_clf.predict([some_digit])
array([ True])

The classifier correctly identifies the digit 5.  


---

**Performance Measures**  

Evaluating classifiers is significantly trickier than evaluating regressors, requiring careful metric selection.

**Measuring Accuracy Using Cross-Validation**  

**Implementing Cross-Validation**  
Manual cross-validation implementation for greater control:

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))  # prints 0.9502, 0.96565, and 0.96495

StratifiedKFold performs stratified sampling to produce folds with representative class ratios.​

Use cross_val_score for simpler evaluation:

In [None]:
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
array([0.96355, 0.93795, 0.95615])

Over 93% accuracy appears impressive, but consider a "dumb" classifier:

In [None]:
from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        return self
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

>>> never_5_clf = Never5Classifier()
>>> cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
array([0.91125, 0.90855, 0.90915])

The dummy classifier achieves over 90% accuracy simply because only 10% of images are 5s. This demonstrates why accuracy is generally not preferred for classifiers, especially with skewed datasets where some classes are much more frequent.

**Confusion Matrix**  
A confusion matrix counts how many times instances of class A are classified as class B.​

Generate predictions using cross_val_predict:

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Unlike cross_val_score, cross_val_predict returns predictions made on each test fold using a model that never saw the data during training.  

Compute the confusion matrix:

In [None]:
>>> from sklearn.metrics import confusion_matrix
>>> confusion_matrix(y_train_5, y_train_pred)
array([[53057,  1522],
       [ 1325,  4096]])

Each row represents an actual class, each column represents a predicted class:​
* True negatives (TN): 53,057 non-5s correctly classified as non-5s​
* False positives (FP): 1,522 non-5s wrongly classified as 5s​
* False negatives (FN): 1,325 5s wrongly classified as non-5s​
* True positives (TP): 4,096 5s correctly classified as 5s​

A perfect classifier has only true positives and true negatives (nonzero values only on the main diagonal):

In [None]:
>>> y_train_perfect_predictions = y_train_5
>>> confusion_matrix(y_train_5, y_train_perfect_predictions)
array([[54579,     0],
       [    0,  5421]])

![Figure3-2.jpg](./03.Chapter-03/Figure3-2.jpg)

**Precision and Recall**  

Precision measures the accuracy of positive predictions:  

Equation 3-1. Precision  
![Eq3-1.jpg](./03.Chapter-03/Eq3-1.jpg)  

where TP is true positives and FP is false positives.​

Recall (sensitivity or true positive rate) measures the ratio of positive instances correctly detected:  

Equation 3-2. Recall  
![Eq3-2.jpg](./03.Chapter-03/Eq3-2.jpg)   

where FN is false negatives.​

Compute precision and recall:


In [None]:
>>> from sklearn.metrics import precision_score, recall_score
>>> precision_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1522)
0.7290850836596654
>>> recall_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1325)
0.7555801512636044

The 5-detector is correct only 72.9% when it claims an image represents a 5, and detects only 75.6% of all 5s.  

F1 Score
The F1 score combines precision and recall into a single metric using the harmonic mean:  

Equation 3-3. F1  
![Eq3-3.jpg](./03.Chapter-03/Eq3-3.jpg)  

The harmonic mean gives much more weight to low values, so the classifier only achieves high F1 scores when both precision and recall are high.

In [None]:
>>> from sklearn.metrics import f1_score
>>> f1_score(y_train_5, y_train_pred)
0.7420962043663375

The F1 score favors classifiers with similar precision and recall. However, different contexts prioritize different metrics: a kid-safe video classifier should favor high precision (rejecting many good videos but keeping only safe ones), while a shoplifter detector should favor high recall (30% precision acceptable with 99% recall).  

**Precision/Recall Trade-off**  
Increasing precision reduces recall, and vice versa.​

The SGDClassifier computes a score based on a decision function and assigns instances to the positive class if the score exceeds a threshold.  

![Figure3-3.jpg](./03.Chapter-03/Figure3-3.jpg)  

Access decision scores directly:

In [None]:
>>> y_scores = sgd_clf.decision_function([some_digit])
>>> y_scores
array([2412.53175101])
>>> threshold = 0
>>> y_some_digit_pred = (y_scores > threshold)
array([ True])

The SGDClassifier uses threshold = 0 by default. Raising the threshold decreases recall:

In [None]:
>>> threshold = 8000
>>> y_some_digit_pred = (y_scores > threshold)
>>> y_some_digit_pred
array([False])

**Selecting the Right Threshold**  
Get scores for all training instances:

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

Compute precision and recall for all possible thresholds:

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

Plot precision and recall versus threshold:

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    # [...] highlight threshold, add legend, axis labels, and grid

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

![Figure3-4.jpg](./03.Chapter-03/Figure3-4.jpg)  

Plot precision directly against recall:  

![Figure3-5.jpg](./03.Chapter-03/Figure3-5.jpg)  

To achieve 90% precision:

In [None]:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)] # ~7816

y_train_pred_90 = (y_scores >= threshold_90_precision)

>>> precision_score(y_train_5, y_train_pred_90)
0.9000380083618396
>>> recall_score(y_train_5, y_train_pred_90)
0.4368197749492714

Creating a 90% precision classifier is easy by setting a high threshold, but this results in only 43.7% recall. When someone requests "99% precision," always ask "at what recall?".

**The ROC Curve**  
The receiver operating characteristic (ROC) curve plots the true positive rate (recall) against the false positive rate (FPR). FPR is the ratio of negative instances incorrectly classified as positive, equal to 1 - TNR (true negative rate or specificity).  

Compute TPR and FPR:

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Plot the ROC curve:

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
    # [...] Add axis labels and grid

plot_roc_curve(fpr, tpr)
plt.show()

![Figure3-6.jpg](./03.Chapter-03/Figure3-6.jpg)  

Measure the area under the curve (AUC):

In [None]:
>>> from sklearn.metrics import roc_auc_score
>>> roc_auc_score(y_train_5, y_scores)
0.9611778893101814

A perfect classifier has ROC AUC = 1, while a purely random classifier has ROC AUC = 0.5.​

**Rule of thumb:** Prefer the precision/recall (PR) curve when the positive class is rare or when you care more about false positives than false negatives; otherwise use the ROC curve. The ROC curve may appear overly optimistic with imbalanced datasets, while the PR curve makes improvement opportunities clearer.​

**Comparing Classifiers: Random Forest**  
Train a RandomForestClassifier and compare with SGDClassifier:

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

RandomForestClassifier has predict_proba() instead of decision_function(), returning probabilities for each class.​

Use the positive class probability as score:

In [None]:
y_scores_forest = y_probas_forest[:, 1]   # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)

Plot both ROC curves:

In [None]:
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

![Figure3-7.jpg](./03.Chapter-03/Figure3-7.jpg)

In [None]:
>>> roc_auc_score(y_train_5, y_scores_forest)
0.9983436731328145

The Random Forest achieves significantly better ROC AUC (0.998 vs 0.961), with 99.0% precision and 86.6% recall.  

---

**Multiclass Classification**  

Multiclass classifiers (multinomial classifiers) distinguish between more than two classes.​

Some algorithms (SGD, Random Forest, naive Bayes) handle multiple classes natively. Others (Logistic Regression, Support Vector Machine) are strictly binary but can use strategies for multiclass classification.​

**Strategies for Multiclass Classification**  
**One-versus-the-rest (OvR)** or one-versus-all: Train 10 binary classifiers (one per digit 0-9), then select the class with the highest decision score.​

**One-versus-one (OvO)**: Train a binary classifier for every pair of digits, requiring N × (N−1)/2 classifiers (45 for MNIST). When classifying, run the image through all classifiers and select the class winning the most duels. OvO's advantage is that each classifier trains only on the two classes it must distinguish.​

For algorithms that scale poorly with training set size, OvO is preferred (faster to train many classifiers on small sets). For most binary classification algorithms, OvR is preferred.  

**Training a Multiclass Classifier**  
Scikit-Learn automatically runs OvR or OvO when using binary classifiers for multiclass tasks:

In [None]:
>>> from sklearn.svm import SVC
>>> svm_clf = SVC()
>>> svm_clf.fit(X_train, y_train) # y_train, not y_train_5
>>> svm_clf.predict([some_digit])
array([5], dtype=uint8)

The SVC trains on original target classes 0-9, making a correct prediction. Scikit-Learn used OvO strategy, training 45 binary classifiers and selecting the class winning the most duels.​

View decision scores:

In [None]:
>>> some_digit_scores = svm_clf.decision_function([some_digit])
>>> some_digit_scores
array([[ 2.92492871,  7.02307409,  3.93648529,  0.90117363,  5.96945908,
         9.5       ,  1.90718593,  8.02755089, -0.13202708,  4.94216947]])
>>> np.argmax(some_digit_scores)
5
>>> svm_clf.classes_
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)
>>> svm_clf.classes_[5]
5

When trained, classifiers store target classes in the classes_ attribute, ordered by value.​

Force OvR or OvO using OneVsRestClassifier or OneVsOneClassifier:

In [None]:
>>> from sklearn.multiclass import OneVsRestClassifier
>>> ovr_clf = OneVsRestClassifier(SVC())
>>> ovr_clf.fit(X_train, y_train)
>>> ovr_clf.predict([some_digit])
array([5], dtype=uint8)
>>> len(ovr_clf.estimators_)
10

This creates 10 classifiers for OvR strategy.​

Train SGDClassifier (natively multiclass):

In [None]:
>>> sgd_clf.fit(X_train, y_train)
>>> sgd_clf.predict([some_digit])
array([5], dtype=uint8)

Scikit-Learn didn't need OvR or OvO because SGD classifiers directly classify into multiple classes:

In [None]:
>>> sgd_clf.decision_function([some_digit])
array([[-15955.22628, -38080.96296, -13326.66695,   573.52692, -17680.68466,
          2412.53175, -25526.86498, -12290.15705, -7946.05205, -10631.35889]])

The classifier is confident (class 5 scores 2412.5) with slight doubt about class 3 (scores 573.5).​

Evaluate using cross-validation:

In [None]:
>>> cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
array([0.8489802 , 0.87129356, 0.86988048])

Over 84% accuracy beats the 10% random classifier baseline. Simply scaling inputs increases accuracy above 89%:

In [None]:
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
>>> cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
array([0.89707059, 0.8960948 , 0.90693604])

---  

**Error Analysis**  
After finding a promising model, analyze the types of errors it makes to improve performance.​

**Confusion Matrix Analysis**  
Generate predictions and create confusion matrix:


In [None]:
>>> y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
>>> conf_mx = confusion_matrix(y_train, y_train_pred)
>>> conf_mx
array([[5578,    0,   22,    7,    8,   45,   35,    5,  222,    1],
       [   0, 6410,   35,   26,    4,   44,    4,    8,  198,   13],
       [  28,   27, 5232,  100,   74,   27,   68,   37,  354,   11],
       [  23,   18,  115, 5254,    2,  209,   26,   38,  373,   73],
       [  11,   14,   45,   12, 5219,   11,   33,   26,  299,  172],
       [  26,   16,   31,  173,   54, 4484,   76,   14,  482,   65],
       [  31,   17,   45,    2,   42,   98, 5556,    3,  123,    1],
       [  20,   10,   53,   27,   50,   13,    3, 5696,  173,  220],
       [  17,   64,   47,   91,    3,  125,   24,   11, 5421,   48],
       [  24,   18,   29,   67,  116,   39,    1,  174,  329, 5152]])

Visualize with Matplotlib:

In [None]:
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

The confusion matrix looks good with most images on the main diagonal (correctly classified). The 5s appear slightly darker, potentially indicating fewer 5s in the dataset or worse classifier performance on 5s.​

Focus on errors by normalizing the confusion matrix:

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

Rows represent actual classes, columns represent predicted classes. The bright column for class 8 indicates many images get misclassified as 8s, but the row for class 8 isn't too bad (actual 8s are generally classified correctly). Classes 3 and 5 often get confused in both directions.​

**Insights:** Focus efforts on reducing false 8s by gathering more training data for digits resembling 8s, engineering features (e.g., counting closed loops), or preprocessing images to make patterns stand out.  

**Analyzing Individual Errors**
Plot examples of 3s and 5s:

In [None]:
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()

The two 5×5 blocks on the left show digits classified as 3s; on the right, digits classified as 5s. Some misclassified digits are so badly written that even humans would struggle. However, most errors seem obvious to humans.​

The SGDClassifier is a linear model assigning weights per class to each pixel, summing weighted pixel intensities to score each class. Since 3s and 5s differ by only a few pixels, the model easily confuses them. The main difference is the small line joining the top line to the bottom arc.​

The classifier is quite sensitive to image shifting and rotation. Preprocessing images to ensure they're well-centered and not too rotated would reduce 3/5 confusion and other errors.  

---

**Multilabel Classification**

**Multilabel classification** systems output multiple binary tags per instance. For example, a face-recognition classifier recognizing Alice, Bob, and Charlie should output for a picture containing Alice and Charlie.​

Simple example:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

This creates two target labels per digit: whether it's large (7, 8, or 9) and whether it's odd.​

Make predictions:

In [None]:
>>> knn_clf.predict([some_digit])
array([[False,  True]])

Correct prediction: digit 5 is not large (False) and is odd (True).  

**Evaluating Multilabel Classifiers**  
One approach measures the F1 score for each individual label, then computes the average:

In [None]:
>>> y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
>>> f1_score(y_multilabel, y_train_knn_pred, average="macro")
0.976410265560605

This assumes all labels are equally important. For weighted evaluation (e.g., more pictures of Alice than Bob), set average="weighted" to give each label weight equal to its support.  

---

**Multioutput Classification**  

**Multioutput-multiclass **classification (multioutput classification) is a generalization of multilabel classification where each label can be multiclass (more than two possible values).​

**Example: Noise Removal System**  
Build a system removing noise from digit images. It takes noisy digit images as input and outputs clean images (arrays of pixel intensities). The output is multilabel (one label per pixel) where each label has multiple values (pixel intensity 0-255).​

The line between classification and regression can be blurry—predicting pixel intensity is more akin to regression. Multioutput systems aren't limited to classification; they can output multiple labels per instance including both class labels and value labels.​

Create noisy training and test sets:

In [None]:
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

The target images are the original clean images.​

Train and clean an image:

In [None]:
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

The cleaned output looks close to the target image.