# Linear Classification

In [3]:
import os
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt

## The MNIST Dataset

Download the MNIST dataset files `mnist_train_uint8.npz` and `mnist_test_uint8.npz` from resources.

In [4]:
train_data = np.load('mnist_train_uint8.npz', allow_pickle=True)
train_X, train_y = train_data['X'], train_data['y']

test_data = np.load('mnist_test_uint8.npz', allow_pickle=True)
test_X, test_y = test_data['X'], test_data['y']

assert train_X.shape[0] == train_y.shape[0]
assert test_X.shape[0] == test_y.shape[0]

N_train = train_X.shape[0]
N_test = test_X.shape[0]

print(f'Training set size: {N_train}, Test set size: {N_test}')

train_X.shape, train_y.shape

FileNotFoundError: [Errno 2] No such file or directory: 'mnist_train_uint8.npz'

Each digit can represented by a $784$-D vector (flattened $28 \times 28$ image) and has a corresponding target digit label. We apply the reshape transform now.

In [5]:
train_X = train_X.reshape(-1, 28 * 28) / 255.0
test_X = test_X.reshape(-1, 28 * 28) / 255.0

train_X.shape, train_y.shape

NameError: name 'train_X' is not defined

Let's take a look at the digits distribution in the dataset.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(np.unique(train_y), np.bincount(train_y))
ax.set_xticks(np.unique(train_y))
ax.set_xlabel('Class')
ax.set_ylabel('Count')
fig.show()

In [None]:
num_train, num_test = 1000, 100
train_idx = np.random.choice(train_X.shape[0], num_train)
test_idx = np.random.choice(test_X.shape[0], num_test)

We use numpy indexing to get the data at the indices sampled above.

In [None]:
train_data, train_labels = train_X[train_idx], train_y[train_idx].astype(np.int)
test_data, test_labels = test_X[test_idx], test_y[test_idx].astype(np.int)

Let's also take a look at the label distribution for our newly created training data. Ideally, we would like them to be equal in number.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(np.unique(train_labels), np.bincount(train_labels))
ax.set_xticks(np.unique(train_labels))
ax.set_xlabel('Class')
ax.set_ylabel('Count')
fig.show()

## Visualize Train Data

In [None]:
fig, axes = plt.subplots(4, 4, figsize=(7, 7))

random_idx = np.random.choice(train_data.shape[0], size=16, replace=False)

for i in range(4):
    for j in range(4):
        idx = j + i * 4
        axes[i, j].imshow(train_data[idx].reshape(28, 28), cmap='gray')
        axes[i, j].axis('off')

fig.show()

## Model Definition

We define a very simple model which outputs the probability of incoming digit belonging to one of the $10$ classes.

$$ 
\begin{align}
\mathbf{p} &= \text{softmax}(XW^T) \tag{$X \in \mathbb{R}^{M \times (D+1)}$, $W \in \mathbb{R}^{K \times (D+1)}$} \\
\text{softmax}(v_i) &= \frac{\exp{v_i}}{\sum_j \exp{v_j}} \tag{$\forall i \in [1, K]$}
\end{align}
$$

$M$ is the number of samples, $D = 784$, $K = 10$ total number of classes.

## Learning using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(train_data, train_labels)

## Visualize Features

If everything goes well, we should be able to visualize each row of the parameter matrix $W$ defined above and start seeing some patterns. Intuitively, these can be thought of as a visual association of parameters with parts of the image space. Brighter the visual, higher the response signal on interaction with the input.

In [None]:
fig, axes = plt.subplots(figsize=(10, 5), nrows=2, ncols=5, sharex=True, sharey=True)

for i in range(2):
    for j in range(5):
        idx = j + i * 2
        axes[i, j].imshow(clf.coef_[idx].reshape(28, 28), cmap='gray')
        axes[i, j].axis('off')

fig.tight_layout()
fig.show()

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

cmat = confusion_matrix(clf.predict(test_data), test_labels)

fig, ax = plt.subplots(figsize=(10,10))
im = ax.imshow(cmat, cmap=plt.cm.viridis)
fig.colorbar(im)
fig.show()