# Lecture 05

The goal of this assignment is to go through the data-driven approach process with a linear classifer model. This will include loading a dataset, preprocessing, and splitting a dataset into training/validation/testing splits, training a linear classifier, model evaluation, and visualazation. We will use a toy dataset called MedNIST, which consists of toy medical images of different modalities and organs. The goal is to classify each image into 6 catergories: Abdomen CT, Breast MRI, Chest XRay, Chest CT, Hand XRay, and Head CT. The images are 2D and significantly downsampled to $64 x 64$ pixels. For simplicity and efficiency, for this in-class activity we will work with a sample of 1000 images. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import PIL.Image
from tqdm import tqdm
import math
import os
import torch

In [None]:
engr_dir = "/opt/nfsopt/DLMI"
idas_dir = os.path.join(os.path.expanduser('~'), "classdata")

if os.path.isdir(engr_dir):
    data_dir = engr_dir
elif os.path.isdir(idas_dir):  
    data_dir = idas_dir
else:
    print("Data directory not found")

## Load Dataset

The `MEdNIST.csv` file is a text file with the image filenames and corresponding class label for each sample in the dataset. Below the csv file is read using the `pandas` library, and the first 10 lines of the file are displayed. 

In [None]:
df = pd.read_csv(os.path.join(data_dir,"lecture05", "MedNIST.csv"))
df.head(10)

Define a Python dictionary which maps the numerical class value to the corresponding category.

In [None]:
class_names = {0: 'AbdomenCT',
               1: 'BreastMRI',
               2: 'CXR',
               3: 'ChestCT', 
               4: 'Hand',
               5: 'HeadCT'}

<br>

Read and store all the images in a single 2D matrix $X$. Each image will be flattened and stored as a row in $X$. The size of $X$ with be $N \times M$ where N is the number of samples and M is the number of pixels in each image. 
The spatial structure of the image will be lost after flattening the image, however, the spatial structure is not used when applying a linear classification model to raw pixel values.
Storing the data in a single matrix will allow us to compute linear classification scores more efficiently. The labels for each image will be stored in a single vector $Y$ with $N$ elements.

First determine the size of the dataset, i.e., the number of samples (N) and numper of pixels per image (M)

In [None]:
full_filename = os.path.join(data_dir, "lecture05", df.iloc[0]['filename'])
image_width, image_height = PIL.Image.open(full_filename).size
N = len(df)
M = image_width * image_height

print("Image dimensions: {} x {}".format(image_width, image_height))
print("Number of samples (N): {}".format(N))
print("Number of pixels (M): {}".format(M))

<br>

Create empty NumPy array to store dataset

In [None]:
X = np.zeros((N, M))

Iterate over the rows of the dataframe, read each image, and store as row in data array. Additionally, create an array for the labels. 

In [None]:
# Populate data array X
for i, row in df.iterrows():
    filename = row['filename']
    label = row['label']
    full_filename = os.path.join(data_dir, "lecture05", filename)
    im = PIL.Image.open(full_filename)
    arr = np.array(im)
    X[i,:] = arr.ravel()

# Create labels array Y
Y = df['label'].values


print("data shape: {}".format(X.shape))
print("label shape: {}".format(Y.shape))


<br>

**QUESTION:** For training a linear classifer on this dataset, what are learnable parameters and what are the corresponding sizes/shapes?

XXX

## Visualization

Visualization is an important step for exploring data and validating pre-processing steps. The code below will take a random sample and display the image with corresponding label. To view the image, the row vector needs to be reshaped into the 2D matrx using the calculated `image_width` and `image_height`. The code also displays the corresponding histogram below each image, so the instensity values and distrubution can be assessed. Run this cell multiple times to see different samples.

In [None]:
fig, ax = plt.subplots(2, 6, figsize=(15, 6))

for i, k in enumerate(np.random.randint(N, size=6)):
    arr = X[k,:]
    label = Y[k]
    arr2d = arr.reshape(image_width, image_height)
    ax[0,i].set_title(class_names[label])
    ax[0,i].imshow(arr2d, cmap="gray", vmin=0, vmax=255)
    ax[1,i].hist(arr, bins=20, range=(0, 255), edgecolor='black', color='white', density=True)
    
    ax[0,i].set_axis_off()
    ax[1,i].set_yticklabels([])
    ax[1,i].set_yticks([])

plt.tight_layout()
plt.show()

## Intensity Normalization

Notice some images look washed. It is common to rescale image intensities to have values in the range $[-1, 1]$. To do this a linear rescaling will be used which maps the max intensity value to 1 and the minimum intensity to -1. To avoid potential outliers, we will use the 1st and 99th percentiles of the intensity distribution instead of the min and max. You could write a for loop to iterate over the rows of our data matrix X, and for each row calculate the 1st percentile and 99th percentile. However, NumPy provides functions which can do this in a more efficient way. For example, to calculate the 50th percentile of each image in our data matrix, we could call the np.percentile function, specify that we want to take the 50th perecentile (q=50), and specify the calculation to be performed across columns (axis=1). The result will be a percentile value for each row (corresponding to each image). Below complete the code to calculate the 1st percentile and 99th percentile for each image in the dataset (replace XXX).

In [None]:
######################################
#######         TODO           #######
######################################

perc01 = XXX
perc99 = XXX


print("data shape: {}".format(X.shape))
print("perc01 shape: {}".format(perc01.shape))
print("perc99 shape: {}".format(perc99.shape))


What are the shapes of the perc01 and perc99 arrays? What does each element of the arrays represent?

XXX

<br> 

Now perform linear rescaling to map [perc01, perc99] intensities for each image to $[-1, 1]$. For an image $I$, the desired linear intensity rescaling is defined as:

$$
\begin{align}
I_{norm} = 2\frac{I - \mathrm{perc_{01}}(I)}{\mathrm{perc_{99}}(I)-\mathrm{perc_{01}}(I)}-1
\end{align}
$$

Similar to before, we could write a for loop over the rows of the data matrix of for each row perform this operation. However this is unneccessary because we can take advantage of NumPy broadcasing and vectorization which is significantly more efficient. Below, perform the linear rescaling on our data matrix $X$ without using a for loop (replace XXX)


In [None]:
######################################
#######         TODO           #######
######################################

X_norm = XXX

print(X_norm.shape)

WARNING: if you get the following error: "operands could not be broadcast together with shapes (1000,4096) (1000,) (1000,)"

The perc01 and perc99 arrays are not broadcastable with the data array $X$. For arrays to be broadcastable, the sizes of each dimension must match or one must equal 1, this comparison is made starting at the right last dimension (last number returned by shape).

For example, the shapes of the arrays are:

|   |  |   |
| :------- | :------: | -------: |
| X | 1000, | 4096 |
| perc01 |  | 1000 |


Since the last dimensions (4096 and 1000) are not equal and neither equals 1, the arrays are not broadcastable. To fix this, a "dummy" dimension can be added as the last dimension of the perc01/perc99 arrays with size 1

|   |  |   |
| :------- | :------: | -------: |
| X | 1000, | 4096 |
| perc01 | 1000, | 1 |


Now, the arrays are broadcastable since each dimension matches or equals 1. To fix this, try adding the parameter `keepdims=True` to your np.percentile calls, which will preserve the column dimension. 

More reading on broadcasting: https://numpy.org/doc/stable/user/basics.broadcasting.html

<br>

Lastly, clip the intensities of the normalized array $X_{norm}$ to be between $[-1, 1]$. This will map values that were less than perc01 to -1, and values greater than perc99 to -1. 

In [None]:
X_norm = np.clip(X_norm, -1, 1)

Now visualize the normalized dataset. Run this cell multiple times to see different samples.

In [None]:
fig, ax = plt.subplots(2, 6, figsize=(15, 6))

for i, k in enumerate(np.random.randint(N, size=6)):
    arr = X_norm[k,:]
    label = Y[k]
    arr2d = arr.reshape(image_width, image_height)
    ax[0,i].set_title(class_names[label])
    ax[0,i].imshow(arr2d, cmap="gray")
    ax[1,i].hist(arr, bins=20, edgecolor='black', color='white', density=True)
    
    ax[0,i].set_axis_off()
    ax[1,i].set_yticklabels([])
    ax[1,i].set_yticks([])

plt.tight_layout()
plt.show()

**QUESTION:** How have the images and intensity distributions changed after normalization?

XXX

## Training / Validation / Testing Splits

Split the dataset into training, validation, testing sets. The total dataset is 1000 samples, we will use a 600 / 200 / 200 split. Each dataset will be saved as a `npz` which we can load into a NumPy array. Multiple arrays can be stored in a single `npz` file by specifying keyword argments.

In [None]:
num_train = int(N*0.6) # 60% training
num_val = int(N*0.2) # 20% validation
num_test = N - num_train - num_val # remainder tresting

Xtrain, Xval, Xtest = X_norm[:num_train], X_norm[num_train:num_train+num_val], X_norm[num_train+num_val:]
Ytrain, Yval, Ytest = Y[:num_train], Y[num_train:num_train+num_val], Y[num_train+num_val:]

np.savez("train.npz", X=Xtrain, Y=Ytrain)
np.savez("val.npz", X=Xval, Y=Yval)
np.savez("test.npz", X=Xtest, Y=Ytest)

Load the datasets. Since multiple arrays are saved in each `npz` file, the `np.load` call will return a Python dictionary, and each array will can be accessed with the keys corresponding to the keyword used when saving. For example. when loading "train.npz" into variable train, the data array can be accessed using train['X'] and the labels array can be accesed using train['Y']. Below the datasets are loaded and the array shapes are printed to confirm everythig was save/loaded properly.

In [None]:
train = np.load("train.npz")
test  =  np.load("test.npz")
val  =  np.load("val.npz")

print("Train X: {}".format(train['X'].shape))
print("Train Y: {}".format(train['Y'].shape))
print("Val X: {}".format(val['X'].shape))
print("Val Y: {}".format(val['Y'].shape))
print("Test X: {}".format(test['X'].shape))
print("Test Y: {}".format(test['Y'].shape))

## Train Softmax Classifier

In the model.py file, a class called `LinearClassifier` is provided for training a softmax classifer. At this point in the class, you don't need to undertsand exactly what is going on in the code, if you are interested you can open the file. For this assignment, you will utilize the class for training a softmax classifer on your dataset. Below an instance of the `LinearClassifier` class is created, and the `train` member function is called with the training data and labels. The `train` function returns the learned model parameters `W` and `b`


In [None]:
from model import LinearClassifier

clf = LinearClassifier(epochs = 1000,learning_rate=0.02)
W, b = clf.train(train['X'],train['Y'])

print(f'W.shape {W.shape}')
print(f'b.shape {b.shape}')

Below visualize the `W` parameters learned. For each of the 6 classes, the corresponding parameters are reshaped into an image. They weights can be thought of as the "templates" learned for each class.

In [None]:
fig, axs = plt.subplots(1,6, figsize = (12, 8),constrained_layout=True)
for i in range(6):
    axs[i].imshow(W.T[i].reshape(64,64), cmap = "gray")
    axs[i].axis('off')
    axs[i].set_title(class_names[i])
plt.savefig("svm_w.pdf",bbox_inches='tight', pad_inches=.05)

## Model Evaluation

Implement the linear classifer predict function using the learned parameters `W` and `b`. Additionally, the predict function requires input data `X` which it will predict the class labels for. Your code should work for X with multiple samples ($N \times M$ matrix) . Note, use vectorized code NOT for loops!

In [None]:
def predict(X, W, b):
    """
    Linear classifer prediction 

    Assume the input is N samples organized in a 2D matrix of N x M

    Inputs:
    - X (np.array): NumPy array of data to predict labels, size N x M
    - W (np.array): NumPy array of learned model weights
    - b (np.array): NumPy array of learned model biases


    Returns:
    - y (np.array): NumPy array of predictions for each sample, size N
    """
    ######################################
    #######         TODO           #######
    ######################################

    return y

def accuracy(Y_true, Y_pred):
    """
    Calculates accuracy 

    Inputs:
    - Y_true (np.array): NumPy array true class labels
    - Y_pred (np.array): NumPy array predicted class labels
   

    Returns:
    - acc (double): Accuracy as percentage
    """
    
    ######################################
    #######         TODO           #######
    ######################################

    
    return acc

Predict the labels and report the accuracy on the training, validation, and testing datasets.

In [None]:
######################################
#######         TODO           #######
######################################
