## Supervised Learning - Multiclass Classification
In the last tutorial you learned how to use the [binary cross entropy](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) loss function to separate two classes. In this tutorial you'll be using [cross entropy](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) loss to learn to predict multiple classes.

Specifically, you will be working with the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset which contains 70k hand written digits from 0-9. In the last tutorial I had you overfit the training data to help illustrate how a neural network can learn to fully approximate some unknown function. Although that's cool, it's also error prone. The problem comes when the neural network has a new data point that it hasn't seen before. If the neural network is overfit to the training data, it's possible that the new data point will be misclassified since the decision boundary will be so tightly fit to the training data. Because of this, you will use 60k images to train on and 10k images to test on.

IMPORTANT: this tutorial was designed to be run on Google Colab. It doesn't need to be, but if you're behind a firewall you'll most likely face issues when downloading the datasets.

## MNIST
In this tutorial you will learn:<br/>
- How to load the training and test data
- How to standardize your data
- How to create a deep neural network of arbitrary depth
- How to use other activation functions such as ReLU
- How to determine if you're overfitting the training data and what you can do to prevent that

Let's get started!

### Load the data

First you'll load the training and test dataset via PyTorch. PyTorch has [quite a few](https://pytorch.org/vision/stable/datasets.html) datasets to choose from which makes it easy to test models. At some point you'll want to load data that isn't publicly available, but I often find that there's a similar open-sourced problem which makes it easy to test ideas before commiting to one.

In [None]:
# This cell is meant to install dependencies within google colab
def _install_deps():
    try:
        if 'google.colab' in str(get_ipython()):
            print('Installing dependencies within Google Colab...')
            !wget https://github.com/Chrispresso/ML-for-coders/blob/main/requirements.txt?raw=True -O requirements.txt
            !python -m pip install -r requirements.txt
    except:
        pass

try:
    if __cell_install_requirements:
        _install_deps()
    else:
        pass
except:
    _install_deps()

__cell_install_requirements = False

In [None]:
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from IPython import display 
import matplotlib.pyplot as plt
import numpy as np

train_set = datasets.MNIST(root='data', train=True, download=True)
test_set = datasets.MNIST(root='data', train=False, download=True)

Let's look at a sample of the training set so we know what we'll be working with.

In [None]:
pair = train_set[0]
pair

So each entry in the dataset is a tuple containing the image and the label (0-9). Take a look at the size of the image. It's 28x28. In this case all images here are 28x28.

In the previous tutorial you worked with an x-position and a y-position and predicted a class. In that case you dealt with 2 features: the (x,y) pair. In this case you have 28x28 pixels and will use each pixel as it's own feature. That's 784 features! Let's now take a look at one of the images.

In [None]:
img = pair[0]  # Extract the image
plt.imshow(img, cmap='gray')
plt.show()

Okay! So you can see a "5" written above. Computers don't understand images very well but they do a good job with numbers. Because of this you will be using the raw pixel values. Let's take a look what this "5" is to a computer.

In [None]:
print('shape is:', np.array(img).shape)
print('min value:', np.min(np.array(img)))
print('max value:', np.max(np.array(img)))
print('10x10 section of the number "5" above:')
print(np.array(img)[10:20, 10:20])

This might look overwhelming but just take it in one piece at a time. This means you'll be using a 28x28 matrix that has values 0-255. 

### Standardize the Data

Neural networks use small weight values, so having large inputs such as 255 can cause issues during training. Because of this there are two common ways to modify features. The first is with normalization. Normalizing your data means to convert it to a given range. This range is generally `[-1, 1]` or `[0, 1]`. The smaller values tend to increase training speed and help reduce numerical instability. The second way is called standardization. Standardizing your data consists of shifting the data to have a mean of 0 and a standard deviation of 1.

Because our image is already bounded by `[0, 255]`, reducing it via normalization will keep the same spread. This would allow for faster and more stable training, but we're instead going to use standardization. I often find standardizing the data more reliable but you can try both methods and see which works better! (This is a common trend in machine learning)

Now let's figuring out what the mean and standard deviation of our training data is.

In [None]:
X = np.stack([np.array(pair[0]) for pair in train_set])
print(X.shape)

In [None]:
np.mean(X / 255.0) 

In [None]:
np.std(X / 255.0)

Woah, woah, woah. All that talk about not normalizing the data and suddenly we're dividing by 255? Since normalizing the data keeps the same spread, it doesn't matter if we subtract the un-normalized or normalized mean as long as we're consistent. That means we'll also want to first transform the data to the range `[0, 1`] and **then** apply the standardization. In this case we'll want to subtract the mean and divide by the standard deviation we see above.

In [None]:
transform = transforms.Compose([
    transforms.ToTensor(),  # Converts to a Float Tensor and divides by 255
    # $MODIFY 1
    transforms.Normalize((0.1307,), (0.3081,))  # Mean and std
])
train_set = datasets.MNIST(root='data', train=True, download=True, transform=transform)
test_set = datasets.MNIST(root='data', train=False, download=True, transform=transform)

Notice that you're applying the same transformation to the training set as you are the test set. Your test set is a hidden set - you are never allowed to look at it. The transformations you apply to the training set should be the same as to the test set. You're under the assumption your training data is coming from a similar distribution that your unseen data is coming from.

### Deep Neural Networks

Next you're going allow the ability to create neural networks of arbitray depth.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Network(nn.Module):
    def __init__(self, hidden_layers, activation_func):
        super().__init__()
        self.activation_func = activation_func
        self.layers = nn.ModuleList()

        # Add a linear layer from input to the first hidden layer
        self.layers.append(nn.Linear(784, hidden_layers[0]))
        # Add arbitrary number of hidden layers
        for i in range(len(hidden_layers) - 1):
            self.layers.append(nn.Linear(hidden_layers[i], hidden_layers[i+1]))
        # Add a final one from the last hidden layer to the number of classes (10)
        self.layers.append(nn.Linear(hidden_layers[-1], 10))

    def forward(self, x):
        out = x
        for layer in self.layers[:-1]:
            out = layer(out)
            out = self.activation_func(out)
        # The final hidden layer to the classes don't need to go through 
        # an activation function
        out = self.layers[-1](out)
        return out


This is pretty similar to what you saw in the last tutorial. The only difference is now you're using a [module list](https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html) to hold the layers. Take a look:

In [None]:
net1 = Network([128, 64, 32], nn.ReLU())
print(net1)

Using a module list is an efficient way to dynamically add modules (layers).

### Activation Functions

PyTorch offers many [non-linear activation functions](https://pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions). In the last tutorial you used the `Sigmoid` activation function but most modern models use a rectified linear unit (ReLU). There are generally two main reasons to favor ReLU over Sigmoid. The first is that ReLU is faster to compute, along with the derivative. The second is that Sigmoid often suffers from a vanishing gradient. As neural networks get sufficiently deep, it can be easy to saturate the activation function. This happens when the activation function plateaus, causing the derivative to be 0. Once the derivative is 0, the updates that get passed back will also be 0 resulting in an extremely slow convergence, if there even is one. You can read more about ReLU vs. Sgimoid activations [here](https://wandb.ai/ayush-thakur/dl-question-bank/reports/ReLU-vs-Sigmoid-Function-in-Deep-Neural-Networks--VmlldzoyMDk0MzI#:~:text=In%20other%20words%2C%20once%20a,eliminated%20by%20using%20leaky%20ReLUs.)

### Training

How do you avoid overfitting the training data if you're not allowed to touch the test set? A common practive is to have a validation set - a subset of the training set which you are allowed to look at, but treat as though it were a test set. By using the validation set you can take a look at how your model **might** act when given the test set. <br/>

In fact, an easy way to see if you're overfitting the training data is to watch the training loss decrease while the validation loss begins to increase. Let's take a look.

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks.progress import TQDMProgressBar
from torchmetrics import Accuracy
import torch
pl.seed_everything(0xC0FFEE)

# Split the training set from 60k training to 50k/10k training/val
train, val = torch.utils.data.random_split(train_set, [50000, 10000])
print(type(train))
print(len(train))
print(len(val))

Now it's time to create the lightning module. Very similar to what you did last time but now the loss function is `F.cross_entropy` instead of `F.binary_cross_entropy`. You're also now providing a `validation_step()` and `test_step()` function. This is similar to the training step you've seen benfore, but gets run automatically by PyTorch Lightning when it's time to validate and test the data.

In [None]:
from torch.utils.data import DataLoader, random_split

class LitModel(pl.LightningModule):
    def __init__(self, model):
        super().__init__()
        self.net = model
       
        self.val_accuracy = Accuracy()
        self.test_accuracy = Accuracy()

    def training_step(self, batch, batch_idx):
        X, y = batch
        X = X.view(-1, 28*28)
        y_pred = self.net(X)
        loss = F.cross_entropy(y_pred, y)
        
        self.log('train_loss', loss, prog_bar=True)

        return loss
      
    def validation_step(self, batch, batch_idx):
        X, y = batch
        X = X.view(-1, 28*28)
        y_pred = self.net(X)
        loss = F.cross_entropy(y_pred, y)
        self.val_accuracy.update(y_pred, y)

        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', self.val_accuracy, prog_bar=True)
    
    def test_step(self, batch, batch_idx):
        X, y = batch
        X = X.view(-1, 28*28)
        y_pred = self.net(X)
        loss = F.cross_entropy(y_pred, y)
        self.test_accuracy.update(y_pred, y)

        self.log('test_loss', loss, prog_bar=True)
        self.log('test_acc', self.test_accuracy, prog_bar=True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=2e-3)
    


Next, instantiate the model and train.

In [None]:
lit_model = LitModel(
    Network([32], nn.ReLU()) 
)

batch_size = 128

train_dataloader = DataLoader(train, batch_size=batch_size, shuffle=True)
val_dataloader   = DataLoader(val, batch_size=batch_size, shuffle=False)

trainer = pl.Trainer(
    max_epochs=100,
    enable_checkpointing=False,
    callbacks=[TQDMProgressBar(refresh_rate=20)]
)

trainer.fit(lit_model, train_dataloader, val_dataloader)

Now you can load `tensorboard` to take a look at the loss.

In [None]:
%load_ext tensorboard


Finally, just like you've seen before, launch tensorboard. Spend some time going through the different tabs to help familiarize yourself with the interface.

In [None]:
%tensorboard --logdir=lightning_logs

Take a good look above. Notice that the validation loss increases and continues to rise after a while. That is a clear sign of overfitting, which is not what you want. What you'd like is a way to know when the validation loss stops decreasing and then stop the training. This is known as early stopping. Now let's add early stopping to monitor the `val_loss`. You do this by adding `EarlyStopping` as a callback. This will check for `val_loss`, which is being logged above. If `val_loss` does not decrease for a number of epochs in a row, training will halt.

In [None]:
from pytorch_lightning.callbacks.early_stopping import EarlyStopping


# $MODIFY 2
hidden_layers = [32]
# $MODIFY 3
activation_func = nn.ReLU()

lit_model = LitModel(
    Network(hidden_layers, activation_func) 
)

# $MODIFY 4
batch_size = 128

train_dataloader = DataLoader(train, batch_size=batch_size, shuffle=True)
val_dataloader   = DataLoader(val, batch_size=batch_size, shuffle=False)

trainer = pl.Trainer(
    max_epochs=100,
    enable_checkpointing=False,
    callbacks=[TQDMProgressBar(refresh_rate=100), EarlyStopping(monitor="val_loss", mode="min", patience=3)]
)

trainer.fit(lit_model, train_dataloader, val_dataloader)


Take note how many epochs it ran for! Let's test the accuracy on the test set to see if it performs well.

In [None]:
test_dataloader = DataLoader(test_set, batch_size=batch_size)
trainer.test(lit_model, test_dataloader)

### Tasks
Below are a list of tasks to try to help you solidify your understanding. Remember, each number corresponds to a `$MODIFY` comment above.

1. Try removing the standardization portion and re-running. Do you notice any differences in how the model trains? 
2. Change the hidden layer structure. A common architecture you'll see is a "funnel" approach, where the number of neurons decreases as the layer number increases. Try some different depths, number of neurons per layer and weird architectures. You'll probably want to run this with early stopping. What do you notice about the accuracy? Does the architecture play a role in training time? How do the number of parameters get effectred?
3. Try some different [activation functions](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity). How does it effect overall training and accuracy?
4. Change the `batch_num`. How does this effect the training?

### Summary
In this tutorial you learned how to load the training and test data using PyTorch `datasets`. You learned some of the differences between normalizing and standardizing your data, and saw how to use PyTorch `transforms` to accomplish this. You were able to create a deep neural network of arbitrary depth and were able use different activation functions. Finally, you saw how to detect overfitting, and how to prevent it using early stopping.

That's it for now. See you next time!