**task:** Classification  
**data:** MNIST -- handwritten digits between 0 and 9, in grayscale  
**Loss function:** Cross Entropy Loss (more on this later)  


In [13]:
import torch
import torchvision as tv
import torch.nn as nn
import torch.nn.functional as f
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

**Model Architecture:** We use a multilayer perceptron

Each layer consists of what we can think of neurons or nodes. Each node takes the outputs of all the neurons in the previous layer as input, in the form of a vector. It then performs a linear transformation on the input (wx+b) and then puts it through a non-linear transformation. The final output is a scalar-value, but since the layer (generally speaking) has multiple nodes, the output of the layer itself is a vector.

So if z is the output

$$z = \sigma(\vec{w}^T\vec{x} + \vec{b}) = \sigma (\sum w_i \cdot x_i + b_i)$$

where $\sigma$ is a non-linear function. In our case we use "ReLU", where $$ReLU(x) = max(0, x)$$ In otherwords, it just zeros-out negatives.


In [None]:
class MLP(nn.Module):
    def __init__(self):
        super(MLP,self).__init__()
        self.fc1 = nn.Linear(784, 128) #Input layer 
        self.fc2 = nn.Linear(128, 64) #Hidden layer 1
        self.fc3 = nn.Linear(64, 32) #Hidden layer 2
        self.fc4 = nn.Linear(32, 10) #Output layer

    def forward(self, x):
        x = x.view(-1, 784)
        x = f.relu(self.fc1(x))
        x = f.relu(self.fc2(x))
        x = f.relu(self.fc3(x))
        x = f.relu(self.fc4(x))
        return x    


Above we have defined a multilayer perceptron with 2 hidden layers. The input layer takes in input tensors of size 784 because that is the size of our data: MNIST data consists of 28x28=784 pixel images.

After that we have 2 hidden layers because, why not?

The output layer outputs a 10-dimensional vector because there are 10 possible outputs to the data: 0,1,2,3,4,5,6,7,8,9

Now we get back to the loss function we mentioned earlier. We use "Cross Entropy Loss", given by
$$L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \cdot \log{\hat{y}_{i,c}}$$

Where N is the number of inputs, and c is the number of classes.

$y_{i,c}$ is 1 if sample ***i*** belongs to class ***c*** and 0 otherwise.

$\hat{y}_{i,c}$ is the predicted probability that sample ***i*** belongs to class ***c***

***Hyperparameters***

Our main hyperparameter is the learning rate, which scales the gradient.

Later on, we will set the number of training epochs. That number is also a hyperparameter.

In the next step where we preprocess the data, our batch sizes are also hyperparameters. The batch size is how many examples/datapoints we go through before updating the model weights.

In [34]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = MLP().to(device)
criterion = nn.CrossEntropyLoss()                          # suitable for classification
optimizer = optim.Adam(model.parameters(), lr=1e-3) #lr is the 'learning rate', which is a hyperparameter



**Normalization**: We are normalizing and flattening the data so everything is between 0 and 1.  

*What does that mean, you ask?*

--> Flattening: we are classifying images, which are 2-dimensional. We need input that is 1-dimensional so we can put it into the network, so 'flattening' just means converting it to a 1D array

--> Normalizing: Our images are grayscale values between 0 and 255. Normalizing means we smoosh that to values between 0 and 1, where 255 corresponds to 1. This prevents large values from being overly influential, or compounding into a big number.

In [35]:
transform = transforms.Compose([
    transforms.ToTensor()
])

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset  = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=1000)


**Training**

Our training algorithm is backpropogation with gradient descent.

After making a forward pass through the network, we make a backward pass (backpropogation) where we compute the loss of each node and update it using the gradient.

In [36]:
num_epochs = 10

for epoch in range(num_epochs):
    model.train()  # training mode (e.g., enables dropout if used)
    total_loss = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss / len(train_loader):.4f}")


Epoch 1/10, Loss: 0.6031
Epoch 2/10, Loss: 0.3798
Epoch 3/10, Loss: 0.3356
Epoch 4/10, Loss: 0.3110
Epoch 5/10, Loss: 0.2933
Epoch 6/10, Loss: 0.2816
Epoch 7/10, Loss: 0.2737
Epoch 8/10, Loss: 0.2659
Epoch 9/10, Loss: 0.2615
Epoch 10/10, Loss: 0.2547


In [37]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        predicted = torch.argmax(outputs, dim=1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")


Test Accuracy: 88.51%


**Overfitting**: The model becomes to specialized on the training data, so that it doesn't generalize well. This can be caused by using too many training epochs and to strict of a loss function.

--> You can tell a model is overfitting when it performs extremely well on training/test data but performs poorly on validation data.

To fix it, use fewer training epochs, or less complicated models (complicated models can become exteremly specialized very easily)

***Accuracy, Precision, and Recall:***

**Notation**: TP = True Positive, FP = False Positive, TN = True Negative, FN = False Negative

**Accuracy** is the percentage of predictions the model got right. So its $\frac{TP+TN}{TP+FP+TN+FN}$

**Precision** is the number of 'true' predictions that were actually correct, given as $\frac{TP}{TP+FP}$

**Recall** is the number of 'positive' examples the model actually identified, given as $\frac{TP}{TP+FN}$