# Neural Networks in PyTorch
## What is a neural network?
You can think of a neural network as a simple function that takes an input and produces an output. In this example, we will focus on a simple MLP (Multi Layer Perceptron). This is a modern feed forward neural network with the following conditions:
1. Fully Connected Layers.
2. At least one hidden layer.

### Input Layer
This is where your data enters the network. The number of neurons in this layer corresponds to the number of features in your dataset.

### Hidden Layers
These are the layers between the input and output layers. They perform most of the computational work, transforming the input data through a series of mathematical operations. Each neuron in a hidden layer takes a weighted sum of the outputs from the previous layer, adds a bias term, and then passes the result through an activation function.  The activation function is key because it introduces non-linearity, allowing the network to learn complex patterns that a simple linear model couldn't.

### Output Layer
This is the final layer that produces the network's prediction. The number of neurons here depends on the problem you're trying to solve (e.g., one neuron for binary classification, multiple neurons for multi-class classification).
NOTE: Most output layers will pass the output given by the network into a *softmax* function to provide probabilities for each class.


## Loading our Data
In the following code, we will get our dataset from the torchvision library. This dataset has a ton of images that we can use to train our network. [MNIST](https://docs.pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html)

### Transformer
The raw images are not readable by our model so we need to convert them into PyTorch tensors before we do anything. We can do this by using torchvision's ToTensor() method (torchvision.transforms.ToTensor()). 
```py
from torchvision import transforms
transformer = transforms.Compose([
    transforms.ToTensor(), # Convert the image to a PyTorch Tensor
])
```
Along with converting the images into PyTorch Tensors, the ToTensor method will also scale the pixel values from [0, 255] to [0.0, 1.0]
We use transforms.Compose here so we can chain together multiple transformers in the future (like transforms.Normalize()).

Next, we import our datasets from torchvision.datasets. This dataset has a pre-defined split: 60,000 images for the `Train=True` and 10,000 for the `Train=False`.

In [1]:
%pip install torch
%pip install torchvision

import torch
from torchvision import transforms, datasets # Contains our MNIST dataset and a transformer to convert the PIL images into PyTorch Tensors.

transformer = transforms.Compose([
    transforms.ToTensor(), # Convert the image to a PyTorch Tensor
])

train_dataset = datasets.MNIST(
    root='./data', 
    train=True, 
    transform=transformer, 
    download=True
)

test_dataset = datasets.MNIST(
    root='./data', 
    train=False, 
    transform=transformer, 
    download=True
)


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Data Loader
### What is a Data Loader?
A DataLoader is an iterable that wraps a Dataset and provides an efficient way to load and process data for training or inference in machine learning models.
It can shuffle our data and split them into batches. 

The `shuffle=True` parameter is very important while we train as shuffling our training data ensures that in each epoch—one complete pass through the entire training dataset—our model sees the images in a different, random order. This prevents things like `Memorization` or `Order Bias`

## Why Batching?
Batching is used to balance computational efficiency with the model's performance. We use mini-batching here as it is a compromise between training on the whole dataset at once ([batch gradient descen](https://zilliz.com/glossary/batch-gradient-descent)) and training on individual data points ([stochastic gradient descent](https://www.geeksforgeeks.org/machine-learning/ml-stochastic-gradient-descent-sgd/)). 

### Pros of Mini-Batching
1. Computational Efficiency: It is much faster than full-batching and requires less memory.

2. Training Stability: It provides a more stable and representative gradient than pure SGD, leading to a more reliable training process.

3. Escaping Local Minima: The inherent noise in the gradient can actually be a good thing, helping the model to avoid getting stuck in a shallow "local minimum" and find a better solution in the loss landscape.




In [2]:
from torch.utils.data import DataLoader

# Defining a batch size
BATCH_SIZE = 64

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False
)


## Creating our Neural Network
The following code shows us how we can structure our network. In our `__init__` method, we define the layers of our neural network, including a flattening layer and a stack of linear layers with activation functions.

### Inside of our initalizer (\_\_init\_\_)
#### Flatten
The nn.Flatten layer transforms the 2D input tensor (28x28) into a 1D vector (784 features), preparing it for the linear layers.

#### Layers
We use nn.Sequential to create a container for a linear stack of layers. The ReLU (Rectified Linear Unit) activation function introduces non-linearity to the network, allowing it to learn complex, non-linear mappings.
1. The first linear layer takes a 784-dimensional input and projects it into a 512-dimensional feature space.
2. The second linear layer further transforms the 512 features into a 256-dimensional space.
3. The final linear layer, or "output layer," projects the features down to 10 dimensions, corresponding to the 10 digit classes (0-9). These raw outputs are known as "logits."

In [3]:
from torch import nn

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten() # Flattens the 28x28 image into a 784-dimensional vector
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512), # First linear layer
            nn.ReLU(), # Activation function
            nn.Linear(512, 256), # Second linear layer
            nn.ReLU(), # Activation function
            nn.Linear(256, 10) # Output layer
        )

    def forward(self, x):
        x = self.flatten(x) # Flatten the input
        logits = self.linear_relu_stack(x) # Pass through the linear_relu_stack
        return logits

## Forward Pass
1. First we must create an instance of our model. 
2. Next, we need to get our next batch. We use our training DataLoader to get a single batch of images and their corresponding labels. 
3. Finally, we pass in our images into our model. PyTorch will automatically call the forward() method we have created earlier. This method returns our logits. 

In [4]:
# Create an instance of the model
model = SimpleNN()

# Get a single batch of images and labels from the training data loader
images, labels = next(iter(train_loader))

# Perform the forward pass to get the logits
logits = model(images)

# Print the shape of the output
print(f"Shape of the logits tensor: {logits.shape}")
print(f"Sample logits for first image:\n{logits[0]}")

Shape of the logits tensor: torch.Size([64, 10])
Sample logits for first image:
tensor([ 0.0765,  0.0350, -0.0983,  0.0425, -0.0550,  0.0564, -0.0206, -0.0030,
        -0.0416, -0.0777], grad_fn=<SelectBackward0>)


## Loss Function
A loss function is a mathematical function that measures the discrepancy between the network's predictions and and true labels. \
Since our model has multiple outputs, a multi-class classification network, we will use the Cross-Entropy Loss function. 
### Cross-Entropy Loss Function
`nn.CrossEntropyLoss` compines two key operations in one efficient function:
#### Softmax
It first applies the softmax function to the network's logits (raw output). This converts the logits into a set of probabilities that will sum up to 1. The highest logit will correspond to the class with the highest probability. \
Softmax formula: $$ P(y_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$
#### Negative Log likelihood
It then calculates the negative log likelihood of the true class. This will heavily penalize the model when it is very confident (assigns a high probability) about a wrong prediction. \
NLL formula: $$ L_{NLL} = -log(p_i) $$
where $i$ is the true class and $p_i$ is the probability of the true class.


In [5]:
# Define the loss function
loss_function = nn.CrossEntropyLoss()

# Calculate the loss
loss = loss_function(logits, labels)

print(f"Loss value: {loss.item()}")

Loss value: 2.298157215118408


## Optimizer
### What is an Optimizer?
An optimizer is an algorithm that adjusts the model's weights and biases to reduce the loss. Think of the loss function as a measure of the "error" or "cost," and the optimizer as the engine that uses that information to navigate a "loss landscape" to find the lowest point. Some examples of Optimizers are Adam, RMSprop, Adagrad, and multiple versions of Gradient Descent. \

Lets use Adam for this example. 
### Adam (Optimizer)
Adam, which stands for Adaptive Moment Estimation, is an advanced optimization algorithm used to train neural networks. Adam is a more sophisticated and often times more effective model then stochastic gradient descent. Adam works by dynamically changing the learning rate for each of the model's parameters, rather than a fixed learning rate for all of them. You can learn more about gradient descent optimization algorithms [here](https://arxiv.org/pdf/1609.04747). \
NOTE: `model.parameters()` returns an iterator that gives the optimizer access to the model's trainable tensors.



In [6]:
import torch.optim as optim

# Set the starting learning rate (hyperparameter)
# This is the maximum learning rate and acts as an Upper Bound for the learning rates.
learning_rate = 0.001
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

## Training Loop
Now that we have set up all the components we need to train our model, we can combine them all to create our training loop. \
The Training Loop is an iterative process that runs for a number of `epochs`. An epoch is one full pass through the entire training dataset.
Here's a breakdown of the four key steps that happen in each iteration of the loop:
1. Forward Pass: We pass a batch of data through the model to get predictions.
2. Loss Calculation: We compare the predictions to the actual labels using our `loss_function`.
3. Backpropagation: We call `loss.backward()` to compute the gradients. PyTorch handles this automatically.
4. Parameter Update: We call `optimizer.step()` to update the model's weights and biases using the gradients.

NOTE: It's also crucial to zero out the gradients before each backward pass. We do this by calling `optimizer.zero_grad()`. This is because PyTorch accumulates gradients by default. If we don't zero them out, the gradients from the previous batch would be added to the gradients of the current batch, leading to incorrect updates. \
This is manually done as to allow for more flexable training strategies such as Gradient Accumulation and Multi-Task Learning. 

In [7]:
# Hyperparameter
# Define the number of epochs
epochs = 5

for epoch in range(1, epochs + 1):
    running_loss = 0.0 # To monitor progress and debugging. This should eventually flatten out
    for inputs, labels in train_loader:
        # Zero out the gradients 
        optimizer.zero_grad()

        # Forward Pass
        logits = model(inputs)
        loss = loss_function(logits, labels)

        # Backward Pass (backpropagation)
        loss.backward()

        # Update the weights
        optimizer.step()

        running_loss += loss.item()

    # Print the average loss for this epoch
    print(f"Epoch {epoch}, Loss: {running_loss / len(train_loader)}")


Epoch 1, Loss: 0.23829017599711
Epoch 2, Loss: 0.08969508984914498
Epoch 3, Loss: 0.0576080125368097
Epoch 4, Loss: 0.043796712890511125
Epoch 5, Loss: 0.034099964722583906


## Chain of Communication 
For understanding, this is the chain of communication:
1. `model and DataLoader`: The `DataLoader` provides batches of data to the model's forward method.

2. `loss and model`: The `loss_function` takes the model's raw output (logits) and compares it to the true labels.

3. `loss and PyTorch's Autograd`: Calling `loss.backward()` is the key trigger. PyTorch's automatic differentiation engine (`Autograd`) takes over. It traces the entire computation from the final loss value back to the initial input parameters.

4. `Autograd and model's parameters`: As it traces back, `Autograd` computes the gradient for each parameter and populates the `.grad` attribute for that parameter's tensor.

5. `optimizer and model's parameters`: The `optimizer` was already given a reference to all the parameters. When `optimizer.step()` is called, it accesses the now-populated `.grad` attributes and updates the parameters' values.

## Evaluation
After the training loop is complete, your neural network has learned and its weights have been adjusted to make better predictions. This is great for the training data, but the true test of a model's performance is how well it works on data it has never seen before. We do this by evaluating the model on our separate test dataset.

The evaluation process is different from training because our goal is just to measure performance, not to continue learning. Therefore, we do not update the model's parameters, which means we do not need the loss function, backpropagation, or the optimizer.

The following code is a high-level overview of the evaluation process:

`model.eval()`: This command tells PyTorch that the model is now in evaluation mode. This is very important because certain layers, like Dropout and BatchNorm, behave differently during evaluation than they do during training.

`with torch.no_grad()`: This is a context manager that temporarily disables gradient calculation. Since we are not training, we don't need to compute gradients, and using this block saves memory and speeds up computations.

`Looping through the test_loader`: We use the test_loader to iterate through all the test images in batches.

`torch.max(outputs.data, 1)`: For each batch, we perform a forward pass to get the logits. Then, we use torch.max to find the index of the highest logit for each image. This index is our model's final prediction for that image.

`Calculating Accuracy`: Finally, we compare the model's predictions to the true labels to calculate the accuracy. Accuracy is a simple and intuitive metric that tells us the percentage of test images that the model correctly classified.



In [8]:
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

# Disable gradient calculation
with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        # Get the predicted class
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy of the model on the 10,000 test images: {accuracy:.2f}%')

Accuracy of the model on the 10,000 test images: 97.30%


## Lets try some other network architectures
Lets create a network with only one hidden layer

In [9]:
class SingleLayerNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten() # Flattens the 28x28 image into a 784-dimensional vector
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 256), # First linear layer
            nn.ReLU(), # Activation function
            nn.Linear(256, 10) # Output layer
        )
    
    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    

In [10]:
# Create our model
single_layer_model = SingleLayerNN()

# Create our loss function
single_layer_loss_function = nn.CrossEntropyLoss()

# Create our ADAM Optimizer
single_layer_optimizer = optim.Adam(single_layer_model.parameters(), lr=0.001)

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    for index, (inputs, labels) in enumerate(train_loader):
        # Zero out the gradients 
        single_layer_optimizer.zero_grad()

        # Forward Pass
        logits = single_layer_model(inputs)

        # Calculate the loss
        loss = single_layer_loss_function(logits, labels)

        # Backward Pass (backpropagation)
        loss.backward()

        # Update the weights
        single_layer_optimizer.step()

        running_loss += loss.item()
    print(f'Epoch {epoch}\'s training loss: {running_loss}')



Epoch 1's training loss: 288.5349732283503
Epoch 2's training loss: 122.41327644325793
Epoch 3's training loss: 80.539859585464
Epoch 4's training loss: 58.42187776044011
Epoch 5's training loss: 43.51963080186397


In [11]:
single_layer_model.eval()  # Set the model to evaluation mode
single_layer_correct = 0
single_layer_total = 0

# Disable gradient calculation
with torch.no_grad():
    for images, labels in test_loader:
        outputs = single_layer_model(images)
        # Get the predicted class
        _, predicted = torch.max(outputs.data, 1)
        single_layer_total += labels.size(0)
        single_layer_correct += (predicted == labels).sum().item()

single_layer_accuracy = 100 * single_layer_correct / single_layer_total
print(f'Accuracy of the model on the 10,000 test images: {single_layer_accuracy:.2f}%')

Accuracy of the model on the 10,000 test images: 97.78%


## Advantages of More Layers and Neurons
More hidden layers and neurons increase the network's capacity. This is a major advantage for complex tasks.

### Learning Complex Patterns
A deeper and wider network has more trainable parameters (weights and biases). This gives the model the flexibility to learn and represent non-linear relationships in the data that a simpler model cannot.

### Improved Performance
On large and complex datasets, a higher-capacity network can often achieve higher accuracy and outperform a simpler network because it is better equipped to capture all the variations in the data.

## Disadvantages of More Layers and Neurons
A network with too much capacity can lead to a significant problem called overfitting.

### Overfitting
This occurs when a model learns the training data and its random noise too well. Instead of learning the general patterns, the network essentially "memorizes" the specific examples it has seen. As a result, the model performs exceptionally well on the training data but fails to generalize to new, unseen data, leading to poor performance on the test set.

### Computational Cost
Deeper networks with more neurons require significantly more memory and computational resources for training, which can be a limitation.