# Neural Networks in PyTorch
## What is a neural network?
A basic neural network, also known as a multi-layer perceptron (MLP), is made up of three main types of layers:

### Input Layer
This is where your data enters the network. The number of neurons in this layer corresponds to the number of features in your dataset.

### Hidden Layers
These are the layers between the input and output layers. They perform most of the computational work, transforming the input data through a series of mathematical operations. Each neuron in a hidden layer takes a weighted sum of the outputs from the previous layer, adds a bias term, and then passes the result through an activation function.  The activation function is key because it introduces non-linearity, allowing the network to learn complex patterns that a simple linear model couldn't.

### Output Layer
This is the final layer that produces the network's prediction. The number of neurons here depends on the problem you're trying to solve (e.g., one neuron for binary classification, multiple neurons for multi-class classification).
NOTE: Most output layers will pass the output given by the network into a *softmax* function to provide probabilities for each class.


## Loading our Data
In the following code, we will get our dataset from the torchvision library. This dataset has a ton of images that we can use to train our network as we learn.

### Transformer
The raw images are not readable by our model so we need to convert them into PyTorch tensors before we do anything. We can do this by using torchvision's ToTensor() method (torchvision.transforms.ToTensor()). 
```py
from torchvision import transforms
transformer = transforms.Compose([
    transforms.ToTensor(), # Convert the image to a PyTorch Tensor
])
```
Along with converting the images into PyTorch Tensors, the ToTensor method will also scale the pixel values from [0, 255] to [0.0, 1.0]
We use transforms.Compose here so we can chain together multiple transformers in the future (like transforms.Normalize()).

Next, we import our datasets from torchvision.datasets. This dataset has a pre-defined split: 60,000 images for the `Train=True` and 10,000 for the `Train=False`.

In [1]:
%pip install torch
%pip install torchvision

import torch
from torchvision import transforms, datasets # Contains our MNIST dataset and a transformer to convert the PIL images into PyTorch Tensors.

transformer = transforms.Compose([
    transforms.ToTensor(), # Convert the image to a PyTorch Tensor
])

train_dataset = datasets.MNIST(
    root='./data', 
    train=True, 
    transform=transformer, 
    download=True
)

test_dataset = datasets.MNIST(
    root='./data', 
    train=False, 
    transform=transformer, 
    download=True
)


Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/84/57/2f64161769610cf6b1c5ed782bd8a780e18a3c9d48931319f2887fa9d0b1/torch-2.8.0-cp311-cp311-win_amd64.whl.metadata
  Downloading torch-2.8.0-cp311-cp311-win_amd64.whl.metadata (30 kB)
Collecting typing-extensions>=4.10.0 (from torch)
  Obtaining dependency information for typing-extensions>=4.10.0 from https://files.pythonhosted.org/packages/18/67/36e9267722cc04a6b9f15c7f3441c2363321a3ea07da7ae0c0707beb2a9c/typing_extensions-4.15.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting sympy>=1.13.3 (from torch)
  Obtaining dependency information for sympy>=1.13.3 from https://files.pythonhosted.org/packages/a2/09/77d55d46fd61b4a135c444fc97158ef34a095e5681d0a6c10b75bf356191/sympy-1.14.0-py3-none-any.whl.metadata
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Downloading torch-2.8.0-cp311-cp311-win_amd64.whl (241.

100%|██████████| 9.91M/9.91M [00:01<00:00, 7.09MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 930kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 4.57MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 2.26MB/s]


## Data Loader
### What is a Data Loader?
A DataLoader is an iterable that wraps a Dataset and provides an efficient way to load and process data for training or inference in machine learning models.
It can shuffle our data and split them into batches. 

The `shuffle=True` parameter is very important while we train as shuffling our training data ensures that in each epoch—one complete pass through the entire training dataset—our model sees the images in a different, random order. This prevents things like `Memorization` or `Order Bias`

In [2]:
from torch.utils.data import DataLoader

# Defining a batch size
BATCH_SIZE = 64

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False
)


## Creating our Neural Network
The following code shows us how we can structure our network. In our `__init__` method, we define the layers of our neural network, including a flattening layer and a stack of linear layers with activation functions.

### Inside of our initalizer
#### Flatten
The nn.Flatten layer transforms the 2D input tensor (28x28) into a 1D vector (784 features), preparing it for the linear layers.

#### Layers
We use nn.Sequential to create a container for a linear stack of layers. The ReLU (Rectified Linear Unit) activation function introduces non-linearity to the network, allowing it to learn complex, non-linear mappings.
1. The first linear layer takes a 784-dimensional input and projects it into a 512-dimensional feature space.
2. The second linear layer further transforms the 512 features into a 256-dimensional space.
3. The final linear layer, or "output layer," projects the features down to 10 dimensions, corresponding to the 10 digit classes (0-9). These raw outputs are known as "logits."

In [3]:
from torch import nn

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten() # Flattens the 28x28 image into a 784-dimensional vector
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512), # First linear layer
            nn.ReLU(), # Activation function
            nn.Linear(512, 256), # Second linear layer
            nn.ReLU(), # Activation function
            nn.Linear(256, 10) # Output layer
        )

    def forward(self, x):
        x = self.flatten(x) # Flatten the input
        logits = self.linear_relu_stack(x) # Pass through the linear_relu_stack
        return logits

## Forward Pass
1. First we must create an instance of our model. 
2. Next, we need to get our next batch. We use our training DataLoader to get a single batch of images and their corresponding labels. 
3. Finally, we pass in our images into our model. PyTorch will automatically call the forward() method we have created earlier. This method returns our logits. 

In [None]:
# Create an instance of the model
model = SimpleNN()

# Get a single batch of images and labels from the training data loader
images, labels = next(iter(train_loader))

# Perform the forward pass to get the logits
logits = model(images)

# Print the shape of the output
print(f"Shape of the logits tensor: {logits.shape}")
print(f"Sample logits for first image:\n{logits[0]}")

## Loss Function
A loss function is a mathematical function that measures the discrepancy between the network's predictions and and true labels. \
Since our model has multiple outputs, a multi-class classification network, we will use the Cross-Entropy Loss function. \
### Cross-Entropy Loss Function
`nn.CrossEntropyLoss` compines two key operations in one efficient function:
#### Softmax
It first applies the softmax function to the network's logits (raw output). This converts the logits into a set of probabilities that will sum up to 1. The highest logit will correspond to the class with the highest probability. \
Softmax formula: $$ P(y_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$
#### Negative Log likelihood
It then calculates the negative log likelihood of the true class. This will heavily penalize the model when it is very confident (assigns a high probability) about a wrong prediction. \
NLL formula: $$ L_{NLL} = -log(p_i) $$
where $i$ is the true class and $p_i$ is the probability of the true class.


In [22]:
# Define the loss function
loss_function = nn.CrossEntropyLoss()

# Calculate the loss
loss = loss_function(logits, labels)

print(f"Loss value: {loss.item()}")

Loss value: 2.311577558517456


## Optimizer
### What is an Optimizer?
An optimizer is an algorithm that adjusts the model's weights and biases to reduce the loss. Think of the loss function as a measure of the "error" or "cost," and the optimizer as the engine that uses that information to navigate a "loss landscape" to find the lowest point. Some examples of Optimizers are Adam, RMSprop, Adagrad, and multiple versions of Gradient Descent. \

Lets use Adam for this example. 
### Adam (Optimizer)
Adam, which stands for Adaptive Moment Estimation, is an advanced optimization algorithm used to train neural networks.