# Build the Neural Network
> To build the neural network, we need the framework of [PyTorch](https://pytorch.org/). You can refer above website for installation and other informations. After installation PyTorch, you can call torch to build your own models. Now, let's start.

In this session, we'll build a simple feedforward neural network from scratch using the **PyTorch** library. Our goal is to create a model capable of classifying images from the FashionMNIST dataset.

PyTorch is a powerful deep learning framework (used in leading labs like OpenAI, META, NVIDIA, Microsoft, DeepSeek, etc) that provides the tools we need to define, train, and test neural networks.



In the lecture a simple feedforward neural network includes the input layer, hidden layer and output layer. 

In Pytorch, the [torch.nn](https://pytorch.org/docs/stable/nn.html) can be used to constructed the neural network models. 
The [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) is the base class for all neural network modules, which contains layers, and a method ``forward(input)`` that returns the <b>output</b>.
 Let's have a look the details. 


In the next section, we'll build a neural network to classify images in the FashionMNIST dataset.

In [3]:
# --- Core PyTorch Imports ---
import torch
from torch import nn
import torch.nn.functional as F # Contains useful functions like activation functions

# --- Data Handling Imports ---
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# --- Other Essential Imports ---
import os
%matplotlib inline
import matplotlib.pyplot as plt

## Get Device for Training model

PyTorch can leverage hardware accelerators like GPUs for faster training. We'll write our code to be device-agnostic, meaning it will run on a GPU if one is available (cuda) or fall back to the CPU if not.


Let's check to see if
[torch.cuda](https://pytorch.org/docs/stable/notes/cuda.html) is available, else we
continue to use the CPU.



In [4]:
# choose the device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


## 1- Define the Neural Network

In PyTorch, we define our network as a Python class that inherits from `nn.Module`. `nn.Module` is the base class for all neural network modules.

Our class will have two essential parts:
1.  `__init__()`: This is where we define the layers of our network (e.g., linear layers, flatten layers).
2.  `forward()`: This is where we specify how data flows through the layers we defined.



In [5]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__() # Initialize the attributes of parent class

        # This layer flattens the 28x28 images into a 784-dimensional vector
        self.flatten = nn.Flatten()

        # This is an ordered container of layers.
        # Data will pass through them in the sequence they are defined.
        self.linear_relu_stack = nn.Sequential(
            # First fully connected layer: 784 inputs, 512 outputs
            nn.Linear(28*28, 512),
            # ReLU activation to introduce non-linearity
            nn.ReLU(),
            # Second fully connected layer: 512 inputs, 512 outputs
            nn.Linear(512, 512),
            nn.ReLU(),
            # Output layer: 512 inputs, 10 outputs (one for each class)
            nn.Linear(512, 10),
        )

    def forward(self, x):
        # First, flatten the input image
        x = self.flatten(x)
        # Then, pass the flattened input through the sequential layers
        logits = self.linear_relu_stack(x)
        return logits



Now, we create an instance of ``NeuralNetwork``, i.e. ``model``, and move it to the ``device``, then print its structure.



In [6]:
# Create an instance of our network
model = NeuralNetwork().to(device)

# Print the model architecture
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


## 2. A Walk-Through of the Forward Pass (How Data Flows)

Fantastic, we have defined our neural network. Let's trace what happens when we pass a single image to our model. It's always good practice to sanity check our models too. The output of the final linear layer is a tensor of raw prediction values called **logits**.

In [7]:
# Create a dummy input tensor representing one 28x28 image
# The '1' is the batch size.
X = torch.randn(1, 28, 28, device=device)

# Pass the dummy input through the model to get the logits
logits = model(X)
print(f"Logits shape: {logits.shape}")  # Should be [1, 10] for batch size 1 and 10 classes
print(f"Raw logits: {logits}\n")

Logits shape: torch.Size([1, 10])
Raw logits: tensor([[ 0.0314,  0.0141,  0.0244, -0.1263,  0.1768, -0.0404, -0.0989,  0.1189,
         -0.0509,  0.0812]], grad_fn=<AddmmBackward0>)



**Logits are not probabilities**! To get probabilities, we apply the Softmax function.

In [8]:
# The softmax function converts logits into a probability distribution over the classes
pred_probabilities = F.softmax(logits, dim=1)
print(f"Predicted probabilities shape: {pred_probabilities.shape}")  # Should be [1, 10]
print(f"Predicted probabilities: {pred_probabilities}\n")

# To get the predicted class, we take the index of the highest probability.
y_pred = pred_probabilities.argmax(1)
print(f"Predicted class: {y_pred.item()}")

Predicted probabilities shape: torch.Size([1, 10])
Predicted probabilities: tensor([[0.1014, 0.0997, 0.1007, 0.0866, 0.1173, 0.0944, 0.0890, 0.1107, 0.0934,
         0.1066]], grad_fn=<SoftmaxBackward0>)

Predicted class: 4


## 3. The Training Components: Loss and Optimizer


To train a network, we need two more things:
- **Loss Function**: Measures how far the model's output (logits) is from the actual target (the correct class). For multi-class classification, the standard choice is nn.CrossEntropyLoss.

- **Optimizer**: Implements an algorithm (like Stochastic Gradient Descent) to adjust the model's internal parameters (weights and biases) to minimize the loss.

In [9]:
# --- Loss Function ---
# CrossEntropyLoss is ideal for classification. It internally applies Softmax,
# so we should feed it the raw logits directly from our model.
loss_fn = nn.CrossEntropyLoss()

# --- Optimizer ---
# We'll use Stochastic Gradient Descent (SGD).
# We pass model.parameters() to tell the optimizer which values it needs to update.
# The 'lr' is the learning rate, a crucial hyperparameter.
learning_rate = 1e-3
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)

## 4. The Full Training Step

Training a neural network involves a loop where we repeatedly perform these five steps:

1.  **Forward Pass:** Get the model's predictions (logits).
2.  **Calculate Loss:** Compare the predictions to the true labels.
3.  **Zero Gradients:** Clear old gradients from the previous step.
4.  **Backward Pass (Backpropagation):** Calculate the gradient of the loss with respect to each model parameter.
5.  **Update Weights:** The optimizer adjusts the parameters based on the calculated gradients.

Let's simulate a single training step with a dummy input and target.

In [10]:
# Let's create a dummy target label (e.g., the true class is 3)
target = torch.tensor([3], dtype = torch.long, device = device)

# --- A single training step --- (We typically repeat this in a loop over many epochs)

# 1. Forward Pass: Get the model's output for a dummy input
logits = model(X)

# 2. Compute Loss: Compare the model's output with the target
loss = loss_fn(logits, target)
print(f"Calculated Loss: {loss.item()}")

# 3. Zero Gradients: Clear previous gradients before backpropagation
# We need to reset the gradients before backpropagation, or they will accumulate.
optimizer.zero_grad()

# 4. Backward Pass: Compute gradients of the loss with respect to model parameters
loss.backward()

# 5. Update Weights
# The optimizer adjusts the model's parameters using the gradients computed in the backward pass.
optimizer.step() # weights = weights - learning_rate * gradients

print("Model weights have been updated after one step.")


Calculated Loss: 2.4460151195526123
Model weights have been updated after one step.


Great! We've successfully defined a simple feedforward neural network, traced a forward pass, and outlined the components needed for training. Now let's inspect each part in detail

## Going into the details of each section above one by one.

## Each Layer Analysis

Now, we analyze each layer in the model. 
To illustrate it, we will take a sample minibatch of 3 with size 28x28 using [nn.Rand](https://pytorch.org/docs/stable/generated/torch.rand.html). Then we pass it through the network and do the further processing.

In [11]:
input = torch.rand(3,28,28)
print(input.size())

torch.Size([3, 28, 28])


### nn.Flatten
We initialize the [nn.Flatten](https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html) layer to convert each 28x28 input into a contiguous array of 784 values (
the minibatch dimension (at dim=0) is maintained).



In [12]:
flatten = nn.Flatten()
flat_input = flatten(input)
print(flat_input.size())

torch.Size([3, 784])


### nn.Linear
The [linear layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
is a module that applies a linear transformation on the input using its stored weights and biases.




In [13]:
layer1 = nn.Linear(in_features=28*28, out_features=20)
hidden1 = layer1(flat_input)
print(hidden1.size())

torch.Size([3, 20])


### nn.ReLU
Non-linear activations are what create the complex mappings between the model's inputs and outputs.
They are applied after linear transformations to introduce *nonlinearity*, helping neural networks
learn a wide variety of phenomena.

In this model, we use [nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) between our
linear layers, but there's other activations to introduce non-linearity in your model.



In [14]:
print(f"Before ReLU: {hidden1}\n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")

Before ReLU: tensor([[-0.1230, -0.0530, -0.0769,  0.0015,  0.2978,  0.2154,  0.3731, -0.0948,
          0.0743, -0.0480, -0.0471, -0.2434, -0.3886,  0.3078,  0.9808,  0.2476,
         -0.9264,  0.1047, -0.1490,  0.0292],
        [ 0.1122,  0.0870, -0.0689,  0.0156,  0.2387,  0.1687,  0.4041, -0.2142,
          0.2437, -0.1256,  0.1172, -0.2881, -0.2813,  0.2007,  1.0718,  0.6410,
         -0.8452,  0.1147,  0.1220,  0.0767],
        [ 0.0705,  0.0964,  0.2674, -0.1119, -0.0634, -0.2818,  0.1298, -0.0944,
          0.4618,  0.1933,  0.1285, -0.2135, -0.3047, -0.0992,  1.0859,  0.4416,
         -0.5967,  0.0672, -0.1959,  0.1097]], grad_fn=<AddmmBackward0>)


After ReLU: tensor([[0.0000, 0.0000, 0.0000, 0.0015, 0.2978, 0.2154, 0.3731, 0.0000, 0.0743,
         0.0000, 0.0000, 0.0000, 0.0000, 0.3078, 0.9808, 0.2476, 0.0000, 0.1047,
         0.0000, 0.0292],
        [0.1122, 0.0870, 0.0000, 0.0156, 0.2387, 0.1687, 0.4041, 0.0000, 0.2437,
         0.0000, 0.1172, 0.0000, 0.0000, 0.2007, 1.07

### nn.Sequential
[nn.Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) is an ordered
container of modules. The data is passed through all the modules in the same order as defined. You can use
sequential containers to put together a quick network like ``seq_modules``.



In [16]:
seq_modules = nn.Sequential(
    flatten,
    layer1,
    nn.ReLU(),
    nn.Linear(20, 10)
)
input = torch.rand(3,28,28)

logits = seq_modules(input)

### nn.Softmax
The last linear layer of the neural network returns `logits` - raw values in $[-\infty, \infty]$ - which are passed to the
[nn.Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) module. The logits are scaled to values
[0, 1] representing the model's predicted probabilities for each class. ``dim`` parameter indicates the dimension along
which the values must sum to 1.



In [17]:
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)
print(pred_probab)
print(pred_probab.size())

tensor([[0.0792, 0.0943, 0.0843, 0.1303, 0.0663, 0.1243, 0.1237, 0.0889, 0.0911,
         0.1176],
        [0.0838, 0.1091, 0.0902, 0.1291, 0.0605, 0.1157, 0.1373, 0.0907, 0.0770,
         0.1065],
        [0.0747, 0.1039, 0.0906, 0.1271, 0.0613, 0.1168, 0.1375, 0.0846, 0.0832,
         0.1204]], grad_fn=<SoftmaxBackward0>)
torch.Size([3, 10])


### Model Parameters

In the model, we define the ``forward function`` in ``NeuralNetwork``, the backward function can be automatically calculated by ``autograd`` in PyTorch. 

Many layers of a model are *parameterized*, i.e. have associated weights
and biases that are optimized during training. Subclassing ``nn.Module`` automatically
tracks all fields defined inside your model object, and makes all parameters
accessible using your model's ``parameters()`` or ``named_parameters()`` methods.

In this example, we iterate over each learnable parameter, and print its size and its values.



In [18]:
# the learnable parameters
params = list(model.parameters())
print(len(params))
print(params[0].size())

print(f"Model structure: {model}\n\n")
for name, param in model.named_parameters():
    # A parameter is "learnable" if requires_grad is True
    if param.requires_grad:
        print(f"Layer: {name}")
        print(f"  Size: {param.size()}")
        print(f"  First few values:\n {param.data[:2]}\n")

6
torch.Size([512, 784])
Model structure: NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


Layer: linear_relu_stack.0.weight
  Size: torch.Size([512, 784])
  First few values:
 tensor([[ 1.9848e-02,  2.6076e-03,  8.3389e-03,  ..., -1.6783e-03,
          2.1267e-02,  1.3748e-03],
        [ 1.0714e-05,  8.8581e-03, -3.1278e-03,  ..., -7.7031e-03,
          3.2129e-02, -1.6902e-02]])

Layer: linear_relu_stack.0.bias
  Size: torch.Size([512])
  First few values:
 tensor([0.0173, 0.0096])

Layer: linear_relu_stack.2.weight
  Size: torch.Size([512, 512])
  First few values:
 tensor([[-0.0376, -0.0024, -0.0102,  ..., -0.0155, -0.0090,  0.0294],
        [-0.0061,  0.0423,  0.0049,  ...,  0.0220,  0.0214,  0.0081]])

Layer: lin

### Update the Weights of the networks
After we obtain the parameters by feeding input into network, we can update the parameters based on the optimizer.

In practice, the simplest update rule is the Stochastic Gradient Descent (SGD):

    ``weight = weight - learning_rate * gradient``

By updating the parameters, we complete the whole training process and achieve the best parameters.

This is done using [torch.optim](https://pytorch.org/docs/stable/optim.html).



--------------




## Further Reading
You can refer the following website for further information.

- [torch.nn API](https://pytorch.org/docs/stable/nn.html)
- [tutorials](https://pytorch.org/tutorials/)

