# Deep Learning Training Process

In Deep Learning, before starting the training, you need to define the **model**, the **loss function** (criterion), and the **optimizer**. The training process typically follows four key steps:

#### 1. Define the Model, Criterion, and Optimizer

- The **model** represents the architecture of your neural network.
- The **criterion** is the loss function that measures how far the model's predictions are from the true values. e.g. `nn.MSELoss()`, `nn.CrossEntropyLoss()`
- The **optimizer** is used to update the model's parameters based on the gradients. e.g. `optim.SGD()`, `optim.Adam()`

#### 2. Forward Pass: Pass the inputs through the model, compute the outputs, and calculate the loss.
```
outputs = model(inputs)
loss = criterion(outputs, targets)
```
#### 3. Backward Pass: Compute gradients of the loss with respect to model parameters.
```
loss.backward()
```
#### 4. Update Parameters: Update the model parameters using the optimizer.
```
optimizer.step()
```
#### 5. Reset Gradients: Clear the computed gradients to avoid accumulation.
```
optimizer.zero_grad()
```
![training](https://hackmd.io/_uploads/BkWLuMNhye.png)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchsummary import summary
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# Define a simple model and data
model = nn.Sequential(
    nn.Linear(10, 2),
    nn.ReLU(),
    nn.BatchNorm1d(2),
    nn.Linear(2, 1)
).to(device)
criterion = nn.MSELoss()  # Mean Squared Error Loss
optimizer = optim.SGD(model.parameters(), lr=0.01)  # Stochastic Gradient Descent

# Generate dummy data
inputs = torch.randn(4, 10).to(device)  # 32 samples, 10 features
targets = torch.randn(4, 1).to(device)  # 32 target values

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)  # Compute the loss

# Backward pass
loss.backward()  # Compute gradients for all parameters

# Update parameters
optimizer.step()  # Apply the gradients to update model weights

# Reset gradient
optimizer.zero_grad()  # Clear gradients for the next iteration

cpu


In [None]:
summary(model, input_size=(10,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                    [-1, 2]              22
              ReLU-2                    [-1, 2]               0
       BatchNorm1d-3                    [-1, 2]               4
            Linear-4                    [-1, 1]               3
Total params: 29
Trainable params: 29
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
----------------------------------------------------------------


In [None]:
def show_grad(model) -> None:
    for i, layer in enumerate(model):
        if isinstance(layer, nn.Linear): # Check for Linear layer
            print(f"Layer {i} (Linear) weights gradient:")
            print(layer.weight.grad)
            print(f"Layer {i} (Linear) bias gradient:")
            print(layer.bias.grad)
        elif isinstance(layer, nn.BatchNorm1d): # Check for BatchNorm1d layer
            print(f"Layer {i} (BatchNorm1d) running mean:")
            print(layer.running_mean.grad)
            print(f"Layer {i} (BatchNorm1d) running var:")
            print(layer.running_var.grad)
    return None

In [None]:
def show_param(model) -> None:
    for i, layer in enumerate(model):
        if isinstance(layer, nn.Linear): # Check for Linear layer
            print(f"Layer {i} (Linear) weights:")
            print(layer.weight)
            print(f"Layer {i} (Linear) bias:")
            print(layer.bias)
        elif isinstance(layer, nn.BatchNorm1d): # Check for BatchNorm1d layer
            print(f"Layer {i} (BatchNorm1d) running mean:")
            print(layer.running_mean)
            print(f"Layer {i} (BatchNorm1d) running var:")
            print(layer.running_var)
    return None

In [None]:
def show_size(model) -> None:
    for i, layer in enumerate(model):
        if isinstance(layer, nn.Linear): # Check for Linear layer
            print(f"Layer {i} (Linear) weights:")
            print(layer.weight.size())
            print(f"Layer {i} (Linear) bias:")
            print(layer.bias.size())
        elif isinstance(layer, nn.BatchNorm1d): # Check for BatchNorm1d layer
            print(f"Layer {i} (BatchNorm1d) running mean:")
            print(layer.running_mean.size())
            print(f"Layer {i} (BatchNorm1d) running var:")
            print(layer.running_var.size())
    return None

In [None]:
show_size(model)

Layer 0 (Linear) weights:
torch.Size([2, 10])
Layer 0 (Linear) bias:
torch.Size([2])
Layer 2 (BatchNorm1d) running mean:
torch.Size([2])
Layer 2 (BatchNorm1d) running var:
torch.Size([2])
Layer 3 (Linear) weights:
torch.Size([1, 2])
Layer 3 (Linear) bias:
torch.Size([1])


### Let's look at when gradients are calculated and when they are reset.

In [None]:
optimizer.zero_grad()
show_grad(model)
outputs = model(inputs)
loss = criterion(outputs, targets)

print("=========== backward: compute gradient without update weight ===========")
loss.backward()
show_grad(model)

print("=========== update weights ===========")
optimizer.step()
show_grad(model)

print("=========== clear gradient ===========")
optimizer.zero_grad()
show_grad(model)

Layer 0 (Linear) weights gradient:
None
Layer 0 (Linear) bias gradient:
None
Layer 2 (BatchNorm1d) running mean:
None
Layer 2 (BatchNorm1d) running var:
None
Layer 3 (Linear) weights gradient:
None
Layer 3 (Linear) bias gradient:
None
Layer 0 (Linear) weights gradient:
tensor([[-0.0887,  0.0704,  0.6894, -0.2819, -0.0098,  0.1812,  0.2640, -0.5957,
          0.0465,  0.1058],
        [ 1.3334,  0.7219,  0.7804, -0.6931, -0.9008,  0.8042, -0.8149,  1.6782,
         -2.0290,  0.4211]])
Layer 0 (Linear) bias gradient:
tensor([-1.9388e-01,  1.1921e-07])
Layer 2 (BatchNorm1d) running mean:
None
Layer 2 (BatchNorm1d) running var:
None
Layer 3 (Linear) weights gradient:
tensor([[-0.2336,  2.0318]])
Layer 3 (Linear) bias gradient:
tensor([-0.5723])
Layer 0 (Linear) weights gradient:
tensor([[-0.0887,  0.0704,  0.6894, -0.2819, -0.0098,  0.1812,  0.2640, -0.5957,
          0.0465,  0.1058],
        [ 1.3334,  0.7219,  0.7804, -0.6931, -0.9008,  0.8042, -0.8149,  1.6782,
         -2.0290,  0.421

### Let's look at when parameters are updated.

In [None]:
optimizer.zero_grad()
show_param(model)
outputs = model(inputs)
loss = criterion(outputs, targets)

print("=========== backward: compute gradient without update weight ===========")
loss.backward()
show_param(model)

print("=========== update weights ===========")
optimizer.step()
show_param(model)

print("=========== clear gradient ===========")
optimizer.zero_grad()
show_param(model)

Layer 0 (Linear) weights:
Parameter containing:
tensor([[ 0.1525,  0.0137,  0.2033, -0.2471, -0.1581, -0.0719, -0.2868,  0.2001,
          0.2577,  0.1084],
        [ 0.1827,  0.0307, -0.1462,  0.0191,  0.1154, -0.3269,  0.0037,  0.2141,
          0.1603,  0.1684]], requires_grad=True)
Layer 0 (Linear) bias:
Parameter containing:
tensor([ 0.1291, -0.0575], requires_grad=True)
Layer 2 (BatchNorm1d) running mean:
tensor([0.1426, 0.1207])
Layer 2 (BatchNorm1d) running var:
tensor([0.8626, 0.7573])
Layer 3 (Linear) weights:
Parameter containing:
tensor([[-0.4544, -0.6435]], requires_grad=True)
Layer 3 (Linear) bias:
Parameter containing:
tensor([-0.3612], requires_grad=True)
Layer 0 (Linear) weights:
Parameter containing:
tensor([[ 0.1525,  0.0137,  0.2033, -0.2471, -0.1581, -0.0719, -0.2868,  0.2001,
          0.2577,  0.1084],
        [ 0.1827,  0.0307, -0.1462,  0.0191,  0.1154, -0.3269,  0.0037,  0.2141,
          0.1603,  0.1684]], requires_grad=True)
Layer 0 (Linear) bias:
Parameter 

## Advanced Skills

In [None]:
model = nn.Sequential(
    nn.Linear(10, 2),
    nn.ReLU(),
    nn.Linear(2, 1)
).to(device)

### Implementing Deep Learning Functions Using NumPy
Deep Learning is fundamentally a series of functions and operations applied to data. In this approach, we will implement basic deep learning functions using the NumPy library, leveraging your previous knowledge of matrix operations.

In [None]:
import numpy as np
# Define the model

# Create some input data
inputs = torch.randn(4, 10).to(device)  # Batch of 4 samples, each with 10 features
targets = torch.randn(4, 1).to(device)  # Target values

# Function to perform matrix multiplication using
def matmul_with_weights_and_bias_numpy(model, inputs):
    inputs = inputs.cpu().numpy()
    W1 = model[0].weight.detach().cpu().numpy()  # Shape: (2, 10)
    b1 = model[0].bias.detach().cpu().numpy()    # Shape: (2,)
    W2 = model[2].weight.detach().cpu().numpy()  # Shape: (1, 2)
    b2 = model[2].bias.detach().cpu().numpy()    # Shape: (1,)

    # First matrix multiplication: input @ W1.T (Shape: (4, 2))
    output1 = np.dot(inputs, W1.T) + b1  # Shape: (4, 2)

    # Apply ReLU after the first Linear layer (ReLU activation)
    output1_relu = np.maximum(output1, 0)

   # Second matrix multiplication: output1_relu @ W2.T (Shape: (4, 1))
    output2 = np.dot(output1_relu, W2.T) + b2  # Shape: (4, 1)

    return output2  # The final output, shape (4, 1)

# Perform matrix multiplication with weights and bias
matmul_output = matmul_with_weights_and_bias_numpy(model, inputs)

# Forward pass through the model
model_output = model(inputs)

# Compare the result
print("MatMul output:", np.round(matmul_output, 4))
print("Model output:", model_output)

MatMul output: [[1.2576]
 [0.6159]
 [0.4845]
 [0.497 ]]
Model output: tensor([[1.2576],
        [0.6159],
        [0.4845],
        [0.4970]], grad_fn=<AddmmBackward0>)


### Understanding requires_grad in PyTorch
The `requires_grad` attribute in PyTorch enables automatic differentiation by tracking operations on tensors, allowing gradients to be computed for optimization. You can try commenting or enabling `inputs.requires_grad = True` to observe the effect on gradient computation.

In [None]:
# Generate dummy data
inputs = torch.randn(4, 10).to(device)
targets = torch.randn(4, 1).to(device)
inputs.requires_grad = True # turn on or turn off
print(inputs)

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)  # Compute the loss
optimizer = optim.SGD([{'params': model.parameters()}, {'params': inputs}], lr=1.)

# Backward pass
loss.backward()  # Compute gradients for all parameters

# Update parameters
optimizer.step()  # Apply the gradients to update model weights
print(inputs)

tensor([[ 1.4534, -1.0645,  0.2839, -0.9921, -1.0600,  0.5219,  0.5143,  1.2848,
         -0.2122,  0.9502],
        [-1.3286,  1.7345,  1.9765, -0.4719, -0.5561,  1.4850,  0.5428,  0.9903,
         -1.0181, -0.4368],
        [ 0.4171, -0.4225,  0.5338,  0.6584,  1.0168, -0.3858, -0.6769, -1.1842,
          0.1727,  1.6155],
        [-0.2080, -1.7594, -0.3703, -1.6607,  0.3330,  0.4466,  0.1733,  0.6349,
          1.5357, -0.8100]], requires_grad=True)
tensor([[ 1.2547, -1.0372,  0.5301, -0.9289, -0.9684,  0.5801,  0.6536,  1.1904,
         -0.3777,  0.8683],
        [-1.3286,  1.7345,  1.9765, -0.4719, -0.5561,  1.4850,  0.5428,  0.9903,
         -1.0181, -0.4368],
        [ 0.6101, -0.0342,  0.1301,  0.6279,  0.7152, -0.4082, -0.3423, -1.2325,
         -0.0639,  1.3403],
        [-0.5198, -1.7166,  0.0163, -1.5615,  0.4768,  0.5381,  0.3919,  0.4866,
          1.2759, -0.9386]], requires_grad=True)


### Using torch.no_grad() for Inference in PyTorch
`torch.no_grad()` disables gradient tracking, which reduces memory usage and computation during inference or validation. This is particularly useful when you don't need to update model parameters.

In [None]:
inputs = torch.randn(4, 10).to(device)
targets = torch.randn(4, 1).to(device)
optimizer.zero_grad()
with torch.no_grad():
    outputs = model(inputs)
    loss = criterion(outputs, targets)

show_grad(model)
# Backward pass
loss.backward()  # Compute gradients for all parameters

Layer 0 (Linear) weights gradient:
None
Layer 0 (Linear) bias gradient:
None
Layer 2 (Linear) weights gradient:
None
Layer 2 (Linear) bias gradient:
None


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn