<div align="center">
    <img src="https://www.sharif.ir/documents/20124/0/logo-fa-IR.png/4d9b72bc-494b-ed5a-d3bb-e7dfd319aec8?t=1609608338755" alt="Logo" width="200">
    <p><b>HW1 @ Deep Learning Course, Dr. Soleymani</b></p>
    <p><b>ŸêDesinged by Amirmahdi Meighani</b></p>
</div>

---




*Full Name:*

*Student Number:*

# Efficient Gradient Checkpointing for Memory-Constrained Deep Learning (50 points)
Deep learning experiments are often limited by GPU memory constraints, making it challenging to train large models. To overcome this, we implemented gradient checkpointing, a technique that significantly reduces memory usage by strategically storing intermediate activations and recomputing them during backpropagation. This allows us to train models that would otherwise exceed our GPU‚Äôs memory capacity.

In this guide, you'll first implement gradient checkpointing from scratch to understand its inner workings. Then, you'll learn how to leverage PyTorch's built-in checkpointing feature for more efficient deep learning workflows.

### How Gradient Checkpointing Works
We divide the neural network into segments and only store activations at the segment boundaries (checkpoints). The activations for intermediate layers are discarded and recomputed during backpropagation.

Step-by-Step Process:



*   **Forward Pass (Training Phase)**



1.  Divide the model into segments (e.g., every few layers).
2.  Save activations only at checkpoint layers.
3.  Discard activations of intermediate layers.
4.  Proceed as usual to compute the final output.



*   **Backward Pass (Gradient Calculation)**



1. Recompute missing activations for each segment.
2. Compute gradients using the recomputed activations.
3. Update model parameters with computed gradients.


By recomputing only small segments at a time, we save significant memory while keeping the computational cost manageable.


First import what you need and check your GPU.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import time
from torch.utils.data import DataLoader, TensorDataset
from tqdm.auto import tqdm
import numpy as np
import matplotlib.pyplot as plt

# Additional imports for checkpointing examples
from torch.utils.checkpoint import checkpoint_sequential

# End of Done


In [None]:
# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}") # You must use cuda for this notebook


# Function to measure GPU memory usage
def get_gpu_memory_usage():
    # Todo: return the gpu memory useage for comparison
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / (1024 ** 2)  # MB
    return 0.0
    # End of Todo


We have created a small model for using it with and with out lazy gradient. You can change it if you like.

In [None]:
# Define a model for intensive GPU usage
# You can change it
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        self.before = nn.Sequential(
            nn.Linear(200*200, 100),
            nn.ReLU(),
            nn.Linear(100, 32),
            nn.ReLU(),
        )

        self.after = nn.Sequential(
            nn.Linear(32, 3000),
            nn.ReLU(),
            nn.Linear(3000, 200*200),
            nn.ReLU(),
            nn.Linear(200*200, 1),
        )



    def forward(self, x):
        bottleneck_out = self.before(x)
        final_out = self.after(bottleneck_out)
        return final_out, bottleneck_out




In [None]:


def create_dataloader(batch_size=128):
    # Todo: Create synthetic data and return a dataloader
    # The data must be compatible with the model you use
    num_samples = 1024
    x = torch.randn(num_samples, 200 * 200)
    # Simple linear target with noise
    y = torch.sum(x, dim=1, keepdim=True) * 0.001 + torch.randn(num_samples, 1) * 0.01
    dataset = TensorDataset(x, y)
    return DataLoader(dataset, batch_size=batch_size, shuffle=True)
    # End of Todo


In [None]:


# Training loop
def train_model(use_lazy_grad=False, num_epochs=1):
    dataloader = create_dataloader()
    print('Data is Created')
    model = Model().to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.MSELoss()

    mem_usage = []
    start_time = time.time()

    for epoch in range(num_epochs):
        for batch_x, batch_y in tqdm(dataloader):

            optimizer.zero_grad()
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)
            batch_x.requires_grad = True  # Track gradients
            final_out, bottleneck_out = model(batch_x)


            if use_lazy_grad:
                # Todo: implement the lazy gradient method, using the bottleneck
                # as the only segmentation. You should compute gradient w.r.t
                # bottleneck output and then backpropagate manually
                loss = criterion(final_out, batch_y)
                grad_final = torch.autograd.grad(loss, final_out, create_graph=False, retain_graph=True)[0]
                grad_bottleneck = torch.autograd.grad(final_out, bottleneck_out, grad_final, retain_graph=True)[0]
                bottleneck_out.backward(grad_bottleneck)
                # End of Todo

            else:
                # Todo: normal back prop of loss
                loss = criterion(final_out, batch_y)
                loss.backward()
                # End of Todo

            optimizer.step()

            mem_usage.append(get_gpu_memory_usage())

    elapsed_time = time.time() - start_time
    return mem_usage, elapsed_time


In [None]:
# Done:
# Run training with and without lazy gradient propagation
# Note that you should empty cache of cuda before, between and after your trainings
# So your results will be valid. Save the mem_usage and elapsed_time of each one

if torch.cuda.is_available():
    torch.cuda.empty_cache()
mem_lazy, time_lazy = train_model(use_lazy_grad=True)
if torch.cuda.is_available():
    torch.cuda.empty_cache()
mem_normal, time_normal = train_model(use_lazy_grad=False)
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# End of Done


In [None]:
# Done: Plot memory usage and compare the time
# that each method needs
plt.figure(figsize=(8,4))
plt.plot(mem_lazy, label='Lazy gradient')
plt.plot(mem_normal, label='Standard')
plt.xlabel('Iteration')
plt.ylabel('GPU Memory (MB)')
plt.legend()
plt.title('Memory usage comparison')
plt.grid(True)
plt.show()

print(f"Lazy gradient time: {time_lazy:.2f}s, Standard time: {time_normal:.2f}s")

# End of Done


PyTorch provides a built-in gradient checkpointing feature through torch.utils.checkpoint. This makes it easy to implement checkpointing without manually managing activation storage and recomputation.



First, let's create a simple Sequential model and checkpoint it. We can also verify that the checkpointing doesn't change the value of gradients or the activations.

In [None]:
from torch.utils.checkpoint import checkpoint_sequential


# Done: create a simple Sequential model and then create the model inputs.
# get the modules in the model. These modules should be in the order
# the model should be executed. Then set the number of checkpoint segments.
# Now call the checkpoint API and get the output.
# finally run the backwards pass on the model. For simplicity,
# we won't calculate the loss and rather backprop on output.sum()

model = nn.Sequential(
    nn.Linear(10, 32), nn.ReLU(),
    nn.Linear(32, 16), nn.ReLU(),
    nn.Linear(16, 4)
).to(device)

inp = torch.randn(8, 10, device=device, requires_grad=True)
modules = list(model.children())
segments = 2
out = checkpoint_sequential(modules, segments, inp)

(out.sum()).backward()

# End of Done


Now that we have executed the checkpointed pass on the model, let's also run the non-checkpointed model and verify that the checkpoint API doesn't change the model outputs or the parameter gradients.

In [None]:
# Done: use the non-checkpointed mode. create a new variable using the same
# tensor data. get the model output.

model.zero_grad(set_to_none=True)
inp_nc = inp.detach().clone().requires_grad_(True)
out = model(inp_nc)
out.sum().backward()

# End of Done


Now that we have done the checkpointed and non-checkpointed pass of the model and saved the output and parameter gradients, let's compare their values

In [None]:
# Done: compare the output and parameters gradients with and without checkpoint.
# they must be equal.

assert torch.allclose(output_checkpointed, out_not_checkpointed)
for name in grad_checkpointed:
    assert torch.allclose(grad_checkpointed[name], grad_not_checkpointed[name])

# End of Done

print("All checks passed!")


So, from this example, we can see that it's very easy to use checkpointing on Sequential models and that the checkpoint API doesn't alter any data. The Checkpoint API implementation is based on autograd and hence there is no need for explicitly specifying what the execution of backwards should look like

# Gradient Accumulation in Deep Learning (25 points)


### 1. Introduction

Training deep learning models with large batch sizes can be difficult due to memory limitations, especially when using large datasets or deep networks. **Gradient Accumulation** is a technique that allows us to simulate large batch sizes without increasing memory usage. This tutorial will cover:

1. **The theory behind Gradient Accumulation**
2. **Using it in PyTorch**
3. **Observing its effects**

---

## 2. Theory Behind Gradient Accumulation

### 2.1 What is Gradient Accumulation?
Instead of updating the model parameters after every mini-batch, **Gradient Accumulation** allows us to accumulate gradients over multiple mini-batches before performing an update. This effectively simulates training with a larger batch size.

### 2.2 Why Use Gradient Accumulation?
- **Overcome Memory Limits**: Training with large batch sizes often exceeds GPU memory capacity. Accumulating gradients allows training on smaller mini-batches while maintaining the benefits of larger batch training.
- **Stable Training**: Larger batch sizes help in stable updates and reducing variance in gradient estimation.
- **Effective Batch Size**: If GPU memory allows batch size `B` but we need `N`, we can accumulate gradients for `N/B` steps before updating.

### 2.3 How Does It Work?
If `loss` is computed on `batch_size = B`, instead of calling `optimizer.step()` every step, we:
1. Compute gradients on `B` and accumulate them.
2. Repeat for `K` iterations, accumulating gradients.
3. Update weights only after `K` steps.
4. Reset gradients after update.

This results in an **effective batch size** of `B * K` without needing extra memory.


first create a simple model and implement the function to train with accumulation.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

def train_with_gradient_accumulation(model, dataloader, accumulation_steps):
    criterion = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    model.train()
    optimizer.zero_grad()
    mem_usage = []

    for batch_idx, (inputs, targets) in enumerate(dataloader):
        # Done: Get loss and Normalize loss by accumulation steps
        # Then backpropagate loss.  Accumulate gradients and update after
        # `accumulation_steps`
        inputs = inputs.to(device)
        targets = targets.to(device)
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps
        loss.backward()

        if (batch_idx + 1) % accumulation_steps == 0 or (batch_idx + 1) == len(dataloader):
            optimizer.step()
            optimizer.zero_grad()
        mem_usage.append(get_gpu_memory_usage())

        # End of Done

    return mem_usage


# Done: Define a simple model
class SimpleModel(nn.Module):
    def __init__(self, input_dim=5, hidden_dim=32, output_dim=1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)
# End of Done


model = SimpleModel().to(device)


Now create a dataloader with the mentioned parameters

In [None]:


def create_dataloader(batch_size, accumulation_steps):
    # Done: create dataloader with batch size and same number of samples
    # for different accumulation_steps

    num_samples = batch_size * accumulation_steps * 5
    x = torch.randn(num_samples, 5)
    y = 2 * x.sum(dim=1, keepdim=True) + 3 + torch.randn(num_samples, 1) * 0.1
    dataset = TensorDataset(x, y)
    return DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # End of Done




Lets train with accumulation and save mem usage.

In [None]:
batch_size = 100  # Small batch size due to memory limits you can change it
effective_batch_size = 400
accumulation_steps = effective_batch_size // batch_size

# Done: train with accumulation_steps and save mem_usage
# dont forget to empty cuda cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()
model_accum = SimpleModel().to(device)
dataloader_accum = create_dataloader(batch_size, accumulation_steps)
mem_usage_accum = train_with_gradient_accumulation(model_accum, dataloader_accum, accumulation_steps)

# End of Done


And now train in the classic way with no accumulation.

In [None]:
# Done: train with accumulation_steps=1 (it means standard back prop)
# and save mem_usage
if torch.cuda.is_available():
    torch.cuda.empty_cache()
model_no_accum = SimpleModel().to(device)
dataloader_no_accum = create_dataloader(batch_size, 1)
mem_usage_no_accum = train_with_gradient_accumulation(model_no_accum, dataloader_no_accum, accumulation_steps=1)

# End of Done


Check if we actually used less memory:

In [None]:
# Done: Plot memory usage
plt.figure(figsize=(8,4))
plt.plot(mem_usage_accum, label='With accumulation')
plt.plot(mem_usage_no_accum, label='No accumulation')
plt.xlabel('Iteration')
plt.ylabel('GPU Memory (MB)')
plt.legend()
plt.title('Gradient accumulation memory usage')
plt.grid(True)
plt.show()
# End of Done


##  Conclusion

- **Gradient Accumulation** helps simulate large batch sizes without requiring more memory.
- It improves **training stability** and allows training on **memory-constrained GPUs**.
- We implemented it **from scratch** and in **PyTorch**.
- We visualized its **effects** on loss stabilization.

Gradient Accumulation is a useful trick when working with **deep networks and large datasets** on limited hardware. üöÄ


# Sources of Randomness (5 points)
run the code below. the outputs are different. Why do you think this happened?

In [None]:
input = torch.randn(1, 5).to(device)
model = SimpleModel(input_dim=5, hidden_dim=1500,output_dim=1).to(device)

out = model(input)
print(out)

model = SimpleModel(input_dim=5, hidden_dim=1500,output_dim=1).to(device)

out = model(input)
print(out)

## Controlling Sources of Randomness in PyTorch Models (with GPU)

### Introduction
Randomness plays a crucial role in deep learning, but uncontrolled randomness can lead to inconsistent results. This part explores the sources of randomness in PyTorch and how to control them, especially when using a GPU.

---

## **1. Understanding Sources of Randomness**
In PyTorch, randomness can come from multiple sources:

1. **Python's built-in random module**: Used for operations that involve randomness in Python code.
2. **NumPy**: If NumPy is used for data augmentation or initialization.
3. **PyTorch CPU Randomness**: Random initialization of weights, dropout layers, etc.
4. **PyTorch GPU Randomness**: When CUDA is used, operations can be non-deterministic.
5. **cuDNN Backend**: NVIDIA's cuDNN has optimizations that may introduce non-determinism.

To get reproducible results, all these sources must be controlled.

## **2. Setting Seeds in PyTorch**
To control randomness, we define a function that sets the seed for all sources:

In [None]:
import os
import random
import numpy as np
import torch

def set_seed(seed):
    # Done: set these seeds:
    # Python Hash seed
    # Python random module
    # NumPy seed
    # PyTorch CPU seed
    # PyTorch CUDA seed
    # Multi-GPU seed
    # Ensure deterministic cudnn
    # Disables optimization that may introduce randomness(benchmark)

    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    # End of Done




Now check if the results are the same:

In [None]:
input = torch.randn(1, 5).to(device)

set_seed(123)  # Set seed to 123 (or any fixed value)
model = SimpleModel(input_dim=5, hidden_dim=1500,output_dim=1).to(device)

out = model(input)
print(out)

set_seed(123)  # Set seed to 123 (or any fixed value)

model = SimpleModel(input_dim=5, hidden_dim=1500,output_dim=1).to(device)

out = model(input)
print(out)

## **Key Takeaways**
1. Randomness in PyTorch comes from multiple sources: Python, NumPy, PyTorch (CPU & GPU), and cuDNN.
2. Using `set_seed()` ensures reproducibility in experiments.
3. cuDNN optimizations can introduce non-determinism; setting `torch.backends.cudnn.deterministic = True` helps mitigate this.
4. Always set the seed before model initialization to ensure identical starting conditions.

Reproducibility is critical for debugging and fair benchmarking of deep learning models. üöÄ


**Note**

Setting torch.backends.cudnn.deterministic = True makes the code slower because it forces cuDNN to use deterministic algorithms instead of its default highly-optimized, non-deterministic implementations. Here‚Äôs why:
1. cuDNN Optimizations

cuDNN (CUDA Deep Neural Network Library) provides highly optimized implementations of deep learning operations, such as convolutions and matrix multiplications. By default, cuDNN selects the fastest algorithm available based on the given input size, hardware, and configuration.
2. Deterministic vs. Non-Deterministic Algorithms

Some of cuDNN‚Äôs fastest algorithms introduce minor sources of randomness due to floating-point precision differences in parallel execution, particularly in:

    Convolution operations (e.g., torch.nn.Conv2d)
    Batch normalization
    Recurrent layers (e.g., LSTMs)

When torch.backends.cudnn.deterministic = True, PyTorch forces cuDNN to use only deterministic versions of these algorithms. However, deterministic algorithms are not always the most optimized ones, leading to slower performance.

# Second-Order Gradients in Deep Learning (20 points)

What are Second-Order Gradients?

In deep learning, second-order gradients refer to the derivatives of gradients (i.e., the second derivative of a loss function with respect to model parameters). These are commonly used in optimization methods that require information about the curvature of the loss function.



### **Mathematically**:
- **First-order gradient**: $ g = \nabla_\theta L(\theta) $ (gradient of loss $ L $ w.r.t. parameters $ \theta $)
- **Second-order gradient (Hessian)**: $ H = \nabla^2_\theta L(\theta) $ (derivative of $ g $, which captures curvature)


###**Why Use Second-Order Gradients?**

 *   Better Optimization ‚Äì Second-order methods like Newton‚Äôs Method use curvature information to converge faster than first-order methods (like SGD).
 *    Natural Gradient Descent ‚Äì Second-order gradients help in adapting the learning rate in different directions based on the Hessian matrix.
 *   Meta-Learning ‚Äì Algorithms like MAML (Model-Agnostic Meta-Learning) require second-order gradients to update learning rates.
 *   Adversarial Training ‚Äì Computing second-order derivatives is useful for generating adversarial examples.
 *   Regularization ‚Äì Used in some regularization techniques like curvature-based penalties.



###**Challenges of Second-Order Gradients**

 * Computationally Expensive ‚Äì Computing second-order derivatives (Hessian) can be costly, especially for deep networks.
 * Memory Intensive ‚Äì Requires additional memory, making it impractical for very large models.
 * Numerical Stability ‚Äì Sometimes leads to unstable gradients and requires careful tuning.

The second derivative (Hessian matrix) provides information about how the gradient changes. By moving in the direction of the second-order gradient, we can adaptively adjust step sizes based on how steep or flat the loss landscape is.

In the simplest case:

$
Œ∏‚Ä≤=Œ∏‚àíŒ±‚àáL(Œ∏)‚àíŒ≤‚àá2L(Œ∏)
$

where:

 * ‚àá2L(Œ∏) (the second-order gradient) adjusts the update based on curvature,
 * Œ≤ is a small scaling factor.

In [None]:
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Done: define a model (1d to 1d)
        self.net = nn.Sequential(
            nn.Linear(1, 16),
            nn.ReLU(),
            nn.Linear(16, 1)
        )
        # End of Done

    def forward(self, x):
        # Done: implement forward
        return self.net(x)
        # End of Done


def generate_data():
    # Done: Generate data for 1d to 1d model
    # a simple linear data with noise is ok. For example: y = 2x + 3 + noise
    x = torch.linspace(-5, 5, steps=200).unsqueeze(1)
    noise = torch.randn_like(x) * 0.2
    y = 2 * x + 3 + noise
    # End of Done
    return x, y


model = SimpleNet()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)


x_train, y_train = generate_data()
num_epochs = 100


In [None]:

for epoch in range(num_epochs):

    # Done: implement the loop using second-order gradients.
    # first, compute first-order gradient. Then Convert first-order gradient
    # to scalar before computing second-order gradient Update parameters
    # manually. using first order and second order gradients.
    # print loss in every step
    optimizer.zero_grad()
    preds = model(x_train)
    loss = loss_fn(preds, y_train)
    grads_first = torch.autograd.grad(loss, model.parameters(), create_graph=True)
    grad_scalar = sum((g ** 2).sum() for g in grads_first)
    grads_second = torch.autograd.grad(grad_scalar, model.parameters())

    with torch.no_grad():
        for p, g1, g2 in zip(model.parameters(), grads_first, grads_second):
            p -= 0.01 * g1 + 0.001 * g2

    print(f"Epoch {epoch+1}: loss={loss.item():.4f}")
    # End of Done


print("Training complete!")


In [None]:
# Done: evaluate the model using some data and visuilize the model output
# and compare it with the real function
x_eval = torch.linspace(-6, 6, steps=100).unsqueeze(1)
y_true = 2 * x_eval + 3
with torch.no_grad():
    y_pred = model(x_eval)

plt.figure(figsize=(6,4))
plt.plot(x_eval.numpy(), y_true.numpy(), label='True function')
plt.scatter(x_eval.numpy(), y_pred.numpy(), s=10, label='Model prediction')
plt.legend()
plt.title('Second-order training result')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(True)
plt.show()

# End of Done


## Interpretation of Moving in the Second-Order Direction

* If the second derivative is large (high curvature) ‚Üí The loss is changing rapidly ‚Üí Smaller step sizes.
* If the second derivative is small (low curvature) ‚Üí The loss changes slowly ‚Üí Larger step sizes.

This is the core idea behind Newton's method