<a href="https://colab.research.google.com/github/AnasAlhasan/large-models-course/blob/main/notebooks/distributed_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip -q install torch torchvision torchaudio transformers

#**1. Activation Checkpointing**
This techniques is all about **trading computation for memory**. When you're training a neural network, the program needs to store a lot of intermediate values (called "activations") during the forward pass so it can calculate the gradients during the backward pass. For very deep or wide models, these activations can eat up all your GPU memory.
**Activate checkpointing** is a clever trick: it avoids storing some of these activations and simply recomputes them when they're needed during the backward pass.

#Importing Libraries

In [11]:
import torch
import torch.nn as nn
import torch.utils.checkpoint as checkpoint

#Feed-forward Block

In [16]:
class Block(nn.Module):
  def __init__(self,size):
    super().__init__()

    self.Linear1 = nn.Linear(size, size)
    self.relu = nn.ReLU()
    self.Linear2 = nn.Linear(size,size)

  def forward(self, x):
    return self.Linear2(self.relu(self.Linear1(x)))


class BlockwithCheckpoint(nn.Module):
  def __init__(self,size):
    super().__init__()

    self.block = Block(size)

  def forward(self, x):
    return checkpoint.checkpoint(self.block, x)



#Comparing between two approaches

In [25]:
x1 = torch.randn(1,1000, requires_grad= True)
x2 = x1.clone().detach().requires_grad_(True)
model_normal = Block(1000)
model_checkpoint = BlockwithCheckpoint(1000)

out1 = model_normal(x1).sum()
out2 = model_checkpoint(x2).sum()

out1.backward()
print(out1)
out2.backward()
print(out2)
print("Both models ran successfully!")

tensor(7.4170, grad_fn=<SumBackward0>)
tensor(-11.2639, grad_fn=<SumBackward0>)
Both models ran successfully!


#**2. Gradient Accumulation**
his technique is a way to simulate a larger batch size than your GPU can physically handle. Training with larger batches can make the training process more stable. If a batch size of 1024 gives the best results but you can only fit a batch of 256 into memory, gradient accumulation is the solution.

The idea is to process several small batches, calculate their gradients, and add them up (accumulate them). Only after you've processed enough small batches to equal your desired large batch size do you update the model's weights.

In [26]:
import torch.optim as optim

In [29]:
model = nn.Linear(10,1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

data = torch.randn(100, 10)
targets = torch.randn(100, 1)

batch_size = 20
accum_steps = 4


optimizer.zero_grad()

for i in range(0, len(data), batch_size):
  x = data[i: i+batch_size]
  y = targets[i: i+batch_size]

  out = model(x)
  loss = loss_fn(out,x)

  loss= loss / accum_steps

  loss.backward()

  if ( i // batch_size + 1) % accum_steps == 0:
    optimizer.step()
    optimizer.zero_grad()

print("Training with gradient accumulation complete")

100
Training with gradient accumulation complete


  return F.mse_loss(input, target, reduction=self.reduction)
