# 1. Dataset and Model

## 1.1. Random Dataset
This randomly generate the input and output data with given sizes and also the number of samples.

It uses `torch.manual_seed` to generate same data in each run for consistensy

A data loader then uses this dataset and create the data batches with given batch size.

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader

class RandomTensorDataset(Dataset):
  def __init__(self, num_samples, in_shape, out_shape):
    self.num_samples = num_samples
    torch.manual_seed(12345)
    self.data = [(torch.randn(in_shape), torch.randn(out_shape)) for _ in range(num_samples)]

  def __len__(self):
    return self.num_samples

  def __getitem__(self, idx):
    return self.data[idx]

input_size  = 6
output_size = 2

# dataset construction
num_samples = 32
dataset = RandomTensorDataset(
  num_samples=num_samples,
  in_shape=input_size,
  out_shape=output_size
  )

batch_size  = 32 # One batch in total since the total number of data samples are 32
dataloader = DataLoader(
  dataset,
  batch_size=batch_size,
  pin_memory=True,
  shuffle=False
  )

## 1.2. A Simple 2-Layer MLP model

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

class MLP(nn.Module):
  def __init__(self, in_feature, hidden_units, out_feature):
    super().__init__()
    torch.manual_seed(12345)
    self.hidden_layer = nn.Linear(in_feature, hidden_units)
    self.output_layer = nn.Linear(hidden_units, out_feature)

  def forward(self, x):
    x = self.hidden_layer(x)
    x = self.output_layer(x)
    return x

device = 'cuda' if torch.cuda.is_available() else 'cpu' # Using single GPU (GPU 0) if available otherwise CPU

# model construction
layer_1_units = input_size
layer_2_units = 4
layer_3_units = output_size
model = MLP(
  in_feature=layer_1_units,
  hidden_units=layer_2_units,
  out_feature=layer_3_units
  ).to(device)

loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(),lr=0.01)

## 1.3. Extracting Each Layer Original Weights and Biases

In [3]:
# hidden layer parameters
W1 = model.hidden_layer.weight
b1 = model.hidden_layer.bias

# output layer parameters
W2 = model.output_layer.weight
b2 = model.output_layer.bias

print(f"W1.shape: {W1.shape},  b1.shape: {b1.shape}\nW2.shape: {W2.shape} b2.shape: {b2.shape}\n")

W1.shape: torch.Size([4, 6]),  b1.shape: torch.Size([4])
W2.shape: torch.Size([2, 4]) b2.shape: torch.Size([2])



# 2. Run One Epoch Using PyTorch On Single Device

We dont update the Model Parameters yet since we need them in the next section to compute forward pass and backward pass manually.

In [4]:
# One iteration using PyTorch
print(f'Using {device} For One Iteration of Forward and Backward Passes Using PyTorch')
for x, y in dataloader:
  x = x.to(device)
  y = y.to(device)

  # Forward Pass
  out = model(x)

  # Calculate loss
  loss = loss_fn(out, y)

  # Zero grad
  optimizer.zero_grad(set_to_none=True)

  # Backward Pass
  loss.backward()

  # We update the model on Section 4. (`optimizer.step()`)

print(f'{out=}')

Using cuda For One Iteration of Forward and Backward Passes Using PyTorch
out=tensor([[ 0.1546, -0.5383],
        [ 0.1222, -0.3119],
        [ 0.6736, -0.2439],
        [ 0.8124, -0.3054],
        [ 0.4496, -0.4473],
        [ 0.4101, -0.5774],
        [ 0.4743, -0.6111],
        [ 0.5034, -0.3675],
        [ 0.3844, -0.3518],
        [ 0.5095, -0.5411],
        [ 0.6302, -0.4292],
        [ 0.0960, -0.5650],
        [ 0.5374, -0.3396],
        [ 0.4914, -0.3475],
        [ 0.9440, -0.1130],
        [ 0.7744, -0.2111],
        [ 0.4808, -0.3415],
        [ 0.3199, -0.5394],
        [ 0.5131, -0.4608],
        [-0.0191, -0.6592],
        [ 0.3838, -0.4386],
        [ 0.5702, -0.2098],
        [ 0.2020, -0.4805],
        [ 0.1737, -0.2273],
        [ 0.6647, -0.2574],
        [ 0.6794, -0.2492],
        [ 0.6364, -0.3637],
        [ 1.0508, -0.2086],
        [ 0.5117, -0.4432],
        [ 0.4517, -0.2172],
        [ 0.5502, -0.5689],
        [ 0.6571, -0.0120]], device='cuda:0', grad_fn=

# 3. Manual Computation
* Simulate Forward Pass on a single device
* Simulate Tensor Parallelism (TP) of Forward Pass on two devices
* Simulate Backward Pass on a single device
* Simulate Tensor Parallelism (TP) of Backward Pass on two devices





First let's have a function to compare two tensors for equality.

Here is a function to compare two tensor if they are the same.

Note since we are doing Floating Point operations it is acceptable if they are not exactly the same but very close.

In [5]:
def cmp(s, t1, t2):
  ex = torch.all(t1 == t2).item()
  app = torch.allclose(t1, t2)
  maxdiff = (t1 - t2).abs().max().item()
  print(f'{s:15s} | exact: {str(ex):5s} | approximate: {str(app):5s} | maxdiff: {maxdiff}')

## 3.1. Forward Pass
Doing the forward pass manually and compare with what PyTorch has given to us in previous section (out)


### 3.1.1 Simulate Forward Pass On One Device
You can see that the manual output is exactly match the model output generated with pytorch on the previous section

In [6]:
# Forward Pass Manually
h = x @ W1.T + b1
manual_out = h @ W2.T + b2
cmp('forward pass', manual_out, out) # is it consistent with pytorch?

forward pass    | exact: True  | approximate: True  | maxdiff: 0.0


### 3.1.2 Simulate Forward Pass On Two Devices Using Tensor Parallelism (TP)
Using TP to distribute the computation across two devices, we assume that each layer parameters are sharded column-wise between two devices and perform the forward computation manually to understand how TP works and where it needs communication.

lets first see how the model parameters are divided between the two devices:

In [7]:
############## on CPU 0
  # Layer 1
W1_0 = W1[:2]
b1_0 = b1[:2]
  # Layer 2
W2_0 = W2[:1]
b2_0 = b2[:1]
print(f"GPU 0 : W1.shape: {W1_0.shape},  b1.shape: {b1_0.shape}\nW2.shape: {W2_0.shape} b2.shape: {b2_0.shape}\n")

############## on CPU 1
  # Layer 1
W1_1 = W1[2:]
b1_1 = b1[2:]
  # Layer 2
W2_1 = W2[1:]
b2_1 = b2[1:]
print(f"GPU 0 : W1.shape: {W1_1.shape},  b1.shape: {b1_1.shape}\nW2.shape: {W2_1.shape} b2.shape: {b2_1.shape}\n")

GPU 0 : W1.shape: torch.Size([2, 6]),  b1.shape: torch.Size([2])
W2.shape: torch.Size([1, 4]) b2.shape: torch.Size([1])

GPU 0 : W1.shape: torch.Size([2, 6]),  b1.shape: torch.Size([2])
W2.shape: torch.Size([1, 4]) b2.shape: torch.Size([1])



Now we compute the forward pass for each device locally (independently) and after all_gather we compare it with the pytorch output.

The aggregated output match the output we got from PyTorch forward pass.

In [8]:
# GPU 0
h_0 = x @ W1_0.T + b1_0
# GPU 1
h_1 = x @ W1_1.T + b1_1

# All_gather h before starting the forward pass for the next layer
gathered_h = torch.cat((h_0, h_1), dim=1)

# GPU 0
manual_out_0 = gathered_h @ W2_0.T + b2_0
# GPU 1
manual_out_1 = gathered_h @ W2_1.T + b2_1

# All_gather output
gathered_manual_out = torch.cat((manual_out_0, manual_out_1), dim=1)

cmp('TP - forward pass', gathered_manual_out, out)

TP - forward pass | exact: False | approximate: True  | maxdiff: 5.960464477539063e-08


## 3.2 Backward Pass
Now lets do the backward pass manually and calculate the gradients and compare it with what loss.backward() computes by PyTorch in Section 2

### 3.2.1 Simulate Backward Pass On One Device
* Following if use d{variable} meaning dL/d{variable}
* We use unsqueeze to add a dimension with size 1 in order to use broadcasting for element-wise multiplication.

In [9]:
# Backward Pass Manually
dout = (manual_out - y)

# Output Layer
dW2 = (dout.unsqueeze(2) * h.unsqueeze(1)).mean(dim=0) # Avarage across 32 data points in the batch
db2 = dout.mean(dim=0)
# Comparing with PyTorch gradients
cmp('dW2', dW2, W2.grad)
cmp('db2', db2, b2.grad)

dh = (dout.unsqueeze(2) * W2.unsqueeze(0)).sum(1)

# Hidden Layer
dW1 = (dh.unsqueeze(2) * x.unsqueeze(1)).mean(dim=0)
db1 = dh.mean(0)
cmp('dW1', dW1, W1.grad)
cmp('db1', db1, b1.grad)

dW2             | exact: False | approximate: True  | maxdiff: 2.9802322387695312e-08
db2             | exact: True  | approximate: True  | maxdiff: 0.0
dW1             | exact: False | approximate: True  | maxdiff: 2.9802322387695312e-08
db1             | exact: False | approximate: True  | maxdiff: 1.862645149230957e-09


### 3.2.2 Simulate Backward Pass On Two Devices Using Tensor Parallelism (TP)

In [10]:
# Output Layer
# GPU 0
dout_0 = (manual_out_0 - y[:, :1])
dW2_0 = (dout_0.unsqueeze(2) * gathered_h.unsqueeze(1)).mean(0)
db2_0 = dout_0.mean(0)
cmp('dW2_0', dW2_0, W2.grad[:1])
cmp('db2_0', db2_0, b2.grad[:1])

# GPU 1
dout_1 = (manual_out_1 - y[:, 1:])
dW2_1 = (dout_1.unsqueeze(2) * gathered_h.unsqueeze(1)).mean(0)
db2_1 = dout_1.mean(0)
cmp('dW2_1', dW2_1, W2.grad[1:])
cmp('db2_1', db2_1, b2.grad[1:])
###########

dh_0 = (dout_0.unsqueeze(2) * W2_0.unsqueeze(0)).sum(1)
dh_1 = (dout_1.unsqueeze(2) * W2_1.unsqueeze(0)).sum(1)
# Reduce-scatter h
reduce_scatter_dh_0 = (dh_0 + dh_1)[:, :2]
reduce_scatter_dh_1 = (dh_0 + dh_1)[:, 2:]

###########
# Hidden Layer
# GPU 0
dW1_0 = (reduce_scatter_dh_0.unsqueeze(2) * x.unsqueeze(1)).mean(0)
db1_0 = reduce_scatter_dh_0.mean(0)
cmp('dW1_0', dW1_0, W1.grad[:2])
cmp('db1_0', db1_0, b1.grad[:2])

# GPU 1
dW1_1 = (reduce_scatter_dh_1.unsqueeze(2) * x.unsqueeze(1)).mean(0)
db1_1 = reduce_scatter_dh_1.mean(0)
cmp('dW1_1', dW1_1, W1.grad[2:])
cmp('db1_1', db1_1, b1.grad[2:])

dW2_0           | exact: False | approximate: True  | maxdiff: 2.9802322387695312e-08
db2_0           | exact: False | approximate: True  | maxdiff: 5.960464477539063e-08
dW2_1           | exact: False | approximate: True  | maxdiff: 1.4901161193847656e-08
db2_1           | exact: False | approximate: True  | maxdiff: 1.4901161193847656e-08
dW1_0           | exact: False | approximate: True  | maxdiff: 7.450580596923828e-09
db1_0           | exact: False | approximate: True  | maxdiff: 1.862645149230957e-09
dW1_1           | exact: False | approximate: True  | maxdiff: 2.9802322387695312e-08
db1_1           | exact: True  | approximate: True  | maxdiff: 0.0


# 4. Update The Model

## 4.2 Manually
Using gradients calculated in Section 3 to update the model manually and compare it with SGD function from PyTorch.

We first compute the manual update since we need the current model parameters. If we run `optimizer.step()` first then the model parameters change and make the manual computation incorrect.

In [11]:
# Using the same learning rate as Section 1
lr = 0.01
manual_nW1 = W1 - lr * dW1
manual_nb1 = b1 - lr * db1
manual_nW2 = W2 - lr * dW2
manual_nb2 = b2 - lr * db2

## 4.1 PyTorch
Now we run `optimizer.step()` to update the model then compare the updated parameters with the manually computed ones from previous subsection.

Note that W1, b1, W2, b2 still point the updated mpdel parameters.

In [12]:
# Run optimizer.step() to update the model
optimizer.step()

In [13]:
cmp('manual_nW1', manual_nW1, W1)
cmp('manual_nb1', manual_nb1, b1)
cmp('manual_nW2', manual_nW2, W2)
cmp('manual_nb2', manual_nb2, b2)

manual_nW1      | exact: False | approximate: True  | maxdiff: 2.9802322387695312e-08
manual_nb1      | exact: True  | approximate: True  | maxdiff: 0.0
manual_nW2      | exact: True  | approximate: True  | maxdiff: 0.0
manual_nb2      | exact: True  | approximate: True  | maxdiff: 0.0
