# Problem 2 - Data Parallelism in Pytorch

## 2.1

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.backends.cudnn as cudnn
import torchvision
import torchvision.transforms as transforms
import time
import pickle
from torchvision.models import resnet18
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.nn.parallel import DataParallel

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

print("Number of GPUs =", torch.cuda.device_count())

Number of GPUs = 4


In [None]:
# Data preprocessing
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

# CIFAR10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)

Files already downloaded and verified


In [None]:
def train(epoch, trainloader, net, criterion, optimizer):
    net.train()
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

In [None]:
def measure_training_time(batch_size):
    try:
        trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=2)
        net = resnet18().to(device)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

        # Warm-up epoch
        train(0, trainloader, net, criterion, optimizer)

        # Timed epoch
        start_time = time.time()
        train(1, trainloader, net, criterion, optimizer)
        end_time = time.time()

        return end_time - start_time
    except RuntimeError as e:
        if 'out of memory' in str(e):
            print(f"Out of memory for batch size: {batch_size}")
        else:
            raise e
        return None

batch_size = 32
times = []
while True:
    training_time = measure_training_time(batch_size)
    if training_time is not None:
        times.append((batch_size, training_time))
        print(f"Batch Size: {batch_size}, Training Time: {training_time:.2f} seconds")
        batch_size *= 4  # Increase batch size 4-fold
    else:
        break

# Output the times for analysis
print(times)

Files already downloaded and verified
Batch Size: 32, Training Time: 17.92 seconds
Batch Size: 128, Training Time: 8.26 seconds
Batch Size: 512, Training Time: 8.07 seconds
Batch Size: 2048, Training Time: 8.38 seconds
Batch Size: 8192, Training Time: 9.47 seconds
Batch Size: 32768, Training Time: 12.77 seconds
Batch Size: 131072, Training Time: 16.32 seconds
Batch Size: 524288, Training Time: 16.27 seconds
Batch Size: 2097152, Training Time: 16.37 seconds
Batch Size: 8388608, Training Time: 16.30 seconds
Batch Size: 33554432, Training Time: 16.44 seconds
Batch Size: 134217728, Training Time: 19.67 seconds
Batch Size: 536870912, Training Time: 21.83 seconds
Batch Size: 2147483648, Training Time: 30.68 seconds
Batch Size: 8589934592, Training Time: 64.33 seconds


**Answer:**

The above output shows different training times with Batch size going from 32 until 1 GPU cannot handle it anymore (increasing 4-fold) and it dies. We can notice that at first - from 32 to 512 - the training time decreases with increasing batch size but after 512 training time starts increasing with increasing batch size. From this we can conclude that very small batch sizes lead to less efficient GPU usage which does not optimize the training time, while extremely large batch sizes can strain or exceed GPU memory limit which can again lead to slower training times. This just shows us how important it is to choose the right batch size for training while making sure to find a good balance between computational efficiency and memory constraints.

## 2.2

In [None]:
def measure_training_time_multi_gpu(batch_size, num_gpus):
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size*num_gpus, shuffle=True, num_workers=2)
    net = resnet18().to(device)
    if num_gpus > 1:
        net = DataParallel(net, device_ids=list(range(num_gpus)))
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

    # Warm-up epoch
    train(0, trainloader, net, criterion, optimizer)

    # Timed epoch
    start_time = time.time()
    train(1, trainloader, net, criterion, optimizer)
    end_time = time.time()

    return end_time - start_time

num_available_gpus = torch.cuda.device_count()

# Measure training times for batch sizes 32, 128, and 512 on 1, 2, and 4 GPUs
batch_sizes = [32, 128, 512]
for batch_size in batch_sizes:
    for num_gpus in [1, 2, 4]:
        if num_gpus <= num_available_gpus:
            training_time = measure_training_time_multi_gpu(batch_size, num_gpus)
            print(f"Batch Size: {batch_size}, GPUs: {num_gpus}, Training Time: {training_time:.2f} seconds")

Batch Size: 32, GPUs: 1, Training Time: 16.10 seconds
Batch Size: 32, GPUs: 2, Training Time: 41.32 seconds
Batch Size: 32, GPUs: 4, Training Time: 24.85 seconds
Batch Size: 128, GPUs: 1, Training Time: 8.37 seconds
Batch Size: 128, GPUs: 2, Training Time: 10.53 seconds
Batch Size: 128, GPUs: 4, Training Time: 9.29 seconds
Batch Size: 512, GPUs: 1, Training Time: 9.19 seconds
Batch Size: 512, GPUs: 2, Training Time: 8.78 seconds
Batch Size: 512, GPUs: 4, Training Time: 8.76 seconds


From above output and the table we can see that for batch sizes of 32 and 128, 1 GPU training time is the fastest, followed by 4 GPU, while 2 GPU training time is the slowest. However, in the case of batch size of 512, training time is the fastest for 4 GPUs, followed by 2 GPU while 1 GPU is the slowest.

**Answer:**

|        | Batch-size 32 per GPU |              | Batch-size 128 per GPU |              | Batch-size 512 per GPU |              |
|--------|-----------------------|--------------|------------------------|--------------|------------------------|--------------|
|        | Time (sec)            | Speedup      | Time (sec)             | Speedup      | Time (sec)             | Speedup      |
| 1-GPU  | 16.10                 | 1            | 8.37                   | 1            | 9.19                   | 1            |
| 2-GPU  | 41.32                 | 0.78         | 10.53                  |1.59        | 8.78                   | 2.09         |
| 4-GPU  | 24.85                 | 2.59         | 9.29                   | 3.60         | 8.76                   | 4.20        |


Speedup calculation:

For 2 GPUs: Speedup = (2 * Time for 1 GPU) / (Time for 2 GPUs)
<br>
For 4 GPUs: Speedup = (4 * Time for 1 GPU) / (Time for 4 GPUs)


Looking at the speedup, we can say that for batch size of 32 and 2 GPUs is actually a "slowdown" as having 2 GPUs takes more time than 1 GPU does. This shows us that for small batch sizes the overhead of distributing the work across multiple GPUs could outweight the benefits. However, for batch size of 32 and 4GPU we have the expected speedup of around 2.6. Further, we can see that for batch size of 128 and 512 we have a significant speedup for both 2 and 4 GPU setups compared to 1 GPU.

**Comment on which type of scaling we are measuring: weak-scaling or strong-scaling?**

**Comment on if the other type of scaling was used to speed up the number will be better or worse than what you are measuring.**

When using Weak-Scaling, the workload per GPU remains constant while the number of processors increase. On the other hand, in Strong-Scaling the total workload remains constant while increasing the number of GPUs.

Our case fits better with Weak-Scaling since the batch size per GPU is constant while we are incresing the number of GPUs. We can say that Strong-Scaling is not being used as total workload (total batch size) increases with more GPUs rather than being constant.

As mentioned above, we can see from the results that Weak-Scaling efficiency is not linear, especially for smaller batch sizes (32, 128) where the overhead of parallelization (when using multiple GPUs) leads to worse performance than that of a 1 GPU. This is most likely due to the communication overhead and the inefficiency of smaller workloads (batch-sizes) on multiple GPUs.

If we chose to use Strong-Scaling instead of Weak-Scaling, we would need to keep total batch size constant while increasing the number of GPUs. This could have lead to better results for smaller batch sizes as Strong-Scaling can reduce the workloan per GPU  which makes the training process more efficient for smaller batch sizes. However, for larger batch sizes Strong-Scaling could also potentially lead to worse results since it might lead to under-utilization of each GPU's capabilities.

## 2.3

**Answer:**


|        | Batch-size 32 per GPU |             | Batch-size 128 per GPU |             | Batch-size 512 per GPU |             |
|--------|-----------------------|-------------|------------------------|-------------|------------------------|-------------|
|        | Compute(sec)          | Comm(sec)   | Compute(sec)           | Comm(sec)   | Compute(sec)           | Comm(sec)   |
| 2-GPU  | 8.05                 | 33.27       | 4.19                   | 6.34       | 4.59                   | 4.19       |
| 4-GPU  | 4.03                 | 20.82        | 2.09                   | 7.2        | 2.29                   | 6.47       |



As we have training times for 1 GPU, 2 GPU, and 4 GPU setup we will calculate compute time and communication time for 2 and 4 GPUs the following way:
- We will approximate **Compute Time** of multi-GPU setup to be equal to training time for 1 GPU diveded by the number of GPUs used (2 for 2-GPU setup and 4 for 4-GPU setup) since there is no communication overhead (it does not communicate with anyone since it's just 1 GPU).
- We will calculate the **Communication Time** as the difference in the total training time of multi-GPUs and their compute time (1 GPU training time / N for corresponding batch size).

As this is not a perfect method we have to keep in mind we are making these assumptions:
- The computation time scales perfectly with the number of GPUs. (this is often not true due to the inefficiencies and overheads)
- The communication time is the only additional time when moving from single GPU to multiple GPUs.

**2-GPU Calculations:**
- **Batch size 32:** Compute time = 1-GPU time for batch size 32, which is 16.10 seconds / 2 = 8.05. Communication time would be 41.32 (total time for 2-GPU) - 8.05 = 33.27 sec.
- **Batch size 128:** Compute time is 4.19 sec. Communication time would be 10.53 - 4.19 = 6.34 sec.
- **Batch size 512:** Compute time is 4.59 sec. Communication time would be 8.78 - 4.59 = 4.19 sec.


**4-GPU Calculations:**
- **Batch size 32:** Compute time = 1-GPU time for batch size 32, which is 16.10 seconds / 4 = 4.03 sec. Communication time would be 24.85 - 4.03 = 20.82 sec.
- **Batch size 128:** Compute time is 2.09 sec. Communication time would be 9.29 - 2.09 = 7.2 sec.
- **Batch size 512:** Compute time is 2.29 sec. Communication time would be 8.76 - 2.29 = 6.47 sec.


We can notice that the compute etime is the smallest for both 2 and 4 GPU setup when using 128 batch size. Further we can also notice that communication time decreases by increasing batch size and thus is the smallest with 512 batch size.

We can say that this approach (using 1 GPU training time as benchmark for multi-GPU compute time) gives us a good approximation, we should do more detailed calculations to better (more accuratly) calculate and understand compute and communication times.

## 2.4

Formula for Allreduce:  $2(N-1)\frac{K}{N}$

Where
  - K is number of model parameters (for resnet18 in our case)
  - N is number of GPUs utilized



Source: https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/

Formula for Bandwidth Utilization = Allreduce communication cost / Communication time (from 2.3)

Let's first calculate the Allreduce Cost for 2 and 4 GPUs:

K = number of parameters in resnet18 = 11,689,512
- Source: https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html

<br>

- 2 GPU Allreduce Cost = $2(2 -1) \frac{11689512}{2}$ = $11,689,512$

- 4 GPU Allreduce Cost = $2(4 -1) \frac{11689512}{4}$ = $17,534,268$

As we want our final results in GB, we will convert these to GB:
- 2 GPU Allreduce Cost = $\frac{11,689,512*4bytes}{2^{30}} = 0.044 GB$

- 4 GPU Allreduce Cost = $\frac{17,534,268*4bytes}{2^{30}} = 0.065 GB$


**Answer:**

For Bandwidth we devide above mentioned Allreduce costs with corresponding communication time from part 2.3 for each batch size:

<br>


|        | Batch-size-per-GPU 32                  | Batch-size-per-GPU 128                 | Batch-size-per-GPU 512                 |
|--------|----------------------------------------|----------------------------------------|----------------------------------------|
|        | Bandwidth Utilization (GB/s)           | Bandwidth Utilization (GB/s)           | Bandwidth Utilization (GB/s)           |
| 2-GPU  | 0.0013 |0.0069  |0.0105|
| 4-GPU  | 0.0031|0.0090   |0.01005 |


We can see that the bandwith utilizations are quite small (when in GB/s) but also that they grow significanly with increase in batch size.