# Getting Started with PyTorch Parallelism on H100 GPUs

Sequence:
1. Baseline code
2. Profile baseline code
3. Monitor with wandb GPU usage
4. DDP-nstep on baseline code
5. Repeat 2-4

Reference:
1. Following this [tutorial](https://pytorch.org/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro)

## 1. Baseline code

1. Simple serial run

In [52]:
!python /workspace/basic_parallelization/singlegpu.py --total_epochs 50 --save_every 10 --batch_size 32 

[GPU0] Epoch 0 | Batchsize: 32 | Steps: 64
Epoch 0 | Training checkpoint saved at checkpoint.pt
[GPU0] Epoch 1 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 2 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 3 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 4 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 5 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 6 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 7 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 8 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 9 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 10 | Batchsize: 32 | Steps: 64
Epoch 10 | Training checkpoint saved at checkpoint.pt
[GPU0] Epoch 11 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 12 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 13 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 14 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 15 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 16 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 17 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 18 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 19 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 20 | Batch

2. Parallel run

In [53]:
!python /workspace/basic_parallelization/multigpu.py --total_epochs 50 --save_every 10 --batch_size 32 

[GPU0] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU1] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU6] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU7] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU2] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU4] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU3] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU5] Epoch 0 | Batchsize: 32 | Steps: 8
[GPU6] Epoch 1 | Batchsize: 32 | Steps: 8[GPU4] Epoch 1 | Batchsize: 32 | Steps: 8[GPU2] Epoch 1 | Batchsize: 32 | Steps: 8[GPU5] Epoch 1 | Batchsize: 32 | Steps: 8[GPU7] Epoch 1 | Batchsize: 32 | Steps: 8




Epoch 0 | Training checkpoint saved at checkpoint.pt
[GPU1] Epoch 1 | Batchsize: 32 | Steps: 8
[GPU3] Epoch 1 | Batchsize: 32 | Steps: 8
[GPU0] Epoch 1 | Batchsize: 32 | Steps: 8
[GPU5] Epoch 2 | Batchsize: 32 | Steps: 8
[GPU0] Epoch 2 | Batchsize: 32 | Steps: 8
[GPU7] Epoch 2 | Batchsize: 32 | Steps: 8
[GPU4] Epoch 2 | Batchsize: 32 | Steps: 8
[GPU6] Epoch 2 | Batchsize: 32 | Steps: 8
[GPU2] Epoch 2 | Batchsize: 32 | Steps: 8
[GPU1] Epoch 2 | Batchs

3. Torchrun on multigpu

In [None]:
!torchrun --standalone --nproc_per_node=gpu /workspace/basic_parallelization/multigpu_torchrun.py --total_epochs 50 --save_every 10 --batch_size 32