# Report

### Solution setup

- `sequential_print.py`: simple example of `torch.distributed` use.
- `syncbn.py`: implemented synchorinzed BatchNorm layer with forward/backward passes.
- `test_syncbn.py`: comparison of custom implementation with `nn.SyncBatchNorm`.
- `train.py`: net architecture, synchronized training loop.
- `metric_accumulation.py`: implemented accuracy metric over dataset accumulation with `torch.distributed.scatter`.
- `bn_benchmark.py, training_benchmark.py`: benchmarking for correctness of implemented layers and full training pipeline respectively.
- `utils.py`: supporting functions.

Several GPU's experiments are sponsored by Kaggle T4 x2.

In [1]:
!nvidia-smi topo -m

	[4mGPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	PHB	0-3	0		N/A
GPU1	PHB	 X 	0-3	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks


### BatchNorm

Firstly, we compare custom and library implementation of `SyncBatchNorm`.
We compare forward + backward passes time & memory cost and observe that our implementation is faster, but need more memory due to additional variables in forward/backward implementation (used for better code readability). Besides, we can see linear memory dependency on batch size, while time growth is much slower.

In [23]:
!python bn_benchmark.py --size=2 --norm_type=custom

Started measuring for BatchNorm type = custom
| Hidden size | Batch size  | Time (s)    | Memory (Mb)
| 128         | 32          | 5.738396    | 0.169922    
| 128         | 64          | 5.782004    | 0.326172    
| 256         | 32          | 5.651294    | 0.334961    
| 256         | 64          | 5.856960    | 0.647461    
| 512         | 32          | 6.138276    | 0.665039    
| 512         | 64          | 5.958334    | 1.290039    
| 1024        | 32          | 5.815169    | 1.325195    
| 1024        | 64          | 5.535275    | 2.575195    


In [24]:
!python bn_benchmark.py --size=2 --norm_type=lib

Started measuring for BatchNorm type = lib
| Hidden size | Batch size  | Time (s)    | Memory (Mb)
| 128         | 32          | 6.340657    | 0.134766    
| 128         | 64          | 5.681419    | 0.259766    
| 256         | 32          | 5.724207    | 0.265625    
| 256         | 64          | 5.523697    | 0.515625    
| 512         | 32          | 5.881188    | 0.527344    
| 512         | 64          | 6.135359    | 1.027344    
| 1024        | 32          | 5.431167    | 1.050781    
| 1024        | 64          | 5.933679    | 2.050781    


### Time/Memory

Secondly, we check the behaviour during the first training epoch. Here we apply batch accumulation with parameter 2. Time and memory costs are similar for both custom and library options.

In [43]:
!python training_benchmark.py --norm_type=custom --grad_accum=2 --size=2 --n_epoch=1

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Loss: 4.381737594409367, train_acc: 0.05370044757795456
Training for BNorm type = custom; Grad Acc = 2; Num epoch = 1
Skipping val metrics since validation epoch was disabled...

Train acc: 0.053.
Time: 16.0187 s.
Memory: 82.7720 Mb.



In [44]:
!python training_benchmark.py --norm_type=lib --grad_accum=2 --size=2 --n_epoch=1

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Loss: 4.382226187859655, train_acc: 0.05274136829406709
Training for BNorm type = lib; Grad Acc = 2; Num epoch = 1
Skipping val metrics since validation epoch was disabled...

Train acc: 0.053.
Time: 16.1724 s.
Memory: 82.7720 Mb.



#### Removing grad_accum

We have tried to remove grad accumulation to check its influence. We observe higher time spendings. Since the model is slow and `optimizer.step()` does not cost a low on GPU, we suppose that the difference can be caused by rarer calls of `all_reduce` operations in case of `grad_accum = 2`.

In [45]:
!python training_benchmark.py --norm_type=custom --size=2 --n_epoch=1

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Loss: 4.240081084658728, train_acc: 0.07651054988736691
Training for BNorm type = custom; Grad Acc = 1; Num epoch = 1
Skipping val metrics since validation epoch was disabled...

Train acc: 0.075.
Time: 16.7334 s.
Memory: 82.7720 Mb.



### Accuracy

Finally, we compare the model performance after several training epochs on validation set. Here we use `accuracy` accumulation  with `dist.scatter`. In addition, we notice that time, memory and final score are approximately equal for both `SyncBatchNorm` implementations.

In [14]:
!python training_benchmark.py --norm_type=custom --run_val=True --size=2 --n_epoch=10

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Rank 0, train_acc: 0.03818, val_acc: 0.1604
Rank 0, train_acc: 0.06962, val_acc: 0.2072
Rank 0, train_acc: 0.08758, val_acc: 0.2437
Rank 0, train_acc: 0.10368, val_acc: 0.2695
Rank 0, train_acc: 0.11568, val_acc: 0.288
Rank 0, train_acc: 0.1266, val_acc: 0.2974
Rank 0, train_acc: 0.13712, val_acc: 0.3175
Rank 0, train_acc: 0.14482, val_acc: 0.3225
Rank 0, train_acc: 0.1526, val_acc: 0.34
Rank 0, train_acc: 0.15996, val_acc: 0.3475
Training for BNorm type = custom; Grad Acc = 1; Num epoch = 10
Train acc: 0.161.
Val acc: 0.347.
Time: 109.5906 s.
Memory: 82.7720 Mb.



In [16]:
!python training_benchmark.py --norm_type=lib --run_val=True --size=2 --n_epoch=10

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Rank 0, train_acc: 0.03844, val_acc: 0.1605
Rank 0, train_acc: 0.06938, val_acc: 0.206
Rank 0, train_acc: 0.0873, val_acc: 0.2441
Rank 0, train_acc: 0.10292, val_acc: 0.2653
Rank 0, train_acc: 0.11592, val_acc: 0.2874
Rank 0, train_acc: 0.12546, val_acc: 0.2981
Rank 0, train_acc: 0.13764, val_acc: 0.3167
Rank 0, train_acc: 0.14426, val_acc: 0.3262
Rank 0, train_acc: 0.1524, val_acc: 0.3399
Rank 0, train_acc: 0.15886, val_acc: 0.3481
Training for BNorm type = lib; Grad Acc = 1; Num epoch = 10
Train acc: 0.160.
Val acc: 0.348.
Time: 112.6651 s.
Memory: 82.7720 Mb.

