# Resnet152 PyTorch Multi-Card Training on Gaudi

In this notebook we will demonstrate how you can train the resnet152 image classifier using Pytorch on 8 HPUs. 

### Habana Mixed Precision Usage and why it’s important
Habana Mixed Precision (HMP) package is a tool that allows you to run mixed precision training on HPU without extensive modifications to existing FP32 model scripts. You can easily add mixed precision training support to the model script by adding the following lines anywhere in the script before the start of the training loop:

>`from habana_frameworks.torch.hpex import hmp`<br>
>`hmp.convert()`

## Distributed Training

We will use the Model-References repository command line to demo distributed training on 8 HPUs. 
Distributed training differs in the following ways.

1. [Initialization with hccl](https://github.com/HabanaAI/Model-References/blob/1.6.0/PyTorch/computer_vision/classification/torchvision/utils.py#L249) with torch.distributed package using DDP - Distributed Data Parallel

2. [Use the torch distributed data sampler](https://github.com/HabanaAI/Model-References/blob/1.6.0/PyTorch/computer_vision/classification/torchvision/train.py#L179)

3. [Distributed data parallel pytorch model initalization](https://github.com/HabanaAI/Model-References/blob/1.6.0/PyTorch/computer_vision/classification/torchvision/train.py#L328)


### Initialization with HCCL

>`from habana_frameworks.torch.distributed.hccl import initialize_distributed_hpu`<br>
>`args.world_size, args.rank, args.local_rank = initialize_distributed_hpu()`<br>
>
>`if args.device == 'hpu':`<br>
&emsp;`args.dist_backend = 'hccl'`<br>
&emsp;`dist.init_process_group(args.dist_backend, rank=args.rank, world_size=args.world_size)`

### Torch Distributed Data Sampler

>`if distributed:`<br>
&emsp;`train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)`

### Distributed Data Parallel PyTorch Model Initialization

>`model = torch.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False,
                    gradient_as_bucket_view=True)`

#### Set the ENV variables to begin the run

In [4]:
%set_env PYTHONPATH=/home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision:/root/examples/models:/usr/lib/habanalabs/:/root

env: PYTHONPATH=/home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision:/root/examples/models:/usr/lib/habanalabs/:/root


In [5]:
%cd /home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision

/home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision


#### Run the following bash command as a shell script in the final cell(demo_resnet.sh) to start multi-HPU training.

```bash
  export MASTER_ADDR=localhost
  export MASTER_PORT=12355
  /opt/amazon/openmpi/bin/mpirun -n 8 --bind-to core --map-by slot:PE=6 --rank-by core --report-bindings --allow-run-as-root \
    python3 train.py --model=resnet152 --device=hpu --batch-size=256 --epochs=90 --workers=10 \
    --dl-worker-type=MP --print-freq=10 --output-dir=. --seed=123 --hmp --hmp-bf16 ./ops_bf16_Resnet.txt \
    --hmp-fp32 ./ops_fp32_Resnet.txt --custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 \
    --custom-lr-milestones 1 2 3 4 30 60 80 --deterministic --dl-time-exclude=False
```

In [6]:
!sh /home/ubuntu/DL1-Workshop/PyTorch-ResNet152/demo_resnet.sh

[ip-172-31-18-215:662121] MCW rank 7 bound to socket 1[core 42[hwt 0-1]], socket 1[core 43[hwt 0-1]], socket 1[core 44[hwt 0-1]], socket 1[core 45[hwt 0-1]], socket 1[core 46[hwt 0-1]], socket 1[core 47[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB]
[ip-172-31-18-215:662121] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../../../../../../../..]
[ip-172-31-18-215:662121] MCW rank 1 bound to socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]]: [../../../../../../BB/BB/BB/BB/BB/BB/../../../../../../../../../../../..][

# SUMMARY

In this workshop, we did the following:
- Learned about HMP usage and why it is important.
- Setup DistributedDataParallel in the model and trained on 8 HPUs.



