Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.  
Copyright (c) 2017, Pytorch contributors All rights reserved.
## BSD 3-Clause License
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

# ResNet50 for PyTorch with GPU Migration

In this notebook we will demonstrate ResNet50 model which has been enabled using an experimental feature called GPU migration and it can be trained using Pytorch on 8 HPUs.

#### Clone the Model-References repository

In [None]:
!git clone https://github.com/habanaai/Model-References

#### Set the ENV variables

In [1]:
%set_env PYTHONPATH=/root/Gaudi2-Workshop/Model-Migration/Model-References:/usr/lib/habanalabs/:/root
%set_env PYTHON=/usr/bin/python3.8

env: PYTHONPATH=/root/tf/Model-References/PyTorch/examples/gpu_migration/computer_vision/classification/torchvision:/usr/lib/habanalabs/:/root


#### Naviagte to the model to begin the run

In [2]:
%cd /root/Gaudi2-Workshop/Model-Migration/Model-References/PyTorch/examples/gpu_migration/computer_vision/classification/torchvision

/root/tf/Model-References/PyTorch/examples/gpu_migration/computer_vision/classification/torchvision


#### Download dataset
ImageNet 2012 dataset needs to be organized according to PyTorch requirements, and as specified in the scripts of [imagenet-multiGPU.torch](https://github.com/soumith/imagenet-multiGPU.torch).

#### Run the following command to start multi-HPU training.

```bash
GPU_MIGRATION_LOG_LEVEL=1 torchrun --nproc_per_node 8 train.py --batch-size=256 --model=resnet50 --device=cuda --data-path=/root/software/data/pytorch/imagenet/ILSVRC2012 --workers=8 --epochs=1 --opt=sgd --amp
```

In [3]:
!GPU_MIGRATION_LOG_LEVEL=1 torchrun --nproc_per_node 8 train.py --batch-size=256 --model=resnet50 --device=cuda --data-path=/root/software/data/pytorch/imagenet/ILSVRC2012 --workers=8 --epochs=1 --opt=sgd --amp

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
gpu migration log will be saved to /var/log/habana_logs/gpu_migration_logs/2023-06-07/16-25-46/gpu_migration_9020.log
gpu migration log will be saved to /var/log/habana_logs/gpu_migration_logs/2023-06-07/16-25-46/gpu_migration_9021.log
gpu migration log will be saved to /var/log/habana_logs/gpu_migration_logs/2023-06-07/16-25-46/gpu_migration_9026.log
gpu migration log will be saved to /var/log/habana_logs/gpu_migration_logs/2023-06-07/16-25-46/gpu_migration_9027.log
gpu migration log will be saved to /var/log/habana_logs/gpu_migration_logs/2023-06-07/16-25-46/gpu_migration_9025.log
gpu migration log will be saved to /var/log/habana_logs/gpu_migration_logs/2023-06-07/16-25-46/gpu_migration_9024.lo

Took 4.0025529861450195
Loading validation data
Creating data loaders
MediaDataloader 1/8 seed : 760445125
 PT_HPU_LAZY_MODE = 1
 PT_HPU_LAZY_EAGER_OPTIM_CACHE = 1
 PT_HPU_ENABLE_COMPILE_THREAD = 0
 PT_HPU_ENABLE_EXECUTION_THREAD = 1
 PT_HPU_ENABLE_LAZY_EAGER_EXECUTION_THREAD = 1
 PT_ENABLE_INTER_HOST_CACHING = 0
 PT_ENABLE_INFERENCE_MODE = 1
 PT_ENABLE_HABANA_CACHING = 1
 PT_HPU_MAX_RECIPE_SUBMISSION_LIMIT = 0
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_MAX_COMPOUND_OP_SIZE_SS = 10
 PT_HPU_ENABLE_STAGE_SUBMISSION = 1
 PT_HPU_STAGE_SUBMISSION_MODE = 2
 PT_HPU_PGM_ENABLE_CACHE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 0
 PT_HCCL_SLICE_SIZE_MB = 16
 PT_HCCL_MEMORY_ALLOWANCE_MB = 0
 PT_HPU_INITIAL_WORKSPACE_SIZE = 0
 PT_HABANA_POOL_SIZE = 24
 PT_HPU_POOL_STRATEGY = 5
 PT_HPU_POOL_LOG_FRAGMENTATION_INFO = 0
 PT_ENABLE_MEMORY_DEFRAGMENTATION = 1
 PT_ENABLE_DEFRAGMENTATION_INFO = 0
 PT_HPU_ENABLE_SYNAPSE_OUTPUT_PERMUTE = 1
 PT_HPU_ENABLE_VALID_DATA_RANGE_CHECK = 1
 PT_HPU_FORCE_US

Epoch: [0]  [ 10/622]  eta: 0:30:27  lr: 0.1  img/s: 123.30808523404552  loss: 7.1341 (7.1421)  acc1: 0.0000 (0.1953)  acc5: 1.1719 (1.1719)  time: 2.9867  data: 0.1094  max mem: 16966
Epoch: [0]  [ 20/622]  eta: 0:15:55  lr: 0.1  img/s: 5146.506143980567  loss: 7.1502 (7.1801)  acc1: 0.0000 (0.1302)  acc5: 1.1719 (0.9115)  time: 1.0613  data: 0.0073  max mem: 16966
Epoch: [0]  [ 30/622]  eta: 0:10:45  lr: 0.1  img/s: 5240.960702867602  loss: 7.1341 (7.1405)  acc1: 0.0000 (0.1953)  acc5: 1.1719 (1.3672)  time: 0.0465  data: 0.0027  max mem: 16966
Epoch: [0]  [ 40/622]  eta: 0:08:06  lr: 0.1  img/s: 4989.253908607789  loss: 7.1410 (7.1406)  acc1: 0.3906 (0.2344)  acc5: 1.1719 (1.1719)  time: 0.0472  data: 0.0014  max mem: 16966
Epoch: [0]  [ 50/622]  eta: 0:06:29  lr: 0.1  img/s: 4990.452762419641  loss: 7.1341 (7.1110)  acc1: 0.0000 (0.1953)  acc5: 0.3906 (1.0417)  time: 0.0485  data: 0.0036  max mem: 16966
Epoch: [0]  [ 60/622]  eta: 0:05:25  lr: 0.1  img/s: 4179.821571352384  loss: 7

Epoch: [0]  [460/622]  eta: 0:00:19  lr: 0.1  img/s: 5259.098979078055  loss: 5.6689 (6.2195)  acc1: 3.9062 (2.3604)  acc5: 14.0625 (7.2889)  time: 0.0461  data: 0.0023  max mem: 16966
Epoch: [0]  [470/622]  eta: 0:00:17  lr: 0.1  img/s: 5229.459083523925  loss: 5.6102 (6.2022)  acc1: 4.6875 (2.4984)  acc5: 14.4531 (7.5033)  time: 0.0460  data: 0.0009  max mem: 16966
Epoch: [0]  [480/622]  eta: 0:00:16  lr: 0.1  img/s: 5238.089501154462  loss: 5.6028 (6.1853)  acc1: 4.6875 (2.5351)  acc5: 14.4531 (7.6690)  time: 0.0461  data: 0.0020  max mem: 16966
Epoch: [0]  [490/622]  eta: 0:00:14  lr: 0.1  img/s: 5266.86894213488  loss: 5.5439 (6.1684)  acc1: 4.6875 (2.6172)  acc5: 14.8438 (7.9141)  time: 0.0459  data: 0.0022  max mem: 16966
Epoch: [0]  [500/622]  eta: 0:00:13  lr: 0.1  img/s: 5258.596733694798  loss: 5.5297 (6.1512)  acc1: 4.6875 (2.7037)  acc5: 14.8438 (8.1265)  time: 0.0458  data: 0.0015  max mem: 16966
Epoch: [0]  [510/622]  eta: 0:00:12  lr: 0.1  img/s: 5264.389968386454  loss

Training time 0:01:15
