# Resnet152 PyTorch Training on Gaudi

In this notebook we will demonstrate how you can train the resnet152 image classifier using Pytorch. We will first demonstrate training on a single HPU, then on 8 HPUs. 

## Install pre-requisites

In [None]:
!pip install matplotlib

## Download and prepare the CIFAR10 dataset


The CIFAR10 dataset contains 60,000 color images in 10 classes, with 6,000 images in each class. The dataset is divided into 50,000 training images and 10,000 testing images. The classes are mutually exclusive and there is no overlap between them.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 8

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# functions to show an image
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join(f'{classes[labels[j]]:5s}' for j in range(batch_size)))

## Setup for running the training

Set the python path environment variable and cd into appropriate directory

In [None]:
%set_env PYTHONPATH=/home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision:/root/examples/models:/usr/lib/habanalabs/:/root

In [None]:
%cd /home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision

### Steps required for migrating the model to Habana HPU
Here are the lines of code we need to target the Habana device:

1. Import the Habana Torch Library like this: 
```
import habana_frameworks.torch.core as htcore
```

2. Target the Gaudi HPU device:
```
device = torch.device('hpu')
```

3. Add mark_step():
```
htcore.mark_step()
```
In Lazy mode, mark_step() must be added in all training scripts right after loss.backward() and optimizer.step(). The Habana bridge internally accumulates these ops in a graph. The execution of the ops in the accumulated graph is triggered only when a tensor value is required by the user. This allows the bridge to construct a SynapseAI graph with multiple ops, which provides the graph compiler the opportunity to optimize the device execution for these ops. 

Next, we import libraries necessary for pytorch training and define the argument parser.

In [None]:
from __future__ import print_function

import datetime
import os
import time
import sys

import torch
import torch.utils.data
from torch import nn
import torchvision
import torchvision.datasets as datasets
from torchvision import transforms
import random
import utils

# Key changes for targetting the Habana Device
import habana_frameworks.torch.core as htcore
device = torch.device('hpu')

def get_resnet152_argparser():
    import argparse
    import sys
    parser = argparse.ArgumentParser(description='PyTorch Classification Training')
    parser.add_argument('--dl-time-exclude', default='True', type=lambda x: x.lower() == 'true', help='Set to False to include data load time')
    parser.add_argument('-b', '--batch-size', default=128, type=int)
    parser.add_argument('--device', default='hpu', help='device')
    parser.add_argument('--epochs', default=90, type=int, metavar='N',
                        help='number of total epochs to run')
    parser.add_argument('-j', '--workers', default=10, type=int, metavar='N',
                        help='number of data loading workers (default: 10)')
    parser.add_argument('--process-per-node', default=8, type=int, metavar='N',
                        help='Number of process per node')
    parser.add_argument('--hls_type', default='HLS1', help='Node type')
    parser.add_argument('--lr', default=0.1, type=float, help='initial learning rate')
    parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
                        help='momentum')
    parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
                        metavar='W', help='weight decay (default: 1e-4)',
                        dest='weight_decay')

    parser.add_argument('--print-freq', default=1, type=int, help='print frequency')
    parser.add_argument('--output-dir', default='.', help='path where to save')

    parser.add_argument('--channels-last', default='True', type=lambda x: x.lower() == 'true',
                        help='Whether input is in channels last format.'
                        'Any value other than True(case insensitive) disables channels-last')
    parser.add_argument('--resume', default='', help='resume from checkpoint')
    parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
                        help='start epoch')
    parser.add_argument('--seed', type=int, default=123, help='random seed')
    parser.add_argument('--world-size', default=1, type=int,
                        help='number of distributed processes')
    parser.add_argument('--num-train-steps', type=int, default=sys.maxsize, metavar='T',
                        help='number of steps a.k.a iterations to run in training phase')
    parser.add_argument('--num-eval-steps', type=int, default=sys.maxsize, metavar='E',
                        help='number of steps a.k.a iterations to run in evaluation phase')
    parser.add_argument('--save-checkpoint', action="store_true",
                        help='Whether or not to save model/checkpont; True: to save, False to avoid saving')
    parser.add_argument('--run-lazy-mode', action='store_true',
                        help='run model in lazy execution mode')
    parser.add_argument('--deterministic', action="store_true",
                        help='Whether or not to make data loading deterministic;This does not make execution deterministic')
    parser.add_argument('--hmp', dest='is_hmp', action='store_true', help='enable hmp mode')
    parser.add_argument('--hmp-bf16', default='', help='path to bf16 ops list in hmp O1 mode')
    parser.add_argument('--hmp-fp32', default='', help='path to fp32 ops list in hmp O1 mode')
    parser.add_argument('--hmp-opt-level', default='O1', help='choose optimization level for hmp')
    parser.add_argument('--hmp-verbose', action='store_true', help='enable verbose mode for hmp')

    return parser


## Main training function 
Uncomment the mark_step code in the appropriate places (after backward loss computation and after optimizer step).

In [None]:
def train_one_epoch(model, criterion, optimizer, data_loader, device, epoch, print_freq):
    model.train()
    metric_logger = utils.MetricLogger(delimiter="  ",device=device)
    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value}'))
    metric_logger.add_meter('img/s', utils.SmoothedValue(window_size=10, fmt='{value}'))

    header = 'Epoch: [{}]'.format(epoch)
    step_count = 0
    last_print_time= time.time()

    for image, target in metric_logger.log_every(data_loader, print_freq, header):
        image, target = image.to(device, non_blocking=True), target.to(device, non_blocking=True)

        dl_ex_start_time=time.time()

        if args.channels_last:
            image = image.contiguous(memory_format=torch.channels_last)

        output = model(image)
        loss = criterion(output, target)
        optimizer.zero_grad(set_to_none=True)

        loss.backward()
        # Trigger graph execution
        #htcore.mark_step()

        optimizer.step()
        # Trigger graph execution
        #htcore.mark_step()

        if step_count % print_freq == 0:
            output_cpu = output.detach().to('cpu')
            acc1, acc5 = utils.accuracy(output_cpu, target, topk=(1, 5))
            batch_size = image.shape[0]
            metric_logger.update(loss=loss.item(), lr=optimizer.param_groups[0]["lr"])
            metric_logger.meters['acc1'].update(acc1.item(), n=batch_size*print_freq)
            metric_logger.meters['acc5'].update(acc5.item(), n=batch_size*print_freq)
            current_time = time.time()
            last_print_time = dl_ex_start_time if args.dl_time_exclude else last_print_time
            metric_logger.meters['img/s'].update(batch_size*print_freq / (current_time - last_print_time))
            last_print_time = time.time()

        step_count = step_count + 1
        if step_count >= args.num_train_steps:
            break

Setup necessary environment variables and command line arguments for single HPU resnet152 training:

In [None]:
os.environ["MAX_WAIT_ATTEMPTS"] = "50"
os.environ['HCL_CPU_AFFINITY'] = '1'
os.environ['PT_HPU_ENABLE_SYNC_OUTPUT_HOST'] = 'false'
parser = get_resnet152_argparser()
   
args = parser.parse_args(["--batch-size", "256", "--epochs", "20", "--workers", "8",
"--dl-time-exclude", "False", "--print-freq", "20", "--channels-last", "False", "--seed", "123", 
"--run-lazy-mode", "--hmp",  "--hmp-bf16", "/home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision/ops_bf16_Resnet.txt",
"--hmp-fp32", "/home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision/ops_fp32_Resnet.txt",
"--deterministic"])

## HMP Usage and why it’s important
Habana Mixed Precision (HMP) package is a tool that allows you to run mixed precision training on HPU without extensive modifications to existing FP32 model scripts. You can easily add mixed precision training support to the model script by adding the following lines anywhere in the script before the start of the training loop:
```
from habana_frameworks.torch.hpex import hmp 
hmp.convert()
```

Main training code block for single node training. Using CIFAR data to train.

In [None]:
if args.is_hmp:
    from habana_frameworks.torch.hpex import hmp
    hmp.convert(opt_level=args.hmp_opt_level, bf16_file_path=args.hmp_bf16,
                fp32_file_path=args.hmp_fp32, isVerbose=args.hmp_verbose)

In [None]:
torch.manual_seed(args.seed)

if args.deterministic:
    seed = args.seed
    random.seed(seed)
else:
    seed = None

torch.backends.cudnn.benchmark = True

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

if args.workers > 0:
    # patch torch cuda functions that are being unconditionally invoked
    # in the multiprocessing data loader
    torch.cuda.current_device = lambda: None
    torch.cuda.set_device = lambda x: None

print("Creating model")
model = torchvision.models.__dict__['resnet152'](pretrained=False)
model.to(device)

criterion = nn.CrossEntropyLoss()

if args.run_lazy_mode:
    from habana_frameworks.torch.hpex.optimizers import FusedSGD
    sgd_optimizer = FusedSGD
else:
    sgd_optimizer = torch.optim.SGD
optimizer = sgd_optimizer(
    model.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=args.weight_decay)

model_for_train = model


print("Start training")
start_time = time.time()
for epoch in range(args.start_epoch, args.epochs):
    train_one_epoch(model_for_train, criterion, optimizer, trainloader,
            device, epoch, print_freq=args.print_freq)

total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print('Training time {}'.format(total_time_str))

# Distributed Training

**Restart the kernel before running the next section of the notebook**

We will use the Model-References repository command line to demo distributed training on 8 HPUs. 

Distributed training differs in the following ways.

1. [Initialization with hccl](https://github.com/HabanaAI/Model-References/blob/1.6.0/PyTorch/computer_vision/classification/torchvision/utils.py#L249)
```
    from habana_frameworks.torch.distributed.hccl import initialize_distributed_hpu
    args.world_size, args.rank, args.local_rank = initialize_distributed_hpu()
    ...
    dist.init_process_group(backend='hccl', rank=args.rank, world_size=args.world_size)
```

2. [Use the torch distributed data sampler](https://github.com/HabanaAI/Model-References/blob/1.6.0/PyTorch/computer_vision/classification/torchvision/train.py#L179)
```
    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
```

3. [Distributed data parallel pytorch model initalization](https://github.com/HabanaAI/Model-References/blob/1.6.0/PyTorch/computer_vision/classification/torchvision/train.py#L328)
```
    model = torch.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False,
            gradient_as_bucket_view=True)
```

__Note__: In Step 3, you must use the DistributedDataParallel API as DataParallel API is not supported by Habana.

In [None]:
%set_env PYTHONPATH=/home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision:/root/examples/models:/usr/lib/habanalabs/:/root

In [None]:
%cd /home/ubuntu/Model-References/PyTorch/computer_vision/classification/torchvision

Apply the following patch to use CIFAR data and remove evaluation (you can pass -R to git apply if you want to revert it)

In [None]:
!git apply /home/ubuntu/DL1-Workshop/PyTorch-ResNet152/cifar_no_eval.patch

Run the following bash command as a shell script in the final cell(demo_resnet.sh) to start multi-HPU training.

  ```bash
  export MASTER_ADDR=localhost
  export MASTER_PORT=12355
  /opt/amazon/openmpi/bin/mpirun -n 8 --bind-to core --map-by slot:PE=6 --rank-by core --report-bindings --allow-run-as-root \
    python3 train.py --model=resnet152 --device=hpu --batch-size=256 --epochs=90 --workers=10 \
    --dl-worker-type=MP --print-freq=10 --output-dir=. --seed=123 --hmp --hmp-bf16 ./ops_bf16_Resnet.txt \
    --hmp-fp32 ./ops_fp32_Resnet.txt --custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 \
    --custom-lr-milestones 1 2 3 4 30 60 80 --deterministic --dl-time-exclude=False
  ```

In [None]:
!sh /home/ubuntu/DL1-Workshop/PyTorch-ResNet152/demo_resnet.sh

# SUMMARY

In this workshop, we did the following:
- Downloaded and previewed the CIFAR dataset.
- Learnt the steps needed for migrating the model to Habana HPU.
- Learnt about HMP usage and why it is important.
- Trained the Resnet152 model on single HPU.
- Re-configured the training script for multi-node training and trained on 8 HPUs.



