# PyTorch Mixed Precision with Autocast

Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.<br>
All rights reserved.

# Licensed under the Apache License, Version 2.0 (the “License”);
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## Overview
Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it faster and use less memory. By keeping certain parts of the model in the 32-bit types for numerical stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy.

Habana HPUs can run operations in bfloat16 faster than float32. Therefore, these lower-precision dtypes should be used whenever possible on HPUs. However, variables and a few computations should still be in float32 for numerical stability so that the model is trained to the same quality. The PyTorch mixed precision allows you to use a mix of bfloat16 and float32 during model training, to get the performance benefits from bfloat16 and the numerical stability benefits from float32.

**Autocast is a native PyTorch module that allows running mixed precision training without extensive modifications to the existing FP32 model script. It executes operations registered to autocast using lower precision floating datatype. The module is provided using the torch.amp package.**   For more details on PyTorch autocast, see https://pytorch.org/docs/stable/amp.html.  

This simple example shows the basic steps needed to add torch.autocast to a first-gen Gaudi or Gaudi2 based model.  For more details you can refer to the Mixed Precsion documenation or review the Pytorch ResNet50 model example in our Model-References 

## Supported hardware
Habana Gaudi HPUs supports a mix of bfloat16 and float32. 

Even on CPUs, where no speedup is expected, mixed precision APIs can still be used for unit testing, debugging, or just to try out the API. However, on CPUs, mixed precision will run significantly slower.

## Setup

In [1]:
!pip install -q ipywidgets

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


Set the basic import commands

In [2]:
import torch
import os
import habana_frameworks.torch.core as htcore

  
In this simple model, you set set the input paramaters, the Linear model, and the optimizer

In [3]:
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device='hpu')
y = torch.randn(N, D_out, device='hpu')

model = torch.nn.Linear(D_in, D_out).to(torch.device('hpu'))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

 PT_HPU_LAZY_MODE = 1
 PT_HPU_LAZY_EAGER_OPTIM_CACHE = 1
 PT_HPU_ENABLE_COMPILE_THREAD = 0
 PT_HPU_ENABLE_EXECUTION_THREAD = 1
 PT_HPU_ENABLE_LAZY_EAGER_EXECUTION_THREAD = 1
 PT_ENABLE_INTER_HOST_CACHING = 0
 PT_ENABLE_INFERENCE_MODE = 1
 PT_ENABLE_HABANA_CACHING = 1
 PT_HPU_MAX_RECIPE_SUBMISSION_LIMIT = 0
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_MAX_COMPOUND_OP_SIZE_SS = 10
 PT_HPU_ENABLE_STAGE_SUBMISSION = 1
 PT_HPU_STAGE_SUBMISSION_MODE = 2
 PT_HPU_PGM_ENABLE_CACHE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 0
 PT_HCCL_SLICE_SIZE_MB = 16
 PT_HCCL_MEMORY_ALLOWANCE_MB = 0
 PT_HPU_INITIAL_WORKSPACE_SIZE = 0
 PT_HABANA_POOL_SIZE = 24
 PT_HPU_POOL_STRATEGY = 5
 PT_HPU_POOL_LOG_FRAGMENTATION_INFO = 0
 PT_ENABLE_MEMORY_DEFRAGMENTATION = 1
 PT_ENABLE_DEFRAGMENTATION_INFO = 0
 PT_HPU_ENABLE_SYNAPSE_LAYOUT_HANDLING = 1
 PT_HPU_ENABLE_SYNAPSE_OUTPUT_PERMUTE = 1
 PT_HPU_ENABLE_VALID_DATA_RANGE_CHECK = 1
 PT_HPU_FORCE_USE_DEFAULT_STREAM = 0
 PT_RECIPE_CACHE_PATH = 
 PT_HPU_ENABLE_REF

#### To use autocast on HPU, wrap the forward pass (model+loss) of the training to `torch.autocast`:

##### Registered Operators  
There are three types of registration to torch.autocast:  
**Lower precision** - These ops run in the lower precision bfloat16 datatype.  
**FP32** - These ops run in the higher precision float32 datatype. 
**Promote**  These ops run in the highest precision datatypes among its inputs.  

**NOTE**  Float16 datatype is not supported. Ensure that BFloat16 specific OPs and functions are used in place of Float16; for example, tensor.bfloat16() should be used instead of tensor.half().

In [4]:
#os.environ['LOWER_LIST'] = '/path/to/lower_list.txt'
#os.environ['FP32_LIST'] = '/path/to/fp32_list.txt

for t in range(50):
   with torch.autocast(device_type='hpu', dtype=torch.bfloat16):
       y_pred = model(x)
       loss = torch.nn.functional.mse_loss(y_pred, y)
       print(loss)
   optimizer.zero_grad()

   loss.backward()
   htcore.mark_step()

   optimizer.step()
   htcore.mark_step()

tensor(1.3303, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3302, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3300, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3298, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3296, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3294, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3292, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3291, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3289, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3287, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3285, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3283, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3281, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3279, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3278, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3276, device='hpu:0', grad_fn=<MseLossBackward0>)
tensor(1.3274, device='hpu:0', grad_fn=<MseLossBackward0

### Supported OPS
##### The default list of supported ops for each registration type are internally hard-coded. The following provides the default list of registered ops for each type:

Lower precision: addmm, batch_norm, bmm, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, dot, dropout, feature_dropout, group_norm, instance_norm, layer_norm, leaky_relu, linear, matmul, mean, mm, mul, mv, softmax, log_softmax

FP32: acos, addcdiv, asin, atan2, bilinear, binary_cross_entropy, binary_cross_entropy_with_logits, cdist, cosh, cosine_embedding_loss, cosine_similarity, cross_entropy_loss, dist, div, divide, embedding, embedding_bag, erfinv, exp, expm1, hinge_embedding_loss, huber_loss, kl_div, l1_loss, log, log10, log1p, log2, logsumexp, margin_ranking_loss, mse_loss, multi_margin_loss, multilabel_margin_loss, nll_loss, pdist, poisson_nll_loss, pow, reciprocal, renorm, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softplus, tan, topk, triplet_margin_loss, truediv, true_divide

Promote: add, addcmul, addcdiv, cat, div, exp, mul, pow, sub, iadd, truediv, stack