Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.<br>
Copyright (c) 2019 The TensorFlow Authors.

## Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

# Migrating TensorFlow EfficientNet to Habana Gaudi

In this Jupyter notebook, we will learn how to migrate EfficientNet in public TensorFlow [models](https://github.com/tensorflow/models) repository to Habana Gaudi device with very limited code changes. We will first verify the model can be trained on CPU. Then add code to the training script to enable it on a single HPU. 

### Setup

First of all, check the current directory to prepare for cloning TensorFlow model's repository.

In [None]:
%pwd

Then, we will clone TensorFlow [models](https://github.com/tensorflow/models.git)  repository v2.11.0 to the current directory.

In [None]:
!git clone --depth 1 --branch v2.11.0 https://github.com/tensorflow/models.git   

Verify if the repository was cloned successfully in the current location.

In [None]:
%ls

Check if the current PYTHONPATH contains TensorFlow `models` location. If not, add it. 

The following command assumes the models repository location is at `/home/ubuntu/work/DL1-Workshop/TF-EfficientNet` folder. Modify it accordingly if it is in a different location.

In [None]:
%set_env PYTHONPATH=/root/DL1-Workshop/TF-EfficientNet/models:$PYTHONPATH

Install the depedent packages for the model.

In [None]:
!python3 -m pip install tensorflow-model-optimization tensorflow_addons gin-config tensorflow_datasets

### Training on CPU

We will be using Keras EfficientNet at https://github.com/tensorflow/models/tree/v2.9.2/official/legacy/image_classification as the example to show how to enable a public model on Habana Gaudi device. 

EfficientNet is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a compound coefficient. The model was first introduced by Tan et al. in [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946).  In this session, we are going to use EfficientNet baseline model EfficientNet-B0 as the training example.

First of all, let's enable the training with synthetic data on CPU and check its performance.

In [None]:
%cd models/official/legacy/image_classification

In [None]:
%ls

Let's first verify if EfficientNet can be run on CPU with the existing code from `models` repository.

In TensorFlow `models` repository, there are only EfficientNet configuration files for GPU and TPU under `configs` folder. We will use the following Python command to override the existing configurations for GPU and run EfficientNet-B0 on CPU:

```
python3 classifier_trainer.py --mode=train_and_eval --model_type=efficientnet --dataset=imagenet --model_dir=./log_cpu --data_dir=./ --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml --params_override='runtime.num_gpus=0,runtime.distribution_strategy="off",train_dataset.builder="synthetic",validation_dataset.builder="synthetic",train.steps=1000,train.epochs=1,evaluation.skip_eval=True'
```


The Efficient-B0 training results on CPU look as below:

<img src="./enet_cpu_results.png" alt="efficientnet_config" align="left" width="800"/>

From the output log above, we can see that the throughput for EfficientNet-B0 training on CPU with synthetic data is around `40 examples/sec`.

### Training on 1 HPU

Now, let's modify the traning script and enable the model on Habana Gaudi device with **BF16** data type. With environment variable **`TF_ENABLE_BF16_CONVERSION=1`**, EfficientNet is trained with BF16 data type.

Open [models/official/legacy/image_classification/classifier_trainer.py](models/official/legacy/image_classification/classifier_trainer.py) and insert the following 2 lines of code in **line 444**:

```
  from habana_frameworks.tensorflow import load_habana_module
  load_habana_module()
```

**Optionally**, if you want to reduce the verbosity of the training process, modify **line 452** from `logging.INFO` to `logging.ERROR`:

```
logging.set_verbosity(logging.ERROR)
```

To display the throughput during the training process, in **line 388**, insert the following statement:

```
logging.set_verbosity(logging.INFO)
```

Save the file.


These 2 lines code will load Habana software modules in the beginning of training, and aquire Habana Gaudi device and register the device to TensorFlow. This is all you need to do to enable EfficientNet on HPU.

Now, let's run the same command as above to launch the training. This time EfficientNet will be trained on a single HPU. 

In [None]:
!TF_ENABLE_BF16_CONVERSION=1 python3 classifier_trainer.py --mode=train_and_eval --model_type=efficientnet --dataset=imagenet --model_dir=./log_hpu --data_dir=./ --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml --params_override='runtime.num_gpus=0,runtime.distribution_strategy="off",train_dataset.builder="synthetic",validation_dataset.builder="synthetic",train.steps=1000,train.epochs=1,evaluation.skip_eval=True'

From the output log above, we can see that the throughput for EfficientNet-B0 training on Habana Gaudi with synthetic data is around `658 examples/sec`.

### Distributed Training on 8 HPUs 

Now, let's enable the distributed training for EfficientNet on 8 HPUs of DLAMI.

In the original source code, `tf.distribute.Strategy` is used to support the distributed training for TPU and GPU. We will re-use this architecture and enable the distributed training on multi-HPUs with `HPUStrategy`. `HPUStrategy` was built based on `MultiWorkerMirroredStrategy`, in which each worker runs in a separate process and with a single Gaudi device acquired.

* According to our [collateral](https://docs.habana.ai/en/latest/Tensorflow_Scaling_Guide/TensorFlow_Gaudi_Scaling_Guide.html#multi-worker-training-using-hpustrategy), we will first construct HPUStrategy instance when `distribution_strategy` parameter is set to `hpu`.

    Click [models/official/common/distribute_utils.py](models/official/common/distribute_utils.py) and in **line 148**, insert the following code:

    ```
    if distribution_strategy == "hpu":
      from habana_frameworks.tensorflow.distribute import HPUStrategy
      return HPUStrategy()
    ```
    
    And save the file. 

* Then we will configure **`TF_CONFIG`** environment variable by re-using the existing `distribute_utils.configure_cluster()` function in the code:
  
  Open [models/official/legacy/image_classification/classifier_trainer.py](models/official/vision/image_classification/classifier_trainer.py) and replace **line 293 and 294** with following code:

  ```
    if params.runtime.distribution_strategy == 'hpu':
      hls_addresses = ["127.0.0.1"]
      TF_BASE_PORT = 2410
      from habana_frameworks.tensorflow.multinode_helpers import comm_size, comm_rank
      mpi_rank = comm_rank()
      mpi_size = comm_size()

      worker_hosts = ",".join([",".join([address + ':' + str(TF_BASE_PORT + rank)
                                         for rank in range(mpi_size // len(hls_addresses))])
                               for address in hls_addresses])
      task_index = mpi_rank
      distribute_utils.configure_cluster(worker_hosts, task_index)
    else:
      distribute_utils.configure_cluster(params.runtime.worker_hosts,
                                         params.runtime.task_index)
  ```

    Save the file.

Now we launch 8 processes with mpirun command to start the distributed training for EfficientNet on 8 HPUs with `HPUStrategy`:

```
mpirun --allow-run-as-root -np 8 -x TF_ENABLE_BF16_CONVERSION=1 python3 classifier_trainer.py --mode=train_and_eval --model_type=efficientnet --dataset=imagenet --model_dir=./log_hpu_8 --data_dir=./ --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml --params_override='runtime.num_gpus=0,runtime.distribution_strategy="hpu",train_dataset.builder="synthetic",validation_dataset.builder="synthetic",train.steps=1000,train.epochs=1,evaluation.skip_eval=True'

```

Run the following command:

In [None]:
!mpirun --allow-run-as-root -np 8 -x TF_ENABLE_BF16_CONVERSION=1 python3 classifier_trainer.py --mode=train_and_eval --model_type=efficientnet --dataset=imagenet --model_dir=./log_hpu_8 --data_dir=./ --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml --params_override='runtime.num_gpus=0,runtime.distribution_strategy="hpu",train_dataset.builder="synthetic",validation_dataset.builder="synthetic",train.steps=1000,train.epochs=1,evaluation.skip_eval=True,train.callbacks.enable_checkpoint_and_export=False'


From the output above, you can see that with 8 Gaudi cards, the training throughput is significantly improved to around 3962 images/sec.