Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.
SPDX-License-Identifier: Apache-2.0


## Objective
This tutorial will show the user how to run the Intel Gaudi Profiling tools: the habana_perf_tool and the Tensorboard plug-in on the Intel Gaudi 2 AI Accelerator, and the profiling trace viewer.  These tools will provide the user valueable optimization tips and information to modify any model for better performance.   Following these steps and using these tools can help you better understand some of the bottlenecks of your model.  For more information, please refer to the [Profiling](https://docs.habana.ai/en/latest/Profiling/index.html) section of the documentation for info on how to setup the profiler and the [Optimization Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/index.html) for additional background on other optimization techniques.

| Task                                 | Description                                             | Details                                         |
|--------------------------------------|---------------------------------------------------------|-------------------------------------------------|
| PyTorch Profiling with TensorBoard   | Obtains Gaudi-specific recommendations for performance using TensorBoard. | [Profiling with PyTorch](https://docs.habana.ai/en/latest/Profiling/Profiling_with_PyTorch.html#profiling-with-pytorch)        |
| Review the PT_HPU_METRICS_FILE      | Looks for excessive re-compilations during runtime.     | [Runtime Environment Variables](https://docs.habana.ai/en/latest/PyTorch/Reference/Runtime_Flags.html#pytorch-runtime-flags)                   |                         
| Profiling Trace Viewer               | Uses Perfetto to view traces.           |  [Getting Started with Intel Gaudi Profiler](https://docs.habana.ai/en/latest/Profiling/Intel_Gaudi_Profiling/Getting_Started_with_Profiler.html#getting-started-with-profiler)                      |                         
| Model Logging                        | Sets ENABLE_CONSOLE to set Logging for debug and analysis. | [Runtime Environment Variables](https://docs.habana.ai/en/latest/PyTorch/Reference/Runtime_Flags.html#pytorch-runtime-flags)                |                         




### Initial Setup
To run the this jupyter notebook and the Tensorboard viewer, set the appropriate ports for access when you ssh into the Intel Gaudi 2 node. you need to ensure that the following ports are open:
* 8888 (for running this jupyter notebook)
* 6006 (for running Tensorboard)    

Do to this, you need to add the following in your overall ssh commmand when connecting to the Intel Gaudi Node:

`ssh -L 8888:localhost:8888 -L 6006:localhost:6006 .... `

We start with an Intel Gaudi PyTorch Docker image and run this notebook.   For this example, we'll be using the [Swin Transformer](https://huggingface.co/microsoft/swin-base-patch4-window7-224-in22k) model from the Hugging Face Repository running on Hugging Face's Optimum-Habana library.  So the first step is to load the Optimum-Habana library and model repository:

In [None]:
%cd ~
!git clone -b v1.15.0 https://github.com/huggingface/optimum-habana.git
!pip install optimum-habana==1.15.0
!pip install pickleshare ipython

We now will go into the image-classification task and load the specfic requirements for the task:

In [None]:
%cd ~/optimum-habana/examples/image-classification
!pip install --quiet -r requirements.txt

### Running the Model
Now that the model is loaded, we'll run the model and look for the trace files for analysis. 

For this model script we can see the profiling set in the utils.py. 
For other models not in optimum-habana, users can refer to [Profiling_with_PyTorch](https://docs.habana.ai/en/latest/Profiling/Profiling_with_PyTorch.html) to setup profiling 

In [None]:
%%sh

cat -n ../../optimum/habana/utils.py | head -n 313 | tail -n 13

Run Model to collect trace file (unoptimized)
Swin Transformer is a model that capably serves as a general-purpose backbone for computer vision. run_image_classification.py is a script that showcases how to fine-tune Swin Transformer on HPUs.

Notice the torch profiler specific commands:

- `--profiling_warmup_steps 10` - profiler will wait for warmup steps
- `--profiling_steps 3` - records for the next active steps  
                             
The collected trace files will be saved to ./hpu_profile

In [None]:
!python run_image_classification.py \
    --model_name_or_path microsoft/swin-base-patch4-window7-224-in22k \
    --dataset_name cifar10 \
    --output_dir /tmp/outputs/ \
    --remove_unused_columns False \
    --image_column_name img \
    --do_train \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 64 \
    --evaluation_strategy no \
    --save_strategy no \
    --load_best_model_at_end False \
    --save_total_limit 3 \
    --seed 1337 \
    --use_habana \
    --use_lazy_mode \
    --use_hpu_graphs_for_training \
    --gaudi_config_name Habana/swin \
    --throughput_warmup_steps 3 \
    --bf16 \
    --report_to none \
    --overwrite_output_dir \
    --ignore_mismatched_sizes \
    --profiling_warmup_steps 10 \
    --profiling_steps 3

In [None]:
%cd hpu_profile
%ls -al

### Reviewing the Details in Tensorboard and perf_tool
Now that the training is completed, you can see the trace files (...pt.trace.json) have been generated and now can be viewed.  Two types of information are produced by TensorBoard:

Model Performance Tracking - While your workload is being processed in batches, you can track the progress of the training process on the dashboard in real-time by monitoring the model’s cost (loss) and accuracy.

Profiling Analysis - Right after the last requested step was completed, the collected profiling data is analyzed by TensorBoard and then immediately submitted to your browser, without any need to wait till the training process is completed.

In [None]:
%load_ext tensorboard
%tensorboard --logdir=~/optimum-habana/examples/image-classification/hpu_profile --port 6006    # Your port selection may vary, default is 6006

If you do not want to run the TensorBoard UI, you can take the same .json log files and use the habana_perf_tool that will parse the existing .json file and provide the same recommendations for performance enhancements, but in a text form.

In [None]:
!habana_perf_tool --trace /root/optimum-habana/examples/image-classification/hpu_profile/sc09wynn05-hls2_14734.1729284340533778439.pt.trace.json

### Using the Perfetto Trace Viewer
Finally, to view the details of the Intel Gaudi Device itself, you can view the traces in the perfetto trace viewer.  

This step requires you to set the `hl-prof-config` settings and the Environment variable `HABANA_PROFILE=1` as shown below, this will generate the .hltv file that can be viewed using https://perfetto.habana.ai.  Since this is using the Gaudi profiler, the runtime profiling commands need to be removed.  At the end of this run, you will see a `my_profiling_session_12345.hltv` file that can be loaded into the Perfetto browser.

For More Information to enable your model to use the Habana Perfetto Trace viewer, you can refer to the documentation https://docs.habana.ai/en/latest/Profiling/Intel_Gaudi_Profiling/Getting_Started_with_Profiler.html

In [None]:
%cd ..
!hl-prof-config -e off -phase=multi-enq -g 1-20 -s my_profiling_session
!export HABANA_PROFILE=1

In [None]:
!HABANA_PROFILE=1 python run_image_classification.py \
    --model_name_or_path microsoft/swin-base-patch4-window7-224-in22k \
    --dataset_name cifar10 \
    --output_dir /tmp/outputs/ \
    --remove_unused_columns False \
    --image_column_name img \
    --do_train \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 64 \
    --evaluation_strategy no \
    --save_strategy no \
    --load_best_model_at_end False \
    --save_total_limit 3 \
    --seed 1337 \
    --use_habana \
    --use_lazy_mode \
    --report_to none \
    --use_hpu_graphs_for_training \
    --gaudi_config_name Habana/swin \
    --throughput_warmup_steps 3 \
    --bf16 \
    --overwrite_output_dir \
    --ignore_mismatched_sizes 
    #--profiling_warmup_steps 10 \
    #--profiling_steps 3

In [None]:
!ls -l *.hltv

Consult the [Analysis guide](https://docs.habana.ai/en/latest/Profiling/Intel_Gaudi_Profiling/Analysis.html) for performing a thorough analysis of the above .hltv profile.

In [None]:
exit()