Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.
#### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
# Intel&reg; Gaudi&reg; 2 Model Profiling and Optimization using HuggingFace

## Objective
This tutorial will show the user how to run the Intel Gaudi Profiling tools: the habana_perf_tool and the Tensorboard plug-in on the Intel Gaudi 2 AI Accelerator, and the profiling trace viewer.  These tools will provide the user valueable optimization tips and information to modify any model for better performance.   Following these steps and using these tools can help you better understand some of the bottlenecks of your model.  For more information, please refer to the [Profiling](https://docs.habana.ai/en/latest/Profiling/index.html) section of the documentation for info on how to setup the profiler and the [Optimization Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/index.html) for additional background on other optimization techniques.

| Task                                 | Description                                             | Details                                         |
|--------------------------------------|---------------------------------------------------------|-------------------------------------------------|
| PyTorch Profiling with TensorBoard   | Obtains Gaudi-specific recommendations for performance using TensorBoard. | [Profiling with PyTorch](https://docs.habana.ai/en/latest/Profiling/Profiling_with_PyTorch.html#profiling-with-pytorch)        |
| Review the PT_HPU_METRICS_FILE      | Looks for excessive re-compilations during runtime.     | [Runtime Environment Variables](https://docs.habana.ai/en/latest/PyTorch/Reference/Runtime_Flags.html#pytorch-runtime-flags)                   |                         
| Profiling Trace Viewer               | Uses Perfetto to view traces.           |  [Getting Started with Intel Gaudi Profiler](https://docs.habana.ai/en/latest/Profiling/Intel_Gaudi_Profiling/Getting_Started_with_Profiler.html#getting-started-with-profiler)                      |                         
| Model Logging                        | Sets ENABLE_CONSOLE to set Logging for debug and analysis. | [Runtime Environment Variables](https://docs.habana.ai/en/latest/PyTorch/Reference/Runtime_Flags.html#pytorch-runtime-flags)                |                         




### Initial Setup
To run the this jupyter notebook and the Tensorboard viewer, set the appropriate ports for access when you ssh into the Intel Gaudi 2 node. you need to ensure that the following ports are open:
* 8888 (for running this jupyter notebook)
* 6006 (for running Tensorboard)    

Do to this, you need to add the following in your overall ssh commmand when connecting to the Intel Gaudi Node:

`ssh -L 8888:localhost:8888 -L 6006:localhost:6006 .... `

We start with an Intel Gaudi PyTorch Docker image and run this notebook.   For this example, we'll be using the [Swin Transformer](https://huggingface.co/microsoft/swin-base-patch4-window7-224-in22k) model from the Hugging Face Repository running on Hugging Face's Optimum-Habana library.  So the first step is to load the Optimum-Habana library and model repository:

In [1]:
%cd ~/Gaudi-tutorials/PyTorch/Profiling_and_Optimization
!pip install pickleshare ipython
!pip install optimum-habana==1.12.0

  bkms = self.shell.db.get('bookmarks', {})
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


/root/Gaudi-tutorials/PyTorch/Profiling_and_Optimization
[0mCollecting optimum-habana==1.10.4
  Downloading optimum_habana-1.10.4-py3-none-any.whl.metadata (16 kB)
Collecting transformers<4.38.0,>=4.37.0 (from optimum-habana==1.10.4)
  Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate<0.28.0 (from optimum-habana==1.10.4)
  Downloading accelerate-0.27.2-py3-none-any.whl.metadata (18 kB)
Collecting diffusers<0.27.0,>=0.26.0 (from optimum-habana==1.10.4)
  Downloading diffusers-0.26.3-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub (from accelerate<0.28.0->optimum-habana==1.10.4)
  Downloading huggingface_hub-0.21.4-py3-none-any.whl.metadata (13 kB)
INFO: pip is looking at multiple versions of tokenizers to determine which version is compatible with other requirements. This could take a while.
Collecting tokenizers<0.

In [2]:
!git clone https://github.com/huggingface/optimum-habana
%cd optimum-habana
!git checkout v1.12.0
%cd ..

Cloning into 'optimum-habana'...
remote: Enumerating objects: 10772, done.[K
remote: Counting objects: 100% (3023/3023), done.[K
remote: Compressing objects: 100% (700/700), done.[K
remote: Total 10772 (delta 2665), reused 2444 (delta 2285), pack-reused 7749[K
Receiving objects: 100% (10772/10772), 4.69 MiB | 3.63 MiB/s, done.
Resolving deltas: 100% (7348/7348), done.
/root/Gaudi-tutorials/PyTorch/Profiling_and_Optimization/optimum-habana
Note: switching to 'v1.10.4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD i

We now will go into the image-classification task and load the specfic requirements for the task:

In [3]:
%cd optimum-habana/examples/image-classification
!pip install -r requirements.txt

/root/Gaudi-tutorials/PyTorch/Profiling_and_Optimization/optimum-habana/examples/image-classification
[0m

### Running the Model
Now that the model is loaded, we'll run the model and look for the trace files for analysis. 

For this model script we can see the profiling set in the utils.py. 
For other models not in optimum-habana, users can refer to [Profiling_with_PyTorch](https://docs.habana.ai/en/latest/Profiling/Profiling_with_PyTorch.html) to setup profiling 

In [4]:
%%sh

cat -n ../../optimum/habana/utils.py | head -n 313 | tail -n 13

   262	            profiler = torch.profiler.profile(
   263	                schedule=schedule,
   264	                activities=activities,
   265	                on_trace_ready=torch.profiler.tensorboard_trace_handler(output_dir),
   266	                record_shapes=record_shapes,
   267	                with_stack=True,
   268	            )
   269	            self.start = profiler.start
   270	            self.stop = profiler.stop
   271	            self.step = profiler.step
   272	            HabanaProfile.enable.invalid = True
   273	            HabanaProfile.disable.invalid = True
   274	


Run Model to collect trace file (unoptimized)
Swin Transformer is a model that capably serves as a general-purpose backbone for computer vision. run_image_classification.py is a script that showcases how to fine-tune Swin Transformer on HPUs.

Notice the torch profiler specific commands:

- `--profiling_warmup_steps 10` - profiler will wait for warmup steps
- `--profiling_steps 3` - records for the next active steps  
                             
The collected trace files will be saved to ./hpu_profile

In [5]:
!python run_image_classification.py \
    --model_name_or_path microsoft/swin-base-patch4-window7-224-in22k \
    --dataset_name cifar10 \
    --output_dir /tmp/outputs/ \
    --remove_unused_columns False \
    --image_column_name img \
    --do_train \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 64 \
    --evaluation_strategy no \
    --save_strategy no \
    --load_best_model_at_end False \
    --save_total_limit 3 \
    --seed 1337 \
    --use_habana \
    --use_lazy_mode \
    --use_hpu_graphs_for_training \
    --gaudi_config_name Habana/swin \
    --throughput_warmup_steps 3 \
    --bf16 \
    --report_to none \
    --throughput_warmup_steps 2 \
    --overwrite_output_dir \
    --ignore_mismatched_sizes \
    --profiling_warmup_steps 10 \
    --profiling_steps 3

03/18/2024 22:45:42 - INFO - __main__ - Training/evaluation parameters GaudiTrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-06,
adjust_throughput=False,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=hccl,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=230,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tensor_cache_hpu_graphs=False,
disable_tqdm=False,
dispatch_batches=None,
distribution_strategy=ddp,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gaudi_config_name=Habana/swin,
gradient

In [7]:
%cd hpu_profile
%ls -al

[Errno 2] No such file or directory: 'hpu_profile'
/root/Gaudi-tutorials/PyTorch/Profiling_and_Optimization/optimum-habana/examples/image-classification/hpu_profile
total 258112
drwxr-xr-x 2 root root      4096 Mar 18 22:46 [0m[01;34m.[0m/
drwxr-xr-x 4 root root      4096 Mar 18 22:46 [01;34m..[0m/
-rw-r--r-- 1 root root 264297613 Mar 18 22:46 hls2-srv01-demolab_1924.1710802005850814532.pt.trace.json


  bkms = self.shell.db.get('bookmarks', {})


### Reviewing the Details in Tensorboard and perf_tool
Now that the training is completed, you can see the trace files (...pt.trace.json) have been generated and now can be viewed.  Two types of information are produced by TensorBoard:

Model Performance Tracking - While your workload is being processed in batches, you can track the progress of the training process on the dashboard in real-time by monitoring the model’s cost (loss) and accuracy.

Profiling Analysis - Right after the last requested step was completed, the collected profiling data is analyzed by TensorBoard and then immediately submitted to your browser, without any need to wait till the training process is completed.

In [8]:
%load_ext tensorboard
%tensorboard --logdir=~/Gaudi-tutorials/PyTorch/Profiling_and_Optimization/optimum-habana/examples/image-classification/hpu_profile --port 6006    # Your port selection may vary, default is 6006

If you do not want to run the TensorBoard UI, you can take the same .json log files and use the habana_perf_tool that will parse the existing .json file and provide the same recommendations for performance enhancements, but in a text form.

In [9]:
!habana_perf_tool --trace /root/Gaudi-tutorials/PyTorch/Profiling_and_Optimization/optimum-habana/examples/image-classification/hpu_profile/hls2-srv01-demolab_1924.1710802005850814532.pt.trace.json

2024-03-18 22:52:36,310 - pytorch_profiler - DEBUG - Loading /root/Gaudi-tutorials/PyTorch/Profiling_and_Optimization/optimum-habana/examples/image-classification/hpu_profile/hls2-srv01-demolab_1924.1710802005850814532.pt.trace.json
Import Data (KB): 100%|██████████████| 258103/258103 [00:02<00:00, 97726.52it/s]
2024-03-18 22:52:39,972 - pytorch_profiler - DEBUG - Please wait for initialization to finish ...
2024-03-18 22:52:48,030 - pytorch_profiler - DEBUG - PT Track ids: BridgeTrackIds.Result(pt_bridge_launch='32,49,35', pt_bridge_compute='33', pt_mem_copy='35', pt_mem_log='', pt_build_graph='48,34,51,52')
2024-03-18 22:52:48,030 - pytorch_profiler - DEBUG - Track ids: TrackIds.Result(forward='31', backward='47', synapse_launch='0,2,50', synapse_wait='1,37', device_mme='43,44,45,46', device_tpc='7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30', device_dma='6,38,39,40,41,42')
2024-03-18 22:52:49,689 - pytorch_profiler - DEBUG - Device ratio: 28.32 % (202.93 ms, 7

### Using the Perfetto Trace Viewer
Finally, to view the details of the Intel Gaudi Device itself, you can view the traces in the perfetto trace viewer.  

This step requires you to set the `hl-prof-config` settings and the Environment variable `HABANA_PROFILE=1` as shown below, this will generate the .hltv file that can be viewed using https://perfetto.habana.ai.  Since this is using the Gaudi profiler, the runtime profiling commands need to be removed.  At the end of this run, you will see a `my_profiling_session_12345.hltv` file that can be loaded into the Perfetto browser.

For More Information to enable your model to use the Habana Perfetto Trace viewer, you can refer to the documentation https://docs.habana.ai/en/latest/Profiling/Intel_Gaudi_Profiling/Getting_Started_with_Profiler.html

In [None]:
%cd ..
!hl-prof-config -e off -phase=multi-enq -g 1-20 -s my_profiling_session
!export HABANA_PROFILE=1

In [None]:
!HABANA_PROFILE=1 python run_image_classification.py \
    --model_name_or_path microsoft/swin-base-patch4-window7-224-in22k \
    --dataset_name cifar10 \
    --output_dir /tmp/outputs/ \
    --remove_unused_columns False \
    --image_column_name img \
    --do_train \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 64 \
    --evaluation_strategy no \
    --save_strategy no \
    --load_best_model_at_end False \
    --save_total_limit 3 \
    --seed 1337 \
    --use_habana \
    --use_lazy_mode \
    --report_to none \
    --use_hpu_graphs_for_training \
    --gaudi_config_name Habana/swin \
    --throughput_warmup_steps 3 \
    --bf16 \
    --throughput_warmup_steps 2 \
    --overwrite_output_dir \
    --ignore_mismatched_sizes 
    #--profiling_warmup_steps 10 \
    #--profiling_steps 3