Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.

##### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Intel® Gaudi® Accelerator Quick Start Guide


This document provides instructions on setting up the Intel Gaudi 2 AI accelerator Instance on the Intel® Developer Cloud or any on-premise Intel Gaudi Node. You will be running models from the Intel Gaudi software Model References and the Hugging Face Optimum Habana library.

Please follow along with the [video](https://developer.habana.ai/intel-developer-cloud/) on our Developer Page to walk through the steps below.  This assumes that you have setup the latest Intel Gaudi PyTorch Docker image.

To set up a multi-node instance with two or more Gaudi nodes, refer to Setting up Multiple Gaudi Nodes in the [Quick Start Guide Documentation](https://docs.habana.ai/en/latest/Intel_DevCloud_Quick_Start/Intel_DevCloud_Quick_Start.html#setting-up-multiple-gaudi-nodeshttps://docs.habana.ai/en/latest/Intel_DevCloud_Quick_Start/Intel_DevCloud_Quick_Start.html#setting-up-multiple-gaudi-nodes).  

The first step is to install the Model-References repository from GitHub and run the "hello-world" model from the examples library.

In [None]:
%cd ~/Gaudi-tutorials/PyTorch/Intel_Gaudi_Quickstart
!git clone -b 1.17.1 https://github.com/HabanaAI/Model-References.git

In [None]:
%cd Model-References/PyTorch/examples/computer_vision/hello_world/

We set the correct paths for the python execution:

In [None]:
import os
os.environ['PYTHONPATH'] = '$PYTHONPATH:~/Model-References'
os.environ['PYTHON'] = '/usr/bin/python3.10'

We now run the simple example with the MNIST dataset on one Intel Gaudi card:

In [None]:
!python3 mnist.py --batch-size=64 --epochs=1 --lr=1.0 --gamma=0.7 --hpu --autocast

We can now run the same model on eight Intel Gaudi cards using mpirun:

In [None]:
!mpirun -n 8 --bind-to core --map-by slot:PE=6 \
      --rank-by core --report-bindings \
      --allow-run-as-root \
      python3 mnist.py \
      --batch-size=64 --epochs=1 \
      --lr=1.0 --gamma=0.7 \
      --hpu --autocast

### Fine-tuning with Hugging Face Optimum Habana Library
The Optimum Habana library is the interface between the Hugging Face Transformers and Diffusers libraries and the Gaudi 2 card. It provides a set of tools enabling easy model loading, training and inference on single and multi-card settings for different downstream tasks. The following example uses the text-classification task to fine-tune a BERT-Large model with the MRPC (Microsoft Research Paraphrase Corpus) dataset and also run Inference.

Follow the below steps to install the stable release from the Optimum Habana examples and library:

1. Clone the Optimum-Habana project and check out the latest stable release.  This repository gives access to the examples that are optimized for Intel Gaudi:

In [1]:
%cd ~
!git clone -b v1.13.2 https://github.com/huggingface/optimum-habana.git

/root
Cloning into 'optimum-habana'...


  bkms = self.shell.db.get('bookmarks', {})
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


remote: Enumerating objects: 18369, done.[K
remote: Counting objects: 100% (1954/1954), done.[K
remote: Compressing objects: 100% (916/916), done.[K
remote: Total 18369 (delta 1310), reused 1431 (delta 909), pack-reused 16415 (from 1)[K
Receiving objects: 100% (18369/18369), 11.88 MiB | 9.52 MiB/s, done.
Resolving deltas: 100% (12642/12642), done.
Note: switching to '1266993d741ba97e929965c01307f3c6cce8c107'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false



2. Install Optimum-Habana library. This will install the latest stable library:

In [2]:
!pip install --quiet optimum-habana==1.13.2

[0m

3. In order to use the DeepSpeed library on Intel Gaudi 2, install the Intel Gaudi DeepSpeed fork:

In [3]:
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.17.1

Collecting git+https://github.com/HabanaAI/DeepSpeed.git@1.17.1
  Cloning https://github.com/HabanaAI/DeepSpeed.git (to revision 1.17.1) to /tmp/pip-req-build-_ehmoh8h
  Running command git clone --filter=blob:none --quiet https://github.com/HabanaAI/DeepSpeed.git /tmp/pip-req-build-_ehmoh8h
  Running command git checkout -b 1.17.1 --track origin/1.17.1
  Switched to a new branch '1.17.1'
  Branch '1.17.1' set up to track remote branch '1.17.1' from 'origin'.
  Resolved https://github.com/HabanaAI/DeepSpeed.git to commit e3078cb74d027995725e39806e358a2fceee16be
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting hjson (from deepspeed==0.14.0+hpu.synapse.v1.17.1)
  Downloading hjson-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Downloading hjson-3.1.0-py3-none-any.whl (54 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.0/54.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for co

The following example is based on the Optimum-Habana Text Classification task example. Change to the text-classification directory and install the additional SW requirements for this specific example:

In [4]:
%cd ~/optimum-habana/examples/text-classification/
!pip install -r requirements.txt

/root/optimum-habana/examples/text-classification
Collecting evaluate (from -r requirements.txt (line 7))
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
[0m

### Execute Single-Card Training
This run instruction will fine-tune the BERT-Large Model on one Intel Gaudi card:  

In [5]:
!python run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking  \
--task_name mrpc   \
--do_train   \
--do_eval   \
--per_device_train_batch_size 32 \
--learning_rate 3e-5  \
--num_train_epochs 3   \
--max_seq_length 128   \
--output_dir ./output/mrpc/  \
--use_habana  \
--use_lazy_mode   \
--bf16   \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3 \
--report_to none \
--overwrite_output_dir 

gaudi_config.json: 100%|██████████████████████| 90.0/90.0 [00:00<00:00, 380kB/s]
10/01/2024 02:21:44 - INFO - __main__ - Training/evaluation parameters GaudiTrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-06,
adjust_throughput=False,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=hccl,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=230,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tensor_cache_hpu_graphs=False,
disable_tqdm=False,
dispatch_batches=None,
distribution_strate

### Execute Multi-Card Training
In this example, you will be doing the same fine-tuning task with eight Gaudi 2 cards.   In this case the Optimum-Habana models repository has a helper script called `gaudi_spawn.py` that manages multi card execution.  
Notice the execution time for the fine-tuning compared to the single-card run. 

In [None]:
!python ../gaudi_spawn.py  --world_size 8 --use_mpi run_glue.py  \
--model_name_or_path bert-large-uncased-whole-word-masking  \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking  \
--task_name mrpc  \
--do_train  \
--do_eval  \
--per_device_train_batch_size 32  \
--per_device_eval_batch_size 8  \
--learning_rate 3e-5  \
--num_train_epochs 3   \
--max_seq_length 128  \
--output_dir /tmp/mrpc_output/  \
--use_habana   \
--use_lazy_mode   \
--bf16    \
--use_hpu_graphs_for_inference  \
--throughput_warmup_steps 3 \
--report_to none \
--overwrite_output_dir 


Running with the following model specific env vars: 
MASTER_ADDR=localhost
MASTER_PORT=29500
DistributedRunner run(): command = mpirun -n 8 --bind-to core --map-by socket:PE=10 --rank-by core --report-bindings --allow-run-as-root /usr/bin/python run_glue.py --model_name_or_path bert-large-uncased-whole-word-masking --gaudi_config_name Habana/bert-large-uncased-whole-word-masking --task_name mrpc --do_train --do_eval --per_device_train_batch_size 32 --per_device_eval_batch_size 8 --learning_rate 3e-5 --num_train_epochs 3 --max_seq_length 128 --output_dir /tmp/mrpc_output/ --use_habana --use_lazy_mode --bf16 --use_hpu_graphs_for_inference --throughput_warmup_steps 3 --report_to none --overwrite_output_dir
Authorization required, but no authorization protocol specified
[sc09wynn08-hls2:01902] MCW rank 5 bound to socket 1[core 50[hwt 0-1]], socket 1[core 51[hwt 0-1]], socket 1[core 52[hwt 0-1]], socket 1[core 53[hwt 0-1]], socket 1[core 54[hwt 0-1]], socket 1[core 55[hwt 0-1]], socket 1[co

### Training with DeepSpeed
With the DeepSpeed package already installed, run multi-card training with DeepSpeed. The command below will create and point to a ds_config.json file to set up the parameters of the DeepSpeed run. Once the ds_config.json file is created, you can run the DeepSpeed training command below. 

#### Create DeepSpeed Config file with ZeRO preferences
The ds_config.json file will configure the parameters to run DeepSpeed and will still execute on eight Intel Gaudi 2 Accelerators

In this case, we will run the ZeRO2 optimizer and BF16 mixed precision.

In [None]:
%%sh
tee ./ds_config.json <<EOF
{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false
    }
}
EOF

This is the DeepSpeed run command for the bert-fine tuning.   At the completion of the run, compare the runtime and Max Memory usage with the non-DeepSpeed run above, you will see even faster execution and reduced memory consumption.  With larger models these advantages of using DeepSpeed are very important for running Large Language and Generative AI models.

In [None]:
!python ../gaudi_spawn.py \
--world_size 8 --use_deepspeed run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--task_name mrpc \
--do_train \
--do_eval \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 3 \
--max_seq_length 128 \
--overwrite_output_dir \
--output_dir /tmp/mrpc_output/ \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3 \
--report_to none \
--overwrite_output_dir  \
--deepspeed ds_config.json

### Inference Example Run
This is a separate example using inference only. This will run the same evaluation metrics (accuracy, F1 score) as shown above. This will display how well the model has performed:

In [None]:
!python run_glue.py --model_name_or_path bert-large-uncased-whole-word-masking \
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking \
--task_name mrpc \
--do_eval \
--max_seq_length 128 \
--output_dir ./output/mrpc/ \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--report_to none \
--overwrite_output_dir 

## Next Steps
You now have access to all the Models in Model-References and Optimum-Habana repositories, you can start to look at other models.  Remember that all the models in these repositories are fully documented so they are easy to use.
* To explore more models from the Model References, start [here](https://github.com/HabanaAI/Model-References).  
* To run more examples using Hugging Face go [here](https://github.com/huggingface/optimum-habana?tab=readme-ov-file#validated-models).  
* To migrate other models to Gaudi 2, refer to PyTorch Model Porting in the [documentation](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html)

In [None]:
# Please be sure to run this exit command to ensure that the resources running on Intel Gaudi are released 
exit()