Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.

##### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## Intel® Gaudi® Accelerator Using Hugging Face Transformer Reinforcement Learning


This document provides instructions on setting up the Intel Gaudi 2 AI accelerator Instance on the Intel® Developer Cloud or any on-premise Intel Gaudi Node. You will be running models from the Intel Gaudi software Model References and the Hugging Face Optimum Habana library.

This assumes that you have setup the latest Intel Gaudi PyTorch Docker image.

The first step is to install the Optimum Habana repository from GitHub and run the demo of Transformer Reinforcement Learning.

### Fine-tuning with Hugging Face Optimum Habana Library
The Optimum Habana library is the interface between the Hugging Face Transformers and Diffusers libraries and the Gaudi 2 card. It provides a set of tools enabling easy model loading, training and inference on single and multi-card settings for different downstream tasks. The following example use the DPO and PPO pipeline to fine-tune a Llama 2 7B model.  For more details, see the [TRL](https://github.com/huggingface/optimum-habana/tree/main/examples/trl) examples at the Optimum-Habana GitHub page. 

Follow the below steps to install the stable release from the Optimum Habana examples and library:

1. Clone the Optimum-Habana project and check out the lastest stable release.  This repository gives access to the examples that are optimized for Intel Gaudi:

In [None]:
%cd ~
!git clone https://github.com/huggingface/optimum-habana.git
%cd optimum-habana
!git checkout v1.11.1
%cd ~

2. Install Optimum-Habana library. This will install the latest stable library:

In [None]:
!pip install optimum-habana==1.11.1

3. In order to use the DeepSpeed library on Intel Gaudi 2, install the Intel Gaudi DeepSpeed fork:

In [None]:
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.0

The following example is based on the Optimum-Habana TRL task example. Change to the trl directory and install the additional SW requirements for this specific example:

In [None]:
%cd ~/optimum-habana/examples/trl/
!pip install -U -r requirements.txt
!pip install datasets==2.18

In [5]:
!huggingface-cli login --token <your_token_here>

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### DPO Pipeline

#### Training

The following example is for the creation of StackLlaMa 2: a Stack exchange llama-v2-7b model. There are two main steps to the DPO training process:

1. Supervised fine-tuning of the base llama-v2-7b model to create llama-v2-7b-se:

In [6]:
!python sft.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --output_dir="./sft" \
    --max_steps=500 \
    --logging_steps=10 \
    --save_steps=100 \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=1 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --lr_scheduler_type="cosine" \
    --warmup_steps=100 \
    --weight_decay=0.05 \
    --optim="paged_adamw_32bit" \
    --lora_target_modules "q_proj" "v_proj" \
    --bf16 \
    --remove_unused_columns=False \
    --run_name="sft_llama2" \
    --report_to=none \
    --use_habana \
    --use_lazy_mode

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
[2024-06-03 19:56:16,720] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
  _torch_pytree._register_pytree_node(
config.json: 100%|█████████████████████████████| 609/609 [00:00<00:00, 5.79MB/s]
model.safetensors.index.json: 100%|████████| 26.8k/26.8k [00:00<00:00, 61.8MB/s]
Downloading shards:   0%|                                 | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors:   0%|             | 0.00/9.98G [00:00<?, ?B/s][A
model-00001-of-00002.safetensors:   0%|    | 10.5M/9.98G [00:00<05:59, 27.7MB/s][A
model-00002-of-00002.safetensors:  96%|███▊| 3.37G/3.50G [02:02<00:04, 27.6MB/s][A
model-00002-of-00002.safetensors:  96%|███▊| 3.38G/3.50G [02:03<00:04, 27.7MB/s][A
model-00002-of-00002.safetensors: 100%|███▉| 3.49G/3.50G [02:07<00:00, 27.4MB/s][A
model-00002

2. Run the DPO trainer using the model saved by the previous step:

In [7]:
!python dpo.py \
    --model_name_or_path="sft/final_merged_checkpoint" \
    --tokenizer_name_or_path=meta-llama/Llama-2-7b-hf \
    --lora_target_modules "q_proj" "v_proj" "k_proj" "out_proj" "fc_in" "fc_out" "wte" \
    --output_dir="dpo" \
    --report_to=none

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[2024-06-03 20:17:31,679] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
  _torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:00<00:00, 11.02it/s]
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:00<00:00, 10.41it/s]
Resolving data files: 100%|████████████████████| 20/20 [00:00<00:00, 108.02it/s]
Downloading data: 100%|██████████████████████| 315M/315M [00:12<00:00, 25.6MB/s]
Downloading data: 100%|██████████████████████| 313M/313M [00:12<00:00, 25.7MB/s]
Downloading data: 100%|██████████████████████| 314M/314M [00:12<00:00, 25.3MB/s]
Downloading data: 100%|██████████████████████| 312M/312M [00:12<00:00, 25.5MB/s]
Downloading data: 100%|██████████████████████| 313M/313M [00:12<00:00, 25.5MB/s]
Generating train split: 7435908 

#### Merging the adaptors

To merge the adaptors into the base model we can use the merge_peft_adapter.py helper script that comes with TRL:

In [8]:
!python merge_peft_adapter.py --base_model_name="meta-llama/Llama-2-7b-hf" --adapter_model_name="dpo" --output_name="stack-llama-2"

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.28it/s]


which will also push the model to your HuggingFace hub account.

#### Running the model

We can load the DPO-trained LoRA adaptors which were saved by the DPO training step and run it through the [text-generation example]([../text-generation/](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation)).

In [22]:
%cd ~/optimum-habana/examples/text-generation/
!python run_generation.py \
--model_name_or_path ../trl/stack-llama-2/ \
--use_hpu_graphs --use_kv_cache --batch_size 1 --bf16 --do_sample --max_new_tokens 50 \
--temperature 0.5 \
--top_p 0.5 \
--prompt "When I go to New York I always go see "

/root/optimum-habana/examples/text-generation
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
06/03/2024 22:12:34 - INFO - __main__ - Single-device run.
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:00<00:00,  3.14it/s]
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056433764 KB
------------------------------------------------------------------------------
06/03/2024 22:12:41 - INFO - __main__ - Args: Namespace(device='hpu', model_name_or_path='../trl/stack-llama-2/', bf16=True, max_new_tokens=50, max_input_tokens=0, batch_size=1, warmup=3, n_iterations=5, local_rank=0, use_

## Next Steps
You now have access to all the Models in Model-References and Optimum-Habana repositories, you can start to look at other models.  Remember that all the models in these repositories are fully documented so they are easy to use.
* To explore more models from the Model References, start [here](https://github.com/HabanaAI/Model-References).  
* To run more examples using Hugging Face go [here](https://github.com/huggingface/optimum-habana?tab=readme-ov-file#validated-models).  
* To migrate other models to Gaudi 2, refer to PyTorch Model Porting in the [documentation](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html)

In [None]:
exit()