Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.

#### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Using Paramater Efficient Fine Tuning on Llama 2 with 7B Parameters on One Intel&reg; Gaudi&reg; 2 AI Accelerator
This example will Fine Tune the Llama2-70B model using Parameter Efficient Fine Tuining (PEFT) and then run inference on a text prompt.  This will be using the Llama2 model with two task examples from the Optimum Habana library on the Hugging Face model repository.   The Optimum Habana library is optimized for Deep Learning training and inference on First-gen Gaudi and Gaudi2 and offers tasks such as text generation, language modeling, question answering and more. For all the examples and models, please refer to the [Optimum Habana GitHub](https://github.com/huggingface/optimum-habana#validated-models).

This example will Fine Tune the Llama2-70B model using Parameter Efficient Fine Tuining (PEFT) on the timdettmers/openassistant-guanaco dataset using the Language-Modeling Task in Optimum Habana.

### Parameter Efficient Fine Tuning with Low Rank Adaptation
Parameter Efficient Fine Tuning is a strategy for adapting large pre-trained language models to specific tasks while minimizing computational and memory demands.   It aims to reduce the computational cost and memory requirements associated with fine-tuning large models while maintaining or even improving their performance.  It does so by adding a smaller task-specific layer, leveraging knowledge distillation, and often relying on few-shot learning, resulting in efficient yet effective models for various natural language understanding tasks.   PEFT starts with a pre-trained language model that has already learned a wide range of language understanding tasks from a large corpus of text data. These models are usually large and computationally expensive.   Instead of fine-tuning the entire pre-trained model, PEFT adds a task-specific layer or a few task-specific layers on top of the pre-trained model. These additional layers are relatively smaller and have fewer parameters compared to the base model.


In [2]:
%cd ~/Gaudi-tutorials/PyTorch/Single_card_tutorials

/root/Gaudi-tutorials/PyTorch/Single_card_tutorials


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


#### Setup the execution environment path and variables

In [None]:
import sys
print(sys.path)
sys.path=['/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '~/.local/lib/python3.10/site-packages', '/usr/local/lib/python3.10/dist-packages', '/usr/lib/python3/dist-packages',"~/.local/bin","/usr/bin/habanatools","/usr/local/sbin","/usr/local/bin","/usr/sbin","/usr/bin","/sbin","/bin","/usr/games","/usr/local/games","/snap/bin"]

In [None]:
import os
os.environ['PYTHONPATH']='~/Model-References,/usr/lib/habanalabs/'
os.environ['DATA_LOADER_AEON_LIB_PATH'] = '/usr/lib/habanalabs/libaeon.so'
os.environ['GC_KERNEL_PATH'] = '/usr/lib/habanalabs/libtpc_kernels.so'
os.environ['HABANA_PLUGINS_LIB_PATH'] = '/opt/habanalabs/habana_plugins'
os.environ['HABANA_SCAL_BIN_PATH'] = '/opt/habanalabs/engines_fw'


### Model Setup: 

##### Install the Parameter Efficient Fine Tuning Library methods
This is taking the PEFT method from the Hugging Face repository and will be used to help create the PEFT Fine Tuning with the Llama2 model.

In [4]:
!pip install peft==0.10.0

Collecting peft==0.8.2
  Downloading peft-0.8.2-py3-none-any.whl.metadata (25 kB)
Downloading peft-0.8.2-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.8.2
[0m

##### Install the Optimum-Habana Library

In [5]:
!pip install -q optimum-habana==1.11.0

[0m

##### Pull the Hugging Face Examples from GitHub
These contain the working Hugging Face Task Examples that have been optimized for Gaudi.  For Fine Tuning, we'll use the language-modeling task. 

In [7]:
%cd ~/Gaudi-tutorials/PyTorch/Single_card_tutorials
!git clone -b v1.11.0 https://github.com/huggingface/optimum-habana.git

/root/Gaudi-tutorials/PyTorch/Single_card_tutorials
fatal: destination path 'optimum-habana' already exists and is not an empty directory.
/root/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana
HEAD is now at 1dfbc02 Release: v1.10.4
/root/Gaudi-tutorials/PyTorch/Single_card_tutorials


##### Go to the Language Modeling Task and install the model specific requirements

In [9]:
%cd ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/language-modeling
!pip install -q -r requirements.txt

/root/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/language-modeling
[0m

##### How to access and Use the Llama 2 model

Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/.
Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

To be able to run gated models like this Llama-2-70b-hf, you need the following: 
- Have a HuggingFace account
- Agree to the terms of use of the model in its model card on the HF Hub
- set a read token
- Login to your account using the HF CLI: run huggingface-cli login before launching your script

In [10]:
!huggingface-cli login --token <your_token_here>

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Fine Tuning the model with PEFT and LoRA

We'll now run the fine tuning with the PEFT method. Remember that the PEFT methods only fine-tune a small number of extra model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning.

##### Here's a summary of the command required to run the Fine Tuning, you'll run this in the next cell below. 
Note in this case the following: 
1. Using the language modeling with LoRA; `run_lora_clm.py`
2. It's very efficient: only 0.06% of the total paramters are being fine tuned of the total 7B parameters.
4. Only 3 epochs are needed for fine tuning, it takes less than 20 minutes to run with the openassisant-guanaco dataset.


In [11]:
!python3 run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset_name timdettmers/openassistant-guanaco \
    --bf16 True \
    --output_dir ./model_lora_llama_single \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --learning_rate 1e-4 \
    --warmup_ratio  0.03 \
    --lr_scheduler_type "constant" \
    --max_grad_norm  0.3 \
    --logging_steps 1 \
    --do_train \
    --do_eval \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --lora_rank=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --lora_target_modules "q_proj" "v_proj" \
    --dataset_concatenation \
    --report_to none \
    --max_seq_length 512 \
    --low_cpu_mem_usage True \
    --validation_split_percentage 4 \
    --adam_epsilon 1e-08

03/21/2024 00:26:36 - INFO - __main__ -   Training/evaluation parameters GaudiTrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
adjust_throughput=False,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=hccl,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=230,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tensor_cache_hpu_graphs=False,
disable_tqdm=False,
dispatch_batches=None,
distribution_strategy=ddp,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gaudi_config_name=None,
gradient_accum

#### LoRA Fine Tuning Completed
You will now see a "model_lora_llama_single" folder created which contains the PEFT model `adapter_model.bin` which will be used in the inference example below. 

## Inference with Llama 2

We'll now use the Hugging Face `text-generation` task to run inference on the Llama2-70b model; we'll generate text based on an included prompt.  Notice that we've included a path to the PEFT model that we just created.

First, we'll move to the text-generation examples folder and install the requirements. 

In [12]:
%cd ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/text-generation
!pip install -q -r requirements.txt

/root/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/text-generation
[0m

You will see that we are now running inference with the `run_generation.py` task and we are including the PEFT model that we Fine Tuned in the steps above. 

```
python3 run_generation.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--batch_size 1 \
--do_sample
--max_new_tokens 250 \
--n_iterations 4
--use_hpu_graphs \
--use_kv_cache \
--bf16 \
--prompt "I am a dog. Please help me plan a surprise birthday party for my human, including fun activities, games and decorations. And don't forget to order a big bone-shaped cake for me to share with my fur friends!" \
--peft_model /root/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/language-modeling/model_lora_llama_single
```

In [16]:
prompt = input("Enter a prompt for text generation: ")

Enter a prompt for text generation:  What is the history of Turkey?


In [17]:
cmd = f'python3 run_generation.py  --model_name_or_path meta-llama/Llama-2-7b-hf --batch_size 1 --do_sample --max_new_tokens 300 --n_iterations 4 \
      --use_hpu_graphs --use_kv_cache --bf16 --prompt "{prompt}" \
      --peft_model ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/language-modeling/model_lora_llama_single '
print(cmd)
import os
os.system(cmd)

python3 run_generation.py  --model_name_or_path meta-llama/Llama-2-7b-hf --batch_size 1 --do_sample --max_new_tokens 250 --n_iterations 4       --use_hpu_graphs --use_kv_cache --bf16 --prompt "What is the history of Turkey?"       --peft_model /root/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/language-modeling/model_lora_llama_single 


Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 11983.73it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 2182.26it/s]
03/21/2024 00:55:59 - INFO - __main__ - Single-device run.
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.35it/s]
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056399524 KB
------------------------------------------------------------------------------
03/21/2024 00:56:18 - INFO - __main__ - Args: Namespace(device='hpu', model_name_or_path='meta-llama/Llama-2-7b-hf', bf16=True, max_new_tokens=250, max_input_tokens=0, batch_size=1, warmup=3, n_iterations=4, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, 

Warming up
Warming up
Warming up


03/21/2024 00:56:27 - INFO - __main__ - Running generate...



Input/outputs:
input 1: ('What is the history of Turkey?',)
output 1: ("What is the history of Turkey?### Assistant: Turkey has a long and rich history that dates back to the ancient civilizations of the Anatolian peninsula. The first human settlements in Anatolia date back to the Paleolithic period, and the region has been home to a variety of civilizations over the centuries, including the Hittites, the Assyrians, the Persians, and the Greeks.\n\nIn the 11th century BC, the ancient city of Troy was founded in northwestern Turkey, and it became a major center of trade and culture in the region. The Trojan War, a legendary conflict between the city of Troy and the Greek forces, is one of the most famous events in ancient history and is depicted in Homer's epic poem, the Iliad.\n\nIn the 7th century BC, the city of Miletus, located in modern-day Turkey, became a major center of Greek civilization and was home to some of the most influential philosophers and thinkers of the time, includ

0

###### Inference Output with PEFT

```
input 1: ("I am a dog. Please help me plan a surprise birthday party for my human, including fun activities, games and decorations. And don't forget to order a big bone-shaped cake for me to share with my fur friends!",)
output 1: ('I am a dog. Please help me plan a surprise birthday party for my human, including fun activities, games and decorations. And don\'t forget to order a big bone-shaped cake for me to share with my fur friends!

Assistant: Hey there pup! I can help you plan your human\'s birthday party. Here are some ideas for fun activities and games you can play together:\n\n
1. A "Find the Treat" scavenger hunt: Hide treats around your home or yard for your human to find. Provide clues and hints along the way.\n
2. "Tug-of-War": Play a game of tug-of-war with a rope tied to a tree stump or post.\n
3. "Frisbee Fun": Invite your human to a game of fetch with a Frisbee in the park or backyard.\n\n
Decorations can include: Dog-shaped balloons, paw print streamers, and a banner saying "Happy Birthday" with your human\'s name.\n\n
And don\'t forget to order a cake in the shape of a big bone for you and your fur friends to share!
```

##### Comparison without PEFT and LoRA
In this example, we're simply running the Llama2 7B model **without** including the PEFT fine tuned model, so the you are losing the additional detail that is brought to the model, and the results have signficantly less information and fidelity compared to the last model.

In [18]:
cmd = f'python3 run_generation.py  --model_name_or_path meta-llama/Llama-2-7b-hf --batch_size 1 --do_sample --max_new_tokens 300 --n_iterations 4 \
      --use_hpu_graphs --use_kv_cache --bf16 --prompt "{prompt}"'
print(cmd)
import os
os.system(cmd)

python3 run_generation.py  --model_name_or_path meta-llama/Llama-2-7b-hf --batch_size 1 --do_sample --max_new_tokens 250 --n_iterations 4       --use_hpu_graphs --use_kv_cache --bf16 --prompt "What is the history of Turkey?"


Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 12035.31it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 2327.58it/s]
03/21/2024 00:57:03 - INFO - __main__ - Single-device run.
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.36it/s]
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056399524 KB
------------------------------------------------------------------------------
03/21/2024 00:57:15 - INFO - __main__ - Args: Namespace(device='hpu', model_name_or_path='meta-llama/Llama-2-7b-hf', bf16=True, max_new_tokens=250, max_input_tokens=0, batch_size=1, warmup=3, n_iterations=4, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, 

Warming up
Warming up
Warming up


03/21/2024 00:57:24 - INFO - __main__ - Running generate...



Input/outputs:
input 1: ('What is the history of Turkey?',)
output 1: ('What is the history of Turkey?\nTurkey, the country of the Anatolian peninsula, was formed in 1923 from the remnants of the Ottoman Empire. The Ottoman Empire had been in existence since 1300. In 1923, Mustafa Kemal Atatürk, the founder of the modern Turkish Republic, proclaimed that the new nation would be a secular, democratic republic.\nAtatürk was a military leader who had been appointed by the Allies to lead the Turkish nationalist movement against the Ottoman Empire. He was the first president of the new Turkish Republic and was instrumental in the creation of the Turkish Constitution.\nThe Turkish Republic was founded on the principles of secularism and democracy. The government was to be based on the principle of separation of church and state. The Turkish Constitution was adopted in 1924 and established the Turkish Republic as a secular, democratic republic.\nThe Turkish Republic was a republic with a par

0

###### Inference Output without PEFT (using just standard Llama 2 model)

```
input 1: ("I am a dog. Please help me plan a surprise birthday party for my human, including fun activities, games and decorations. And don't forget to order a big bone-shaped cake for me to share with my fur friends!",)
output 1: ("I am a dog. Please help me plan a surprise birthday party for my human, including fun activities, games and decorations. And don't forget to order a big bone-shaped cake for me to share with my fur friends!\n

Make sure that you do not make a big noise because my human doesn’t know that we are planning a birthday party. Thanks to your help now I am sure there are no more things to worry about.\n
The dog does not have to worry that the human will find out about the party. She should not worry about the noise while planning the party. There will be big bone-shaped cake for the guest of honor to share with his fur friends. There will be fun activities, games and decorations. The following items are tagged newsletter marketing:\n
```

In [19]:
exit()