Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.

#### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Using Paramater Efficient Fine Tuning on Llama2
This example will Fine Tune the Llama2-7B model using Parameter Efficient Fine Tuining (PEFT) and then run inference on a text prompt.  This will be using the Llama2 model with two task examples from the Optimum Habana library on the Hugging Face model repository.   The Optimum Habana library is optimized for Deep Learning training and inference on First-gen Gaudi and Gaudi2 and offers tasks such as text generation, language modeling, question answering and more. For all the examples and models, please refer to the [Optimum Habana GitHub](https://github.com/huggingface/optimum-habana#validated-models).

This example will Fine Tune the Llama2-7B model using Parameter Efficient Fine Tuining (PEFT) on the timdettmers/openassistant-guanaco dataset using the Language-Modeling Task in Optimum Habana.

### Parameter Efficient Fine Tuning
Parameter Efficient Fine Tuning is a strategy for adapting large pre-trained language models to specific tasks while minimizing computational and memory demands.   It aims to reduce the computational cost and memory requirements associated with fine-tuning large models while maintaining or even improving their performance.  It does so by adding a smaller task-specific layer, leveraging knowledge distillation, and often relying on few-shot learning, resulting in efficient yet effective models for various natural language understanding tasks.   PEFT starts with a pre-trained language model that has already learned a wide range of language understanding tasks from a large corpus of text data. These models are usually large and computationally expensive.   Instead of fine-tuning the entire pre-trained model, PEFT adds a task-specific layer or a few task-specific layers on top of the pre-trained model. These additional layers are relatively smaller and have fewer parameters compared to the base model.


In [12]:
%cd /root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference

/root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference


### Model Setup: 

##### Install the Habana Deepspeed Library

In [2]:
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.11.0

Collecting git+https://github.com/HabanaAI/DeepSpeed.git@1.11.0
  Cloning https://github.com/HabanaAI/DeepSpeed.git (to revision 1.11.0) to /tmp/pip-req-build-13fj4hk5
  Running command git clone --filter=blob:none --quiet https://github.com/HabanaAI/DeepSpeed.git /tmp/pip-req-build-13fj4hk5
  Running command git checkout -b 1.11.0 --track origin/1.11.0
  Switched to a new branch '1.11.0'
  Branch '1.11.0' set up to track remote branch '1.11.0' from 'origin'.
  Resolved https://github.com/HabanaAI/DeepSpeed.git to commit a24dac1fa60f4e229da854b494ef40a086792521
  Preparing metadata (setup.py) ... [?25ldone
[0m

##### Install the Parameter Efficient Fine Tuning Library methods
This is taking the PEFT method from the Hugging Face repository and will be used to help create the PEFT Fine Tuning with the Llama2 model.

In [3]:
!git clone https://github.com/huggingface/peft.git
%cd peft
!pip install .
%cd ..

Cloning into 'peft'...
remote: Enumerating objects: 4582, done.[K
remote: Counting objects: 100% (1519/1519), done.[K
remote: Compressing objects: 100% (402/402), done.[K
remote: Total 4582 (delta 1299), reused 1202 (delta 1089), pack-reused 3063[K
Receiving objects: 100% (4582/4582), 8.57 MiB | 24.72 MiB/s, done.
Resolving deltas: 100% (2977/2977), done.
/root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference/peft
Processing /root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference/peft
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: peft
  Building wheel for peft (pyproject.toml) ... [?25ldone
[?25h  Created wheel for peft: filename=peft-0.6.0.dev0-py3-none-any.whl size=121547 sha256=498919930b7a2530501d7e54b217252e91cc77cd7d2dcbdf78d858d381dbb8f3
  Stored in directory: /tmp/pip-ephem-wheel-cache-ccrb2zeq/wheels/cf/83/d

##### Install the Optimum-Habana Library

In [4]:
!pip install --upgrade-strategy eager optimum[habana]

[0m

##### Pull the Hugging Face Examples from GitHub
These contain the working Hugging Face Task Examples that have been optimized for Gaudi.  For Fine Tuning, we'll use the language-modeling task. 

In [5]:
!git clone https://github.com/huggingface/optimum-habana

Cloning into 'optimum-habana'...
remote: Enumerating objects: 5793, done.[K
remote: Counting objects: 100% (2832/2832), done.[K
remote: Compressing objects: 100% (978/978), done.[K
remote: Total 5793 (delta 2293), reused 2046 (delta 1820), pack-reused 2961[K
Receiving objects: 100% (5793/5793), 2.96 MiB | 14.87 MiB/s, done.
Resolving deltas: 100% (3734/3734), done.


##### Go to the Language Modeling Task and install the model specific requirements

In [13]:
%cd optimum-habana/examples/language-modeling/
!pip install -r requirements.txt

/root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference/optimum-habana/examples/language-modeling


##### Set the Hugging Face cli login token to access the Llama2 model

To be able to run gated models like this Llama-2-7b-hf, you need the following: 
- Have a HuggingFace account
- Agree to the terms of use of the model in its model card on the HF Hub
- set a read token
- Login to your account using the HF CLI: run huggingface-cli login before launching your script

In [14]:
!huggingface-cli login --token <your_token_here> 

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Fine Tuning the model with PEFT and LoRA

We'll now run the fine tuning with the PEFT method. Remember that the PEFT methods only fine-tune a small number of extra model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning.

##### Here's a summary of the command required to run the Fine Tuning, you'll run this in the next cell below. 
Note in this case the following: 
1. Using the language modeling with LoRA; `run_lora_clm.py`
2. It's very efficient: only 0.06% of the total paramters are being fine tuned of the total 7B parameters.
3. The maximum memory used was 33.03 GB out of a total memory available 94.61 GB
4. Only 2 epochs are needed for fine tuning, it takes less than 6 minutes to run. 

```
python ../gaudi_spawn.py \
       --world_size 8    --use_mpi run_lora_clm.py \
       --model_name_or_path meta-llama/Llama-2-7b-hf  \
       --dataset_name tatsu-lab/alpaca \
       --bf16 True \
       --output_dir ./model_lora_llama \
       --num_train_epochs 2 \
       --per_device_train_batch_size 2 \
       --per_device_eval_batch_size 2 \
       --gradient_accumulation_steps 4 \
       --evaluation_strategy "no"\
       --save_strategy "steps"\
       --save_steps 2000 \
       --save_total_limit 1 \
       --learning_rate 1e-4 \
       --logging_steps 1 \
       --dataset_concatenation \
       --do_train \
       --use_habana \
       --use_lazy_mode \
       --throughput_warmup_steps 3
```

In [15]:
!python ../gaudi_spawn.py --world_size 8 --use_mpi run_lora_clm.py --model_name_or_path meta-llama/Llama-2-7b-hf  --dataset_name timdettmers/openassistant-guanaco --bf16 True --output_dir ./model_lora_llama --num_train_epochs 2 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 1e-4 --logging_steps 1 --dataset_concatenation --do_train --use_habana --use_lazy_mode --throughput_warmup_steps 3

[2023-10-02 22:07:50,564] [INFO] [real_accelerator.py:123:get_accelerator] Setting ds_accelerator to hpu (auto detect)
Running with the following model specific env vars: 
MASTER_ADDR=localhost
MASTER_PORT=12345
DistributedRunner run(): command = mpirun -n 8 --bind-to core --map-by socket:PE=10 --rank-by core --report-bindings --allow-run-as-root /usr/bin/python run_lora_clm.py --model_name_or_path meta-llama/Llama-2-7b-hf --dataset_name timdettmers/openassistant-guanaco --bf16 True --output_dir ./model_lora_llamaOA --num_train_epochs 2 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 1e-4 --logging_steps 1 --dataset_concatenation --do_train --use_habana --use_lazy_mode --throughput_warmup_steps 3
[sc09super17-klb2:59466] MCW rank 4 bound to socket 1[core 40[hwt 0-1]], socket 1[core 41[hwt 0-1]], socket 1[core 42[hwt 0-1]], socket 1[core 43

#### LoRA Fine Tuning Completed
You will now see a "model_lora_llama" folder created which contains the PEFT model `adapter_model.bin` which will be used in the inference example below. 

## Inference with Llama2

We'll now use the Hugging Face `text-generation` task to run inference on the Llama2-7b model; we'll generate text based on an included prompt.  Notice that we've included a path to the PEFT model that we just created.

First, well move to the text-generation examples folder and install the requirements. 

In [16]:
%cd /root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference/optimum-habana/examples/text-generation
!pip install -q -r requirements.txt

/root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference/optimum-habana/examples/text-generation
[0m

You will see that we are now running inference with the `run_generation.py` task and we are including the PEFT model that we Fine Tuned in the steps above. 

```
python run_generation.py \
   --model_name_or_path meta-llama/Llama-2-7b-hf \
   --batch_size 1 \
   --max_new_tokens 500 \
   --n_iterations 4 \
   --use_kv_cache \
   --use_hpu_graphs \
   --bf16 \
   --prompt "Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research." \
   --peft_model /root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference/optimum-habana/examples/language-modeling/model_lora_llama/
```

In [19]:
!python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf --batch_size 1 --max_new_tokens 500 --n_iterations 4 --use_kv_cache --use_hpu_graphs --bf16 --prompt "Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research." --peft_model /root/Gaudi-tutorials/PyTorch/llama2_fine_tuning_inference/optimum-habana/examples/language-modeling/model_lora_llama/

10/02/2023 22:19:40 - INFO - __main__ - Single-device run.
[2023-10-02 22:19:41,267] [INFO] [real_accelerator.py:123:get_accelerator] Setting ds_accelerator to hpu (auto detect)
Fetching 2 files: 100%|████████████████████████| 2/2 [00:00<00:00, 37786.52it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00,  1.48it/s]
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056446944 KB
------------------------------------------------------------------------------
10/02/2023 22:20:42 - INFO - __main__ - Args: Namespace(attn_softmax_bf16=False, bad_words=None, batch_size=1, bf16=True, bucket_size=-1, column_name=None, dataset_max_samples=-1, dataset_name=None, device='hpu', do_sample=F

##### Comparison without PEFT and LoRA
In this example, we're simply running the Llama2 7B model **without** including the PEFT fine tuned model, so the you are losing the additional detail that is brought to the model, and the results have signficantly less information and fidelity compared to the last model.

In [20]:
!python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-hf --batch_size 1 --max_new_tokens 500 --n_iterations 4 --use_kv_cache --use_hpu_graphs --bf16 --prompt "Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research."

10/02/2023 22:21:58 - INFO - __main__ - Single-device run.
[2023-10-02 22:21:59,140] [INFO] [real_accelerator.py:123:get_accelerator] Setting ds_accelerator to hpu (auto detect)
Fetching 2 files: 100%|████████████████████████| 2/2 [00:00<00:00, 35246.25it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.01it/s]
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056446944 KB
------------------------------------------------------------------------------
10/02/2023 22:23:00 - INFO - __main__ - Args: Namespace(attn_softmax_bf16=False, bad_words=None, batch_size=1, bf16=True, bucket_size=-1, column_name=None, dataset_max_samples=-1, dataset_name=None, device='hpu', do_sample=F

In [None]:
exit()