Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.

#### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Fine Tuning Use Case on Intel® Gaudi® 2 AI Accelerator
show how to run a typical model Fine Tuning use case on the Intel Gaudi Accelerator.  You will see how to select a model, setup the environment, execute the workload.  Intel Gaudi supports PyTorch as the main framework for Fine Tuning.  

This example will Fine Tune the Llama 3 70B model using Parameter Efficient Fine Tuining (PEFT) 

In this example, you will see how to select a model, setup the environment, execute the workload and then see a price-performance comparison.   Intel Gaudi supports PyTorch as the main framework for Inference.  

Running Fine Tuning on the Intel Gaudi Accelerator is quite simple, and the code below will take you step-by-step through all the items needed, in summary here:  

•	Get Access to an Intel Gaudi node, using the Intel® Tiber™ Developer Cloud is recommended.  
•	Run the Intel Gaudi PyTorch Docker image; this ensures that all the SW is installed and configured properly.  
•	Select the model for execution by loading the desired Model Repository and appropriate libraries for model acceleration.   
•	Run the model and extract the details for evaluation. 

### Accessing The Intel Gaudi Node
To access an Intel Gaudi node in the Intel Tiber Developer cloud, you will go to [Intel Developer Cloud Console](https://console.cloud.intel.com/hardware) and access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.


### Docker Setup
Now that you have access to the node, you will use the latest Intel Gaudi docker image by first calling the docker run command which will automatically download and run the docker:

```
docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
```

We then start the docker and enter the docker environment by issuing the following command: 
```
docker exec -it Gaudi_Docker bash
```


### Model Setup 
Now that we’re running in a docker environment, we can now install the remaining libraries and model repositories:
Start in the root directory and install the DeepSpeed Library; DeepSpeed is used to improve memory consumption on Intel Gaudi while running large language models. 


In [None]:
%cd ~
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.16.2

Now install the Hugging Face Optimum Habana library and GitHub Examples, notice that we’re selecting the latest validated release of Optimum Habana:

In [None]:
!pip install optimum-habana==1.12.0
!git clone -b v1.12.0 https://github.com/huggingface/optimum-habana 

Finally, we transition to the language example and install the final set of requirements to run the model:

In [None]:
%cd ~/optimum-habana/examples/language-modeling
!pip install -r requirements.txt

### How to access and Use the Llama 3 model
Use of the pretrained model is subject to compliance with third party licenses, including the “META LLAMA 3 COMMUNITY LICENSE AGREEMENT”. For guidance on the intended use of the LLAMA 3 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://llama.meta.com/llama3/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses. To be able to run gated models like this Llama-3-70b, you need the following:

•	Have a HuggingFace account and agree to the terms of use of the model in its model card on the HF Hub  
•	Create a read token and request access to the Llama 2 model from meta-llama  
•	Login to your account using the HF CLI:   

In [None]:
!huggingface-cli login --token <your_hugging_face_token_here>

### Fine Tuning a simple GPT model
Let’s start with a simple example of fine tuning from the Hugging Face language modeling page.   This is using the wikitext dataset to fine tune the gpt2 model.  The fine tuning of this model takes only a few minutes and you can see the fine tuned model output in the `test_clm` folder.

In [None]:
!python3 run_clm.py \
  --model_name_or_path gpt2 \
  --dataset_name wikitext \
  --dataset_config_name wikitext-2-raw-v1 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --do_train \
  --do_eval \
  --overwrite_output_dir \
  --report_to none \
  --output_dir ./test-clm \
  --gaudi_config_name Habana/gpt2 \
  --use_habana \
  --use_lazy_mode \
  --use_hpu_graphs \
  --throughput_warmup_steps 3

### Fine Tuning the Llama 3 70B Model 
We’re now ready to start running the full Llama 3 70 model for fine tuning.  Since the Llama 3 70B is a large model, we’ll employ the DeepSpeed library to more efficiently manage the memory usage of the local HBM memory on each Intel Gaudi card.   We’ll deploy some additional techniques for Fine Tuning:  

•	Parameter Efficient Fine Tuning (PEFT) is a strategy for adapting large pre-trained language models to specific tasks.  Instead of fine-tuning the entire pre-trained model, PEFT adds a task-specific layer or a few task-specific layers on top of the pre-trained model. These additional layers are relatively smaller and have fewer parameters compared to the base model.  
•	DeepSpeed significantly optimizes training efficiency, reducing both computational and memory requirements. It enables the handling of extremely large models by providing advanced parallelism techniques and memory optimization strategies  
•	Flash Attention is used to reduce memory usage and enhancing computational speed through a fused implementation.  This includes the use of the FusedSDPA (Scaled Dot Product Attention) applies similar principles to the Gaudi processor environment, optimizing the scaled dot product attention function with reduced memory usage and faster performance while maintaining compatibility with standard PyTorch functionality.  
•	Setting epochs = 2; this is enough to ensure that the training loss is below 1.0, running any more epoch is not needed.  


In [None]:
!PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \
python3 ../gaudi_spawn.py --use_deepspeed  --world_size 8  run_lora_clm.py \
  --model_name_or_path meta-llama/Meta-Llama-3-70B-Instruct \
  --deepspeed llama2_ds_zero3_config.json \
  --dataset_name tatsu-lab/alpaca \
  --bf16 True \
  --output_dir ./llama3_fine_tuning_output \
  --num_train_epochs 2 \
  --max_seq_len 2048 \
  --per_device_train_batch_size 10 \
  --per_device_eval_batch_size 10 \
  --gradient_checkpointing \
  --evaluation_strategy epoch \
  --eval_delay 2 \
  --save_strategy no \
  --learning_rate 0.0018 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --dataset_concatenation \
  --attn_softmax_bf16 True \
  --do_train \
  --do_eval \
  --use_habana \
  --use_lazy_mode \
  --pipelining_fwd_bwd \
  --throughput_warmup_steps 3 \
  --report_to none \
  --lora_rank 4 \
  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
  --validation_split_percentage 4 \
  --use_flash_attention True \
  --flash_attention_causal_mask True


The result of the run shows that the Fine Tuning of the model required only 38 minutes and achieved 2.2 samples (or sentences) per second.
```
***** train metrics *****
  epoch                       =        2.0
  max_memory_allocated (GB)   =      94.53
  memory_allocated (GB)       =      27.15
  total_flos                  =  1037280GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.1525
  train_runtime               = 0:38:47.30
  train_samples_per_second    =      2.221
  train_steps_per_second      =      0.028
```

See the output in the llama3_fine_tuning_output folder, you will see the resulting files from Fine Tuning.  The full model is the `adapter_model.safetensors` which contains the additional weights generated by the Parameter Efficient Fine Tuning.  These weights can used for Inference. 

In [None]:
%cd llama3_fine_tuning_output
%ls -al

In this case this Fine Tuned model can now be applied to an inference use case like text-generation where you can use this fine tuned model. 

### Next Steps 
Now that you have run a full inference case, you can go back to the Hugging Face Optimum Habana validated models to see more options for running inference with this model or Fine Tune other models. 


In [None]:
exit()