Copyright (c) 2023 Habana Labs, Ltd. an Intel Company.

#### Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Inference on the Intel&reg; Gaudi&reg; 2 AI Accelerator
This example will show how to run model inference on the Llama 2 70B model using the Hugging Face Optimum Habana library.   The Optimum Habana library is optimized for Deep Learning inference on tasks such as text generation, language modeling, question answering and more; these contain fully optimized and fully documented model examples and should be used as a starting point for model execution.  For all the examples and models, please refer to the [Optimum Habana GitHub](https://github.com/huggingface/optimum-habana#validated-models).  

In this example, you will see how to select a model, setup the environment, execute the workload and then see a price-performance comparison.   Intel Gaudi supports PyTorch as the main framework for Inference.  

Running inference on the Intel Gaudi Accelerator is quite simple, and the code below will take you step-by-step through all the items needed, in summary here:  

•	Get Access to an Intel Gaudi node, using the Intel® Tiber™ Developer Cloud is recommended.  
•	Run the Intel Gaudi PyTorch Docker image; this ensures that all the SW is installed and configured properly.  
•	Select the model for execution by loading the desired Model Repository and appropriate libraries for model acceleration.   
•	Run the model and extract the details for evaluation.   



### Accessing The Intel Gaudi Node
To access an Intel Gaudi node in the Intel Tiber Developer cloud, you will go to [Intel Developer Cloud Console](https://console.cloud.intel.com/hardware) and access the hardware instances to select the Intel® Gaudi® 2 platform for deep learning and follow the steps to start and connect to the node.


### Docker Setup
Now that you have access to the node, you will use the latest Intel Gaudi docker image by first calling the docker run command which will automatically download and run the docker:

```
docker run -itd --name Gaudi_Docker --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
```

We then start the docker and enter the docker environment by issuing the following command: 
```
docker exec -it Gaudi_Docker bash
```


### Model Setup 
Now that we’re running in a docker environment, we can now install the remaining libraries and model repositories:
Start in the root directory and install the DeepSpeed Library; DeepSpeed is used to improve memory consumption on Intel Gaudi while running large language models. 


In [None]:
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.18.0

Now install the Hugging Face Optimum Habana library and GitHub Examples, notice that we’re selecting the latest validated release of Optimum Habana:

In [None]:
%cd ~
!git clone -b v1.14.1 https://github.com/huggingface/optimum-habana
!pip install optimum-habana==1.14.1

Finally, we transition to the text-generation example and install the final set of requirements to run the model:

In [None]:
%cd ~/optimum-habana/examples/text-generation
!pip install -r requirements.txt
!pip install -r requirements_lm_eval.txt

### How to access and Use the Llama 2 model
Use of the model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses. 
To be able to run gated models like this Llama-2-70b-hf, you need the following:   

•	Have a HuggingFace account and agree to the terms of use of the model in its model card on the HF Hub  
•	Create a read token and request access to the Llama 2 model from meta-llama  
•	Login to your account using the HF CLI:   

In [None]:
!huggingface-cli login --token <YOUR HUGGINGFACE HUB TOKEN>

### Running the Llama 2 70B Model using the BF16 Datatype
We’re now ready to start running the model for inference.  In this first example, we’ll start with the standard inference example using BF16.  Since the Llama 2 70B is a large model, we’ll employ the DeepSpeed library with a set of default settings to more efficiently manage the memory usage of the local HBM memory on each Intel Gaudi card: 

In [5]:
prompt = input("Enter a prompt for text generation: ")

Enter a prompt for text generation:  who is president of usa


In [None]:
!python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--max_new_tokens 1024 \
--bf16 \
--use_hpu_graphs \
--use_kv_cache \
--batch_size 1 \
--attn_softmax_bf16 \
--limit_hpu_graphs \
--reuse_cache \
--trim_logits \
--prompt "{prompt}" 

You will see the output at the end of the run showing the througput, memory usage and graph compilation time.  You can refer to the Readme of the [text-generation task example](https://github.com/huggingface/optimum-habana/tree/v1.12.0/examples/text-generation) for more options for running inference with Llama 2 and other Large Language Models. 

## Running the Llama 2 70B Model using the FP8 Datatype
Now we’ll now be using the FP8 datatype.  Using FP8 can give significantly better performance as compared to BF16.  The first step is to run quantization measurement.  To learn more about Intel Gaudi FP8 quantization, you can refer to the [user guide](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html).  This is provided by running the local quantization tool using the maxabs_measure.json file that is already loaded on the Hugging Face GitHub library: 


In [None]:
!QUANT_CONFIG=./quantization_config/maxabs_measure.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 4 \
run_lm_eval.py --model_name_or_path meta-llama/Llama-2-70b-hf \
-o acc_70b_bs1_measure4.txt \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--bf16 \
--batch_size 1 \
--use_flash_attention \
--flash_attention_recompute


This generates a set of measurement values in a folder called `hqt_output` that will show what ops have been converted to the FP8 datatype. 

```
-rw-r--r--  1 root root 347867 Jul 13 07:52 measure_hooks_maxabs_0_4.json
-rw-r--r--  1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_0_4.npz
-rw-r--r--  1 root root  40297 Jul 13 07:52 measure_hooks_maxabs_0_4_mod_list.json
-rw-r--r--  1 root root 347892 Jul 13 07:52 measure_hooks_maxabs_1_4.json
-rw-r--r--  1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_1_4.npz
-rw-r--r--  1 root root  40297 Jul 13 07:52 measure_hooks_maxabs_1_4_mod_list.json
-rw-r--r--  1 root root 347903 Jul 13 07:52 measure_hooks_maxabs_2_4.json
-rw-r--r--  1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_2_4.npz
-rw-r--r--  1 root root  40297 Jul 13 07:52 measure_hooks_maxabs_2_4_mod_list.json
-rw-r--r--  1 root root 347880 Jul 13 07:52 measure_hooks_maxabs_3_4.json
-rw-r--r--  1 root root 185480 Jul 13 07:52 measure_hooks_maxabs_3_4.npz
-rw-r--r--  1 root root  40297 Jul 13 07:52 measure_hooks_maxabs_3_4_mod_list.json
```
We now can use these measurements to run the throughput execution of the model.   In this case a standard input prompt is used.  You will notice that the quantization .json config file is now used (instead of the measurement file) and additional input and output parameters are added.  In this case you will see `--max_new_tokens 2048` which determines the size of the output text generated, and `-max_input_tokens 128`  which defines the size of the number of input tokens.   

In [None]:
!QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 \
python3 ../gaudi_spawn.py --use_deepspeed --world_size 4 \
run_generation.py --model_name_or_path meta-llama/Llama-2-70b-hf \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--max_new_tokens 2048 \
--max_input_tokens 128 \
--bf16 \
--batch_size 130  \
--use_flash_attention \
--flash_attention_recompute

## Next Steps 
Now that you have run a full inference case, you can go back to the Hugging Face Optimum Habana [validated models](https://github.com/huggingface/optimum-habana?tab=readme-ov-file#validated-models) to see more options for running inference. 


In [None]:
exit()