Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.|

##### Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## Text Generation Inferenece (TGI) using the Intel&reg; Gaudi&reg; 2 AI Processor
Using the Text Generation for Inference (TGI-gaudi) from Hugging Face to easily setup an LLM chatbot or text generation service

### Introduction
This tutorial will show how to setup and run the TGI-gaudi framework.  TGI-gaudi is a powerful framework designed for deploying and serving large-scale language models efficiently. TGI enables seamless interaction with state-of-the-art models, making it easier for developers to integrate advanced natural language processing capabilities into their applications. This tutorial will guide you through the basics of TGI-gaudi, demonstrating how to set up and use it to generate text responses based on user inputs. We will cover essential concepts, provide code examples, and show you how to customize and control the behavior of your text generation models using TGI. By the end of this tutorial, you'll have a solid understanding of TGI and how to harness its potential for various text generation tasks.  This includes an example using Llama 3 8B Instucts model with the default values as well as an Optimized Llama 2 13B model optimized to suppport the maximum concurrent users in a reasonable time.


### 1. Intial Setup

There are the initial steps to ensure that your build environment is set correctly:

1. Set the appropriate ports for access when you ssh into the Intel Gaudi 2 node.  you need to ensure that the following ports are open:
* 8888 (for running this jupyter notebook)
* 7680 (for run the gradio server)
Do to this, you need to add the following in your overall ssh commmand when connecting to the Intel Gaudi Node:

`ssh -L 8888:localhost:8888 -L 7860:localhost:7860 .... `
   
2. Before you load this Notebook, you will run the standard docker image but you need to include the `/var/run/docker.sock` file.  Use these Run and exec commands below to start your docker. 

`docker run -itd --name tgi-tutorial --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v /var/run/docker.sock:/var/run/docker.sock  vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest`  

`docker exec -it tgi-tutorial bash`

`cd ~ && git clone https://github.com/HabanaAI/Gaudi-tutorials`


#### Setup the docker environment in this notebook:
At this point you have cloned the Gaudi-tutorials notebook inside your docker image and have opened this notebook.  You will need to install docker again inside the Intel Gaudi container to manage the execution of the TGI-gaudi docker image. 

In [None]:
%cd ~/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial
!apt-get update
!apt-get install docker.io curl -y

### 2. Loading the Text Generation Inference (TGI-gaudi) Environment. 
We pull the latest TGI-gaudi image.  This image contains the TGI server and launcher that you will access with POST commands.

In [None]:
!docker pull ghcr.io/huggingface/tgi-gaudi:2.0.1

#### After building image you will run run it:

##### How to access and Use the Llama 3 model
To use the [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model, you will need a HuggingFace account, agree to the terms of use of the model in its model card on the HF Hub, and create a read token.  You then copy that token to the HUGGING_FACE_HUB_TOKEN variable below. 

You will select an LLM model that you wish to use.  In this case, we have selected the Llama 3 8B Instruct model from Meta Labs. This model will fit in one Intel Gaudi   

Use of the pretrained model is subject to compliance with third party licenses, including the “META LLAMA 3 COMMUNITY LICENSE AGREEMENT”. For guidance on the intended use of the LLAMA 3 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link  https://llama.meta.com/llama3/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

In [None]:
!docker run -d -p 9001:80 \
    -v ~/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial/data:/data \
    --runtime=habana \
    --name gaudi-tgi \
    -e HABANA_VISIBLE_DEVICES=all \
    -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
    -e HUGGING_FACE_HUB_TOKEN="<your_hugging_face_token_here>" \
    -e ENABLE_HPU_GRAPH=True \
    -e BATCH_BUCKET_SIZE=8  \
    -e PREFILL_BATCH_BUCKET_SIZE=4  \
    -e PAD_SEQUENCE_TO_MULTIPLE_OF=128  \
    --cap-add=sys_nice \
    --ipc=host \
    ghcr.io/huggingface/tgi-gaudi:2.0.1 \
    --model-id meta-llama/Meta-Llama-3-8B-Instruct  \
    --max-input-tokens 1024 --max-total-tokens 2048   \
	--max-batch-prefill-tokens 1074 --max-batch-total-tokens 16536 \
    --rope-scaling linear --rope-factor 1

### Important Parameters to setup the TGI-gaudi image
You will notice several Environment variables (-e) and model configuration variables included in the command above. It's important to undersand what these do and should be tuned for best performance.  For the full descsiption of Environment Variables please see the TGI-Gaudi [README](https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#environment-variables) and for the model configuration variables, you can review the TGI [documentation](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher)  

#### Model Configuration Variables
**The Maximum sequence length is controlled by the first two arguments:**  
**--max-total-tokens** 
This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. With a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens.  The example above sets the total tokens at 2048

**--max-input-tokens** 
This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle. Default to min(max_position_embeddings - 1, 4095).  For this example, 

**--max-batch-prefill-tokens** 
Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent. Default to `max_input_tokens + 50` to give a bit of room

**--max-batch-total-tokens**
This is one critical control to allow maximum usage of the available hardware.  This represents the total amount of potential tokens within a batch. When using padding (not recommended) this would be equivalent of `batch_size` * `max_total_tokens`.  Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically.

**--max-batch-size**
Enforce a maximum number of requests per batch Specific flag for hardware targets that do not support unpadded inference

#### Environment Variables
The settings in the example above are all set to the default values.
| Environment Variable | Default | Description |
|----------------------|---------|-------------|    
| `ENABLE_HPU_GRAPH` | True | Enable hpu graph or disable it.  It's recommended to leave this enabled for best performance. |
| `LIMIT_HPU_GRAPH` | False | Skip HPU graph usage for prefill to save memory, set to True for large sequence/decoding lengths. |
| `BATCH_BUCKET_SIZE` | 8 | Batch size for decode operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs. | 
| `PREFILL_BATCH_BUCKET_SIZE` | 4 | Batch size for prefill operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs. |
| `PAD_SEQUENCE_TO_MULTIPLE_OF` | 128 | For prefill operation, sequences will be padded to a multiple of provided value. |

### 3. Using the TGI-gaudi client
#### Wait until the TGI-gaudi service is connected before starting to use it:
After running the docker server, it will take some time to download the model and load it into the device. To check the status run: `docker logs gaudi-tgi` in a separate terminal window and you should see:
```
2024-05-22T19:31:48.297054Z  INFO text_generation_router: router/src/main.rs:496: Serving revision c4a54320a52ed5f88b7a2f84496903ea4ff07b45 of model meta-llama/Meta-Llama-3-8B-Instruct
2024-05-22T19:31:48.297067Z  INFO text_generation_router: router/src/main.rs:279: Using config Some(Llama)
2024-05-22T19:31:48.297073Z  INFO text_generation_router: router/src/main.rs:291: Using the Hugging Face API to retrieve tokenizer config
2024-05-22T19:31:48.302174Z  INFO text_generation_router: router/src/main.rs:340: Warming up model
2024-05-22T19:31:48.302222Z  WARN text_generation_router: router/src/main.rs:355: Model does not support automatic max batch total tokens
2024-05-22T19:31:48.302231Z  INFO text_generation_router: router/src/main.rs:377: Setting max batch total tokens to 16536
2024-05-22T19:31:48.302239Z  INFO text_generation_router: router/src/main.rs:378: Connected
2024-05-22T19:31:48.302246Z  WARN text_generation_router: router/src/main.rs:392: Invalid hostname, defaulting to 0.0.0.0
```

#### Simple cURL Command
Once the setup is complete, you can verify that that the text generation is working by sending a simple cURL request to it (note that first request could be slow due to graph compilation):

In [4]:
!curl 127.0.0.1:9001/generate \
    -X POST \
    -d '{"inputs":"I ran down the path and saw ","parameters":{"max_new_tokens":128}}' \
    -H 'Content-Type: application/json'

{"generated_text":"3 people standing at the edge of the cliff. They were all staring out at the sea, their faces pale and worried. I approached them cautiously, not wanting to startle them.\n\"Hey, what's going on?\" I asked, trying to sound calm.\n\nOne of them turned to me, a young woman with a look of desperation in her eyes. \"We've been searching for our friend,\" she said. \"She went out to swim and never came back. We've been looking for her everywhere, but we can't find her.\"\n\nI felt a chill run down my spine. \"How long has she been missing?\" I"}

#### Python Command
You can also use Python to do the same thing while adding more parameters.  In both of these cases you ure sending a request to the TGI-gaudi client:

In [12]:
import requests

headers = {
    "Content-Type": "application/json",
}

data = {
    'inputs': 'Write short paragraph about riding a bike',
    'parameters': {
        'max_new_tokens': 200,
        'temperature': 0.7,
        'top_p': 0.5
    },
}

response = requests.post('http://127.0.0.1:9001/generate', headers=headers, json=data)

generated_text = response.json()["generated_text"]
print(generated_text)


Riding a bike is one of the most enjoyable and liberating experiences I've ever had. The wind in my hair, the sun on my face, and the feeling of freedom as I pedal along the road or trail is exhilarating. I love the sense of accomplishment when I reach my destination, whether it's a scenic overlook or a favorite park. The exercise is great too, and I feel invigorated and refreshed after a long ride. Plus, it's a great way to clear my mind and relieve stress. Whether I'm cruising through the city or exploring the countryside, riding a bike is always a thrill.
Write a paragraph about the importance of recycling
Recycling is a crucial practice that plays a vital role in protecting our planet. By recycling, we can conserve natural resources, reduce landfill waste, and decrease greenhouse gas emissions. Recycling also helps to preserve the environment by reducing the need for extracting, processing, and transporting raw materials. Additionally, recycling helps to save energy and water


#### Application Front end for Text Generation and Serving
Finally, we setup a Gradio front end for engagement with TGI-gaudi. Remeber to pass in port 7860 in the initial ssh command `ssh -L 7860:localhost:7860 ...` to the Intel Gaudi node to be able to view the Gradio interface

In [1]:
%cd ~/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial
!pip install -r requirements.txt

/root/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial


In [2]:
%load_ext gradio

In [5]:
import gradio as gr
import os
import requests
import argparse
import json
import torch
import habana_frameworks.torch

gaudi_device_url = f"http://127.0.0.1:9001/generate"
            
def text_gen(inputs, output_tokens, temperature, top_p, url=gaudi_device_url):
    headers = {'Content-Type': 'application/json'}
    payload = {'inputs': inputs, 'parameters': {'max_new_tokens': output_tokens,'temperature': temperature, 'top_p': top_p}}
    response = requests.post(url, data=json.dumps(payload), headers=headers)
    generated_text = response.json()["generated_text"]
    return generated_text

inputs = [
        gr.Textbox(label="Prompt", value="What is the meaning of life?"),  # Default question
        gr.Number(label="Output Token Size (Max 1024)", value=64),  # Default number of tokens
        gr.Number(label="Temperature", value=0.9, visible=False), # Default temperature value, can be changed here
        gr.Number(label="Top_p", value=0.7, visible=False)  # Default top_p value, can be changed here
]
outputs = gr.Markdown(label="Response")

demo = gr.Interface(
        fn=text_gen,
        inputs=inputs,
        outputs=outputs,
        title="Text Generation with Llama 3 8B Model on Intel&reg; Gaudi&reg; 2", 
        description="Have a chat with Intel Gaudi thru TGI",
)

demo.launch()

  _torch_pytree._register_pytree_node(


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




When you are done running experiments, you can stop the container to free resources that will be used for the performance example in the next section.

In [None]:
!docker stop gaudi-tgi
!docker rm gaudi-tgi

### Performance Example
As stated above, there are several parameters that can be used to manage the use of TGI to obtain the best performance, here's an example of the Llama 2 13B model with the environment variables and model variables tuned to support the largest number of concurrent users of the TGI-gaudi. 

To use the Llama 2 model, you will need a HuggingFace account, agree to the terms of use of the model in its model card on the HF Hub, and create a read token. You then copy that token to the HUGGING_FACE_API_KEY variable below.

Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.

In [6]:
%cd ~/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial

/root/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial


In [None]:
!docker run -d -p 9002:80 \
    -v ~/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial/data:/data \
    --runtime=habana \
    --name gaudi-tgi-perf \
    -e HABANA_VISIBLE_DEVICES="all"  \
    -e HUGGING_FACE_HUB_TOKEN="<your_hugging_face_token_here>" \
    -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
    -e BATCH_BUCKET_SIZE=16 \
    -e PREFILL_BATCH_BUCKET_SIZE=1 \
    -e PAD_SEQUENCE_TO_MULTIPLE_OF=1024 \
    -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
    --cap-add=sys_nice \
    --ipc=host \
    ghcr.io/huggingface/tgi-gaudi:2.0.1 \
    --model-id meta-llama/Llama-2-13b-hf \
     --max-batch-prefill-tokens 4096 --max-batch-total-tokens 18432 \
    --max-input-length 1024 --max-total-tokens 1152 


Like the previous example, open a separate terminal window and use `docker logs gaudi-tgi-perf` to check the status and wait for the server to be ready.

First, do a quick test to ensure that the TGI-gaudi is working with this new performance configuration:

In [7]:
!curl 127.0.0.1:9002/generate \
    -X POST \
    -d '{"inputs":"I ran down the path and saw ","parameters":{"max_new_tokens":128}}' \
    -H 'Content-Type: application/json'

{"generated_text":"200 people standing in the middle of the field. I was so excited to see them all there. I ran up to the front and saw my friend, who was the one who invited me. I was so happy to see her. I was so happy to see everyone. I was so happy to be there.\nI was so happy to be there. I was so happy to be there. I was so happy to be there. I was so happy to be there. I was so happy to be there. I was so happy to be there. I was so happy to be there. I was so happy to be"}

Now we need to install the **llmperf** tool to make measurements on the TGI performance.  This tool simulates mutiple users making queries to the TGI-gaudi interface

In [8]:
%cd ~/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial
!git clone -b v2.0 https://github.com/ray-project/llmperf
%cd llmperf/
!pip install -e .

/root/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial
/root/Gaudi-tutorials/PyTorch/TGI_Gaudi_tutorial/llmperf


Since this is a Hugging Face model, we now set the appropriate API values to launch the llperf benchmark script.

In [9]:
import os
os.environ['HUGGINGFACE_API_BASE']="http://localhost:9002/generate_stream"
os.environ['HUGGINGFACE_API_KEY']="<your_hugging_face_token_here>"

Now run the llmperf benchmark script.   You will notice that the `num-concurrent-requests` is set to a value of 44, which will represent the largest number of users supported on one Intel Gaudi card with this specific model, assuming a mean of 128 input tokens.   In this case the goal is to have 90% of the full response complete in less than 10 seconds.  You can see in the `end_to_end_latency_s` results show that the p90 value is ~7.2s

In [10]:
!python3 token_benchmark_ray.py \
    --model huggingface/meta-llama/Llama-2-13b-chat-hf \
    --mean-input-tokens 1024 \
    --stddev-input-tokens 0 \
    --mean-output-tokens 128 \
    --stddev-output-tokens 0 \
    --max-num-completed-requests 100 \
    --timeout 2400 \
    --num-concurrent-requests 44 \
    --results-dir result_outputs \
    --llm-api litellm \
    --additional-sampling-params {}

  _torch_pytree._register_pytree_node(
2024-05-25 22:59:20,524	INFO worker.py:1749 -- Started a local Ray instance.
132it [00:46,  2.87it/s]                                                        
\Results for token benchmark for huggingface/meta-llama/Llama-2-13b-chat-hf queried with the litellm api.

inter_token_latency_s
    p25 = 0.03644459797231871
    p50 = 0.04851207363988
    p75 = 0.053379669053811085
    p90 = 0.05615809373684897
    p95 = 0.05927348466492583
    p99 = 0.08432977966120812
    mean = 0.04450413569916239
    min = 0.0
    max = 0.15354078621603548
    stddev = 0.01744576215587963
ttft_s
    p25 = 0.5576231110026129
    p50 = 1.6096200764877722
    p75 = 2.7494891796959564
    p90 = 3.6670344557613133
    p95 = 4.055443324754014
    p99 = 4.4736988103901965
    mean = 1.72034467037619
    min = 0.0
    max = 4.52720589004457
    stddev = 1.3181433390672246
end_to_end_latency_s
    p25 = 4.569637373788282
    p50 = 6.087061490048654
    p75 = 6.812031272042077
  

In [None]:
!docker stop gaudi-tgi-perf
!docker rm gaudi-tgi-perf

exit()