# 🦙 **Using the Text Generation Inference (TGI) with Llama Models**

In this notebook, we'll explore how simple it is to serve and consume the **[Llama models](https://huggingface.co/blog/llama32)** using the **Text Generation Inference (TGI)** project. 🚀

## 🗂 **Available Llama Models**

You can browse the entire collection of Llama models over at [this link](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf). These models range in size and capability, giving you plenty of options for your text generation needs.

## 📚 **Learn More About TGI**

To explore the technical details behind the **Text Generation Inference** project, visit the [official GitHub repository](https://github.com/huggingface/text-generation-inference). 💡



# 🍽️ **Serving the Llama Model**

Now that we've seen the available models, it's time to **serve** one using the **Text Generation Inference (TGI)** framework! 🛠️

With TGI, you can deploy Llama models efficiently to handle text generation requests. Whether you're hosting it on your local machine, or deploying it on the cloud, the process is streamlined for performance and scalability.


## 🐳 **Using Docker**

Check out the Docker setup guide in the official TGI repository [here](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#docker).



In [None]:
'''
model=meta-llama/Llama-3.2-1B
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data
token=<HF_TOKEN>

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e HF_TOKEN=$token \
  ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id $model --quantize bitsandbytes
'''

We can call the model using CURL:

In [None]:
'''
curl 127.0.0.1:8080/generate_stream \
  -X POST  \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'  \
  -H 'Content-Type: application/json'
'''

## 🌐 **Using Inference Endpoints**

Another efficient way to serve Llama models is by using **Inference Endpoints** on Hugging Face. 🚀 This allows you to deploy models in a fully managed environment with just a few clicks.

For detailed instructions on how to set up and use Inference Endpoints, refer to the [official documentation](https://huggingface.co/docs/inference-endpoints/index). You'll find everything you need to start serving your models in a reliable and scalable way! 💡


# 🍽️ **Consuming the Model**

Once the Llama model is up and running, you can start interacting with it using a simple API.

To consume the model, you'll send a request to the API endpoint and receive a response with the generated text. Hugging Face provides an easy-to-use interface for this, making it accessible from any application.

You can refer to the [API Inference Notebook](https://github.com/huggingface/huggingface-llama-recipes/blob/ae10e290a3bf1cbdc8523b3eb5ac2437f09e0877/api_inference/inference-api.ipynb) for a step-by-step guide on how to send requests to the model and retrieve responses. This notebook provides sample code and instructions to get you up and running quickly 🦙✨.


## 🐍 **Using the Python API**

To interact with the Llama models programmatically, you can utilize the **[huggingface_hub's Inference Client](https://huggingface.co/docs/huggingface_hub/guides/inference)**.



In [None]:
!pip install huggingface_hub

This models are gated so we need to authenticate first

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## ✍️ **Text Generation**

Using the Hugging Face Hub, you can easily perform text generation with the Llama models. Below is an example of how to utilize the `InferenceClient` to generate text based on a given prompt. 🚀


In [None]:
from huggingface_hub import InferenceClient

client = InferenceClient(model="meta-llama/Llama-3.2-1B")

response = client.text_generation(
    prompt="A HTTP POST request is used to ",
    temperature=0.8,
    max_new_tokens=50,
    seed=42,
    return_full_text=True,
)
print(response)

A HTTP POST request is used to  send a new entity to a web server. There are many reasons why you may want to send out a new entity, such as: creating a new user, making changes to an existing user, or sending in new orders. This tutorial will


## 💬 **Chat Example**

The Llama models can also be utilized for chat-like interactions. With the `InferenceClient`, you can easily create conversational AI experiences. Below is an example of how to generate a chat response based on user input. 🗣️


In [None]:
client = InferenceClient(model="meta-llama/Llama-3.2-11B-Vision-Instruct")

output = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

1
,
 
2
,
 
3
,
 
4
,
 
5
,
 
6
,
 
7
,
 
8
,
 
9
,
 
10
.



## 🖼️ **Chat with Image**

The Llama models also support multimodal interactions, allowing you to send images along with text prompts. This feature enables the model to analyze images and respond accordingly. Below is an example of how to engage in a chat with an image input. 📸


In [None]:
client = InferenceClient(model="meta-llama/Llama-3.2-11B-Vision-Instruct")

output = client.chat.completions.create(
    messages=[
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What’s in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://raw.githubusercontent.com/haotian-liu/LLaVA/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg",
            },
          },
        ],
      }
    ],
    stream=True,
    max_tokens=200,
)

full_response = []
for chunk in output:
    full_response.append(chunk.choices[0].delta.content)

final_text = ''.join(full_response)
print(final_text)

The image is a graph showing the performance of several systems, each identified by a name consisting of four letters. The x-axis represents the name of each system, while the y-axis denotes the performance metric being measured. The graph features a range of colors, with the lines varying in thickness.

The graph is set against a white background, which provides a clean and neutral backdrop for the information being presented. Overall, the graph effectively communicates the performance of each system, allowing for easy comparison and analysis.


## 📸 **Chat with an Image in Base64 Format**

You can also send images encoded in Base64 format to the Llama models, enabling multimodal interactions without relying on image URLs. This approach can be useful when you want to embed images directly in your requests. Below is an example of how to use Base64 encoding for image inputs in a chat. 🌟


In [None]:
import base64
import requests
from PIL import Image
from io import BytesIO

url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg'

image = Image.open(requests.get(url, stream=True).raw)

def encode_image(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

base64_image = encode_image(image)

output = client.chat.completions.create(
    messages=[
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What’s in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": f"data:image/jpeg;base64,{base64_image}"
            },
          },
        ],
      }
    ],
    stream=True,
    max_tokens=200,
)

full_response = []
for chunk in output:
    full_response.append(chunk.choices[0].delta.content)

final_text = ''.join(full_response)
print(final_text)

The image depict a rabbit in clothing.


## 🌐 **Using CURL**

You can also interact with the 🦙 Llama models using CURL. Below are examples of how to use CURL for both standard text generation and chat completions.

_Authorization Token: Replace `<Token>` with your actual Hugging Face API token._

### 1. **Text Generation**

To generate text using the Llama model, you can send a POST request like this:


In [74]:
!curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'\
    -H "Authorization: Bearer <Token>"

[{"generated_text":" A Beginner’s Guide\nDeep learning is a subset of machine learning that involves the use of artificial neural"}]

### 2. **Chat Completions**

For chat interactions, you can use a similar approach to send messages:

In [None]:
!curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct/v1/chat/completions -X POST \
    -d '{"messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "What is deep learning?"}],"stream": true,"max_tokens": 20}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer <Token>"

data: {"object":"chat.completion.chunk","id":"","created":1727810075,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","system_fingerprint":"2.3.1-dev0-sha-de90261","choices":[{"index":0,"delta":{"role":"assistant","content":"Deep"},"logprobs":null,"finish_reason":null}],"usage":null}

data: {"object":"chat.completion.chunk","id":"","created":1727810075,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","system_fingerprint":"2.3.1-dev0-sha-de90261","choices":[{"index":0,"delta":{"role":"assistant","content":" learning"},"logprobs":null,"finish_reason":null}],"usage":null}

data: {"object":"chat.completion.chunk","id":"","created":1727810075,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","system_fingerprint":"2.3.1-dev0-sha-de90261","choices":[{"index":0,"delta":{"role":"assistant","content":" is"},"logprobs":null,"finish_reason":null}],"usage":null}

data: {"object":"chat.completion.chunk","id":"","created":1727810075,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","system_fing

https://huggingface.co/docs/api-inference/parameters

In [76]:
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct"
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json",
    "x-use-cache": "false"
}
data = {
    "inputs": "What is Deep Learning?"
}
response = requests.post(API_URL, headers=headers, json=data)
print(response.json())

[{'generated_text': " [7 models explained]\nPosted by: techsaiyan in Machine Learning June 2, 2019\nDeep Learning is a branch of artificial intelligence that ensues the development of a computer's potential to interpret, research, and understand complex data like photos, audio, and texts. The ultimate objective of deep learning frameworks is to influence machines to examine, process raw data and extract valuable information from the data. Deep learning exists within machine learning and all ensues the development of a computer's potential"}]


In [77]:
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct/v1/chat/completions"
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json",
    "x-use-cache": "false"
}

data = {
    "messages":  [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ]
}

response = requests.post(API_URL, headers=headers, json=data)
print(response.json())

{'object': 'chat.completion', 'id': '', 'created': 1727812546, 'model': 'meta-llama/Llama-3.2-11B-Vision-Instruct', 'system_fingerprint': '2.3.1-dev0-sha-de90261', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'Deep learning is a subfield of machine learning that involves the use of artificial neural networks (ANNs) with multiple layers to learn and represent data. The main idea behind deep learning is to create models that can learn complex patterns and features from large amounts of data by mimicking the structure and function of the human brain.\n\nIn traditional machine learning, models are designed to recognize and classify patterns using hand-engineered features. However, deep learning models learn these features automatically from the data, which allows them to'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 47, 'completion_tokens': 100, 'total_tokens': 147}}
