##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting Started with Xinference and Gemma

This notebook demonstrates how to use Xinference to load a Gemma 2 model and run inference utilizing the GPU provided by Google Colab.

[**Gemma**](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs) available in English, with open weights, pre-trained variants, and instruction-tuned variants.

[**llama.cpp**](https://github.com/ggerganov/llama.cpp) is a C++ implementation of Meta AI's LLaMA and other large language model architectures, designed for efficient performance on local machines or within environments like Google Colab. It enables you to run large language models without needing extensive computational resources.

To make working with llama.cpp more accessible,
[**llama-cpp-python**](https://github.com/abetlen/llama-cpp-python) provides Python bindings for the C++ library. This allows you to enjoy the performance optimizations of `llama.cpp` while benefiting from the simplicity and flexibility of Python. With llama-cpp-python, you get a convenient API for loading models, generating text, and customizing inference parameters.

[**Xorbits Inference (Xinference)**](https://inference.readthedocs.io/en/latest/) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications. You will be using different **Gemma 2** model variants in the GGUF format for this tutorial, but the code should be easily transferrable to all LLM chat models supported by Xinference.

The latest complete list of supported models can be found in Xorbits Inference's [official GitHub page](https://github.com/xorbitsai/inference/blob/main/README.md).

By the end of this notebook, you will learn how to:

- **Install Xinference and its dependencies**: Set up Xinference along with `llama.cpp` to run Gemma models in the GGUF format.
- **Start the Xinference local server**: Initialize the server to run models locally.
- **Launch different Gemma 2 model variants**: Select quantization and model size parameters to launch various Gemma 2 models.
- **Interact with the model using Xinference's Python client**: Use the client to communicate with the model and receive responses.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Using_with_Xinference.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

**Once you've completed these steps, you're ready to move on to the next section where you'll set up the environment for Xinference to work with llama.cpp.**


### Install Xinference and dependencies

In [None]:
# Install Xinference
!pip install -q xinference

# The llama-cpp-python library allows us to leverage GPUs
!pip install llama-cpp-python==0.2.90 \
  -q -U --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

# jq is a powerful command-line JSON processor that's widely used for parsing,
# filtering, and formatting JSON data.
!apt-get install -qq jq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.7/40.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.4/24.4 MB[0m [31m68.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m98.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m88.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.1/57.1 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.1/320.1 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00

## Start Local Server


To start a local instance of Xinference, run `xinference` in the background via `nohup`:

In [None]:
!nohup xinference-local  > xinference.log 2>&1 &

In [None]:
import time
print("Waiting for the Xinference server to start...")
time.sleep(30)  # Wait for 30 seconds
print("Xinference server should be running now.")

# View server logs
!cat xinference.log

Waiting for the Xinference server to start...
Xinference server should be running now.
2024-11-28 12:24:10,788 xinference.core.supervisor 1022 INFO     Xinference supervisor 127.0.0.1:44380 started
2024-11-28 12:24:10,826 xinference.core.worker 1022 INFO     Starting metrics export server at 127.0.0.1:None
2024-11-28 12:24:10,827 xinference.core.worker 1022 INFO     Checking metrics export server...
2024-11-28 12:24:12,083 xinference.core.worker 1022 INFO     Metrics server is started at: http://127.0.0.1:33897
2024-11-28 12:24:12,083 xinference.core.worker 1022 INFO     Purge cache directory: /root/.xinference/cache
2024-11-28 12:24:12,084 xinference.core.worker 1022 INFO     Connected to supervisor as a fresh worker
2024-11-28 12:24:12,097 xinference.core.worker 1022 INFO     Xinference worker 127.0.0.1:44380 started
Task was destroyed but it is pending!
task: <Task pending name='Task-5' coro=<ActorCallerThreadLocal._listen() running at /usr/local/lib/python3.10/dist-packages/xoscar/

Congrats! You now have Xinference running in Colab machine. The default host and ip is 127.0.0.1 and 9997 respectively.


Once Xinference is running, there are multiple ways you can try it out: via the web UI, via cURL, via the command line, or via the Xinference’s Python client.

The command line tool is `xinference`. You can list the commands that can be used by running:

In [None]:
!xinference --help

Usage: xinference [OPTIONS] COMMAND [ARGS]...

  Xinference command-line interface for serving and deploying models.

Options:
  -v, --version       Show the current version of the Xinference tool.
  --log-level TEXT    Set the logger level. Options listed from most log to
                      (Default level is INFO)
  -H, --host TEXT     Specify the host address for the Xinference server.
  -p, --port INTEGER  Specify the port number for the Xinference server.
  --help              Show this message and exit.

Commands:
  cached         List all cached models in Xinference.
  cal-model-mem  calculate gpu mem usage with specified model size and...
  chat           Chat with a running LLM.
  engine         Query the applicable inference engine by model name.
  generate       Generate text using a running LLM.
  launch         Launch a model with the Xinference framework with the...
  list           List all running models in Xinference.
  login          Login when the cluster is authen

You can launch different Gemma 2 [model variants](https://inference.readthedocs.io/en/latest/models/builtin/llm/gemma-2-it.html) using the following command. However, for ease of use, you'll mainly rely on the Python API to select the appropriate combinations of model size and quantization.

```bash
xinference launch \
    --model-engine ${engine} \
    --model-name ${name} \
    --size-in-billions ${model_size} \
    --model-format {format} \
    --quantization ${quantization}
```

The placeholders in the command can be replaced with the appropriate values:

- **`${engine}`**: The model engine to use. Possible options are:
  - `llama.cpp` (used with `--model-format ggufv2`)
  - `transformers`
  - `sglang`
  - `vllm`

- **`${name}`**: The model name, e.g., `gemma-2-it`.

- **`${model_size}`**: The size of the model in billions of parameters. Options are `2`, `9`, or `27`.

- **`${format}`**: The model format. Common options include:
  - `ggufv2` (used with `llama.cpp` engine)
  - `pytorch`
  - `awq`
  - `gptq`

- **`${quantization}`**: The quantization method, which depends on the model format and size.

  - When `--model-format` is `ggufv2` and using `llama.cpp`, valid quantizations for Gemma 2 are:

    - **For 2B models**: `Q3_K_L`, `Q4_K_M`, `Q4_K_S`, `Q5_K_M`, `Q5_K_S`, `Q6_K`, `Q6_K_L`, `Q8_0`, `f32`.

    - **For 9B and 27B models**: `Q2_K`, `Q2_K_L`, `Q3_K_L`, `Q3_K_M`, `Q3_K_S`, `Q4_K_L`, `Q4_K_M`, `Q4_K_S`, `Q5_K_L`, `Q5_K_M`, `Q5_K_S`, `Q6_K`, `Q6_K_L`, `Q8_0`, `f32`.

  - When `--model-format` is `pytorch`, the quantization is `none`.

  - When `--model-format` is `awq`, the quantization is `Int4`.

  - When `--model-format` is `gptq`, the quantization can be `Int3`, `Int4`, or `Int8`.

You can also specify the model's UID using the `--model-uid` or `-u` flag. If not specified, Xinference will generate it automatically, creating a new model instance with a unique ID.

**Note**: To simplify the process and ensure valid parameter combinations, it's recommended to use the Python API provided in this notebook. This automatically handles the selection of appropriate model sizes and quantization methods. For this tutorial, you'll stick to using `llama.cpp` as the model engine to run different `GGUF` Gemma 2 models on a modest GPU like the **T4**.

## Choose a Gemma 2 model
Xinference supports a variety of LLMs. Learn more in https://inference.readthedocs.io/en/latest/models/builtin/.

Let’s start by running a built-in model: `gemma-2-it`. You can find out more about the available Gemma models [here](https://inference.readthedocs.io/en/latest/models/builtin/llm/gemma-2-it.html).


In [None]:
model_name = 'gemma-2-it'

In [None]:
#@title **Select Model Size**

#@markdown **Model Size (in billions):**
#@markdown - **2**: A smaller model that requires less memory but may offer less performance.
#@markdown - **9**: A medium-sized model that balances performance and resource usage.
#@markdown - **27**: A large model that provides better performance but requires more memory and computation time.

model_size_in_billions = "27" #@param ["2", "9", "27"] {type:"string"}

# Convert model size to integer
model_size_in_billions = int(model_size_in_billions)

In [None]:
# Define allowed quantizations per model size
allowed_quantizations = {
    2: ["Q3_K_L", "Q4_K_M", "Q4_K_S", "Q5_K_M", "Q5_K_S", "Q6_K", "Q6_K_L", "Q8_0", "f32"],
    9: ["Q2_K", "Q2_K_L", "Q3_K_L", "Q3_K_M", "Q3_K_S", "Q4_K_L", "Q4_K_M", "Q4_K_S", "Q5_K_L", "Q5_K_M", "Q5_K_S", "Q6_K", "Q6_K_L", "Q8_0", "f32"],
    27: ["Q2_K", "Q2_K_L", "Q3_K_L", "Q3_K_M", "Q3_K_S", "Q4_K_L", "Q4_K_M", "Q4_K_S", "Q5_K_L", "Q5_K_M", "Q5_K_S", "Q6_K", "Q6_K_L", "Q8_0", "f32"]
}
#@markdown **Allowed Quantizations for Selected Model Size:**
print(f"Allowed quantizations for {model_size_in_billions}B model: {allowed_quantizations[model_size_in_billions]}")

Allowed quantizations for 27B model: ['Q2_K', 'Q2_K_L', 'Q3_K_L', 'Q3_K_M', 'Q3_K_S', 'Q4_K_L', 'Q4_K_M', 'Q4_K_S', 'Q5_K_L', 'Q5_K_M', 'Q5_K_S', 'Q6_K', 'Q6_K_L', 'Q8_0', 'f32']


In [None]:
#@title **Select Quantization**

#@markdown **Quantization Method:**
#@markdown - **Q2_K to Q8_0**: Lower quantization levels (e.g., Q2_K) reduce memory usage but may affect model accuracy.
#@markdown - **Higher quantization levels** (e.g., Q8_0) preserve model performance but require more memory as it uses more bits per parameter.


#@markdown **Note:** Larger models and higher quantization levels require more memory and computation time. Ensure that your Colab instance has sufficient resources. i.e. Choose the appropriate **GPU Runtime Type** (T4, L4, A100) for your model.
import ipywidgets as widgets

# Create empty dropdown widget for quantization
quantization_dropdown = widgets.Dropdown(
    options=allowed_quantizations[model_size_in_billions],
    description='Quantization:',
    disabled=False,
)

# Display the widget
display(quantization_dropdown)

Dropdown(description='Quantization:', options=('Q2_K', 'Q2_K_L', 'Q3_K_L', 'Q3_K_M', 'Q3_K_S', 'Q4_K_L', 'Q4_K…

In [None]:
# Set the quantization and model name based on the selected size
quantization = quantization_dropdown.value
model_uid = f"gemma-2-{model_size_in_billions}b-it-{quantization}"

print(f"Selected model size: {model_size_in_billions}B")
print(f"Selected quantization: {quantization}")
print(f"Model UID: {model_uid}")

Selected model size: 27B
Selected quantization: Q2_K
Model UID: gemma-2-27b-it-Q2_K


## Launch the Gemma Model with Xinference

Now you'll use Xinference's Python client to launch the Gemma model with the selected parameters.

In [None]:
from xinference.client import RESTfulClient

# Define the client to connect to Xinference
port = 9997  # Default Xinference port
client = RESTfulClient(f"http://localhost:{port}")

# Launch the Gemma model with selected parameters
try:
    print(f"Launching model '{model_name}' with quantization '{quantization}'...")
    print("This may take several minutes as the model needs to be downloaded.")
    model_uid = client.launch_model(
        model_uid=model_uid,
        model_engine="llama.cpp",
        model_name=model_name,
        model_size_in_billions=model_size_in_billions,
        model_format="ggufv2",
        quantization=quantization.lower(),
    )
    print("Model launched successfully!")
except Exception as e:
    print(f"An error occurred while launching the model: {e}")

Launching model 'gemma-2-it' with quantization 'Q2_K'...
This may take several minutes as the model needs to be downloaded.
Model launched successfully!


When you start a model for the first time, Xinference will download the model parameters from Hugging Face. This process might take a few minutes, depending on the size of the model weights. The model files are cached locally, so you won't need to redownload them for subsequent runs.

**Note**: If your runtime crashes or runs out of memory (OOM), consider switching to a different runtime type (such as **L4** or **A100**), or choose a better balance between **quantization** and **model size**.

In [None]:
!xinference list

UID                  Type    Name        Format      Size (in billions)  Quantization
-------------------  ------  ----------  --------  --------------------  --------------
gemma-2-27b-it-Q2_K  LLM     gemma-2-it  ggufv2                      27  Q2_K



After running `!xinference list`, you can see that the model with the correct UID is now available for use.

## Interact with the model

Congrats! You now have the model running by Xinference. Once the model is running, you can try it out either command line, via cURL, or via Xinference’s Python client:


### Use a cURL request

Let's quickly test the model using a sample prompt.

In [None]:
%%bash -s "$model_uid"
model_uid=$1

# Construct the JSON data with variable expansion
request_payload=$(cat <<EOF
{
  "model": "$model_uid",
  "messages": [
    {
      "role": "user",
      "content": "What is the largest animal?"
    }
  ]
}
EOF
)

# Make the POST request using the constructed JSON data
curl -X POST \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "$request_payload" | jq .

{
  "id": "chatcmpl-42dbf0ef-2973-4a92-80db-87989c6b7603",
  "object": "chat.completion",
  "created": 1732796971,
  "model": "/root/.cache/huggingface/hub/models--bartowski--gemma-2-27b-it-GGUF/blobs/a361b524be3e172f3535b010c440d352f7f3103eee903d1eb939eea21d40e359",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The **blue whale** is the largest animal on Earth. \n\nIt can grow up to 100 feet long and weigh over 200 tons. That's about the size of a Boeing 737 airplane! 🐳 \n"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 53,
    "total_tokens": 69
  }
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   138    0     0  100   138      0    114  0:00:01  0:00:01 --:--:--   114100   138    0     0  100   138      0     62  0:00:02  0:00:02 --:--:--    62100   138    0     0  100   138      0     43  0:00:03  0:00:03 --:--:--    43100   138    0     0  100   138      0     32  0:00:04  0:00:04 --:--:--    32100   138    0     0  100   138      0     26  0:00:05  0:00:05 --:--:--    26100   138    0     0  100   138      0     22  0:00:06  0:00:06 --:--:--     0100   717  100   579  100   138     84     20  0:00:06  0:00:06 --:--:--   123100   717  100   579  100   138     84     20  0:00:06  0:00:06 --:--:--   157


### Use Xinference's Python client

In [None]:
#@title **Chat with the Model**
#@markdown Enter your message below to chat with the model.

query = "Who are you?" #@param {type:"string"}

# Send a chat message
model = client.get_model(model_uid)
model.chat(messages=[
    {
        "role": "user",
        "content": query
    }
])

{'id': 'chatcmpl-46ea8b7e-66fe-4aea-aa06-b05a54022221',
 'object': 'chat.completion',
 'created': 1732796978,
 'model': '/root/.cache/huggingface/hub/models--bartowski--gemma-2-27b-it-GGUF/blobs/a361b524be3e172f3535b010c440d352f7f3103eee903d1eb939eea21d40e359',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': 'I am Gemma, a large language model created by the Gemma team at Google DeepMind. I am trained on a massive dataset of text and code, which allows me to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. I am still under development, but I have learned to perform many kinds of tasks.'},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 14, 'completion_tokens': 73, 'total_tokens': 87}}

Congratulations on completing this tutorial! You've learned how to set up **Xinference** and **llama.cpp** to run different Gemma 2 models in the GGUF format and also interact with these models using Xinference's built-in Python client.

## Next Steps

For an even better user experience, consider exploring the following:

- **Xinference Documentation**:
  * [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)
  * [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)
  * [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)
- **Integrate Retrieval Augmented Generation (RAG)**: Improve the model's responses by incorporating a retrieval mechanism that fetches relevant information from external knowledge bases or documents, enhancing accuracy and context.
- **Utilize External APIs**: Expand the model's capabilities by connecting to external APIs for real-time data and services, enabling dynamic and up-to-date responses.
- **Enhance Output Formatting**: Modify the output display to mimic a chat interface for a more user-friendly interaction.

Enjoy experimenting with Gemma models!