##### Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Build Chatbot with Gradio & Llama.cpp

Author: Sitam Meur

*   GitHub: [github.com/sitamgithub-MSIT](https://github.com/sitamgithub-MSIT/)
*   X: [@sitammeur](https://x.com/sitammeur)

Description: Google recently released Gemma 3 QAT—the [Quantization Aware Trained (QAT) Gemma 3](https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b) checkpoints. These models maintain similar quality to half precision while using three times less memory. This notebook demonstrates creating a user-friendly chat interface for the [gemma-3-1b-it-qat](https://huggingface.co/google/gemma-3-1b-it-qat-q4_0-gguf) text model using Llama.cpp (for inference) and Gradio (for user interface).

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_3]Gradio_LlamaCpp_Chatbot.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma 3 QAT model. In this case, you can use a CPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **CPU**.

### Gemma Setup

**Before we dive into the tutorial, let's get you set up:**

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).
2. **Model Access:** Head over to the [model page](https://huggingface.co/google/gemma-3-1b-it-qat-q4_0-gguf) and accept the usage conditions.
3. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section where we'll set up environment variables in your Colab environment.**

### Configure your HF token

Add your Hugging Face token to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create a new secret with the name `HF_TOKEN`.
3. Copy/paste your token key into the Value input box of `HF_TOKEN`.
4. Toggle the button on the left to allow notebook access to the secret.

In [None]:
import os
from google.colab import userdata
# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### Install dependencies
Run the cell below to install all the required dependencies.

In [None]:
!pip install -q huggingface_hub scikit-build-core llama-cpp-python llama-cpp-agent gradio

### Log into Hugging Face Hub

In [None]:
from huggingface_hub import login

login(os.environ["HF_TOKEN"])

## Instantiate the Gemma 3 QAT model

We’ll use the Hugging Face Hub to download the 4-bit (Q4_0) quantized Gemma 3 1B instruction-tuned text-only model. New quantized Gemma 3 models, created using Quantization-Aware Training (QAT), offer improved accessibility through reduced memory usage (3x less than bfloat16) without significant accuracy loss.  QAT simulates low-precision operations during training for smaller, faster models.

Let's get started by downloading the model from Hugging Face Hub.

### Loading the model from HF Hub

In [None]:
# Download gguf model files
from huggingface_hub import hf_hub_download

if not os.path.exists("./models"):
    os.makedirs("./models")

hf_hub_download(
    repo_id="google/gemma-3-1b-it-qat-q4_0-gguf",
    filename="gemma-3-1b-it-q4_0.gguf",
    local_dir="./models",
)

### Prompt Formatting

Gemma 3 model requires specific formatting to understand the roles of different participants in a conversation. The prompt format is as follows:

```
<bos><start_of_turn>user
{system_prompt}

{prompt}<end_of_turn>
<start_of_turn>model
```

In [None]:
# Define the prompt markers for Gemma 3
from llama_cpp_agent.chat_history.messages import Roles
from llama_cpp_agent.messages_formatter import MessagesFormatter, PromptMarkers

gemma_3_prompt_markers = {
    Roles.system: PromptMarkers("", "\n"),  # System prompt should be included within user message
    Roles.user: PromptMarkers("<start_of_turn>user\n", "<end_of_turn>\n"),
    Roles.assistant: PromptMarkers("<start_of_turn>model\n", "<end_of_turn>\n"),
    Roles.tool: PromptMarkers("", ""),  # If need tool support
}

# Create the formatter
gemma_3_formatter = MessagesFormatter(
    pre_prompt="",  # No pre-prompt
    prompt_markers=gemma_3_prompt_markers,
    include_sys_prompt_in_first_user_message=True,  # Include system prompt in first user message
    default_stop_sequences=["<end_of_turn>", "<start_of_turn>"],
    strip_prompt=False,  # Don't strip whitespace from the prompt
    bos_token="<bos>",  # Beginning of sequence token for Gemma 3
    eos_token="<eos>",  # End of sequence token for Gemma 3
)

## Chat with Gemma 3

This function handles loading the model and generating responses. Streaming provides real-time generation rather than waiting for the complete response.

In [None]:
from typing import List, Tuple
from llama_cpp import Llama
from llama_cpp_agent import LlamaCppAgent
from llama_cpp_agent.providers import LlamaCppPythonProvider
from llama_cpp_agent.chat_history import BasicChatHistory
from llama_cpp_agent.chat_history.messages import Roles

llm = None
llm_model = None

def respond(
    message: str,
    history: List[Tuple[str, str]],
    model: str = "gemma-3-1b-it-q4_0.gguf",
    system_message: str = "You are a helpful assistant.",
    max_tokens: int = 1024,
    temperature: float = 0.7,
    top_p: float = 0.95,
    top_k: int = 40,
    repeat_penalty: float = 1.1,
):
    """
    Respond to a message using the Gemma3 model via Llama.cpp.

    Args:
        - message (str): The message to respond to.
        - history (List[Tuple[str, str]]): The chat history.
        - model (str): The model to use.
        - system_message (str): The system message to use.
        - max_tokens (int): The maximum number of tokens to generate.
        - temperature (float): The temperature of the model.
        - top_p (float): The top-p of the model.
        - top_k (int): The top-k of the model.
        - repeat_penalty (float): The repetition penalty of the model.

    Returns:
        str: The response to the message.
    """
    try:
        # Load the global variables
        global llm
        global llm_model

        # Ensure model is not None
        if model is None:
            model = "gemma-3-1b-it-q4_0.gguf"

        # Load the model
        if llm is None or llm_model != model:
            # Check if model file exists
            model_path = f"models/{model}"
            if not os.path.exists(model_path):
                yield f"Error: Model file not found at {model_path}. Please check your model path."
                return

            llm = Llama(
                model_path=f"models/{model}",
                flash_attn=False,
                n_gpu_layers=0,
                n_batch=8,
                n_ctx=2048,
                n_threads=8,
                n_threads_batch=8,
            )
            llm_model = model
        provider = LlamaCppPythonProvider(llm)

        # Create the agent
        agent = LlamaCppAgent(
            provider,
            system_prompt=f"{system_message}",
            custom_messages_formatter=gemma_3_formatter,
            debug_output=True,
        )

        # Set the settings like temperature, top-k, top-p, max tokens, etc.
        settings = provider.get_provider_default_settings()
        settings.temperature = temperature
        settings.top_k = top_k
        settings.top_p = top_p
        settings.max_tokens = max_tokens
        settings.repeat_penalty = repeat_penalty
        settings.stream = True

        messages = BasicChatHistory()

        # Add the chat history
        for msn in history:
            user = {"role": Roles.user, "content": msn[0]}
            assistant = {"role": Roles.assistant, "content": msn[1]}
            messages.add_message(user)
            messages.add_message(assistant)

        # Get the response stream
        stream = agent.get_chat_response(
            message,
            llm_sampling_settings=settings,
            chat_history=messages,
            returns_streaming_generator=True,
            print_output=False,
        )

        # Generate the response
        outputs = ""
        for output in stream:
            outputs += output
            yield outputs

    # Handle exceptions that may occur during the process
    except Exception as e:
        raise Exception(f"An error occurred: {str(e)}") from e

## Gradio UI

This offers a user interface with customizable buttons, editable messages, and options for advanced users to tune model parameters like temperature and top-p.

In [None]:
# Create a chat interface
import gradio as gr

demo = gr.ChatInterface(
    respond,
    examples=[["What is the capital of France?"], ["Tell me something about artificial intelligence."], ["What is gravity?"]],
    additional_inputs_accordion=gr.Accordion(
        label="⚙️ Parameters", open=False, render=False
    ),
    additional_inputs=[
        gr.Dropdown(
            choices=[
                "gemma-3-1b-it-q4_0.gguf",
            ],
            value="gemma-3-1b-it-q4_0.gguf",
            label="Model",
            info="Select the AI model to use for chat",
        ),
        gr.Textbox(
            value="You are a helpful assistant.",
            label="System Prompt",
            info="Define the AI assistant's personality and behavior",
            lines=2,
        ),
        gr.Slider(
            minimum=512,
            maximum=2048,
            value=1024,
            step=1,
            label="Max Tokens",
            info="Maximum length of response (higher = longer replies)",
        ),
        gr.Slider(
            minimum=0.1,
            maximum=2.0,
            value=0.7,
            step=0.1,
            label="Temperature",
            info="Creativity level (higher = more creative, lower = more focused)",
        ),
        gr.Slider(
            minimum=0.1,
            maximum=1.0,
            value=0.95,
            step=0.05,
            label="Top-p",
            info="Nucleus sampling threshold",
        ),
        gr.Slider(
            minimum=1,
            maximum=100,
            value=40,
            step=1,
            label="Top-k",
            info="Limit vocabulary choices to top K tokens",
        ),
        gr.Slider(
            minimum=1.0,
            maximum=2.0,
            value=1.1,
            step=0.1,
            label="Repetition Penalty",
            info="Penalize repeated words (higher = less repetition)",
        ),
    ],
    submit_btn="Send",
    stop_btn="Stop",
    chatbot=gr.Chatbot(scale=1, show_copy_button=True, resizable=True),
    flagging_mode="never",
    editable=True,
    cache_examples=False,
)

Finally, we’ll launch our chat interface with Gradio

In [None]:
# Launch the chat interface
demo.launch(debug=True)

## What's Next?

That's it! Here are some ideas to explore further:

- **Explore the Gemma family models:** Visit [Gemma Open Models](https://ai.google.dev/gemma) to learn about the latest updates regarding the Gemma family models, new capabilities, versions, and more.

- **Gradio Customization:** Explore the [Gradio documentation](https://www.gradio.app/docs) to learn about customizing your chat interface, adding new options and features.

- **Share Your Gradio Dashboard:** Check out the [Sharing your Gradio app](https://www.gradio.app/guides/sharing-your-app) page to learn how to safely share your Gradio dashboard with others!