##### Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Building a Chatbot with Gemma and Gradio

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Gradio_Chatbot.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Runtime Environment

  1. Click **Open in Colab**.
  2. In the menu, go to **Runtime** > **Change runtime type**.
  3. Under **Hardware accelerator**, select **T4 GPU**.


### Hugging Face Hub Access Token

Before diving into the tutorial, let's set up Gemma:

1. **Create a Hugging Face Account**: If you don't have one, you can sign up for a free account [here](https://huggingface.com/join).
2. **Access the Gemma Model**: Visit the [Gemma model page](https://huggingface.com/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage terms.
3. **Generate a Hugging Face Token**: Go to your Hugging Face [settings page](https://huggingface.com/settings/tokens) and generate a new access token (preferably with `write` permissions). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section, where you'll set up environment variables in your Colab environment.**

### Configure Your Credentials

To access private models and datasets, you need to log in to the Hugging Face (HF) ecosystem.

If you're using Colab, you can securely store your Hugging Face token (`HF_TOKEN`) using the Colab Secrets Manager:
1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. **Add Hugging Face Token**:
- Create a new secret with the **name** `HF_TOKEN`.
- Copy and paste your token key into the **Value** input box for `HF_TOKEN`.
- **Toggle** the button on the left to allow notebook access to the secret

This code retrieves your secrets and sets them as environment variables for use later in the tutorial.

In [None]:
import os
import sys

if "google.colab" in sys.modules:
    from google.colab import userdata
    os.environ['HF_TOKEN'] = userdata.get("HF_TOKEN")

if "HF_TOKEN" not in os.environ:
    raise EnvironmentError(
        "The Hugging Face token (HF_TOKEN) could not be found in the "
        "environment variables. This token is required to download the Gemma "
        "models from the Hugging Face Hub. For more information about "
        "HF User Access tokens, please refer to the HF documentation "
        "here: https://huggingface.co/docs/hub/en/security-tokens."
    )

### Install dependencies

Next, you'll install the required libraries. In this case, we only need gradio for the chat interface and transformers to load the Gemma model from the Hugging Face Hub.


In [None]:
!pip install -q -U gradio==5.9.1
!pip install -q -U transformers==4.46.1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.2/57.2 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.4/320.4 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.2/73.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Chat with Gemma using Gradio

### Initializing Gemma 2 model

Let's create a pipeline that will use the gemma-2-2b-it model to generate text. The transformers library provides an easy way to load the model and tokenizer into memory by simply specifying the model name and some basic parameters.

In [None]:
import torch
import transformers

# Model details
model_name = "google/gemma-2-2b-it"
device = "cuda"
model_kwargs = {
    "torch_dtype": torch.float16,
}

# Load the Gemma tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

# Create a pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=model_name,
    device=device,
    model_kwargs=model_kwargs
)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

### Create a custom chat template

Hugging Face supports chat templates that can define the structure and format for converting conversations into a single tokenizable string, which is the input format expected by the language model. Check the [chat templates documentation](https://huggingface.co/docs/transformers/main/en/chat_templating) to learn more about templates and how to create a custom one.

Since Gemma doesn't support system instructions, you will provide system input as user input. This template has been adjusted to be compatible with Gradio's chat interface. To learn more about the format expected by Gemma, check out the [Gemma formatting documentation](https://ai.google.dev/gemma/docs/formatting).

In [None]:
tokenizer.chat_template = \
    "{{ bos_token }}"\
    "{% if messages[0]['role'] == 'system' %}"\
        "{{'<start_of_turn>user\n' + messages[0]['content'] | trim + ' ' + messages[1]['content'] | trim + '<end_of_turn>\n'}}"\
        "{% set messages = messages[2:] %}"\
    "{% endif %}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{'<start_of_turn>user\n' + message['content'] | trim + '<end_of_turn>\n'}}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{'<start_of_turn>model\n' + message['content'] | trim + '<end_of_turn>\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '<start_of_turn>model\n' }}"\
    "{% endif %}"

### Handle new messages

Now, you need to define a function that will handle new messages (user inputs).

To make the model context-aware, we need to provide:

1. System message: The first message of the conversation that guides the behavior of the model during the chat.
1. Chat history: Messages exchanged between the assistant and the user so far.
1. New message: A new message sent by the user.

All of this information is converted into a list of messages. Then, `apply_chat_template` is used to create the actual prompt (a long string with all the special tokens required by Gemma). The prompt is passed to the tokenizer and then to the model to generate the response.

In [None]:
from typing import List, Dict

system_message = "You're a helpful assistant."

def chat_with_gemma(message: str, history: List[Dict[str, str]],
                    max_new_tokens: int = 512) -> str:
    """Chats with the Gemma 2 model and returns the response.

    This function takes a user message and chat history as input, formats them
    using the custom chat template, and generates a response using the Gemma 2
    pipeline.

    Args:
        message:        The user's message.
        history:        The chat history as a list of messages.
        max_new_tokens: The maximum number of new tokens to generate.

    Returns:
        response: Content generated by the model.
    """

    # Combine system message, history and the new message into a list of messages.
    messages = [
        {"role": "system", "content": system_message},
        *history,
        {"role": "user", "content": message},
    ]

    # Apply the chat template to convert it into the prompt (string).
    prompt = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False
    )

    # Generate response using the pipeline defined above.
    outputs = pipeline(prompt, max_new_tokens=max_new_tokens)

    # A basic error handling mechanism. If something goes wrong, the
    # user will see "Something went wrong..." instead of a long error message.
    # It's usually a good place to handle quota limits, harmful content, etc.
    response = "_Something went wrong. Please try again._"
    try:
        response = outputs[0]["generated_text"][len(prompt):]
    except:
        pass
    return response

### Let's Run It!

Now, we will use Gradio's `ChatInterface` to create an interactive chat interface that will allow you to chat with our Gemma 2 model! In this case, it will create a window inside Google Colab, but if you run it in a standalone file, it will start an HTTP server, and you will be able to access the chat from your browser.

In [None]:
import gradio as gr

gr.ChatInterface(
    fn=chat_with_gemma,
    type="messages"
).launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://29b77c0650271c6a24.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## What's Next?

That's it! If you're wondering how to make your chatbot even better, check out the following resources:

- **Explore the Gemma family models:** Visit [Gemma Open Models](https://ai.google.dev/gemma) to learn about the latest updates regarding the Gemma family models, new capabilities, versions, and more.
- **Gradio Customization:** Explore the [Gradio documentation](https://www.gradio.app/docs) to learn about customizing your chat interface, adding new options and features.
- **Share Your Gradio Dashboard:** Check out the [Sharing your Gradio app](https://www.gradio.app/guides/sharing-your-app) page to learn how to safely share your Gradio dashboard with others!