##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting Started with Gemma 2, Gemini, and RouteLLM

[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open-source language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.
Gemma models are well-suited for various text-generation tasks, including question-answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.

[Gemini](https://blog.google/technology/ai/google-gemini-ai/) is a large language model developed by Google AI. It is a multimodal model, meaning it can process and generate text, code, images, and audio. Gemini is considered one of the most advanced language models available, and it has been shown to outperform other models on various tasks, such as question answering, summarization, and translation.

[RouteLLM](https://github.com/lm-sys/RouteLLM) is a framework that helps you optimize your LLM usage by routing queries to the most appropriate model based on their complexity. It can significantly reduce costs without sacrificing performance. You can easily integrate it into your existing applications and experiment with different routing strategies.

In this notebook, you will learn how to route between the Gemini and the Gemma 2 models using **RouteLLM** in a Google Colab environment. You'll install the necessary packages, set up the models, and run a sample prompt.

<table align="left">
<td>
 <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Using_Gemini_and_Gemma_with_RouteLLM.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
</td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you must have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

### Setup Hugging Face and Gemini

**Before you dive into the tutorial, let's get you set up with Hugging Face and Gemma:**

#### Hugging Face setup

1. **Hugging Face Account:** If you don't already have one, you can create a free one by clicking [here](https://huggingface.co/join).

2. **Hugging Face Token:** Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

#### Gemini setup

1. **Gemini Token:** To use the Gemini API, you need an API key. You can create a key with a few clicks in [Google AI Studio](https://aistudio.google.com/app/apikey).

**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**

### Configure your HF token and Gemini token

Add your Hugging Face token and Gemini token to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create a new secret with the name `HF_TOKEN`.
3. Copy/paste your HF token key into the Value input box of `HF_TOKEN`.
4. Toggle the button on the left to allow notebook access to the secret.
5. Create a new secret with the name `GOOGLE_API_KEY`.
6. Copy/paste your Gemini token key into the Value input box of `GOOGLE_API_KEY`.
7. Toggle the button on the left to allow notebook access to the secret.

In [None]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")
os.environ["GEMINI_API_KEY"] = userdata.get("GOOGLE_API_KEY")

Currently, RouteLLM checks for `OPENAI_API_KEY` before starting. You can set the `OPENAI_API_KEY` to a dummy value as a temporary workaround. The collaborators for RouteLLM are currently working on to fix this [issue](https://github.com/lm-sys/RouteLLM/issues/19).

In [None]:
os.environ["OPENAI_API_KEY"] = "dummy"

### Install dependencies

You'll need to install a few Python packages and dependencies for RouteLLM and Gemini API.

Run the following cell to install or upgrade it:

In [None]:
# Install RouteLLM package.
! pip install "routellm[serve,eval]"

# Install the Python SDK for the Gemini API, `google-generativeai`.
! pip install -q google-generativeai

### Setup Gemma2 with Ollama

To use a local model like Gemma 2 with RouteLLM you need [Ollama](https://ollama.com/).

Ollama is an open-source framework for building and running large language models (LLMs). It's designed to be flexible and customizable, allowing developers to train and deploy their models or fine-tune existing ones. Ollama is a popular choice for those who want to experiment with LLMs or build custom models without relying on proprietary platforms.

#### Install Ollama

In [None]:
!curl https://ollama.ai/install.sh | sh

#### Start Ollama as a background subprocess

In [None]:
import subprocess
import time

process = subprocess.Popen("ollama serve", shell=True)
time.sleep(5)

#### Run Gemma 2 in Ollama as a background subprocess

In [None]:
process = subprocess.Popen("ollama run gemma2", shell=True)
time.sleep(5)

#### Check if Ollama is running

Run the following command to see whether Ollama is up and running. Continue to the next cell when the output of the following command says "**Ollama is running**".

In [None]:
!curl localhost:11434

## How RouteLLM works?


RouteLLM is a systematic framework for preference-data-based LLM routing. There are two models in RouteLLM's routing setup: a weaker but less costly model and a stronger but more costly model.

When a prompt is sent to RouteLLM, an underlying routing approach routes the prompt to either the strong or weak model for inference. RouteLLM offers four different routing approaches:

1. **Similarity-weighted (SW) ranking**: A similarity-weighted (SW) ranking router that performs a "weighted Elo calculation" based on similarity.

2. **Matrix factorization**: A matrix factorization model that learns a scoring function for how well a model can answer a prompt.

3. **BERT classifier**: A BERT classifier that predicts which model can provide a better response.

4. **Causal LLM classifier**: A causal LLM classifier that predicts which model can provide a better response.

Read more about these routing methods and their performance stats from [lmsys's blog](https://lmsys.org/blog/2024-07-01-routellm/).

You can also refer to the research paper, [RouteLLM: Learning to Route LLMs with Preference
Data](https://arxiv.org/abs/2406.18665) published by the creators.

## Routing between Gemini and Gemma 2 using the `routellm` library


You can use **RouteLLM** in your Python code using the `routellm` library.

You will initialize the `Controller` from the `routellm` library by specifying the routers, strong_model, and weak_model. Here's what each argument of `Controller` does:

- `routers`: Name of the router. For this notebook, you will choose `bert` since it is easier to run `bert` in a colab environment.
- `strong_model`: A stronger, more expensive model. For this example you will use **Gemini Pro**.
- `weak_model`: A weaker but cheaper model. You can put your local **Gemma 2** model here.

In [None]:
from routellm.controller import Controller

client = Controller(
  routers=["bert"],  # Use `bert` router
  strong_model="gemini/gemini-pro",
  weak_model="ollama_chat/gemma2"
)

**Threshold calibration** is essential for balancing cost and quality in LLM routing. The optimal threshold depends on your router and incoming queries. Calibrate your queries using a sample and your desired routing percentage. RouteLLM supports default calibration based on the public [Chatbot Arena dataset](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k).

For this example, you will calibrate the threshold for `bert` such that 30% of the calls are routed to the stronger model. To calculate this threshold, you can run the `routellm.calibrate_threshold` command and provide the following values.

`--routers`: bert

`--strong-model-pct` : 0.3

In [None]:
!python -m routellm.calibrate_threshold --task calibrate --routers bert --strong-model-pct 0.3 --config config.example.yaml

The `client.chat.completions.create` function lets you prompt your RouteLLM setup. You can specify the threshold value you obtained in the previous step in the `model` argument of the `client.chat.completions.create` function. If the router is `bert` and the threshold is **0.46514**, pass `router-bert-0.46514` to the `model` argument. You can pass your chat messages to the `messages` argument.

In [None]:
response = client.chat.completions.create(
  # This tells RouteLLM to use the bert router with a cost threshold of 0.46514
  model="router-bert-0.46514",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print("Selected model: {model}\n".format(model=response.model))
print("Response: {model_response}".format(
    model_response=response.choices[0].message.content))

To learn more about the Python usage of **RouteLLM**, visit [RouteLLM's GitHub page](https://github.com/lm-sys/RouteLLM).

## Conclusion

Congratulations! You've successfully routed between the Gemini and the Gemma 2 models using **RouteLLM** in a Google Colab environment.