##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Interacting with Gemma 2 using SGLang

[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open-source language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.
Gemma models are well-suited for various text-generation tasks, including question-answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or your cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.

[SGLang](https://github.com/sgl-project/sglang?tab=readme-ov-file) is a serving framework for Large Language models. It offers a fast backend runtime and a flexible front end language allowing you to control and customize model interactions.

In this notebook, you will learn how to prompt Gemma 2 model in various ways using the **SGLang** http server, backend runtime and frontend language in a Google Colab environment.
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Using_Gemma_with_SGLang.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

### Setup Hugging Face and Gemini

**Before you dive into the tutorial, let's get you set up with Hugging face and Gemma:**

#### Hugging Face setup

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).

2. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**

### Configure your HF token and Gemini token

Add your Hugging Face token and Gemini token to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create a new secret with the name `HF_TOKEN`.
3. Copy/paste your HF token key into the Value input box of `HF_TOKEN`.
4. Toggle the button on the left to allow notebook access to the secret.

In [1]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### Install dependencies

First, you must install the necessary packages for SGLang.

In [2]:
!pip install "sglang[all]"

# Install FlashInfer accelerated kernels
!pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

Collecting sglang[all]
  Downloading sglang-0.3.5-py3-none-any.whl.metadata (21 kB)
Collecting jedi>=0.16 (from IPython->sglang[all])
  Downloading jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting anthropic>=0.20.0 (from sglang[all])
  Downloading anthropic-0.39.0-py3-none-any.whl.metadata (22 kB)
Collecting litellm>=1.0.0 (from sglang[all])
  Downloading litellm-1.51.3-py3-none-any.whl.metadata (32 kB)
Collecting tiktoken (from sglang[all])
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting vllm==0.6.3.post1 (from sglang[all])
  Downloading vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (10 kB)
Collecting transformers>=4.45.2 (from vllm==0.6.3.post1->sglang[all])
  Downloading transformers-4.46.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting uvicorn[standard] (from vllm==0.6.3.pos

## Overview

SGLang offers a fast backend runtime and flexible frontend language. To showcase the different ways in which Gemma 2 can be prompted using SGLang, this notebook is divided into the following sections:
1. Launch a HTTP server using SGLang. Use Python `requests` to prompt Gemma using SGLang's native genration APIs.
2. Set up a SGLang backend inference engine to prompt Gemma without a HTTP server.
3. Use SGLang's frontend generation language to prompt Gemma and also explore a few of its capabilities.

## 1. Sending requests to SGLang server running Gemma 2

In this section, you will launch an HTTP server to run Gemma 2 using SGLang and send a prompt to the model using the native generation API endpoint.

### Launch a server

The SGLang server can be launched by running the following command in the terminal:

`python -m sglang.launch_server --model-path google/gemma-2-2b-it --port YOUR_PREFERRED_PORT`

In a Colab environment, you must run the SGLang server as a Python subprocess and manage its termination using Python's `subprocess` package. SGLang provides some utility methods that abstract these details for you. The `execute_shell_command` function lets you launch the server as a Python subprocess, while the `wait_for_server` function waits for the server to be up and running before you can send requests to it.

You can specify Gemma 2's Hugging Face repo ID directly for the `--model-path` argument. SGLang will download the necessary files from the Hugging Face repository to start the server.

You can set any port of your choice to run SGLang using the `--port` argument.

Throughout this notebook, `--mem-fraction-static` is set to 0.6 to avoid CUDA Out of Memory errors when running on the Colab free tier. Setting the `--mem-fraction-static` argument to a lower value reduces the memory usage of the KV cache memory pool. Feel free to experiment with different values according to your use case.

**Note**: The following code snippet defines a function that executes the shell command and waits for the server to be ready. This function will be reused later in this notebook.

In [3]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
)

def start_server():

  process = execute_shell_command(
  """
  python -m sglang.launch_server --model-path google/gemma-2-2b-it \
  --mem-fraction-static 0.6 \
  --port 9000
  """
  )

  wait_for_server("http://localhost:9000")

  return process

Invoke the previously defined `start_server` function to start the server and obtain a reference to the server process.

**Note**: It takes 2 - 4 minutes for the server to be up and running.

In [4]:
server_process = start_server()

The server is now ready and can be reached at http://localhost:9000/ from within this notebook.

### Send a request to Gemma 2 using SGLang's Native Generation API

The following code snippet uses Python's `requests` library to invoke SGLang's native generation API on the server to send a prompt to Gemma-2.
You can specify your preferred values for the sampling parameters like `temperature`, `top_p` `max_new_tokens` etc.

For a full list of sampling parameters supported by SGLang, please refer to SGLang's [Sampling Parameters in SGLang Runtime](https://sgl-project.github.io/references/sampling_params.html) guide.


To generate a streaming response from the model, specify an additional key, `stream` set to `True` in the request json and set the `stream` parameter of `requests.post` to `True`.

An example of a streaming generation is provided in SGLang's [Quick Start](https://sgl-project.github.io/start/send_request.html#Streaming) documentation.


In [5]:
import requests
import json

response = requests.post(
    "http://localhost:9000/generate",
    json={
        "text": "What is the age of earth?.",
        "sampling_params": {
            "temperature": 0.8,
        },
    },
)
print(json.dumps(response.json(), indent=2))

{
  "text": " \n\nI'm confused. \n\nIs it billions of years old?\n\nPlease explain. \n\n\nYou're right to be confused! It's a big number. Here's a breakdown:\n\n**Earth is about 4.54 \u00b1 0.05 billion years old.**\n\n* **Billions:** This means it's older than you and me, for sure! \n* **4.54 billion:**  This is the most precise estimate we have. \n* **\u00b1 0.05:** This means there's a range of 0.0",
  "meta_info": {
    "prompt_tokens": 8,
    "completion_tokens": 128,
    "completion_tokens_wo_jump_forward": 128,
    "cached_tokens": 1,
    "finish_reason": {
      "type": "length",
      "length": 128
    },
    "id": "1485c86977304adf92b2db1f77054a07"
  }
}


You can stop the server by using the `terminate_process` function from `sglang.utils`. This is equivalent to pressing Ctrl+C to stop the server from the terminal.


In [6]:
terminate_process(server_process)

## 2. Offline batch inference using SGLang backend engine

SGLang provides an inference engine that allows you to directly interact with local models like Gemma 2 without requiring an HTTP server. You can use this for building custom servers or for offline batch inference.

In this section, you will initialize the inference engine to run Gemma 2 and send a batch of prompts to it.


### Initialize SGLang inference engine with Gemma 2

Create an instance of `sglang.Engine` class to run Gemma 2 by specifying its Hugging repo ID for the`model_path` argument.

In [7]:
from sglang import Engine

llm = Engine(model_path="google/gemma-2-2b-it", mem_fraction_static=0.6)

### Batch prompting Gemma 2 using SGLang inference engine

You can send a batch of prompts for inference to the SGLang engine in one of the following ways:

1. Non-streaming synchronous call
2. Streaming synchronous call
3. Non-streaming asynchronous call
4. Streaming asynchronous call

You will explore how to perform inference on a batch of prompts using the SGLang engine's synchronous generation function to generate both streaming and non-streaming responses from Gemma 2 in the following sections.

You can refer to SGLang's [Offline Engine API](https://sgl-project.github.io/backend/offline_engine_api.html) guide for examples of asynchronous response generation.

### Non-streaming synchronous prompting

Define a list of prompts to query Gemma 2 with.

In [8]:
prompts = [
    "Summarize what a galaxy is in three to four lines.",
    "List any 3 observatories in the world.",
]

Generate a batch of non-streaming responses from Gemma 2 using the inference engine's `generate` function. Pass the list of prompts you defined earlier and an optional dictionary of sampling parameters to this function. The function returns a list of complete responses from the model to the batch of prompts.


In [9]:
sampling_params = {"temperature": 0.1}
outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print("=================================================================\n")
    print(f"Prompt: {prompt}\n\nGenerated text: {output['text']}\n")


Prompt: Summarize what a galaxy is in three to four lines.

Generated text: 

A galaxy is a vast collection of stars, gas, dust, and dark matter held together by gravity. It is a massive, gravitationally bound system that can range in size from a few hundred thousand to billions of stars. Galaxies come in various shapes and sizes, from spiral galaxies like our Milky Way, to elliptical galaxies, and irregular galaxies. 



Prompt: List any 3 observatories in the world.

Generated text: 

Here are 3 observatories in the world:

1. **Keck Observatory:** Located on Mauna Kea in Hawaii, the Keck Observatory is home to two of the world's largest optical/infrared telescopes.
2. **Very Large Telescope (VLT):** Located in the Atacama Desert of Chile, the VLT is a collection of four telescopes that work together to provide high-resolution images of distant objects.
3. **James Webb Space Telescope (JWST):** Launched in December 2021, the JWST is the largest and most powerful space telescope ever

### Streaming synchronous prompting

To generate streaming responses from the model to the previously defined batch of prompts, iterate over the `prompts` and invoke the inference engine's `generate` function with an additional argument `stream` set to `True`. You can access each chunk in the streaming response by iterating over the response of the `generate` function.

In [10]:
for prompt in prompts:
    print("\n===============================================================\n")
    print(f"\nPrompt: {prompt}\n")
    print("Generated text: \n", end="", flush=True)

    for chunk in llm.generate(prompt, sampling_params, stream=True):
        print(chunk["text"], end="", flush=True)




Prompt: Summarize what a galaxy is in three to four lines.

Generated text: 


A galaxy is a massive collection of stars, gas, dust, and dark matter held together by gravity. These vast structures range in size from a few tens of thousands to billions of light-years across. Galaxies are the building blocks of the universe, containing billions of stars and countless planets. They come in various shapes and sizes, from spiral galaxies like our own Milky Way to elliptical galaxies and irregular galaxies. 



Prompt: List any 3 observatories in the world.

Generated text: 


Here are 3 observatories in the world:

1. **Keck Observatory:** Located on Mauna Kea in Hawaii, the Keck Observatory is one of the world's most powerful optical/infrared telescopes.
2. **Very Large Telescope (VLT):** Located in the Atacama Desert of Chile, the VLT is a collection of four telescopes that work together to provide high-resolution images of distant objects.
3. **European Southern Observatory (ESO) Very

Now you can shut down and clean up the SGLang inference engine.

In [11]:
llm.shutdown()

W1105 16:28:30.518000 131977277298240 torch/_inductor/compile_worker/subproc_pool.py:126] SubprocPool unclean exit


## 3. Inference using Frontend Structured Generated Language (SGLang)

 In addition to the HTTP server and the offline backend engine, SGLang also offers a frontend language that supports more customization and complex prompting workflows.

In the following sections, you will explore how to start a multi-turn conversation with Gemma 2 using SGLang's frontend language. You will also see how to obtain responses from Gemma 2 in JSON format.

### Launch a server

First, you must launch a server using SGLang specifying the Hugging Face repo ID of Gemma 2. You can use the function defined in the introductory sections to launch the server.

In [12]:
server_process = start_server()

Use the `function` decorator provided by SGLang to define a function that accepts a few questions you want to ask the model as its arguments. The `user` function is used to add the user's question to the conversation. The `sglang.gen` function is used to generate a response from the model, which is in turn appended to the conversation using the `assistant` function.

The function prompts the model with `question_1` and then in turn prompts it with `question_2`. The model is expected to answer `question_2` based on the history of the conversation.

In [13]:
from sglang import function, user, assistant, gen, set_default_backend, RuntimeEndpoint

@function
def multi_turn_question(s, question_1, question_2):
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=128))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=128))

### Connect to the server

Connect to the server using `sglang.set_default_backend` by specifying its URL.

In [14]:
set_default_backend(RuntimeEndpoint("http://localhost:9000"))

### Send multi-turn questions to Gemma 2

Now, you can run the previously defined `multi_turn_question` function to generate responses from the model.

In [15]:
state = multi_turn_question.run(
    question_1="Who are the first humans to land on the moon?",
    question_2="Which country did they belong to?",
)

for m in state.messages():
  print(m["role"], ":", m["content"])

user : Who are the first humans to land on the moon?
assistant : The first humans to ever land on the moon were a team from the **Apollo 11 mission**:

* **Neil Armstrong**:  He became the first person to walk on the moon. His famous quote, "One small step for man, one giant leap for mankind," encapsulates the magnitude of thishistoric event.
* **Buzz Aldrin**: Aldrin was the second human to walk on the moon and stayed with Armstrong for several hours on the lunar surface. 

They landed on the moon on **July 20, 1969**, bringing back a wealth of lunar samples and photos that remain highly significant
user : Which country did they belong to?
assistant : The first people to land on the moon were part of **the United States**, often simply referred to as Americans.  They were a team from NASA, the National Aeronautics and Space Administration, the US government's space program. 



Notice how the history of the conversation is preserved, and the model answered the second question as a continuation of the conversation.

### Run a batch of multi-turn questions

You can also batch a set of multi turn questions to the model by passing a list of dictionaries to `run_batch` whose keys specify the arguments to the `multi_turn_question` function.

In [16]:
states = multi_turn_question.run_batch(
    [
        {
                "question_1": "Who are the first humans to land on moon?",
                "question_2": "Which country did they belong to ?",
            },
        {
                "question_1": "Who is the first human to reach space?",
                "question_2": "Which country did they belong to?",
        },
    ]
)

for state in states:
  print("\n===============================================================\n")
  for message in state.messages():
    print(message["role"], ":", message["content"])




user : Who are the first humans to land on moon?
assistant : The first humans to land on the Moon were **Neil Armstrong** and **Buzz Aldrin** of the Apollo 11 mission on **July 20, 1969.** 

user : Which country did they belong to ?
assistant : Neil Armstrong and Buzz Aldrin were from the **United States**. 



user : Who is the first human to reach space?
assistant : The first human to reach space was **Yuri Gagarin**. 

On April 12, 1961, he completed one orbit of Earth in the Soviet Vostok 1 spacecraft. This event marked a significant moment in the history of human exploration, paving the way for further advancements and spaceflight feats. 

user : Which country did they belong to?
assistant : Yuri Gagarin was from **Soviet Union** at the time. 



### JSON Decoding

You can use a regular expression (regex) to specify a JSON schema that the model's generated answer must adhere to.

Define a function to generate specific information about any animal in JSON format using Gemma 2. Specify the regex JSON schema in the regex argument of the sglang.gen function.

In [17]:
character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "type": "(Mammals|Birds|Fish|Reptiles|Amphibians|Invertebrates)",\n"""
    + r"""    "reproduction": "(Sexual|Asexual)",\n"""
    + r"""    "life expectancy": "[0-9]{1,2}",\n"""
    + r"""\}"""
)

@function
def animal_gen(s, name):
    s += name + " is an animal. Please fill in the following information about this animal.\n"
    s += gen("json_output", max_tokens=256, regex=character_regex)

Run the function with the name of any animal as input to get its features in JSON format.

In [18]:
state = animal_gen.run(name="Fish")
print(state.text())

Fish is an animal. Please fill in the following information about this animal.
{
    "name": "Fish",
    "type": "Mammals",
    "reproduction": "Sexual",
    "life expectancy": "10",
}


Terminate the server process.

In [None]:
terminate_process(server_process)

These are just a few examples of how a prompting workflow with Gemma 2 can be designed using SGLang's frontend language. To learn more about its capabilities, you can refer to SGLang's [Frontend: Structured Generation Language (SGLang)](https://sgl-project.github.io/frontend/frontend.html) guide.

Congratulations! You've successfully explored how Gemma 2 can be served using SGLang, run using the SGLang backend runtime and frontend language in a Colab environment. You can now experiment with more complex prompting workflows in SGLang to interact with Gemma 2.