##### Copyright 2024 Google LLC.

In [1]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting Started with Constrained generation with Gemma 2 using Llamacpp and Guidance

[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open-source language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.
Gemma models are well-suited for various text-generation tasks, including question-answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.

Constrained generation is a method that modifies the token generation process of a generative model to limit its predictions for subsequent tokens to only those that adhere to the necessary output structure.

[llama.cpp](https://github.com/ggerganov/llama.cpp) is a C++ implementation of Meta AI's LLaMA and other large language model architectures, designed for efficient performance on local machines or within environments like Google Colab. It enables you to run large language models without needing extensive computational resources. In llama.cpp, formal grammars are defined using the GBNF (GGML BNF) format to constrain model outputs. It can be used, for instance, to make the model produce legitimate JSON or to communicate exclusively in emojis.

[Guidance](https://github.com/guidance-ai/guidance/tree/main?tab=readme-ov-file#constrained-generation) is an effective programming paradigm for steering language models. Guidance reduces latency and costs compared to traditional prompting or fine-tuning while allowing you to control the output's structure and provide high-quality output for your use case.

In this notebook, you will learn how to perform constrained generation in Gemma 2 models using `llama.cpp` and `guidance` in a Google Colab environment. You'll install the necessary packages, set up the model, and run a sample prompt.

<table align="left">
<td>
 <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Constrained_generation_with_Gemma.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
</td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

### Gemma setup

**Before you dive into the tutorial, let's get you set up with Gemma:**

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).
2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.
3. **Colab with Gemma Power:**  For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.
4. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**


### Configure your HF token

Add your Hugging Face token to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create a new secret with the name `HF_TOKEN`.
3. Copy/paste your token key into the Value input box of `HF_TOKEN`.
4. Toggle the button on the left to allow notebook access to the secret.

In [2]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### Install dependencies

You'll need to install a few Python packages and dependencies to interact with HuggingFace along with `llama-cpp-python` and `guidance`. Find some of the releases of `llama-cpp-python` supporting CUDA 12.2 [here](https://abetlen.github.io/llama-cpp-python/whl/cu122/llama-cpp-python/).

Run the following cell to install or upgrade it:

In [3]:
# The huggingface_hub library allows us to download models and other files from Hugging Face.
!pip install --upgrade -q huggingface_hub

# Install the guidance package.
!pip install guidance

# The llama-cpp-python library allows us to leverage GPUs.
!pip install llama-cpp-python==0.2.90 \
  -q -U --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/447.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/447.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m440.3/447.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m447.5/447.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting guidance
  Downloading guidance-0.1.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)
Collecting diskcache (from guidance)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting ordered-set (from guidance)
  Downloading ordered_set-4.1.0-py3-none-any.whl.metadata (5.3 kB)
Collecting tiktoken>=0.3 (from guidance)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading 

### Logging into Hugging Face Hub

Next, you’ll need to log into the Hugging Face Hub using your access token to download the Gemma model.

In [4]:
from huggingface_hub import login

login(os.environ["HF_TOKEN"])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


### Downloading the Gemma 2 Model
Once you're logged in, you can download the Gemma 2 model files from Hugging Face. The [Gemma 2 model](https://huggingface.co/google/gemma-2-2b-GGUF) is available in **GGUF** format, which is optimized for use with `llama.cpp` and compatible tools like Llamafile.

In [5]:
from huggingface_hub import hf_hub_download

# Specify the repository and filename
repo_id = 'google/gemma-2-2b-GGUF'  # Repository containing the GGUF model
filename = '2b_pt_v2.gguf'  # The GGUF model file

# Download the model file to the current directory
hf_hub_download(repo_id=repo_id, filename=filename, local_dir='.')

2b_pt_v2.gguf:   0%|          | 0.00/10.5G [00:00<?, ?B/s]

'2b_pt_v2.gguf'

## Constrained generation in Gemma 2 model using `llama.cpp`

An advanced way to perform constrained generation is to use Context-free grammar (CFG) to direct your LLM to produce the desired structure.

Context-free grammar (CFG) can be considered a more powerful and expressive regex form. CFGs are capable of managing chores like balancing parenthesis and complex structures like nested and recursive structures.

**llama.cpp** supports CFG through a format called GBNF (GGML Backus-Naur Form). In short, GBNF defines formal grammars that limit the outputs of models in `llama.cpp`. You can read more about GBNF and its syntax from [llama.cpp's README page](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).

You can use the `LlamaGrammer` function from llama_cpp to read the grammar as a string or read the grammar from a GBNF File.

For this example, you will create a GBNF grammar to show football (soccer) player statistics as JSON. Here, you are defining the GBNF grammar directly as a string within your code.

In [6]:
from llama_cpp.llama import LlamaGrammar

# Define the GBNF grammar as a string.
FOOTBALL_GBNF = r"""
ws ::= ([ \t\n] ws)?

string ::=
  "\"" (
    [^"\\\x7F\x00-\x1F] |
     "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
  )* "\""


digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

one-or-two-digits ::= digit | digit digit

one-or-four-digits ::= digit | digit digit | digit digit digit | digit digit digit digit

zero-to-two-digits ::= "" | digit | digit digit

position ::= "\"Striker\"" | "\"Midfielder\"" | "\"Defender\"" | "\"Goalkeeper\""

world-cup ::= "\"Yes\"" | "\"No\""

stats ::= (
  "{\n" ws
    "\"goals\": " one-or-four-digits "," ws
    "\"assists\": " one-or-four-digits "," ws
    "\"height\": " one-or-two-digits "." zero-to-two-digits "," ws
    "\"world-cup\": " world-cup ws
  "}"
)

player ::= (
  "{\n" ws
    "\"name\": " string "," ws
    "\"country\": " string "," ws
    "\"position\": " position "," ws
    "\"stats\": " stats ws
  "}"
)

root ::= player
"""

# Read the GBNF grammar using `from_string` method of LlamaGrammar.
grammar = LlamaGrammar.from_string(FOOTBALL_GBNF, verbose=False)

You will initialize the model using the `llama-cpp-python` library by loading a pre-trained Gemma 2 model from HuggingFace. Here's what each part of the code does:

- `model_path`: Path to the model.
- `verbose`: Disables verbose logging during model loading for cleaner output.
- `n_gpu_layers`: Configures GPU acceleration. A value of `-1` means it will use as many GPU layers as possible.

To perform constrained generation, pass the `grammar` defined above as an argument of the `create_chat_completion` function.

In [7]:
model_path = "2b_pt_v2.gguf"
from llama_cpp.llama import Llama

llm = Llama(
    model_path=model_path,
    n_gpu_layers=-1,
    verbose=False,
)

# Generate response
output = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": "Using JSON, describe the following Football player: "
            + "Lionel Messi",
        },
    ],
    grammar=grammar     # Pass the grammar defined earlier
)

print(output["choices"][0]["message"]["content"])

{
"name": "Lionel Messi",
"country": "Argentina",
"position": "Midfielder",
"stats": {
"goals": 60,
"assists": 80,
"height": 1.73,
"world-cup": "Yes"
}
}


## Constrained generation in Gemma 2 model using `Guidance`

Guidance supports context-free grammar via a purely Pythonic interface. In this example, you will use CFG along with interleaved generative constructs and regex.

Interleaved generative structure specifies your structured output as generative constructs and static strings that alternate. Here the generative parts of the task can be defined individually, which will help you to maintain the output structure.


In this example, you will use different operators provided by guidance to implement CFG, such as `select` and `zero_or_more`.

`select`: Constrains generation to a set of options.

`zero_or_more`: Content repeated zero or more times.

To use regex for constrained generation, you can use the `gen` operator. Specify the regex in the `regex` argument, `gen(regex='...)`.

In [8]:
import guidance
import numpy as np
from guidance import models, gen, block, optional, select, zero_or_more
from guidance import commit_point

# Load the model
model_path = "2b_pt_v2.gguf"
gemma2 = models.LlamaCpp(model_path)


# Custom generation function to repeat the content up to two. Similar to
# one_or_more but there is a max value here.
@guidance(stateless=True)
def repeat_range(lm, content, min_count=1, max_count=2):
    for _ in range(min_count):
        lm += content
    if max_count == np.inf:
        lm += zero_or_more(content)
    else:
        for _ in range(max_count - min_count):
            lm += optional(content)
    return lm

# Function to generate numbers up to two digits.
@guidance(stateless=True)
def number(lm):
    n = repeat_range(select(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']))
    # Allow for negative or positive numbers
    return lm + select(['-' + n, n])

# Function to select player position.
@guidance(stateless=True)
def position(lm):
    return lm + select(["Striker", "Midfielder", "Defender", "Goalkeeper"])

# Function to select whether the player has won a World cup.
@guidance(stateless=True)
def world_cup(lm):
    return lm + select(["Yes", "No"])

# Regex function for string.
@guidance(stateless=True)
def string_exp(lm):
    return lm + gen(regex='([^\\\\]*|\\\\[\\\\bfnrt\/]|\\\\u[0-7a-z])*')

llama_model_loader: loaded meta data with 29 key-value pairs and 288 tensors from 2b_pt_v2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = ff8948d2ca54b23c93d253533c6effcf2e892347
llama_model_loader: - kv   3:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   4:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   5:                         gemma2.block_count u32              = 26
llama_model_loader: - kv   6:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv   7:                gemma2.attention.h

For this example, you will implement a combination of interleaved generative structure and CFG to show the stats of football(soccer) players as JSON.

Here, you will keep the structure and keys of the JSON static, allowing the language model to fill in the value parts. This maintains the overall structure of the output.

In [9]:
# `commit_point`s are just ways of stopping functions once you hit a point.
# For eg: commit_point(",") stops string_exp() once you hit `,`.
@guidance(stateless=True)
def simple_json(lm):
    lm += ('{\n' +
     '"name": ' + string_exp() + commit_point(',') + '\n'
     '"country": ' + string_exp() + commit_point(',') + '\n'
     '"position": ' + position() + commit_point(',') + '\n'
     '"stats": {\n' +
     '         "goals":'+ number() + commit_point(',') + '\n'
     '         "assists": ' + number() + commit_point(',') + '\n'
     '         "height": ' + number() +'.' + number() + commit_point(',') + '\n'
     '         "world-cup": ' + world_cup() + commit_point(',') + '\n'
     + commit_point('}')
     + commit_point('}'))
    return lm

# Initialize the query.
lm = gemma2 + """Using JSON, describe these Football players:
Lionel Messi
"""

# Call the simple_json function and implement the JSON structure.
lm += simple_json()

Congratulations! You've successfully implemented constrained generation with the Gemma 2 model using `llama.cpp` and `Guidance` in a Colab environment. You can now experiment with the model, update the grammar, and explore its capabilities.