In [1]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Fine-Tuning CodeGemma on the SQL Spider Dataset
**Author**: Carlo Fisicaro  
**GitHub**: [github.com/carlofisicaro](https://github.com/carlofisicaro)  
**X**: [@carlo_fisicaro](https://twitter.com/carlo_fisicaro)

# Gemma Basics (Hugging Face)
This notebook demonstrates how to load, fine-tune and deploy Gemma model by utilising Hugging Face.
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Gemma_Basics_with_HF.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

### Gemma setup

**Before we dive into the tutorial, let's get you set up with Gemma:**

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).
2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/google/gemma-2b) and accept the usage conditions.
3. **Colab with Gemma Power:**  For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.
4. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section where we'll set up environment variables in your Colab environment.**


### Configure your HF token

Add your Hugging Face token to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create a new secret with the name `HF_TOKEN`.
3. Copy/paste your token key into the Value input box of `HF_TOKEN`.
4. Toggle the button on the left to allow notebook access to the secret.


In [None]:
import os
from google.colab import userdata
# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### Install dependencies
Run the cell below to install all the required dependencies.

In [None]:
!pip install --upgrade -q transformers huggingface_hub peft \
  accelerate bitsandbytes datasets trl

### Log into Hugging Face Hub


In [5]:
from huggingface_hub import login

login(os.environ["HF_TOKEN"])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


All set and ready to explore the possibilities with Gemma!

## Instantiate the CodeGemma 7B model

CodeGemma is a collection of powerful, lightweight models that can perform a variety of coding tasks like fill-in-the-middle code completion, code generation, natural language understanding, mathematical reasoning, and instruction following.
Her we're importing the 7B instruction-tuned variant for natural language-to-code chat and instruction following.


Let's get started by loading the model from Hugging Face Hub.

### Loading the model from HF Hub

In [22]:
model_id = "google/codegemma-7b-it"
device = "cuda"

In [23]:
# Let's load the tokenizer first
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [24]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Let's quantize the model to reduce its weight
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Let's load the final model
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map={"": 0}
)

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.30s/it]


Let's define a pramble so that our models understands we want to get SQL queries out of it.

In [39]:
PREAMBLE = "Generate an SQL query from the following sentence. "

### Trying it out

In [40]:
prompt = "What is the average, minimum, and maximum age of all singers from France?"
prompt = PREAMBLE + prompt
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Generate an SQL query that from the following sentence. What is the average, minimum, and maximum age of all singers from France?

```sql
SELECT AVG(age), MIN(age), MAX(age)
FROM singers
WHERE nationality = 'French';
```


In [41]:
prompt = "What are the different countries with singers above age 20?"
prompt = PREAMBLE + prompt
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Generate an SQL query that from the following sentence. What are the different countries with singers above age 20?

```sql
SELECT DISTINCT country
FROM singers
WHERE age > 20;
```


## Fine-tuning the model with LoRA

This section of the guide focuses on training your Large Language Model (LLM) to generate SQL code fron natural language. Here, we will explore the process of fine-tuning your model to enable it to produce high quality SQL queries.

In [43]:
# Let's try it out before the fine-tuning
text = "What is the maximum capacity and the average of all stadiums?"
text = PREAMBLE + text
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
tokenizer.decode(outputs[0], skip_special_tokens=True)

'Generate an SQL query that from the following sentence. What is the maximum capacity and the average of all stadiums?\n\n```sql\nSELECT stadium_name, capacity FROM stadium;\n```\n\n```sql\nSELECT MAX(capacity) AS max_capacity, AVG(capacity) AS average_capacity FROM stadium;\n```'

In [28]:
# Loading and processing the spider dataset
from datasets import load_dataset

data = load_dataset("xlangai/spider")
print("Example item:", data["train"][0])

Example item: {'db_id': 'department_management', 'query': 'SELECT count(*) FROM head WHERE age  >  56', 'question': 'How many heads of the departments are older than 56 ?', 'query_toks': ['SELECT', 'count', '(', '*', ')', 'FROM', 'head', 'WHERE', 'age', '>', '56'], 'query_toks_no_value': ['select', 'count', '(', '*', ')', 'from', 'head', 'where', 'age', '>', 'value'], 'question_toks': ['How', 'many', 'heads', 'of', 'the', 'departments', 'are', 'older', 'than', '56', '?']}


We need to define a function to tokenize the input. Let's tokenize the 'question' and 'query' columns for training

In [44]:
def tokenize_function(samples):
    max_length = 256  # Set a reasonable max_length based on your data
    
    # Add the preamble to the question
    questions_with_preamble = [PREAMBLE + question for question in samples["question"]]
    
    inputs = tokenizer(questions_with_preamble, truncation=True, padding="max_length", max_length=max_length, return_tensors="pt")
    outputs = tokenizer(samples["query"], truncation=True, padding="max_length", max_length=max_length, return_tensors="pt")
    
    # Return input and output ids for training
    return {"input_ids": inputs["input_ids"], "labels": outputs["input_ids"]}

In [45]:
# Let's tokenize the quotes
data = data.map(tokenize_function, batched=True)

Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7000/7000 [00:01<00:00, 5083.01 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1034/1034 [00:00<00:00, 4908.24 examples/s]


In [46]:
from peft import LoraConfig

# Define tuning parameters
lora_config = LoraConfig(
    r=8,
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "o_proj",
        "k_proj",
        "v_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

In [47]:
# Define formatting function to format the Spider dataset correctly
def formatting_func(example):
    question_with_preamble = PREAMBLE + example['question']
    input_text = example['question']  # Question in the dataset
    output_text = example['query']    # Corresponding SQL query
    return {"input_text": input_text, "output_text": output_text}

In [48]:
import transformers
from trl import SFTTrainer

# Create Trainer objects that takes care of the process
trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs2",
        optim="paged_adamw_8bit",
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

max_steps is given, it will override any value given in num_train_epochs


In [49]:
# Let's run the fine-tuning
trainer.train()

Step,Training Loss
1,83.1099
2,77.8257
3,71.4669
4,79.6118
5,62.6602
6,60.5979
7,70.6099
8,64.508
9,68.3094
10,71.4728


TrainOutput(global_step=10, training_loss=71.01725540161132, metrics={'train_runtime': 13.2464, 'train_samples_per_second': 3.02, 'train_steps_per_second': 0.755, 'total_flos': 477772854067200.0, 'train_loss': 71.01725540161132, 'epoch': 0.005714285714285714})

In [52]:
# Testing the models after fine-tuning
text = "What are the different countries with singers above age 20?"
text = PREAMBLE + text
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Generate an SQL query that from the following sentence. What are the different countries with singers above age 20?

```sql
SELECT DISTINCT country
FROM singers
WHERE age > 20;
```


## Push the model to your Hugging Face Hub


Hugging Face allow to you easily store trained models in their hub.

In [None]:
# Note: The token needs to have "write" permisssion
#       You can chceck it here:
#       https://huggingface.co/settings/tokens
model.push_to_hub("my-codegemma-7-finetuned-model")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/f33ac/my-gemma-2-finetuned-model/commit/c837075477e241519df9aaf42e6a032b1d2e6df7', commit_message='Upload GemmaForCausalLM', commit_description='', oid='c837075477e241519df9aaf42e6a032b1d2e6df7', pr_url=None, pr_revision=None, pr_num=None)

## Serve you model using Text Generation Inference (TGI)

Text Generation Inference is a toolkit that simplifies deploying and using large language models (LLMs) like Gemma. It optimizes models for text generation tasks, enabling them to run faster and produce results quicker. TGI achieves this through techniques like tensor parallelism, which distributes the workload across multiple graphics cards (GPUs) for faster processing, and optimized code specifically designed for text generation. Additionally, TGI offers features that make it suitable for production environments, such as distributed tracing for monitoring model performance, Prometheus metrics for detailed data collection, and security measures like watermarking to protect model outputs. You can read more about TGI by referring to [the official documentation](https://huggingface.co/docs/text-generation-inference/en/index).

To deploy your model with TGI you can either:

1. **Deploy it locally (requires Docker):** Uncomment the code cells below to run the model on your local machine. This approach requires Docker to be installed and GPU attached.

2. **Deploy it on Google Cloud Platform using GKE:** Follow this guide [Serve Gemma open models using GPUs on GKE with Hugging Face TGI](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi) to deploy your model on Google Cloud's CKE service. This option leverages GPUs for high-performance inference.

Both deployment methods will provide you with an HTTP endpoint for sending requests and receiving text generation responses from your model.

In [None]:
!model="google/codegemma-7b-it" # ID of the model in Hugging Face hube
# (you can use your own fine-tuned model from
# the prevous step)
!volume=$PWD/data               # Shared directory with the Docker container
# to avoid downloading weights every run

# !docker run --gpus all --shm-size 1g -p 8080:80 \
#     -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.3 \
#     --model-id $model