# Tuning and deploy a foundation model


**Learning Objective**

1. Learn how to generate a JSONL file for Gemini tuning
1. Learn how to launch a tuning job
1. Learn how to deploy and query a tuned LLM
1. Learn how to evaluate a tuned LLM


Creating an LLM requires massive amounts of data, significant computing resources, and specialized skills. In this notebook, you'll learn how tuning allows you to customize a Gemini foundation model on Vertex AI studio for more specific tasks or knowledge domains.
While the prompt design is excellent for quick experimentation, if training data is available, you can achieve higher quality by tuning the model. Tuning a model enables you to customize the model response based on examples of the task you want the model to perform.

For more details on tuning have a look at the [official documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models).

## Setup

In [None]:
import os
import warnings

warnings.filterwarnings("ignore")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

In [None]:
import json
import time

import evaluate
import pandas as pd
from google import genai
from google.cloud import bigquery
from google.genai import types
from IPython.display import Markdown
from sklearn.model_selection import train_test_split

In [None]:
REGION = "us-central1"
PROJECT_ID = !(gcloud config get-value project)
PROJECT_ID = PROJECT_ID[0]
BUCKET_NAME = PROJECT_ID
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
!gsutil ls $BUCKET_URI || gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

## Training Data


In this notebook, we will be tuning Gemini using the Gen AI SDK on a questions & answers dataset from StackOverflow. 
Our first step will be to query the StackOverflow data on BigQuery Public Datasets, limiting to questions with the `python` tag, and `accepted` answers from 2020-01-01 only. 

We will limit the dataset to 1000 samples, 800 of which will be used to tune the LLM and the rest for evaluating the tuned model.
The second step will be to convert the dataset into a JSONL format, with one example per line, so that the tuning job can consume it.


Next let us run the query to assemble our dataset into the DataFrame `df`:

In [None]:
%%bigquery df

SELECT CONCAT(q.title, q.body) as input_text, a.body AS output_text
FROM
    `bigquery-public-data.stackoverflow.posts_questions` q
JOIN
    `bigquery-public-data.stackoverflow.posts_answers` a
ON
    q.accepted_answer_id = a.id
WHERE
    q.accepted_answer_id IS NOT NULL AND
    REGEXP_CONTAINS(q.tags, "python") AND
    a.creation_date >= "2020-01-01"
LIMIT
    1000

In [None]:
df.head()

The column `input_text` under `parts` corresponds to the actual questions asked by the StackOverflow users, while the `output_text` column corresponds to the correct answers. From this dataset of 1000 questions-answers pairs, we will now need to generate a JSONL file with one example per line in the format:

```python
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": input_text,
          ...
        }
      ],
    },
    {
      "role": "model",
      "parts": [
        {
          "text": output_text,
          ...
        }
      ],
    },
  ]
}
```

This is the format example we need to tune a Gemini 2.0 Flash model.

Please refer to [the document](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning-prepare#dataset_example_for_gemini-15-pro_and_gemini-15-flash) to check other fields you can use.

To tune Gemini we advise at least 100 to 500 examples. The more examples you provide in your dataset, the better the results. There is no limit for the number of examples in a training dataset. In this case you will use 800.
If possible, also provide a validation dataset. A validation dataset helps you measure the effectiveness of a tuning job. Validation datasets support up to 256 examples.

Let's first split the data into training and evaluation. 

In [None]:
# split is set to 80/20
train, evaluation = train_test_split(df, test_size=0.2)
print("train size:", len(train))
print("eval size:", len(evaluation))

In [None]:
train.head()

For tuning, the training and evaluation data first needs to be converted into a JSONL format. For this, we provide the following two helper functions.
The first one converts a single `input_text` and `output_text` record into the JSONL format required by Gemini.

In [None]:
def format_for_gemini(input_text, output_text):
    return json.dumps(
        {
            "contents": [
                {
                    "role": "user",
                    "parts": [
                        {
                            "text": input_text,
                        }
                    ],
                },
                {
                    "role": "model",
                    "parts": [
                        {
                            "text": output_text,
                        }
                    ],
                },
            ]
        }
    )

The second helper function exports the data into a file:

In [None]:
def export_tuning_data(file_name, df):
    with open(file_name, "a") as file:
        for row in df.iterrows():
            jsonline = format_for_gemini(
                row[1]["input_text"],
                row[1]["output_text"],
            )
            file.write(jsonline)
            file.write("\n")

Let us now create our training and evaluation files:

In [None]:
training_data_filename = "tune_data_stack_overflow_python_qa.jsonl"
evaluation_data_filename = "evaluation_data_stack_overflow_python_qa.jsonl"

!test -f $training_data_filename    && rm $training_data_filename
!test -f $evaluation_data_filename  && rm $evaluation_data_filename

export_tuning_data(training_data_filename, train)
export_tuning_data(evaluation_data_filename, evaluation)

You can then export the local files to GCS, so that they can be used by Vertex AI for the tuning job.

In [None]:
!gsutil cp $training_data_filename   $BUCKET_URI
!gsutil cp $evaluation_data_filename $BUCKET_URI

You can check to make sure that the files successfully transferred to your Google Cloud Storage bucket:

In [None]:
TRAINING_DATA_URI = f"{BUCKET_URI}/{training_data_filename}"
EVALUATION_DATA_URI = f"{BUCKET_URI}/{evaluation_data_filename}"

!gsutil ls -al $TRAINING_DATA_URI
!gsutil ls -al $EVALUATION_DATA_URI

### Model Tuning
Now it's time to start to tune a model. You will use the [Google Gen AI SDK to submit our tuning job](https://googleapis.github.io/python-genai/#tune).
This should take roughly 30min.

In [None]:
client = genai.Client(vertexai=True, location="us-central1")

base_model = "gemini-2.0-flash-001"

training_dataset = types.TuningDataset(
    gcs_uri=TRAINING_DATA_URI,
)
evaluation_dataset = types.TuningValidationDataset(gcs_uri=EVALUATION_DATA_URI)

sft_tuning_job = client.tunings.tune(
    base_model=base_model,
    training_dataset=training_dataset,
    config=types.CreateTuningJobConfig(
        epoch_count=1,
        validation_dataset=evaluation_dataset,
        tuned_model_display_name="stackoverflow_tuned_gemini_pro",
    ),
)

running_states = {"JOB_STATE_PENDING", "JOB_STATE_RUNNING"}

# Polling for job completion
while sft_tuning_job.state in running_states:
    print(sft_tuning_job.state)
    sft_tuning_job = client.tunings.get(name=sft_tuning_job.name)
    time.sleep(10)

print(sft_tuning_job.tuned_model_display_name)
print(sft_tuning_job.name)
print(sft_tuning_job.experiment)

## Retrieve the tuned model from your Vertex AI Model registry


When your tuning job is finished, your model will be available on Vertex. The next cell shows you how to list tuned models.

In [None]:
for model in client.models.list(config={"page_size": 10, "query_base": False}):
    print(model)

It's time to get predictions. You can start sending a prompt to the tuned model via Gen AI SDK.<br>
Feel free to update the following prompt:

In [None]:
PROMPT = """
How can I store my TensorFlow checkpoint on Google Cloud Storage?

Python example:

"""

response = client.models.generate_content(
    model=sft_tuning_job.tuned_model.endpoint,
    contents=PROMPT,
)

Markdown(response.text)

## Manual Evaluation

It's essential to evaluate your model to understand its performance. Evaluation can be done in an automated way using evaluation metrics like F1, Bleu, or Rouge. You can also leverage human evaluation methods. Human evaluation methods involve asking humans to rate the quality of the LLM's answers. This can be done through crowdsourcing or by having experts evaluate the responses. Some standard human evaluation metrics include fluency, coherence, relevance, and informativeness. Often you want to choose a mix of evaluation metrics to get a good understanding of your model performance. 


Among other metrics we will compute the following two metrics that provide crude measures albeit automated of how two texts may have the same meaning: 
- The [BLEU](https://en.wikipedia.org/wiki/BLEU) evaluation metric is a sort of **precision** metric, measuring the proportion of $n$-grams in the generated sentence matching $n$-grams in the reference sentence. It goes from 0 to 1 with a higher score for more similar sentences. BLEU1 considers uni-grams only, while BLEU2 considers bi-grams. 

- The [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) evaluation metric is a sort of **recall** metric, measuring the proportion of $n$-grams in the reference sentence that are matched by $n$-grams in the generated sentence. It goes from 0 to 1 with a higher score for more similar sentences. ROUGE1 considers uni-grams only, while ROUGE2 considers bi-grams.


We will use  [evaluate](https://github.com/huggingface/evaluate/tree/main) to to compute the scores.
Earlier in the notebook, you created a train and eval dataset. Now it's time to take some of the eval data. You will use the questions to get a response from our tuned model, and the answers we will use as a reference:
- **Candidates**: Answers generated by the tuned model.
- **References**: Original answers that we will use to compare


In [None]:
# you can change the number of rows you want to use
EVAL_ROWS = 60
INPUT_LIMIT = 10000  # characters
evaluation = evaluation[evaluation.input_text.apply(len) <= INPUT_LIMIT]
evaluation = evaluation.head(EVAL_ROWS)
evaluation.head()

The function in the cell below will query our tuned model using the `evaluation.input_text` and store the ground truth in `evaluation.output_text` in a DataFrame next to the model answers (this will roughly take 5 minutes):

In [None]:
def create_eval_data(model_endpoint, evaluation):
    model_answers = []

    for prompt in evaluation.input_text:
        response = client.models.generate_content(
            model=model_endpoint, contents=prompt
        )
        model_answers.append(response.text)
    eval_df = pd.DataFrame(
        {"candidate": model_answers, "reference": evaluation.output_text}
    )
    mask = eval_df.candidate == ""
    return eval_df[~mask]

In [None]:
eval_df = create_eval_data(sft_tuning_job.tuned_model.endpoint, evaluation)

In [None]:
eval_df.head()

The function in the next cell computes the uni-gram BLEU and ROUGE scores. It averages these scores over all the reference answers and those generated by our tuned model, giving scores that can serve as performance metrics for our model.

In [None]:
def compute_scores(eval_data):
    predictions = eval_data.candidate.tolist()
    references = eval_data.reference.tolist()
    rouge = evaluate.load("rouge")
    bleu = evaluate.load("bleu")
    rouge_value = rouge.compute(predictions=predictions, references=references)[
        "rouge1"
    ]
    bleu_value = bleu.compute(predictions=predictions, references=references)[
        "bleu"
    ]
    return {"rouge": rouge_value, "bleu": bleu_value}

In [None]:
compute_scores(eval_df)

Given two versions of the model (possibly tuned with a different amount of data or training steps), you can now compare the scores to decide which one is the best. However, remember that these automated metrics are very crude proxy of human assessment. 

## Automated Evaluation


Let us conclude by noting that a Vertex tuning job collects and reports model tuning and model evaluation metrics, which can then be visualized in Vertex AI Experiments by clicking on your tuned model name in the tuning section of Vertex AI Studio.
Here is a description of the metrics that are computed:


#### Model tuning metrics

The model tuning job automatically collects the following tuning metrics for `gemini-2.0-flash`.

* `/train_total_loss`: Loss for the tuning dataset at a training step.
* `/train_fraction_of_correct_next_step_preds`: The token accuracy at a training step. A single prediction consists of a sequence of tokens. This metric measures the accuracy of the predicted tokens when compared to the ground truth in the tuning dataset.
* `/train_num_predictions`: Number of predicted tokens at a training step.


#### Model validation metrics:

You can configure a model tuning job to collect the following validation metrics for gemini-2.0-flash by passing an evaluation dataset as we did in this lab.

`/eval_total_loss`: Loss for the validation dataset at a validation step.
`/eval_fraction_of_correct_next_step_preds`: The token accuracy at an validation step. A single prediction consists of a sequence of tokens. This metric measures the accuracy of the predicted tokens when compared to the ground truth in the validation dataset.
`/eval_num_predictions`: Number of predicted tokens at a validation step.


The metrics visualizations are available after the model tuning job completes. 
If you don't specify a validation dataset when you create the tuning job, 
only the visualizations for the tuning metrics are available.



## Acknowledgement 

This notebook is adapted from a [tutorial](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/tuning/getting_started_tuning.ipynb)
written by Polong Lin.

Copyright 2023 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.