In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Supervised Fine Tuning with Gemini for Question & Answering

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/prompts/examples/ideation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fprompts%2Fexamples%2Fideation.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/prompts/examples/ideation.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/prompts/examples/ideation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

| | |
|-|-|
|Author(s) | erwinh@|

## Overview
This notebook demonstrates fine-tuning the Gemini generative model using the Vertex AI Supervised Tuning feature. Supervised Tuning allows you to use your training data to refine the base model's capabilities toward specific tasks.

Supervised Tuning uses labeled examples to tune a model. Each example demonstrates the output you want from your text model during inference.

- Data Preparation: Your role is crucial in ensuring your training data is high-quality, well-labeled, and directly relevant to the target task. The quality of the data can significantly impact the model's performance and the presence of bias in the fine-tuned model, underscoring the importance of your contribution.
- Training: This phase presents an exciting opportunity to experiment with different configurations, allowing you to optimize the model's performance on the target task. Your creativity and innovation can lead to significant improvements in the model's capabilities.
- Evaluation:
  - Metric: Choose appropriate evaluation metrics that accurately reflect the success of the fine-tuned model for your specific task
  - Evaluation Set: Use a separate set of data to evaluate the model's performance

### Recommended configurations
The following table shows the recommended configurations for tuning a foundation model by task:

| Task           | No. of examples in dataset | Number of epochs |
| -------------- | -------------------------- | ----------- |
| Classification | 500+                       | 2-4         |
| Summarization  | 1000+                      | 2-4         |
| Extractive QA  | 500+                       | 2-4         |
| Chat           | 1000+                      | 2-4         |

Before running this notebook, ensure you have:

- A Google Cloud project: Provide your project ID in the `PROJECT_ID` variable.

- Authenticated your Colab environment: Run the authentication code block at the beginning.

- Prepared training data: Data should be formatted in JSON Lines with prompts and corresponding completions.

## Getting Started

### Install Vertex AI SDK and other required packages

In [1]:
!pip3 install --upgrade --user --quiet google-cloud-aiplatform

### Restart runtime (Colab only)

To use the newly installed packages, you must restart the runtime on Google Colab.

In [2]:
import sys

if "google.colab" in sys.modules:
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>

## Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the cell below to authenticate your environment.

In [3]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

- If you are running this notebook in a local development environment:
  - Install the [Google Cloud SDK](https://cloud.google.com/sdk).
  - Obtain authentication credentials. Create local credentials by running the following command and following the oauth2 flow (read more about the command [here](https://cloud.google.com/sdk/gcloud/reference/beta/auth/application-default/login)):

    ```bash
    gcloud auth application-default login
    ```

## Set Project and Location

First, you will have to set your project_id, location, and bucket_name. You can also use an existing bucket within the project.

In [1]:
PROJECT_ID = "[your-project]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

BUCKET_NAME = "[your-bucket]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [2]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "<your-bucket-name>":
    BUCKET_NAME = "vertex-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

**warning**: Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

## Import Libraries

In [4]:
import vertexai
from vertexai.generative_models import (
    GenerativeModel,
    Part,
    HarmCategory,
    HarmBlockThreshold,
    GenerationConfig,
)
from vertexai.preview.tuning import sft

vertexai.init(project=PROJECT_ID, location=LOCATION)

from typing import Union
import pandas as pd
from google.cloud import bigquery
from sklearn.model_selection import train_test_split
import datetime
import time

## Supervised fine tuning with Gemini on a question and answer dataset

Now it's time for you to create a tuning job. You will be using a Q&A with a context dataset in JSON format.

Supervised fine-tuning offers a solution, allowing focused adaptation of foundation models to new tasks. You can create a supervised text model tuning job using the Google Cloud console, API, or the Vertex AI SDK for Python. You can read more on our [documentation page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning),

But how do you ensure your data is primed for success with supervised fine-tuning? Here's a breakdown of critical areas to focus on:

- **Domain Alignment:** Supervised fine-tuning thrives on smaller datasets, but they must be highly relevant to your downstream task. Seek out data that closely mirrors the domain you will encounter in real-world use cases.
- **Labeling Accuracy:** Noisy labels will sabotage even the best technique. Prioritize accuracy in your annotations and labeling.
- **Noise Reduction:** Outliers, inconsistencies, or irrelevant examples hurt model adaptation. Implement preprocessing, such as removing duplicates, fixing typos, and verifying that data conforms to your task's expectations.
- **Distribution:** A diverse range of examples will help your model generalize better within the confines of your target task. Refrain from overloading the process with excessive variance that strays from your core domain.
- **Balanced Classes:** For classification tasks, try to keep a reasonable balance between different classes to avoid the model learning biases towards a specific class


### Fetching data from BigQuery
💾 Your model tuning dataset must be in a JSONL format where each line contains a single training example. You must make sure that you include instructions.

You will use the [StackOverflow dataset](https://cloud.google.com/blog/topics/public-datasets/google-bigquery-public-datasets-now-include-stack-overflow-q-a) on BigQuery Public Datasets, limiting to questions with the `python` tag, and accepted answers for answers since 2020-01-01.

You will use a helper function to read the data from BigQuery and create a Pandas dataframe.

In [5]:
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    """
    Run a BigQuery query and return the job ID or result as a DataFrame
    Args:
        sql: SQL query, as a string, to execute in BigQuery
    Returns:
        df: DataFrame of results from query,  or error, if any
    """

    bq_client = bigquery.Client(project=PROJECT_ID)

    # Try dry run before executing query to catch any errors
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    bq_client.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq_client.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")

    return df

Next you will write the query. For now you will limit our example to 550.

In [None]:
stack_overflow_df = run_bq_query(
    """SELECT
           CONCAT(q.title, q.body) AS input_text,
           a.body AS output_text
       FROM `bigquery-public-data.stackoverflow.posts_questions` q
       JOIN `bigquery-public-data.stackoverflow.posts_answers` a
         ON q.accepted_answer_id = a.id
       WHERE q.accepted_answer_id IS NOT NULL
         AND REGEXP_CONTAINS(q.tags, "python")
         AND a.creation_date >= "2020-01-01"
       LIMIT 550
    """
)

stack_overflow_df.head()

There should be 550 questions and answers.

In [None]:
print(len(stack_overflow_df))

#### Adding instructions
Finetuning language models on a collection of datasets phrased as instructions have been shown to improve model performance and generalization to unseen tasks [(Google, 2022)](https://arxiv.org/pdf/2210.11416.pdf).

An instruction refers to a specific directive or guideline that conveys a task or action to be executed. These instructions can be expressed in various forms, such as step-by-step procedures, commands, or rules. When we don't use the instructions, it's only a question and answer. The instruction tells the large language model what to do. We want them to answer the question. We have to give a hint about the task we want to perform. Let's extend the dataset with an instruction.

In [8]:
INSTRUCTION_TEMPLATE = f"""\
You are a helpful Python developer \
You are good at answering Stackoverflow questions \
Your mission is to provide developers with helpful answers that work
"""

You will create a new column for the `INSTRUCTION_TEMPLATE`. Use a new column and do not overwrite the existing one, which you might want to use later.

In [None]:
stack_overflow_df["input_text_instruct"] = INSTRUCTION_TEMPLATE

stack_overflow_df.head(2)

Next, you will randomly split the data into training and evaluation. For Extractive Q&A tasks, we advise 500+ training examples. In this case, you will use 440 to generate a tuning job that runs faster. 

20% of your dataset will be used for test. The `random_state` controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. Feel free to adjust this. 

In [None]:
# split is set to 80/20
train, evaluation = train_test_split(stack_overflow_df, test_size=0.2, random_state=42)

print(len(train))
print(len(evaluation))

#### Generating the JSONL files

Prepare your training data in a JSONL (JSON Lines) file and store it in a Google Cloud Storage (GCS) bucket. This format ensures efficient processing. Each line of the JSONL file must represent a single data instance and follow a well-defined schema:

`{"messages": [{"role": "system", "content": "instructions"}, {"role": "user", "content": "question"}, {"role": "model", "content": "answering"}]}`

This is how it maps to the Pandas df columns:

*   `instructions -> input_text_instruct`
*   `question -> input_text`
*   `answer -> output_text`



In [11]:
date = datetime.datetime.now().strftime("%H:%d:%m:%Y")

tuning_data_filename = f"tune_data_stack_overflow_qa-{date}.jsonl"
validation_data_filename = f"validation_data_stack_overflow_qa-{date}.jsonl"

In [12]:
def format_messages(row):
    """Formats a single row into the desired JSONL structure"""
    return {
        "messages": [
            {"role": "system", "content": row["input_text_instruct"]},
            {"role": "user", "content": row["input_text"]},
            {"role": "model", "content": row["output_text"]},
        ]
    }

In [13]:
# Apply formatting function to each row, then convert to JSON Lines format
tuning_data = train.apply(format_messages, axis=1).to_json(orient="records", lines=True)

# Save the result to a JSONL file
with open(tuning_data_filename, "w") as f:
    f.write(tuning_data)

Next you can check if the number of rows match with your Pandas df.

In [None]:
with open(tuning_data_filename, "r") as f:
    num_rows = sum(1 for line in f)

print("Number of rows in the JSONL file:", num_rows)

You will do the same for the validation dataset.

In [15]:
# Apply formatting function to each row, then convert to JSON Lines format
validation_data = evaluation.apply(format_messages, axis=1).to_json(
    orient="records", lines=True
)

# Save the result to a JSONL file
with open(validation_data_filename, "w") as f:
    f.write(validation_data)

Next, you will copy the JSONL files into the Google Cloud Storage bucket you specified or created at the beginning of the notebook.

In [None]:
!gsutil cp $tuning_data_filename $validation_data_filename $BUCKET_URI

Next you can check if the files are in the bucket.

In [None]:
!gsutil ls -al $BUCKET_URI

Now, you will create two variables for the data.


In [18]:
TUNING_DATA_URI = f"{BUCKET_URI}/{tuning_data_filename}"
VALIDATION_DATA_URI = f"{BUCKET_URI}/{validation_data_filename}"

### Create a supervised tuning job using Gemini
Now it's time for you to start your tuning job. You will use the `gemini-1.0-pro-002` model.

In [19]:
foundation_model = GenerativeModel("gemini-1.0-pro-002")

In [None]:
# Tune a model using `train` method.
sft_tuning_job = sft.train(
    source_model=foundation_model,
    train_dataset=TUNING_DATA_URI,
    # Optional:
    validation_dataset=VALIDATION_DATA_URI,
    epochs=3,
    learning_rate_multiplier=1.0,
)

# Get the tuning job info.
sft_tuning_job.to_dict()

Lets monitor the state. Wait for the next step to complete. Tuning a model will take some time.

Next you can retrieve the model resource name.

In [None]:
# Get the resource name of the tuning job
sft_tuning_job_name = sft_tuning_job.resource_name
sft_tuning_job_name

Tuning takes time. Please wait until the job is finished before you continue after the next cell.

In [None]:
%%time
# Wait for job completion
while not sft_tuning_job.refresh().has_ended:
    time.sleep(60)

In [None]:
# tuned model name
tuned_model_name = sft_tuning_job.tuned_model_name
tuned_model_name

And the model endpoint.

You can use `tuning.TuningJob.list()` to retrieve your tuning jobs.

In [None]:
sft_tuning_job.list()

You model is automatically deployed as a Vertex AI Endpoint and ready for usage!

In [None]:
# tuned model endpoint name
tuned_model_endpoint_name = sft_tuning_job.tuned_model_endpoint_name
tuned_model_endpoint_name

# Load tuned Generative Model

In [None]:
tuned_model = GenerativeModel(tuned_model_endpoint_name)
print(tuned_model)

Call the API

In [None]:
tuned_model.generate_content(
    "How do I store a Tensorflow checkpoint on Google Cloud Storage while training?"
)