In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Supervised fine-tuning with Gemini 1.5 Flash for Q&A

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Ftuning%2Fsupervised_finetuning_using_gemini_qa.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/tuning/supervised_finetuning_using_gemini_qa.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

| | |
|-|-|
| Author(s) | [Erwin Huizenga](https://github.com/erwinh85) |

## Overview

**Gemini** is a family of generative AI models developed by Google DeepMind designed for multimodal use cases. The Gemini API gives you access to the various Gemini models, such as Gemini 1.5 Pro and Gemini 1.5 Flash.
This notebook demonstrates fine-tuning the Gemini 1.5 Flash using the Vertex AI Supervised Tuning feature. Supervised Tuning allows you to use your own labeled training data to further refine the base model's capabilities toward your specific tasks.
Supervised Tuning uses labeled examples to tune a model. Each example demonstrates the output you want from your text model during inference.
First, ensure your training data is of high quality, well-labeled, and directly relevant to the target task. This is crucial as low-quality data can adversely affect the performance and introduce bias in the fine-tuned model.
Training: Experiment with different configurations to optimize the model's performance on the target task.
Evaluation:
Metric: Choose appropriate evaluation metrics that accurately reflect the success of the fine-tuned model for your specific task
Evaluation Set: Use a separate set of data to evaluate the model's performance


Refer to public [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning) for more details.

<hr/>

Before running this notebook, ensure you have:

- A Google Cloud project: Provide your project ID in the `PROJECT_ID` variable.

- Authenticated your Colab environment: Run the authentication code block at the beginning.

- Prepared training data (Test with your own data or use the one in the notebook): Data should be formatted in JSONL with prompts and corresponding completions.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

To get an estimate of the number of tokens

## Get started

### Install Vertex AI SDK and other required packages


In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform

### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [1]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set the Google Cloud project information and initialize the Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

PROJECT_ID = "<your-project-id>"  # @param {type:"string", isTemplate: true}
if PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

In [3]:
from collections import Counter
import json
import time

# Vertex AI SDK
from google.cloud import aiplatform
from google.cloud.aiplatform.metadata import context
from google.cloud.aiplatform.metadata import utils as metadata_utils
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Vertex AI SDK
from vertexai.generative_models import GenerationConfig, GenerativeModel
from vertexai.preview.tuning import sft

### Data

#### SQuAD dataset
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

You can fine more information on the SQuAD [github page](https://rajpurkar.github.io/SQuAD-explorer/)


```
@inproceedings{rajpurkar-etal-2016-squad,
    title = "{SQ}u{AD}: 100,000+ Questions for Machine Comprehension of Text",
    author = "Rajpurkar, Pranav  and
      Zhang, Jian  and
      Lopyrev, Konstantin  and
      Liang, Percy",
    editor = "Su, Jian  and
      Duh, Kevin  and
      Carreras, Xavier",
    booktitle = "Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2016",
    address = "Austin, Texas",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D16-1264",
    doi = "10.18653/v1/D16-1264",
    pages = "2383--2392",
    eprint={1606.05250},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
}
```

First update the `BUCKET_NAME` parameter below. You can either use an existing bucket or create a new one.

In [None]:
# Provide a bucket name
BUCKET_NAME = "<your_bucket_name>"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"
print(BUCKET_URI)

Only run the code below if you want to create a new Google Cloud Storage bucket.

In [5]:
# ! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

Next you will copy the data into your bucket.

In [None]:
!gsutil cp gs://github-repo/generative-ai/gemini/tuning/qa/squad_test.csv .
!gsutil cp gs://github-repo/generative-ai/gemini/tuning/qa/squad_train.csv .
!gsutil cp gs://github-repo/generative-ai/gemini/tuning/qa/squad_validation.csv .

### Baseline

Next you will prepare some data that you will use to establish a baseline.  This means evaluating the out of the box default model on a representative sample of your dataset before any fine-tuning. A baseline allows you to quantify the improvements achieved through fine-tuning.

In [None]:
test_df = pd.read_csv("squad_test.csv")
test_df.head(2)

First you need to prepare some data to evaluate the out of the box model and set a baseline. In this case, we will lower the text and remove extra whitespace, but preserve newlines.

In [10]:
row_dataset = 46


def normalize_answer(s):
    """Lower text and remove extra whitespace, but preserve newlines."""

    def white_space_fix(text):
        return " ".join(text.split())  # Splits by any whitespace, including \n

    def lower(text):
        return text.lower()

    return white_space_fix(lower(s))


test_df["answers"] = test_df["answers"].apply(normalize_answer)

You want to make sure that you test data looks the same as your training data to prevent training / serving skew. We will add a system instruction to the dataset:

- `SystemInstruct`: System instructions are a set of instructions that the model processes before it processes prompts. We recommend that you use system instructions to tell the model how you want it to behave and respond to prompts.
- You will also combine the `context` and `question`. Both will be send to the model to generate a response.

In [None]:
few_shot_examples = test_df.sample(3)
# Get the indices of the sampled rows
dropped_indices = few_shot_examples.index
# Remove the sampled rows from the original DataFrame
test_df = test_df.drop(dropped_indices)

few_shot_prompt = ""
for _, row in few_shot_examples.iterrows():
    few_shot_prompt += (
        f"Context: {row.context}\nQuestion: {row.question}\nAnswer: {row.answers}\n\n"
    )

print(few_shot_prompt)

In [14]:
# Incorporate few-shot examples into the system instruction
systemInstruct = f"""Answer the question with a concise extract from the given context. Do not add any additional information, capital letters (only for names) or a punctuation mark in the end.\n\n
Here are some examples: \n\n
{few_shot_prompt}"""

In [None]:
# combine the systeminstruct + context + question into one column. This will be your input prompt.
test_df["systemInstruct"] = systemInstruct

test_df["input_question"] = (
    "\n\n **Below the question with context that you need to answer**"
    + "\n Context: "
    + test_df["context"]
    + "\n Question: "
    + test_df["question"]
)

test_systemInstruct = test_df["systemInstruct"].iloc[row_dataset]
print(test_systemInstruct)
test_question = test_df["input_question"].iloc[row_dataset]
print(test_question)

Next, set the model that you will use. In this example you will use `gemini-1.5-flash-002`. A multimodal model that is designed for high-volume, cost-effective applications, and which delivers speed and efficiency to build fast, lower-cost applications that don't compromise on quality.

For the latest Gemini models and versions please have a look at our [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models)


In [16]:
base_model = "gemini-1.5-flash-002"

In [None]:
y_true = test_df["answers"].values
y_pred_question = test_df["input_question"].values

# Check two pairs of question and answer.
for i in range(2):  # Loop through the first two indices
    print(f"Pair {i+1}:")
    print(f"  True Answer: {y_true[i]}")
    print(f"  Predicted Question: {y_pred_question[i]}")

Next lets take a question and get a prediction from Gemini that we can compare to the actual answer.

In [18]:
def get_predictions(question: str) -> str:
    """Generates predictions."""

    generation_model = GenerativeModel(base_model, system_instruction=systemInstruct)

    prompt = question

    generation_config = GenerationConfig(temperature=0.1)

    response = generation_model.generate_content(
        contents=prompt, generation_config=generation_config
    ).text

    return response

In [None]:
test_answer = test_df["answers"].iloc[row_dataset]
response = get_predictions(test_question)

print(f"Gemini response: {response}")
print(f"Actual answer: {test_answer}")

You can see that both answers are correct, but the response from Gemini is more lengthy. However, answers in the SQuAD dataset are typically concise and clear.

Fine-tuning is a great way to control the type of output your use case requires. In this instance, you would want the model to provide short, clear answers.

In [None]:
# Apply the get_prediction() function to the 'question_column'
test_df["predicted_answer"] = test_df["input_question"].apply(get_predictions)
test_df.head(2)

You also need to make sure that the predicted answer is in the same format.

In [None]:
test_df["predicted_answer"] = test_df["predicted_answer"].apply(normalize_answer)
test_df.head(4)

Next lest establish a baseline using evaluation metrics.

Evaluating the performance of a Question Answering (QA) system requires specific metrics. Two commonly used metrics are Exact Match (EM) and F1 score.

EM is a strict measure that only considers an answer correct if it perfectly matches the ground truth, even down to the punctuation. It's a binary metric - either 1 for a perfect match or 0 otherwise. This makes it sensitive to minor variations in phrasing.

F1 score is more flexible. It considers the overlap between the predicted answer and the true answer in terms of individual words or tokens. It calculates the harmonic mean of precision (proportion of correctly predicted words out of all predicted words) and recall (proportion of correctly predicted words out of all true answer words). This allows for partial credit and is less sensitive to minor wording differences.

In practice, EM is useful when exact wording is crucial, while F1 is more suitable when evaluating the overall understanding and semantic accuracy of the QA system. Often, both metrics are used together to provide a comprehensive evaluation.

In [None]:
def f1_score_squad(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return normalize_answer(prediction) == normalize_answer(ground_truth)


def calculate_em_and_f1(y_true, y_pred):
    """Calculates EM and F1 scores for DataFrame columns."""

    # Ensure inputs are Series
    if not isinstance(y_true, pd.Series):
        y_true = pd.Series(y_true)
    if not isinstance(y_pred, pd.Series):
        y_pred = pd.Series(y_pred)

    em = np.mean(y_true.combine(y_pred, exact_match_score))
    f1 = np.mean(y_true.combine(y_pred, f1_score_squad))

    # # Print non-matching pairs (using index for clarity)
    # for i, (t, p) in enumerate(zip(y_true, y_pred)):
    #     if not exact_match_score(p, t):
    #         print(f"No EM Match at index {i}:\nTrue: {t}\nPred: {p}\n")

    return em, f1

In [24]:
em, f1 = calculate_em_and_f1(test_df["answers"], test_df["predicted_answer"])
print(f"EM score: {em}")
print(f"F1 score: {f1}")

EM score: 0.5714285714285714
F1 score: 0.8134526797963079


### Prepare the data for fine-tuning

To optimize the supervised fine-tuning process for a foundation model, ensure your dataset includes examples that reflect the desired task. Each record in the dataset pairs an input text (or prompt) with its corresponding expected output. This supervised tuning approach uses the dataset to effectively teach the model the specific behavior or task you need it to perform, by providing numerous illustrative examples.

The size of your dataset will vary depending on the complexity of the task, but as a general rule, the more examples you include, the better the model's performance. For fine-tuning Gemini on Vertex AI the minimum number of examples are 100.

Dataset Format
Your training data should be structured in a JSONL file and stored at a Google Cloud Storage (GCS) URI.  Each line in the JSONL file must adhere to the following schema:

A `contents` array containing objects that define:
- A `role` ("user" for user input or "model" for model output)
- `parts` containing the input data.

```
{
   "contents":[
      {
         "role":"user",  # This indicate input content
         "parts":[
            {
               "text":"How are you?"
            }
         ]
      },
      {
         "role":"model", # This indicate target content
         "parts":[ # text only
            {
               "text":"I am good, thank you!"
            }
         ]
      }
      #  ... repeat "user", "model" for multi turns.
   ]
}
```

Refer to the public [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning-prepare#about-datasets) for more details.

In [26]:
# combine the systeminstruct + context + question into one column.
train_df = pd.read_csv("squad_train.csv")
validation_df = pd.read_csv("squad_validation.csv")

In [27]:
# combine the systeminstruct + context + question into one column.
train_df["input_question"] = (
    "\n\n **Below the question with context that you need to answer**"
    + "\n Context: "
    + train_df["context"]
    + "\n Question: "
    + train_df["question"]
)
validation_df["input_question"] = (
    "\n\n **Below the question with context that you need to answer**"
    + "\n Context: "
    + validation_df["context"]
    + "\n Question: "
    + validation_df["question"]
)

In [None]:
def df_to_jsonl(df, output_file):
    """Converts a Pandas DataFrame to JSONL format and saves it to a file.

    Args:
      df: The DataFrame to convert.
      output_file: The name of the output file.
    """

    with open(output_file, "w") as f:
        for row in df.itertuples(index=False):
            jsonl_obj = {
                "systemInstruction": {"parts": [{"text": f"{systemInstruct}"}]},
                "contents": [
                    {
                        "role": "user",
                        "parts": [{"text": f"{row.input_question}"}],
                    },
                    {"role": "model", "parts": [{"text": row.answers}]},
                ],
            }
            f.write(json.dumps(jsonl_obj) + "\n")


# Process the DataFrames
df_to_jsonl(train_df, "squad_train.jsonl")
df_to_jsonl(validation_df, "squad_validation.jsonl")

print(f"JSONL data written to squad_train.jsonl")
print(f"JSONL data written to squad_validation.jsonl")

Next you will copy the files into your Google Cloud bucket

In [None]:
!gsutil cp ./squad_train.jsonl {BUCKET_URI}
!gsutil cp ./squad_validation.jsonl {BUCKET_URI}

### Start fine-tuning job
Next you can start the fine-tuning job.

- `source_model`: Specifies the base Gemini model version you want to fine-tune.
 - `train_dataset`: Path to your training data in JSONL format.

  *Optional parameters*
 - `validation_dataset`: If provided, this data is used to evaluate the model during tuning.
 - `tuned_model_display_name`: Display name for the tuned model.
 - `epochs`: The number of training epochs to run.
 - `learning_rate_multiplier`: A value to scale the learning rate during training.
 - `adapter_size` : Gemini 1.5 Flash supports Adapter length [1, 4], default value is 4.

 **Important**: The default hyperparameter settings are optimized for optimal performance based on rigorous testing and are recommended for initial use. Users may customize these parameters to address specific performance requirements.**

In [None]:
tuned_model_display_name = "fine-tuning-gemini-flash-qa-v01"  # @param {type:"string"}

sft_tuning_job = sft.train(
    source_model=base_model,
    train_dataset=f"""{BUCKET_URI}/squad_train.jsonl""",
    # # Optional:
    validation_dataset=f"""{BUCKET_URI}/squad_validation.jsonl""",
    tuned_model_display_name=tuned_model_display_name,
)

In [None]:
# Get the tuning job info.
sft_tuning_job.to_dict()

In [None]:
# Get the resource name of the tuning job
sft_tuning_job_name = sft_tuning_job.resource_name
sft_tuning_job_name

**Important:** Tuning time depends on several factors, such as training data size, number of epochs, learning rate multiplier, etc.

<div class="alert alert-block alert-warning">
<b>⚠️ It will take ~30 mins for the model tuning job to complete on the provided dataset and set configurations/hyperparameters. ⚠️</b>
</div>

In [None]:
%%time
# Wait for job completion
while not sft_tuning_job.refresh().has_ended:
    time.sleep(60)

In [None]:
# tuned model name
tuned_model_name = sft_tuning_job.tuned_model_name
tuned_model_name

In [None]:
# tuned model endpoint name
tuned_model_endpoint_name = sft_tuning_job.tuned_model_endpoint_name
tuned_model_endpoint_name

#### Model tuning metrics

- `/train_total_loss`: Loss for the tuning dataset at a training step.
- `/train_fraction_of_correct_next_step_preds`: The token accuracy at a training step. A single prediction consists of a sequence of tokens. This metric measures the accuracy of the predicted tokens when compared to the ground truth in the tuning dataset.
- `/train_num_predictions`: Number of predicted tokens at a training step

#### Model evaluation metrics:

- `/eval_total_loss`: Loss for the evaluation dataset at an evaluation step.
- `/eval_fraction_of_correct_next_step_preds`: The token accuracy at an evaluation step. A single prediction consists of a sequence of tokens. This metric measures the accuracy of the predicted tokens when compared to the ground truth in the evaluation dataset.
- `/eval_num_predictions`: Number of predicted tokens at an evaluation step.

The metrics visualizations are available after the model tuning job completes. If you don't specify a validation dataset when you create the tuning job, only the visualizations for the tuning metrics are available.

In [None]:
# Get resource name from tuning job.
experiment_name = sft_tuning_job.experiment.resource_name
experiment_name

In [37]:
# Locate Vertex AI Experiment and Vertex AI Experiment Run
experiment = aiplatform.Experiment(experiment_name=experiment_name)
filter_str = metadata_utils._make_filter_string(
    schema_title="system.ExperimentRun",
    parent_contexts=[experiment.resource_name],
)
experiment_run = context.Context.list(filter_str)[0]

In [38]:
# Read data from Tensorboard
tensorboard_run_name = f"{experiment.get_backing_tensorboard_resource().resource_name}/experiments/{experiment.name}/runs/{experiment_run.name.replace(experiment.name, '')[1:]}"
tensorboard_run = aiplatform.TensorboardRun(tensorboard_run_name)
metrics = tensorboard_run.read_time_series_data()

In [39]:
def get_metrics(metric: str = "/train_total_loss"):
    """
    Get metrics from Tensorboard.

    Args:
      metric: metric name, eg. /train_total_loss or /eval_total_loss.
    Returns:
      steps: list of steps.
      steps_loss: list of loss values.
    """
    loss_values = metrics[metric].values
    steps_loss = []
    steps = []
    for loss in loss_values:
        steps_loss.append(loss.scalar.value)
        steps.append(loss.step)
    return steps, steps_loss

In [40]:
# Get Train and Eval Loss
train_loss = get_metrics(metric="/train_total_loss")
eval_loss = get_metrics(metric="/eval_total_loss")

In [41]:
# Plot the train and eval loss metrics using Plotly python library
fig = make_subplots(
    rows=1, cols=2, shared_xaxes=True, subplot_titles=("Train Loss", "Eval Loss")
)

# Add traces
fig.add_trace(
    go.Scatter(x=train_loss[0], y=train_loss[1], name="Train Loss", mode="lines"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=eval_loss[0], y=eval_loss[1], name="Eval Loss", mode="lines"),
    row=1,
    col=2,
)

# Add figure title
fig.update_layout(title="Train and Eval Loss", xaxis_title="Steps", yaxis_title="Loss")

# Set x-axis title
fig.update_xaxes(title_text="Steps")

# Set y-axes titles
fig.update_yaxes(title_text="Loss")

# Show plot
fig.show()

### Use the fine-tuned model and evaluation

In [42]:
prompt = """
Answer the question based on the context

Context: In the 1840s and 50s, there were attempts to overcome this problem by means of various patent valve gears with a separate, variable cutoff expansion valve riding on the back of the main slide valve; the latter usually had fixed or limited cutoff.
The combined setup gave a fair approximation of the ideal events, at the expense of increased friction and wear, and the mechanism tended to be complicated.
The usual compromise solution has been to provide lap by lengthening rubbing surfaces of the valve in such a way as to overlap the port on the admission side, with the effect that the exhaust side remains open for a longer period after cut-off on the admission side has occurred.
This expedient has since been generally considered satisfactory for most purposes and makes possible the use of the simpler Stephenson, Joy and Walschaerts motions.
Corliss, and later, poppet valve gears had separate admission and exhaust valves driven by trip mechanisms or cams profiled so as to give ideal events; most of these gears never succeeded outside of the stationary marketplace due to various other issues including leakage and more delicate mechanisms.

Question: How is lap provided by overlapping the admission side port?
"""

In [43]:
base_model = tuned_model_endpoint_name

In [None]:
if True:
    tuned_genai_model = GenerativeModel(base_model)
    # Test with the loaded model.
    print("***Testing***")
    print(tuned_genai_model.generate_content(contents=prompt))
else:
    print("State:", sft_tuning_job.state)
    print("Error:", sft_tuning_job.error)

In [None]:
# Apply the get_prediction() function to the 'question_column'
test_df["predicted_answer"] = test_df["input_question"].apply(get_predictions)
test_df.head(2)

In [46]:
test_df["predicted_answer"] = test_df["predicted_answer"].apply(normalize_answer)

In [47]:
em, f1 = calculate_em_and_f1(test_df["answers"], test_df["predicted_answer"])
print(f"EM score: {em}")
print(f"F1 score: {f1}")

EM score: 0.6945812807881774
F1 score: 0.856248634208346
