The following table shows the recommended configurations for tuning a foundation model by task:

| Task             | No. of examples in dataset | Number of epochs |
|------------------|----------------------------|------------------|
| Classification   | 500+                        | 2-4              |
| Summarization    | 1000+                       | 2-4              |
| Extractive QA    | 500+                        | 2-4              |
| Chat             | 1000+                       | 2-4              |

# **1. Setup**

In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [2]:
from google.auth import default
CREDENTIALS, PROJECT_ID= default()

In [3]:
PROJECT_ID= "basic-pipeline-435315"
REGION= "us-central1"

BUCKET_URI= "gs://starborn-1/"
! gsutil ls -al $BUCKET_URI

                                 gs://starborn-1/Stackoverflow_dataset/


In [4]:
import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.preview.tuning import sft
vertexai.init(project=PROJECT_ID, location=REGION)

from google.cloud import bigquery
bq_client= bigquery.Client(project=PROJECT_ID, credentials=CREDENTIALS)

import pandas as pd
from sklearn.model_selection import train_test_split

# **2. Supervised fine tuning with Gemini on a question and answer dataset**

How to ensure the our data is primed for success with supervised tuning?

Here's a breakdown of critical areas to focus on:

- Domain Alignment: Supervised fine-tuning thrives on smaller datasets, but they must be highly relevant to our downstream task. Seek out data that closely mirrors the domain we will encounter in real-world use cases.
- Labeling Accuracy: Noisy labels will sabotage even the best technique. Prioritize accuracy in our annotations and labeling.
- Noise Reduction: Outliers, inconsistencies, or irrelevant examples hurt model adaptation. Implement preprocessing, such as removing duplicates, fixing typos, and verifying that data conforms to our task's expectations.
- Distribution: A diverse range of examples will help our model generalize better within the confines of our target task. Refrain from overloading the process with excessive variance that strays from our core domain.
- Balanced Classes: For classification tasks, try to keep a reasonable balance between different classes to avoid the model learning biases towards a specific class.

In [6]:
from typing import Union
from pprint import pprint



---


**1. Fetchin data from big query**

Our model tuning dataset must be in a JSONL format where each line contains a single training example. We must make sure that we include instructions.


---



In [20]:
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
  """
  Runs a bigquery query and returns the jobID or dataframe of the results.
  Args:
    sql: SQL query, as a string, to execute in big query.
  Returns:
    df: DataFrame of the query results or error if any.
  """
  # Dry run to check if any errors
  job_config= bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
  bq_client.query(sql, job_config=job_config)

  # If dry run succeeds, proceed
  job_config= bigquery.QueryJobConfig()
  client_result= bq_client.query(sql, job_config=job_config)

  job_id= client_result.job_id

  df= client_result.result().to_arrow().to_pandas()
  print("Finished job_id: ", job_id)

  return df

In [30]:
stack_overflow_df= run_bq_query(sql=
    """
    SELECT
      CONCAT (que.title, que.body) AS input_text,
      ans.body AS output_text
    FROM `bigquery-public-data.stackoverflow.posts_questions` que
    JOIN `bigquery-public-data.stackoverflow.posts_answers` ans
      ON que.accepted_answer_id= ans.id
    WHERE
      que.accepted_answer_id IS NOT NULL
      AND REGEXP_CONTAINS(que.tags, "python")
      AND ans.creation_date >= "2020-01-01"
    LIMIT 600
    """
)

Finished job_id:  4a2c8545-8bd0-4e62-bf00-7db3b26bb826


In [31]:
print(len(stack_overflow_df))

600




---


**2. Adding instructions**

Finetuning language models on a collection of datasets phrased as instructions have been shown to improve model performance and generalization to unseen tasks.


---



In [32]:
INSTRUCTION_TEMPLATE="""
You are a helpful Python developer \
You are good at answering StackOverflow questions \
Your mission is to provide developers with helpful answers that work
"""

In [33]:
stack_overflow_df["input_text_instruct"]= INSTRUCTION_TEMPLATE
stack_overflow_df.head(2)

Unnamed: 0,input_text,output_text,input_text_instruct
0,AttributeError with extended class in Python<p...,<p>I do not know much about the <code>derivati...,\nYou are a helpful Python developer You are g...
1,Annotation bug coercing string to float<p>I've...,<p>you need to comment out the <code>y=[...]</...,\nYou are a helpful Python developer You are g...


In [34]:
train, evaluation = train_test_split(stack_overflow_df, test_size=0.2, random_state=42)

print(len(train))
print(len(evaluation))

480
120




---


**3. Generating the JSONL files**

Prepare your training data in a JSONL (JSON Lines) file and store it in a Google Cloud Storage (GCS) bucket. This format ensures efficient processing. Each line of the JSONL file must represent a single data instance and follow a well-defined schema:

{"messages": [{"role": "system", "content": "instructions"}, {"role": "user", "content": "question"}, {"role": "model", "content": "answering"}]}


---



In [27]:
import datetime

In [36]:
date= datetime.datetime.now().strftime("%m%d%Y")

tuning_data_filename= f"tuning_data_{date}.jsonl"
validation_data_filename= f"validation_data_{date}.jsonl"

In [35]:
def format_messages(row):
  """
  Formats each row into the dedired JSONL structure.
  """
  return {
      "messages": [
          {"role": "system", "content": row["input_text_instruct"]},
          {"role": "user", "content": row["input_text"]},
          {"role": "model", "content": row["output_text"]},
      ]
  }

tuning_data= train.apply(format_messages, axis=1).to_json(orient="records", lines=True)
validation_data= evaluation.apply(format_messages, axis=1).to_json(orient="records", lines=True)

In [37]:
with open(tuning_data_filename, "w") as f:
  f.write(tuning_data)

with open(validation_data_filename, "w") as f:
  f.write(validation_data)

In [38]:
with open(tuning_data_filename, "r") as f:
  num_rows= sum(1 for _ in f)

print("Training rows: ", num_rows)

Training rows:  480


In [39]:
!gsutil cp $tuning_data_filename $validation_data_filename $BUCKET_URI

Copying file://tuning_data_09152024.jsonl [Content-Type=application/octet-stream]...
Copying file://validation_data_09152024.jsonl [Content-Type=application/octet-stream]...
- [2 files][  1.8 MiB/  1.8 MiB]                                                
Operation completed over 2 objects/1.8 MiB.                                      


In [40]:
!gsutil ls -al $BUCKET_URI

   1530367  2024-09-15T10:26:43Z  gs://starborn-1/tuning_data_09152024.jsonl#1726396002957443  metageneration=1
    364262  2024-09-15T10:26:43Z  gs://starborn-1/validation_data_09152024.jsonl#1726396003232003  metageneration=1
                                 gs://starborn-1/Stackoverflow_dataset/
TOTAL: 2 objects, 1894629 bytes (1.81 MiB)


In [51]:
TRAINING_DATA_URI= f"{BUCKET_URI}{tuning_data_filename}"
VALIDATION_DATA_URI= f"{BUCKET_URI}{validation_data_filename}"

In [56]:
TRAINING_DATA_URI

'gs://starborn-1/tuning_data_09152024.jsonl'



---


**4. Creating a tuning job using gemini**


---



In [57]:
foundational_model= GenerativeModel("gemini-1.0-pro-002")

In [58]:
sft_trainer= sft.train(
    source_model=foundational_model,
    train_dataset=TRAINING_DATA_URI,
    validation_dataset=VALIDATION_DATA_URI,
    epochs=3,
    learning_rate_multiplier=1.0,
)

sft_trainer.to_dict()

INFO:vertexai.tuning._tuning:Creating SupervisedTuningJob
INFO:vertexai.tuning._tuning:SupervisedTuningJob created. Resource name: projects/52865938246/locations/us-central1/tuningJobs/5218904612085432320
INFO:vertexai.tuning._tuning:To use this SupervisedTuningJob in another session:
INFO:vertexai.tuning._tuning:tuning_job = sft.SupervisedTuningJob('projects/52865938246/locations/us-central1/tuningJobs/5218904612085432320')
INFO:vertexai.tuning._tuning:View Tuning Job:
https://console.cloud.google.com/vertex-ai/generative/language/locations/us-central1/tuning/tuningJob/5218904612085432320?project=52865938246


{'name': 'projects/52865938246/locations/us-central1/tuningJobs/5218904612085432320',
 'tunedModelDisplayName': 'SupervisedTuningJob 2024-09-15 11:02:13.235665',
 'baseModel': 'gemini-1.0-pro-002',
 'supervisedTuningSpec': {'trainingDatasetUri': 'gs://starborn-1/tuning_data_09152024.jsonl',
  'validationDatasetUri': 'gs://starborn-1/validation_data_09152024.jsonl',
  'hyperParameters': {'epochCount': '3', 'learningRateMultiplier': 1.0}},
 'state': 'JOB_STATE_PENDING',
 'createTime': '2024-09-15T11:02:14.090890Z',
 'updateTime': '2024-09-15T11:02:14.090890Z'}

In [61]:
# Retrieving the models resource name
model_resource_name= sft_trainer.resource_name
model_resource_name

'projects/52865938246/locations/us-central1/tuningJobs/5218904612085432320'

Continue with the next cell once tuning is complete.

In [62]:
import time

In [63]:
%%time
# Wait for job completion
while not sft_trainer.refresh().has_ended:
    time.sleep(60)

CPU times: user 76.6 ms, sys: 5.98 ms, total: 82.6 ms
Wall time: 1.61 s


In [64]:
sft_trainer.list()

[<vertexai.tuning._supervised_tuning.SupervisedTuningJob object at 0x7faf34fecc70> 
 resource name: projects/52865938246/locations/us-central1/tuningJobs/5218904612085432320,
 <vertexai.tuning._supervised_tuning.SupervisedTuningJob object at 0x7faf3600f8b0> 
 resource name: projects/52865938246/locations/us-central1/tuningJobs/6129757636721115136]

Our model is automatically deployed as a Vertex AI Endpoint and ready for usage!

In [66]:
tuned_model_endpoint_name = sft_trainer.tuned_model_endpoint_name
tuned_model_endpoint_name

'projects/52865938246/locations/us-central1/endpoints/5273927369395011584'

In [67]:
tuned_model = GenerativeModel(tuned_model_endpoint_name)
print(tuned_model)

<vertexai.generative_models.GenerativeModel object at 0x7faf34fef640>




---


**5. Calling the API**


---



In [69]:
output= tuned_model.generate_content(
    "What are the best resources to learn Google cloud platform?"
)

In [74]:
pprint(output.text)

('## Best Resources for Learning Google Cloud Platform:\n'
 '\n'
 '**Official Resources:**\n'
 '\n'
 '* **Google Cloud Platform Documentation:** Comprehensive and up-to-date '
 'documentation, including tutorials, code samples, and best practices. '
 'https://cloud.google.com/docs/\n'
 '* **Google Cloud Platform Blog:** News, announcements, and insights about '
 'Google Cloud Platform. https://cloud.google.com/blog/\n'
 '* **Google Cloud Platform Community Forums:** A great place to ask questions '
 'and get help from Google experts and other users. '
 'https://cloud.google.com/community/forums/\n'
 '* **Google Cloud Platform Learning Tracks:** Curated learning paths to help '
 'you get started with specific Google Cloud services. '
 'https://cloud.google.com/learning/learn-cloud-gcp/\n'
 '\n'
 '**Free Courses:**\n'
 '\n'
 '* **Cloud OnAir:** Google hosts regular events featuring demos, product deep '
 'dives, and expert sessions. https://cloudonair.withgoogle.com/\n'
 '* **Coursera:**