# Chapter 3 Exercise: Fine-Tuning a Question-Answer Model Using OpenAI



---
## 1: Install Required modules:

The modules can be tricky to load without conflicts. (Not that uncommon for rapidly changing data science software.) The commands below that force the installs of particular versions have worked as of 2/3/24. After these modules are imported, there will be a small button displayed at the end of the output for this block that reads **'RESTART SESSION'**.

Click that button to update the notebook to these particular module versions.

In [None]:
!pip install opendatasets --quiet
!pip install --force-reinstall typing-extensions==4.5
!pip install --force-reinstall openai==1.8

---
## 2: Import Required Libraries:

In [None]:
import pandas as pd
import opendatasets as od
from openai import OpenAI
from datetime import datetime
import matplotlib.pyplot as plt

---
## 3: Load the data from Kaggle

- Details on the [Question-Answer dataset](https://www.kaggle.com/datasets/rtatman/questionanswer-dataset) on Kaggle.
- To load the data from Kaggle, you'll need your Kaggle username and an API key.
- If you don't have a Kaggle account, you can sign up for one [here](https://www.kaggle.com/account/login?phase=startRegisterTab&returnUrl=%2F).
- Once you have an account, you can generate an API key on the [Settings](https://www.kaggle.com/settings) page for your account by selecting ***Create New Token***.
- Alternatively, you can ***download*** the data to your local machine from this page: [Question-Answer dataset](https://www.kaggle.com/datasets/rtatman/questionanswer-dataset) and then ***upload*** it to Colab.

Our task in this notebook will be to fine-tune an existing OpenAI model using this dataset.

In [None]:
# you need your kaggle username and API key here.
qa_url = "https://www.kaggle.com/datasets/rtatman/questionanswer-dataset"
od.download(qa_url)

You can see that the questionanswer-dataset has been saved locally to your colab instance.



In [None]:
!ls

In [None]:
!ls questionanswer-dataset/

---
## 4: Clean and Prepare the Dataset

We'll start by creating three empty dataframes for each year students answered questions to produce this dataset.

In [None]:
df_08=df_09=df_10=[]
print(df_08)

Load the data and take a look at what the dataset looks like, particularly the questions and answers.

In [None]:
df_08 = pd.read_csv('questionanswer-dataset/S08_question_answer_pairs.txt', sep='\t')
df_09 = pd.read_csv('questionanswer-dataset/S09_question_answer_pairs.txt', sep='\t')

# fix for df_10
df_10 = pd.read_csv('questionanswer-dataset/S10_question_answer_pairs.txt', sep='\t', encoding = 'ISO-8859-1')

In [None]:
df_all=pd.concat([df_08,df_09,df_10])
df_all.head()

In [None]:
df_all.tail()

### 4.1: Analyze the dataset:

- How many rows and columns are there?
- Is there bad data inside this dataset (null values, missing etc) ?
- How should we deal with bad rows?

Number of rows and columns:

In [None]:
# your code here

What are the fields and how complete are they?

In [None]:
# your code here

### 4.2: Clean up the data frame and eliminate duplicates and rows with nulls


In [None]:
df_qa = df_all[['Question', 'Answer']]
df_qa.head()

Drop rows with ANY missing data and drop duplicate questions.

In [None]:
# your code here

Check to see if we missed any NaNs

In [None]:
df_qa.isna().any()

In [None]:
df_qa.info()

In [None]:
df_qa.head()

### 4.3: Transform the cleaned dataframe into a format OpenAI uses for fine-tuning


Start by changing 'Question' and 'Answer' to 'prompt' and 'completion', respectively:

In [None]:
df=df_qa.rename(columns={"Question": "prompt", "Answer": "completion"})
df=df.dropna()

In [None]:
df.head()

### 4.4: Split the data into train and validation sets

In [None]:
from sklearn.model_selection import train_test_split
# your code here, set the values of df_train and df_val

In [None]:
df_train.info()

In [None]:
df_val.info()

## 4.5 Convert dataset into json form

In [None]:
df_train.to_json("qadatasetTrain.jsonl", orient='records', lines=True)
df_val.to_json("qadatasetval.jsonl", orient='records', lines=True)
df.to_json("qadataset.jsonl", orient='records', lines=True)

### 4.6 Create the OpenAI client and upload your data to OpenAI

Replace `<your key here>` with your OpenAI API key:

In [None]:
api_key = "<your key here"

Create your OpenAI client

In [None]:
# your code here

Make sure your files are local to your colab working directory:

In [None]:
!ls

Upload your datafiles to OpenAI

In [None]:
file_name_ls = ["qadataset.jsonl",
               "qadatasetval.jsonl",
               "qadatasetTrain.jsonl"]

upload_response={}

for file_name in file_name_ls:
  upload_response[file_name] = client.files.create(
  file=open(file_name, "rb"),
  purpose='fine-tune')

`upload_response` is a dictionary we create to hold the information on how these files are stored within OpenAI. The most important of these are the OpenAI 'versions' of your data files.

In [None]:
upload_response

To retrieve one of these file names you'd pass it the json name as a key to the dictionary and append `.id` to access the file id:



In [None]:
upload_response['qadatasetTrain.jsonl'].id

---
## 5: Fine tune OpenAI's 'base' davinci-002 model

*OpenAI's models are changing pretty rapidly, so don't be surprised if you need to change the name of this model to a successor model if davinci-002 is deprecated. The error message will point you towards options for the successor model to use.*

### 5.1: Kick off a fine-tuning job.

- The `fine_tuning.jobs.create()` method of `client` is the key here.
- On my base version of colab, this job took 12-15 minutes to run.

- **IF YOU CAN'T WAIT THAT LONG FOR A MODEL TO TRAIN, SKIP TO SECTION 6 of this notebook below "6: Examine the Fine-Tuned Model's Performance". IF YOU UNCOMMENT THE INDICATED LINE IN THAT SECTION'S FIRST CELL, YOU CAN LOAD A PREVIOUSLY TRAINED MODEL INSTEAD OF WAITING FOR YOUR NEW ONE TO RUN.**



In [None]:
train_file_id = upload_response['qadatasetTrain.jsonl'].id
val_file_id = upload_response['qadatasetval.jsonl'].id

# your code here

dict(fine_tune_response)

### 5.2 Checking on the status of your job

You can use this code to check the status of your job:

In [None]:
job_id=fine_tune_response.id

print(f'job status is: {client.fine_tuning.jobs.retrieve(job_id).status} \n')

events = client.fine_tuning.jobs.list_events(job_id)

ts = events.data[0].created_at
dt = datetime.fromtimestamp(ts)

print(f"Most recent event: {events.data[0].message}")
print(f"Occurred at: {dt}")


You can use this code to see each step in the status of your job, with the most recent event first and the first event listed last.

In [None]:
import signal
#import datetime

def signal_handler(sig, frame):
  status = client.fine_tuning.jobs.retrieve(job_id).status
  print(f"Stream interrupted. Job is still {status}.")
  return

print(f'Streaming events for the fine-tuning job: {job_id}\n')
signal.signal(signal.SIGINT, signal_handler)

events = client.fine_tuning.jobs.list_events(job_id)

try:
  for event in events.data:
      print(f'{datetime.fromtimestamp(event.created_at)} {event.message}')
except Exception:
    print("Stream interrupted (client disconnected).")


### 5.4: Retrieving Prior Fine-Tuning Jobs

Running fine-tuning jobs when you already have working ones (say, because you colab session timed out) can be expensive and time-consuming, especially if you're working with real, rather than educational datasets.

There are a few ways to retrieve your prior jobs:
- From the [Fine-tuning](https://platform.openai.com/finetune) dashboard on OpenAi's website
- Programatically, which we'll show below.

The object below will get you a list of all of your recent fine-tuning jobs. We'll comment it out because its output can get long, but you're free to uncomment it and run it.

In [None]:
#client.fine_tuning.jobs.list().__dict__ ['data']

If you want to get the most recently completed fine-tuned job, you'd just append [0] to the above to get the first item in the list. (Or n to get the n+1th job.)

In [None]:
ftjob_example = client.fine_tuning.jobs.list().__dict__ ['data'][0]
ftjob_example

The key pieces of information you're likely to want to extract are:
- The fine-tuning job ID
- The name of the fine-tuned model associated with the job
- The result file(s), which provides performance measures at each step for the job as the model is fine-tuned

The next three cells show the commands you'd use to retrieve each:

In [None]:
ftjob_example.id

In [None]:
ftjob_example.fine_tuned_model

In [None]:
#ftjob_example.result_files[0]

This will provide you with some of the key pieces of information for the jobs you've run:

In [None]:
# print up to the last 5 fine tuning jobs in reverse order of completion
# (most recent first)
max_jobs = 5

for i, ft_job in enumerate(client.fine_tuning.jobs.list().__dict__ ['data']):
  #'{datetime.fromtimestamp(event.created_at)
  if i == max_jobs:
    break
  try:
    print(f"Fine-tuning job completed: {datetime.fromtimestamp(ft_job.finished_at)}") # completed at
    print(f"Fine-tuning job ID: {ft_job.id}") # job
    print(f"Fine-tuned model: {ft_job.fine_tuned_model} \n")
  except:
    print("Job hasn't completed (and thus no complete date & time)")


---
## 6: Examine the Fine-Tuned Model's Performance

Note: *If you run this before the job you kicked-off above completes, you'll get your last completed job rather than the new job.*

In [None]:
# get the last completed job
ft_job_new = client.fine_tuning.jobs.list().__dict__ ['data'][0]

# IF YOU CAN'T WAIT FOR THE JOB TO RUN, YOU CAN LOAD AN ALREADY TRAINED MODEL BY UNCOMMENTING THE LINE BELOW
#ft_job_new = client.fine_tuning.jobs.retrieve('ftjob-oqjKe4BrjUfJHiINkWtDGwQw')

Get the performance results for the last job and save those results locally as a CSV file.

In [None]:
response_file = ft_job_new.result_files[0]
print(f"Name of response_file: {response_file}")

## dynamic response file ID
resultsdv=client.files.content(response_file)
resultsdv.write_to_file("compiled_results.csv")


In [None]:
!ls

### 6.1 Reading the Performance Results into a Dataframe

In [None]:
#Evaluation Metric for Fine tuned model

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

TESTDATA = StringIO(resultsdv.text)
df = pd.read_csv(TESTDATA, sep=",")


In [None]:
df

### 6.2: A Quick Visualistion of Model Performance

We want to see a general decrease in loss and a general increase in accuracy.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(df.train_loss)
axes[0].set_title("Model Training Loss")
axes[0].set_xlabel("steps (# of batches)")
axes[1].plot(df.train_accuracy)
axes[1].set_title("Model Training Accuracy")
axes[1].set_xlabel("steps (# of batches)")
;

---
## 7: Evaluation Metrics for Fine-Tuned Models

- **elapsed_examples**: the number of examples the model has seen so far (including repeats), where one example is one element in your batch. For example, if batch_size = 4, each step will increase elapsed_examples by 4.


- **elapsed_tokens**: the number of tokens the model has seen so far (including repeats).

- **training_loss**: loss on the training batch

- **training_sequence_accuracy**: the percentage of completions in the training batch for which the model's predicted tokens matched the true completion tokens exactly. For example, with a batch_size of 3, if your data contains the completions [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67


- **training_token_accuracy**: the percentage of tokens in the training batch that were correctly predicted by the model. For example, with a batch_size of 3, if your data contains the completions [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83

Small batch sizes are one reason the loss can be quite variable from batch to batch even as overall loss measures improve.

---
## 8: Compare the Fine-Tuned Models' Answers to the Base Models'

When are the fine-tuned models answer's superior to the base model's answers and vice-versa?

In [None]:
# the model you fine-tuned above:
new_model = ft_job_new.fine_tuned_model
print(new_model)


The prompt (Question) and completion (Answer) from the model you fine-tuned.

In [None]:
prompt ="When did Lincoln begin his political career? ->"

result = client.completions.create(model=new_model, prompt=prompt,
                                  max_tokens=30, temperature=0,
                                  top_p=1, n=1, stop=['.','\n'])
result.choices[0].text


The same for the 'base' model that is not fine-tuned:

In [None]:
prompt ="When did Lincoln begin his political career?\nAnswer"
#prompt ="When did Lincoln begin his political career? ->"
base_model = "davinci-002"

result = client.completions.create(model=base_model, prompt=prompt,
                                  max_tokens=30, temperature=0,
                                  top_p=1, n=1, stop=['.','\n'])

result.choices[0].text

Write a function that makes comparing the results between the base and new model a little more concise.

- It should take new_model, type and your prompt and print:
  - The the prompt
  - The name of the new model
  - The response

  and the same for the base model:

In [None]:
### STUDENT CODE BELOW

def newbasecompare(new_model, base_model, prompt):
# your code here


We'll use our function to look at few more examples.

In [None]:
prompt ="Who was the first to perform and publish careful experiments aiming at the definition of an international temperature scale on scientific grounds?"
base_model = "davinci-002"
newbasecompare(new_model, base_model, prompt)

In [None]:
prompt="What do beetles eat?"
base_model = "davinci-002"
newbasecompare(new_model, base_model, prompt)

In [None]:
prompt="What are the similarities between beetles and grasshoppers?"
base_model = "davinci-002"
newbasecompare(new_model, base_model, prompt)

Lastly, we'll swtich the base model from davinci-002 to gpt-3.5-turbo-instruct, which is more advanced (and quite a bit more expensive).

In [None]:
prompt ="Who was the first to perform and publish careful experiments aiming at the definition of an international temperature scale on scientific grounds?"
base_model = "gpt-3.5-turbo-instruct"
newbasecompare(new_model, base_model, prompt)

In [None]:
prompt="What do beetles eat?"
base_model = "gpt-3.5-turbo-instruct"
newbasecompare(new_model, base_model, prompt)

In [None]:
prompt="What are the similarities between beetles and grasshoppers?"
base_model = "gpt-3.5-turbo-instruct"
newbasecompare(new_model, base_model, prompt)

In [None]:
prompt="When did Lincoln begin his political career?"
base_model = "gpt-3.5-turbo-instruct"
newbasecompare(new_model, base_model, prompt)