#Fine-Tuning OpenAI Models

Copyright 2024 Denis Rothman

[OpenAI fine-tuning documentation](https://beta.openai.com/docs/guides/fine-tuning/)

Check the cost of fine-tuning your dataset on OpenAI before running the notebook.

Run this notebook cell by cell to:

1.Download and prepare the data   
2.Fine-tune a model   
3.Run a fine-tuned model   
4.View the metrics


# Installing the environment


In [1]:
#You can retrieve your API key from a file(1)
# or enter it manually(2)
#Comment this cell if you want to enter your key manually.
#(1)Retrieve the API Key from a file
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline()
f.close()

Mounted at /content/drive


In [2]:
try:
  import openai
except:
  !pip install openai==1.35.2
  import openai

Collecting openai==1.35.2
  Downloading openai-1.35.2-py3-none-any.whl (327 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/327.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/327.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.4/327.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai==1.35.2)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai==1.35.2)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai==1.

In [3]:
#(2) Enter your manually by
# replacing API_KEY by your key.
#The OpenAI Key
import os
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

In [4]:
!pip install datasets==2.20.0

Collecting datasets==2.20.0
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets==2.20.0)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets==2.20.0)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets==2.20.0)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets==2.20.0)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2

Listing the installed packages

In [5]:
import subprocess

# Run pip list and capture the output
result = subprocess.run(['pip', 'list'], stdout=subprocess.PIPE, text=True)

# Split the output into lines and count them
package_list = result.stdout.split('\n')

# Adjust count for headers or empty lines
package_count = len([line for line in package_list if line.strip() != '']) - 2

print(f"Number of installed packages: {package_count}")

Number of installed packages: 498


In [6]:
import subprocess

# Run pip list and capture the output
result = subprocess.run(['pip', 'list'], stdout=subprocess.PIPE, text=True)

# Print the output
print(result.stdout)

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
aiohttp                          3.9.5
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.7.0
anyio                            3.7.1
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array_record                     0.5.1
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.1.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.15.0
backcall                         0.2.0
beautifulsoup4                   4.12.3
bidict                           0.23.1

counting the number of packages

# 1.Preparing the dataset for fine-tuning

## 1.1.Downloading and preparting the dataset

### Preparing, and displaying the dataset

In [7]:
# Import required libraries
from datasets import load_dataset
import pandas as pd

# Load the SciQ dataset from HuggingFace
dataset = load_dataset("sciq", split="train")

# Filter the dataset to include only questions with support and correct answer
filtered_dataset = dataset.filter(lambda x: x["support"] != "" and x["correct_answer"] != "")


# Print the number of questions with support
print("Number of questions with support: ", len(filtered_dataset))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/339k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/11679 [00:00<?, ? examples/s]

Number of questions with support:  10481


In [8]:
# Convert the filtered dataset to a pandas DataFrame
df = pd.DataFrame(filtered_dataset)

# Columns to drop
columns_to_drop = ['distractor3', 'distractor1', 'distractor2']

# Dropping the columns from the DataFrame
df = df.drop(columns=columns_to_drop)

# Display the DataFrame
df.head()

Unnamed: 0,question,correct_answer,support
0,What type of organism is commonly used in prep...,mesophilic organisms,"Mesophiles grow best in moderate temperature, ..."
1,What phenomenon makes global winds blow northe...,coriolis effect,Without Coriolis Effect the global winds would...
2,Changes from a less-ordered state to a more-or...,exothermic,Summary Changes of state are examples of phase...
3,What is the least dangerous radioactive decay?,alpha decay,All radioactive decay is dangerous to living t...
4,Kilauea in hawaii is the world’s most continuo...,smoke and ash,Example 3.5 Calculating Projectile Motion: Hot...


### Streaming the output to JSON


In [9]:
import json
import pandas as pd

# Assuming 'df' is your DataFrame and it has the required columns
# If 'df' is not defined, you might load it or define it here. For example:
# df = pd.DataFrame(data)

# Open a file to write JSON data
with open('QA_prompts_and_completions.json', 'w') as f:
    # Define the separator and ending
    prompt_separator = " ->"
    completion_ending = "\n"

    # Iterate over DataFrame rows
    for i, row in df.iterrows():
        # Constructing the prompt using the 'question' column
        prompt = row['question'] + prompt_separator

        # Constructing the completion using 'correct_answer' and 'support' columns
        completion =" "+ str(row['correct_answer']) + " because " + str(row['support']) + completion_ending

        # Create a dictionary for the prompt and completion
        line = {
            "prompt": prompt,
            "completion": completion
        }

        # Write the dictionary to file in JSON format
        f.write(json.dumps(line) + '\n')

### Visualizing the JSON file

In [10]:
dfile="/content/QA_prompts_and_completions.json"

In [11]:
import pandas as pd

# Load the data
df = pd.read_json(dfile, lines=True)
df

Unnamed: 0,prompt,completion
0,What type of organism is commonly used in prep...,mesophilic organisms because Mesophiles grow ...
1,What phenomenon makes global winds blow northe...,coriolis effect because Without Coriolis Effe...
2,Changes from a less-ordered state to a more-or...,exothermic because Summary Changes of state a...
3,What is the least dangerous radioactive decay? ->,alpha decay because All radioactive decay is ...
4,Kilauea in hawaii is the world’s most continuo...,smoke and ash because Example 3.5 Calculating...
...,...,...
10476,The enzyme pepsin plays an important role in t...,peptides because Protein A large part of prot...
10477,What remains a constant of radioactive substan...,rate of decay because The rate of decay of a ...
10478,"Terrestrial ecosystems, also known for their d...","biomes because Terrestrial ecosystems, also k..."
10479,High explosives create shock waves that exceed...,supersonic because The modern day formulation...


##  1.2. Processing the files for OpenAI


### Converting the data to JSONL

In [14]:
!openai tools fine_tunes.prepare_data -f {dfile}

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 10481 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix `.\n`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `/content/QA_prompts_and_completions_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "/content/QA_prompts_and_completions_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[".\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 2.44 hou

### Splitting dataset into training and validation data

In [15]:
import json
import random

# Load your dataset
input_file = "QA_prompts_and_completions_prepared.jsonl"
with open(input_file, 'r') as f:
    data = [json.loads(line) for line in f]

# Shuffle and split the data
random.shuffle(data)
split_index = int(len(data) * 0.8)  # 80% for training
train_data = data[:split_index]
valid_data = data[split_index:]

# Save the split data
with open("QA_prompts_and_completions_prepared_train.jsonl", 'w') as f:
    for item in train_data:
        f.write(json.dumps(item) + '\n')

with open("QA_prompts_and_completions_prepared_valid.jsonl", 'w') as f:
    for item in valid_data:
        f.write(json.dumps(item) + '\n')

### Creating the files on Openai

In [None]:
from openai import OpenAI

# Create an instance of the OpenAI client
client = OpenAI()

# Upload the training dataset
with open("/content/QA_prompts_and_completions_prepared_train.jsonl", "rb") as file:
    train_file = client.files.create(
        file=file,
        purpose='fine-tune'
    )
training_file_id = train_file.id  # Retrieve the file ID using the id property

# Upload the validation dataset
with open("/content/QA_prompts_and_completions_prepared_valid.jsonl", "rb") as file:
    valid_file = client.files.create(
        file=file,
        purpose='fine-tune'
    )
validation_file_id = valid_file.id  # Retrieve the file ID using the id property

# 2.Fine-tuning the model

In [None]:
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI()

# Previously uploaded file IDs
training_file_id = train_file.id  # Replace 'train_file.id' with your actual training file ID
validation_file_id = valid_file.id  # Replace 'valid_file.id' with your actual validation file ID

# Create the fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model="babbage-002",
    hyperparameters={
        "batch_size": 4,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 5
    }
)

# Print the job details
print(job)

FineTuningJob(id='ftjob-DrZsgyCQsCSvl5wJkWY5omJf', created_at=1718892051, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=5, batch_size=4, learning_rate_multiplier=0.1), model='babbage-002', object='fine_tuning.job', organization_id='org-h2Kjmcir4wyGtqq1mJALLGIb', result_files=[], seed=1369873366, status='validating_files', trained_tokens=None, training_file='file-TuG0uhh6a7ghhfz3zArus9I5', validation_file='file-qerw4vFJQtk7TahSGU1Da48u', estimated_finish=None, integrations=[], user_provided_suffix=None)


## Monitoring the fine-tunes

In [None]:
import pandas as pd

# Assume client is already set up and authenticated
response = client.fine_tuning.jobs.list(limit=10)

# Initialize lists to store the extracted data
job_ids = []
created_ats = []
statuses = []
models = []
training_files = []
error_messages = []
fine_tuned_models = []  # List to store the fine-tuned model names

# Iterate over the jobs in the response
for job in response.data:
    job_ids.append(job.id)
    created_ats.append(job.created_at)
    statuses.append(job.status)
    models.append(job.model)
    training_files.append(job.training_file)
    error_message = job.error.message if job.error else None
    error_messages.append(error_message)

    # Append the fine-tuned model name
    fine_tuned_model = job.fine_tuned_model if hasattr(job, 'fine_tuned_model') else None
    fine_tuned_models.append(fine_tuned_model)

# Create a DataFrame
df = pd.DataFrame({
    'Job ID': job_ids,
    'Created At': created_ats,
    'Status': statuses,
    'Model': models,
    'Training File': training_files,
    'Error Message': error_messages,
    'Fine-Tuned Model': fine_tuned_models  # Include the fine-tuned model names
})

# Convert timestamps to readable format
df['Created At'] = pd.to_datetime(df['Created At'], unit='s')
df = df.sort_values(by='Created At', ascending=False)

# Display the DataFrame
df

Unnamed: 0,Job ID,Created At,Status,Model,Training File,Error Message,Fine-Tuned Model
0,ftjob-DrZsgyCQsCSvl5wJkWY5omJf,2024-06-20 14:00:51,succeeded,babbage-002,file-TuG0uhh6a7ghhfz3zArus9I5,,ft:babbage-002:personal::9cDSaFnw
1,ftjob-Z23iEC97R27MZbdHvFRNSrns,2024-06-20 10:01:05,succeeded,babbage-002,file-ZCDUM8tgto99orNpKTGNs4OB,,ft:babbage-002:personal::9c9eARD4
2,ftjob-acoqFjfqJk6HOwZnTXItHE4U,2024-06-19 15:58:24,succeeded,davinci-002,file-lMzq6TAKruaJXH0WGSDstCcA,,ft:davinci-002:personal::9bsJfXSd
3,ftjob-M6Db3CB83luS3vlFn7zOO5Xl,2024-06-19 15:50:30,succeeded,babbage-002,file-lMzq6TAKruaJXH0WGSDstCcA,,ft:babbage-002:personal::9bs7k1Nn
4,ftjob-14UOoKiI92pWHKchPgNxktwF,2024-06-19 15:48:47,failed,gpt-3.5-turbo-0125,file-lMzq6TAKruaJXH0WGSDstCcA,The job failed due to an invalid training file...,
5,ftjob-HPFJwLnKEhcyQvFfM1MaoxV6,2024-06-19 15:20:33,failed,gpt-3.5-turbo-0125,file-N6M0hEfLJj8PFyvbwH9d0PE6,The job failed due to an invalid training file...,
6,ftjob-d4FXuTaMqw3aogpwsUmf73n0,2024-06-19 13:16:55,succeeded,babbage-002,file-3Yi7gq5GTR0eaEmPiSElmqcb,,ft:babbage-002:personal::9bpj0C4q
7,ftjob-DcwTW438yDorxGaQ1mqH9WkD,2024-06-19 13:03:43,succeeded,babbage-002,file-3Yi7gq5GTR0eaEmPiSElmqcb,,ft:babbage-002:personal::9bpWELth
8,ftjob-Rsznqgak61G9B60FlyFRyPaG,2024-06-19 12:54:51,failed,gpt-3.5-turbo-0125,file-3Yi7gq5GTR0eaEmPiSElmqcb,The job failed due to an invalid training file...,
9,ftjob-imH6x2AjZhbGuCAo2XZZs6T5,2024-06-01 16:40:50,succeeded,babbage-002,file-qjVeWyAJqbKSY5sHTn6W1um7,,ft:babbage-002:personal::9VMBDZgK


In [None]:
# Display the first non-empty Fine-Tuned Model in the DataFrame
first_non_empty_model = df[df['Fine-Tuned Model'].notna() & (df['Fine-Tuned Model'] != '')]['Fine-Tuned Model'].iloc[0]

print("The lastest fine-tuned model is:", first_non_empty_model)

The lastest fine-tuned model is: ft:babbage-002:personal::9cDSaFnw


# 3.Using the fine-tuned OpenAI model

Note: The is a fine-tuning. As such, be patient!
Rune the `Monitoring the fine-tunes` cell and the f`irst_non_empty_model` cell from time to time.

If the fine-tunning succeeded and your model is ready, the name of your model will be `first_non_empty_model`

1.Go to the OpenAI Playground to test your model: https://platform.openai.com/playground

2.Check the metrics in the fine-tuning UI:
https://platform.openai.com/finetune/

3.Try the fined-tune model out in the cell below.

In [None]:
# Define the prompt
p=1
if p==1:
  prompt = "What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?->"
if p==2:
  prompt="Kilauea in hawaii is the world’s most continuously active volcano. very active volcanoes characteristically eject red-hot rocks and lava rather than this?->"

In [None]:
from openai import OpenAI
import textwrap

client = OpenAI()

#fmodel="ft:babbage-002:personal::9c9eARD4"
fmodel=first_non_empty_model

# Use the fine-tuned model with the client object
response= client.completions.create(
    model=fmodel,  # Replace with your actual fine-tuned model ID
    prompt=prompt,
    max_tokens=75,  # Adjust as needed
    temperature=0.0,# Adjust as needed for variability
    stop=["\n"]  # Stop generation at the first newline character
)

In [None]:
wrapped_text = textwrap.fill(response.choices[0].text.strip(), 60)
print(wrapped_text)

Coriolis effect because The Coriolis effect makes global
winds blow northeast to southwest or the reverse in the
northern hemisphere and northwest to southeast or the
reverse in the southern hemisphere. The Coriolis effect is
caused by the rotation of Earth. The Coriolis effect is
strongest in the northern hemisphere. The Coriolis effect is
strongest in the southern hemisphere. The Cor


[Consult OpenAI fine-tune documentation for more](https://platform.openai.com/docs/guides/fine-tuning/create-a-fine-tuned-model)