#Fine-Tuning OpenAI models

Copyright 2024 Denis Rothman

**September 2,2024 Update**

Starting October 28, 2024, please use **`Chapter08/Fine_tuning_GPT_4o_mini_SQuAd.ipynb`** that you can access through the README file or directly in the GitHub directory.This notebook will no longer be supported after October 28,2024.

OpenAI will retire `babbage-002` and `davinci-002` in October 2024:      
"New fine-tuning training runs on babbage-002 and davinci-002 will no longer be supported starting October 28, 2024."

For more, please consult [OpenAI fine-tuning documentation](https://beta.openai.com/docs/guides/fine-tuning/)

Check the cost of fine-tuning your dataset on OpenAI before running the notebook.

Run this notebook cell by cell to:

1.Preparing the dataset     
2.Fine-tunng a model  
3.Running the fine-tuned model
4.Managing fine-tuned jobs and models     
5.Before leaving

**April 26, 2024** Step 1.2 has been automated.


## Installing OpenAI
Restart the runtime if necessary after installing openai and run the cell again to make sur that "import openai" is executed.

In [None]:
try:
  import openai
except:
  !pip install openai
  import openai

Collecting openai
  Downloading openai-1.23.6-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.6/311.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.5 ht

## Your API Key

In [None]:
#You can retrieve your API key from a file(1)
# or enter it manually(2)

#Comment this cell if you want to enter your key manually.
#(1)Retrieve the API Key from a file
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline()
f.close()

Mounted at /content/drive


In [None]:
#(2) Enter your manually by
# replacing API_KEY by your key.
#The OpenAI Key
import os
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

# 1.Preparing the dataset

## 1.1. Preparing the data in JSON

In [None]:
#From Gutenberg to JSON
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import requests
from bs4 import BeautifulSoup
import json
import re

# First, fetch the text of the book
# Option 1: from Project Gutenberg
#url = 'http://www.gutenberg.org/cache/epub/4280/pg4280.html'
#response = requests.get(url)
#soup = BeautifulSoup(response.content, 'html.parser')

# Option 2: from the GitHub repository:
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition/master/Chapter08/gutenberg.org_cache_epub_4280_pg4280.html --output "gutenberg.org_cache_epub_4280_pg4280.html"

# Open and read the downloaded HTML file
with open("gutenberg.org_cache_epub_4280_pg4280.html", 'r', encoding='utf-8') as file:
    file_contents = file.read()

# Parse the file contents using BeautifulSoup
soup = BeautifulSoup(file_contents, 'html.parser')

# Get the text of the book and clean it up a bit
text = soup.get_text()
text = re.sub('\s+', ' ', text).strip()

# Split the text into sentences
sentences = sent_tokenize(text)

# Define the separator and ending
prompt_separator = " ->"
completion_ending = "\n"

# Now create the prompts and completions
data = []
for i in range(len(sentences) - 1):
    data.append({
        "prompt": sentences[i] + prompt_separator,
        "completion": " " + sentences[i + 1] + completion_ending
    })

# Write the prompts and completions to a file
with open('kant_prompts_and_completions.json', 'w') as f:
    for line in data:
        f.write(json.dumps(line) + '\n')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1295k  100 1295k    0     0  2880k      0 --:--:-- --:--:-- --:--:-- 2878k


In [None]:
import pandas as pd

# Load the data
df = pd.read_json('kant_prompts_and_completions.json', lines=True)
df

Unnamed: 0,prompt,completion
0,The Project Gutenberg Etext of The Critique of...,Be sure to check the copyright laws for your ...
1,Be sure to check the copyright laws for your c...,"We encourage you to keep this file, exactly a..."
2,"We encourage you to keep this file, exactly as...",Please do not remove this.\n
3,Please do not remove this. ->,This header should be the first thing seen wh...
4,This header should be the first thing seen whe...,Do not change or edit it without written perm...
...,...,...
6122,"78-79. is their motto, under which they may le...",As regards those who wish to pursue a scienti...
6123,As regards those who wish to pursue a scientif...,"When I mention, in relation to the former, th..."
6124,"When I mention, in relation to the former, the...",The critical path alone is still open.\n
6125,The critical path alone is still open. ->,If my reader has been kind and patient enough...


##  1.2. Converting the data to JSONL

Answer the questions as necessary for your project.

**April 26, 2024** This [Y/n] has been automated.
Your data will be written to a new JSONL file. Proceed [Y/n]: Y

The following information is provided in interactive mode:


```
Wrote modified file to `kant_prompts_and_completions_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "kant_prompts_and_completions_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 1.44 hours to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

```



In [None]:
! yes | openai tools fine_tunes.prepare_data -f "kant_prompts_and_completions.json"
#uncomment to run in interactive mode
#!openai tools fine_tunes.prepare_data -f "kant_prompts_and_completions.json"

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 6127 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix `\n`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`


Your data will be written to a new JSONL file. Proceed [Y/n]: 
Wrote modified file to `kant_prompts_and_completions_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "kant_prompts_and_completions_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 1.44 hours to train a `curi

In [None]:
import json

# Open the file and read the lines
with open('kant_prompts_and_completions_prepared.jsonl', 'r') as f:
    lines = f.readlines()

# Parse and print a few lines
for line in lines[199:300]:
    data = json.loads(line)
    print(json.dumps(data, indent=4))

{
    "prompt": "For he found that it was not sufficient to meditate on the figure, as it lay before his eyes, or the conception of it, as it existed in his mind, and thus endeavour to get at the knowledge of its properties, but that it was necessary to produce these properties, as it were, by a positive a priori construction; and that, in order to arrive with certainty at a priori cognition, he must not attribute to the object any other properties than those which necessarily followed from that which he had himself, in accordance with his conception, placed in the object. ->",
    "completion": " A much longer period elapsed before physics entered on the highway of science.\n"
}
{
    "prompt": "A much longer period elapsed before physics entered on the highway of science. ->",
    "completion": " For it is only about a century and a half since the wise Bacon gave a new direction to physical studies, or rather\u2014as others were already on the right track\u2014imparted fresh vigour t

creating the file on openai

In [None]:
from openai import OpenAI
client = OpenAI()

file_response=client.files.create(
  file=open("/content/kant_prompts_and_completions_prepared.jsonl", "rb"),
  purpose='fine-tune'
)

# Print option for maintenance
#print(file_response)

In [None]:
# Extract the training file ID
file_id = file_response.id
print(file_id)

file-hRZOuC9rnxPIzotrx36dz1B5


# 2.Fine-tuning a model

In [None]:
from openai import OpenAI
client = OpenAI()

job_response=client.fine_tuning.jobs.create(
  training_file=file_id,
  model="babbage-002"
)

In [None]:
job_id = job_response.id
print(job_id)

ftjob-jL3f6K9lcIA2QePrw5lJKG7d


# 3. Checking the status of the fine-tuning job

In [None]:
# Check the status of the fine-tuning job
job_details = client.fine_tuning.jobs.retrieve(job_id)
print(job_details)

FineTuningJob(id='ftjob-jL3f6K9lcIA2QePrw5lJKG7d', created_at=1714393916, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='babbage-002', object='fine_tuning.job', organization_id='org-h2Kjmcir4wyGtqq1mJALLGIb', result_files=[], seed=399930729, status='validating_files', trained_tokens=None, training_file='file-hRZOuC9rnxPIzotrx36dz1B5', validation_file=None, integrations=[], user_provided_suffix=None, estimated_finish=None)


In [None]:
# There may be a time lapse:
# 1.between the moment you run the fine-tuning job and its completion
# 2.between its completion and the server updates
# Check your email if you have activated OpenAI notifications
status = job_details.status
print(f"Job status: {status}")

Job status: validating_files


Checking the list of the fine-tuning jobs and their status

In [None]:
# List 10 fine-tuning jobs
job_list=client.fine_tuning.jobs.list(limit=10)

In [None]:
import json
from pprint import pprint

# Get the raw JSON string from the SyncCursorPage object
json_string = job_list.json()

# Convert the JSON string into a Python object
data = json.loads(json_string)

# Extract the data array
jobs = data.get('data', [])

# Format the data into a list of dictionaries
formatted_data = [
    {
        'id': job.get('id'),
        'created_at': job.get('created_at'),
        'status': job.get('status'),
        'training_file': job.get('training_file'),
        'model': job.get('model'),
        'model_name':job.get('fine_tuned_model')
    }
    for job in jobs
]

# Print the formatted data
pprint(formatted_data)

[{'created_at': 1714393916,
  'id': 'ftjob-jL3f6K9lcIA2QePrw5lJKG7d',
  'model': 'babbage-002',
  'model_name': None,
  'status': 'validating_files',
  'training_file': 'file-hRZOuC9rnxPIzotrx36dz1B5'},
 {'created_at': 1714393314,
  'id': 'ftjob-rBGEoO6UiDxO7nE16cpcWcvs',
  'model': 'babbage-002',
  'model_name': 'ft:babbage-002:personal::9JKPPXEh',
  'status': 'succeeded',
  'training_file': 'file-iucyWCAIDt3jLI7JThcxpBnu'},
 {'created_at': 1711625435,
  'id': 'ftjob-yGupxxOGqj2IOZuOtK5PSBUn',
  'model': 'babbage-002',
  'model_name': 'ft:babbage-002:personal::97iytOol',
  'status': 'succeeded',
  'training_file': 'file-LZeMWZWpiFLTgXV23CEeZi8Q'},
 {'created_at': 1709495269,
  'id': 'ftjob-4Is8DvKhkx5aTU9Dk8GoWJQQ',
  'model': 'babbage-002',
  'model_name': 'ft:babbage-002:personal::8ymVM2Po',
  'status': 'succeeded',
  'training_file': 'file-9EnBvLQAOllNuwvDNIwOLQuc'},
 {'created_at': 1708420404,
  'id': 'ftjob-X8DzSV2O4upEoG1T3jJd1hhm',
  'model': 'babbage-002',
  'model_name': 'ft:

# 3.Running the fine-tuned model

We will now run the model for a completion task

Note: If your fine-tuned model does not appear immediately after the end of the fine-tuning process, you might have to wait until it is processed by OpenAI.

You can check the status regulary with the functions we just implemented above.

Check your email regularly for a confirmation also if you have activated OpenAI notifications.

In [None]:
# text to complete
text_content = "Space and time are key factors in human reasoning. Human minds cannot think without space and time perceptions."
#print(text_content)

In [None]:
prompt = "Continue the following text as if you were a scientist and philosopher" + text_content
#print(prompt)

In [None]:
response = client.completions.create(
    model="ft:babbage-002:personal::8g9QuR5t",
    prompt=prompt,
    max_tokens=1000,
    temperature=0.8
)
#print(response)

In [None]:
# Check if there are any choices in the response
if response.choices:
    # Get the first choice (index 0)
    first_choice = response.choices[0]

    # Print the text of the first choice
    print("Model's Completion:", first_choice.text)
else:
    print("No choices returned in the response")

Model's Completion:  These two factors, space and time, the condition of the manifold (constitutive reality of the manifold of the perceptions of phenomena), determine the form of the intuition of objects, the content of the intuition of the mind, and the ground on which we can ground the possibility of objects for the mind and of phenomena in it.
But science has found in these factors the conditions of the possibility of an objective reality for objective cognition, and consequently of the possibility of objects for things themselves, and consequently of an object for the empirical intuition of reality 4 of the empirical manifold of experience.
If, therefore, I should attempt to put these terms into the language of science, I should say that they have objective significance; that they do not determine phenomena for themselves, but only the conditions of the possibility of their own empirical existence; and that, consequently, while the conditions of the possibility of the empirical ma

In [None]:
# Formatting the response
print("Completion ID:", response.id)
print("Created:", response.created)
print("Model:", response.model)
print("Object Type:", response.object)

# Formatting the choices
for i, choice in enumerate(response.choices):
    print(f"Choice {i}:")
    print("  Finish Reason:", choice.finish_reason)
    print("  Index:", choice.index)
    print("  Logprobs:", choice.logprobs)
    print("  Text:", choice.text)

# Formatting the usage
print("Usage:")
print("  Completion Tokens:", response.usage.completion_tokens)
print("  Prompt Tokens:", response.usage.prompt_tokens)
print("  Total Tokens:", response.usage.total_tokens)

Completion ID: cmpl-9JKSjnAteU587uNZIID9HhNLm1E8r
Created: 1714393917
Model: ft:babbage-002:personal::8g9QuR5t
Object Type: text_completion
Choice 0:
  Finish Reason: length
  Index: 0
  Logprobs: None
  Text:  These two factors, space and time, the condition of the manifold (constitutive reality of the manifold of the perceptions of phenomena), determine the form of the intuition of objects, the content of the intuition of the mind, and the ground on which we can ground the possibility of objects for the mind and of phenomena in it.
But science has found in these factors the conditions of the possibility of an objective reality for objective cognition, and consequently of the possibility of objects for things themselves, and consequently of an object for the empirical intuition of reality 4 of the empirical manifold of experience.
If, therefore, I should attempt to put these terms into the language of science, I should say that they have objective significance; that they do not determ

Once your fine-tune is available you can follow the following steps:

1.Go to the OpenAI Playground to test your model(the link is in the OpenAI email: https://platform.openai.com/playground

2.Then implement it in your environment

3.You can also run a classification fine-tunning example in this repository: https://github.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition/blob/main/Chapter08/Fine_tuned_classification.ipynb


# 4 Managing fine-tuned jobs and models

OpenAI offers several model management tools.

In [None]:
from openai import OpenAI
client = OpenAI()

# Set maintenance to True carefully if you wish to activate one of several
# job or model functions (information, cancel, delete)
maintenance=False
if maintenance is True:
  # List 10 fine-tuning jobs
  client.fine_tuning.jobs.list(limit=10)

  # Retrieve the state of a fine-tune
  client.fine_tuning.jobs.retrieve("ftjob-your job")

  # Cancel a job
  client.fine_tuning.jobs.cancel("ftjob-your job")

  # List up to 10 events from a fine-tuning job
  client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-your job", limit=10)

  # Delete a fine-tuned model (must be an owner of the org the model was created in)
  #client.models.delete("your model")

# 5.Before leaving

And what if a standard model can do the same job?

In [None]:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4",
  messages=[
    {
      "role": "user",
      "content": ""
    },
    {
      "role": "assistant",
      "content": "You are Kant, the philosopher"
    },
    {
      "role": "user",
      "content": "Explain why space and time are important from Kant's point of view"
    },
    {
      "role": "assistant",
      "content": ""
    }
  ],
  temperature=0.03,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

In [None]:
response.choices[0]

Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='As Immanuel Kant, I argue that space and time are not empirical concepts derived from external experiences, but rather they are fundamental structures of the mind, necessary for it to perceive the external world. This view is central to my philosophy, as outlined in my work "Critique of Pure Reason".\n\nI consider space and time as "a priori intuitions". By this, I mean that they are preconditions for our experience, not things we learn through experience. They are the necessary conditions for the possibility of all physical experience. We cannot perceive objects without intuiting them as existing in space and time.\n\nSpace, for me, is the framework that allows us to perceive objects as existing outside and separate from us. It is the medium in which objects are arranged and related. Without the concept of space, we would not be able to perceive the world as a coherent, organized whole.\n\nTi