#Fine-Tuning GPT-3

Copyright 2023 Denis Rothman

[OpenAI fine-tuning documentation](https://beta.openai.com/docs/guides/fine-tuning/)

Check the cost of fine-tuning your dataset on OpenAI before running the notebook.

Run this notebook cell by cell to:

1.prepare data
2.fine-tune a model
3.run a fine-tuned model
4.manage the fine-tunes

## Installing OpenAI & Wandb

Restart the runtime after installing openai and run the cell again to make sur that "import openai" is executed.

In [10]:
try:
  import openai
except:
  !pip install openai
  import openai

## Your API Key

In [11]:
#You can retrieve your API key from a file(1)
# or enter it manually(2)

#Comment this cell if you want to enter your key manually.
#(1)Retrieve the API Key from a file
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline()
f.close()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
#(2) Enter your manually by
# replacing API_KEY by your key.
#The OpenAI Key
import os
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

In [13]:
try:
  import wandb
except:
  !pip install wandb
  import wandb

# 1.Preparing the dataset

## 1.1. Preparing the data in JSON

In [14]:
#From Gutenberg to JSON
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import requests
from bs4 import BeautifulSoup
import json
import re

# First, fetch the text of the book from Project Gutenberg
url = 'http://www.gutenberg.org/cache/epub/4280/pg4280.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Get the text of the book and clean it up a bit
text = soup.get_text()
text = re.sub('\s+', ' ', text).strip()

# Split the text into sentences
sentences = sent_tokenize(text)

# Define the separator and ending
prompt_separator = " ->"
completion_ending = "\n"

# Now create the prompts and completions
data = []
for i in range(len(sentences) - 1):
    data.append({
        "prompt": sentences[i] + prompt_separator,
        "completion": " " + sentences[i + 1] + completion_ending
    })

# Write the prompts and completions to a file
with open('kant_prompts_and_completions.json', 'w') as f:
    for line in data:
        f.write(json.dumps(line) + '\n')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [15]:
import pandas as pd

# Load the data
df = pd.read_json('kant_prompts_and_completions.json', lines=True)
df

Unnamed: 0,prompt,completion
0,The Project Gutenberg Etext of The Critique of...,Be sure to check the copyright laws for your ...
1,Be sure to check the copyright laws for your c...,"We encourage you to keep this file, exactly a..."
2,"We encourage you to keep this file, exactly as...",Please do not remove this.\n
3,Please do not remove this. ->,This header should be the first thing seen wh...
4,This header should be the first thing seen whe...,Do not change or edit it without written perm...
...,...,...
6122,"78-79. is their motto, under which they may le...",As regards those who wish to pursue a scienti...
6123,As regards those who wish to pursue a scientif...,"When I mention, in relation to the former, th..."
6124,"When I mention, in relation to the former, the...",The critical path alone is still open.\n
6125,The critical path alone is still open. ->,If my reader has been kind and patient enough...


##  1.2. Converting the data to JSONL

Answer the questions as necessary for your project.

In [16]:
!openai tools fine_tunes.prepare_data -f "kant_prompts_and_completions.json"

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 6127 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix `\n`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `kant_prompts_and_completions_prepared (1).jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "kant_prompts_and_completions_prepared (1).jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 1.44 hours to tra

In [17]:
import json

# Open the file and read the lines
with open('kant_prompts_and_completions_prepared.jsonl', 'r') as f:
    lines = f.readlines()

# Parse and print the first 5 lines
for line in lines[199:300]:
    data = json.loads(line)
    print(json.dumps(data, indent=4))

{
    "prompt": "For he found that it was not sufficient to meditate on the figure, as it lay before his eyes, or the conception of it, as it existed in his mind, and thus endeavour to get at the knowledge of its properties, but that it was necessary to produce these properties, as it were, by a positive a priori construction; and that, in order to arrive with certainty at a priori cognition, he must not attribute to the object any other properties than those which necessarily followed from that which he had himself, in accordance with his conception, placed in the object. ->",
    "completion": " A much longer period elapsed before physics entered on the highway of science.\n"
}
{
    "prompt": "A much longer period elapsed before physics entered on the highway of science. ->",
    "completion": " For it is only about a century and a half since the wise Bacon gave a new direction to physical studies, or rather\u2014as others were already on the right track\u2014imparted fresh vigour t

# 2.Fine-tuning a model

In [18]:
!openai api fine_tunes.create -t "kant_prompts_and_completions_prepared.jsonl" -m "ada"

Found potentially duplicated files with name 'kant_prompts_and_completions_prepared.jsonl', purpose 'fine-tune' and size 2761402 bytes
file-3lFv23fBQwwNXxxOmpoxESUL
file-b3CAoITXZgD6NpAfAv4pWZY3
file-dF4n7KualE20Sd8arHc8gCob
file-FDPRTY1NXTsFwqGG6bppWu1e
file-jng9EjuT6G8uZm5ruovoGOT8
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: 
Upload progress: 100% 2.76M/2.76M [00:00<00:00, 2.39Git/s]
Uploaded file from kant_prompts_and_completions_prepared.jsonl: file-YRIdjNnYLIR1IXzBgu1gs6eW
Created fine-tune: ft-L6kXshOWnppQc3kgBUuq8xCf
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-06-22 20:54:59] Created fine-tune: ft-L6kXshOWnppQc3kgBUuq8xCf

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-L6kXshOWnppQc3kgBUuq8xCf



OpenAI has many requests.
If your steam is interrupted, OpenAI will indicate the instruction to continue fine-tuning.

In [49]:
# Uncomment this cell to activate the fine_tunes.follow instruction
!openai api fine_tunes.follow -i ft-L6kXshOWnppQc3kgBUuq8xCf

[2023-06-22 20:54:59] Created fine-tune: ft-L6kXshOWnppQc3kgBUuq8xCf
[2023-06-22 21:41:58] Fine-tune costs $0.83
[2023-06-22 21:41:58] Fine-tune enqueued. Queue number: 21
[2023-06-22 21:43:06] Fine-tune is in the queue. Queue number: 20
[2023-06-22 21:44:00] Fine-tune is in the queue. Queue number: 19
[2023-06-22 21:50:19] Fine-tune is in the queue. Queue number: 18
[2023-06-22 21:52:44] Fine-tune is in the queue. Queue number: 17
[2023-06-22 21:53:17] Fine-tune is in the queue. Queue number: 16
[2023-06-22 21:55:25] Fine-tune is in the queue. Queue number: 15
[2023-06-22 21:57:00] Fine-tune is in the queue. Queue number: 14
[2023-06-22 21:57:51] Fine-tune is in the queue. Queue number: 13
[2023-06-22 21:58:35] Fine-tune is in the queue. Queue number: 12
[2023-06-22 22:00:26] Fine-tune is in the queue. Queue number: 11
[2023-06-22 22:01:25] Fine-tune is in the queue. Queue number: 10
[2023-06-22 22:02:40] Fine-tune is in the queue. Queue number: 9
[2023-06-22 22:02:41] Fine-tune is in

# 3.Running the fine-tuned GPT-3 model

We will now run the model for a completion task

Note: If your fine-tuned model does not appear immediately after the end of the fine-tuning process, you might have to wait until it is processed by OpenAI. You can also:

1.go to the OpenAI Playground to test your model: https://platform.openai.com/playground

2.select your model in the dropdown list and test it in that environment

In [None]:
f = open("drive/MyDrive/files/fine_tune.txt", "r")
FINE_TUNE=f.readline().strip() #load your saved model from a file or load it in this variable
f.close()
FINE_TUNE

In [51]:
prompt = "Freedom can be a concept or a virtue ->"
response=openai.Completion.create(
  model=FINE_TUNE, #Your model in FINE_TUNE,
  prompt=prompt,
  temperature=1,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stop="\n",
  max_tokens=200
)

In [52]:
response

<OpenAIObject text_completion id=cmpl-7UNEuuYwLenT1iWurD4V05K5c97fr at 0x7fd4ec0e9940> JSON: {
  "id": "cmpl-7UNEuuYwLenT1iWurD4V05K5c97fr",
  "object": "text_completion",
  "created": 1687473528,
  "model": "ada:ft-personal-2023-06-22-06-06-12",
  "choices": [
    {
      "text": " nothing but a correction of the imbalanced supply of habits of the human mind.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 16,
    "total_tokens": 25
  }
}

In [53]:
import textwrap
generated_text = response['choices'][0]['text']

# Remove leading and trailing whitespace
generated_text = generated_text.strip()

# Convert to a pretty paragraph by replacing newline characters with spaces
single_line_response = generated_text.replace('\n', ' ')

# Use textwrap.fill to nicely format the paragraph to wrap at 80 characters (or whatever width you prefer)
wrapped_response = textwrap.fill(single_line_response, width=80)
print(wrapped_response)

nothing but a correction of the imbalanced supply of habits of the human mind.


# 4.Managing the fine_tunes

In [None]:
# List all created fine-tunes
!openai api fine_tunes.list > fine_tunes.json
!openai api fine_tunes.list

**ChatGPT PLUS, GPT-4 provides a  breakdown of the components of the JSON object**

- `"object"`: This line specifies the type of object the JSON is representing. Here it's a fine-tuned model.

- `"id"`: This is the unique identifier for this fine-tuning job. This ID is typically used to reference this specific instance of fine-tuning.

- `"hyperparams"`: These are the hyperparameters used for fine-tuning the model.
   - `"n_epochs"`: Number of epochs for the training, i.e., how many times the learning algorithm will work through the entire training dataset.
   - `"batch_size"`: The number of training examples used in one iteration (or update) of model parameters.
   - `"prompt_loss_weight"`: This is the weight assigned to the loss function of the prompts during training. A higher value places more emphasis on minimizing the loss of the prompts.
   - `"learning_rate_multiplier"`: This value is used to scale the learning rate during training. A lower value will cause the model to learn slower and vice versa.

- `"organization_id"`: This is the identifier for the organization account under which the fine-tuning operation was performed.

- `"model"`: The base model used for fine-tuning. In your case, it's `ada`, which is a version of GPT-3.

- `"training_files"`: This array contains information about the files used for training.
  - `"object"`: Specifies the object type, in this case, a file.
  - `"id"`: The unique identifier for this file.
  - `"purpose"`: The purpose of the file, here it's for fine-tuning.
  - `"filename"`: The name of the file.
  - `"bytes"`: The size of the file in bytes.
  - `"created_at"`: The UNIX timestamp for when the file was created.
  - `"status"`: The status of the file processing. Here it's processed.
  - `"status_details"`: Any extra details about the file's status. It's null here, meaning there are no extra details.

- `"validation_files"`: This would include similar details as `"training_files"`, but for any files used for validation during training. It's empty in your case.

- `"result_files"`: This is an array of files that store the result of the fine-tuning operation. The details of each file are similar to those in `"training_files"`.

- `"created_at"`: The UNIX timestamp indicating when this fine-tuning job was created.

- `"updated_at"`: The UNIX timestamp indicating the last time this fine-tuning job was updated.

- `"status"`: The status of the fine-tuning job. In this case, it has succeeded.

- `"fine_tuned_model"`: This is the unique identifier/name for the fine-tuned model.
  
Remember that a UNIX timestamp is the number of seconds that have passed since 00:00:00 Thursday, 1 January 1970, minus leap seconds. Programs like Python's datetime library can convert these to more human-readable formats.

In [None]:
response = openai.FineTune.retrieve("ft-daprSZy6dWb7KlN6WQxOeS0Y")
print(response)

In [55]:
import pandas as pd
import json
from datetime import datetime

# Load data from json file
with open('fine_tunes.json') as f:
    data = json.load(f)

# Convert to Pandas DataFrame:
df = pd.json_normalize(data['data'])

# Select specific columns
selected_columns = ['object', 'id', 'fine_tuned_model','status', 'created_at', 'updated_at']
df = df[selected_columns]

# Rename columns for display
column_mapping = {
    'object': 'Object',
    'id': 'ID',
    'fine_tuned_model': 'Fine_Tuned_Model',
    'filename':'Filename',
    'status': 'Status',
    'created_at': 'Created_At',
    'updated_at': 'Updated_At',
}
df.rename(columns=column_mapping, inplace=True)

# Convert UNIX timestamp to standard format
df['Created_At'] = pd.to_datetime(df['Created_At'], unit='s')
df['Updated_At'] = pd.to_datetime(df['Updated_At'], unit='s')

df

Unnamed: 0,Object,ID,Fine_Tuned_Model,Status,Created_At,Updated_At
0,fine-tune,ft-qtHQMnZUBv0baFBR1flx5hsZ,curie:ft-user-u6to8fv7rsroe5cvku0zfumm-2021-09...,succeeded,2021-09-08 21:18:40,2021-09-08 21:23:59
1,fine-tune,ft-nGWGLgWPTonksokk6c0f2HMS,ada:ft-user-u6to8fv7rsroe5cvku0zfumm-2021-09-1...,succeeded,2021-09-11 10:24:27,2021-09-11 10:26:25
2,fine-tune,ft-L3ngtsDSZ3fzGn6ipm33AUsu,ada:ft-user-u6to8fv7rsroe5cvku0zfumm-2021-12-1...,succeeded,2021-12-13 17:49:40,2021-12-13 18:13:14
3,fine-tune,ft-lIxhjA4XxmPJ5KORsIGkb9oC,ada:ft-personal-2022-01-19-11-49-27,succeeded,2022-01-19 11:25:35,2022-01-19 11:49:33
4,fine-tune,ft-sMSqhlu6PtGiYPpB6MYY7GAo,ada:ft-personal-2022-02-07-06-26-44,succeeded,2022-02-07 06:03:17,2022-02-07 06:26:50
5,fine-tune,ft-lKF7fHigF7VrnkR75DDBhgxN,ada:ft-personal-2022-02-23-16-14-30,succeeded,2022-02-23 15:51:04,2022-02-23 16:14:36
6,fine-tune,ft-RXueKKisaNy2fCOHJ5qLvPUL,ada:ft-personal-2022-02-23-16-37-57,succeeded,2022-02-23 16:02:39,2022-02-23 16:38:03
7,fine-tune,ft-a2TbBRWy7g5n9zoQZmlQnDSi,ada:ft-personal-2022-02-23-17-01-14,succeeded,2022-02-23 16:06:45,2022-02-23 17:01:20
8,fine-tune,ft-36tIItJL380NzhEqbMTLR3lv,ada:ft-personal-2022-02-23-17-41-22,succeeded,2022-02-23 16:11:50,2022-02-23 17:41:28
9,fine-tune,ft-EEVutXUlc8MZck1RrcbrP2Qc,ada:ft-personal-2022-02-23-18-21-20,succeeded,2022-02-23 16:15:23,2022-02-23 18:21:25


In [None]:
#delete a model
# enter a model in the list of fine-tuned models
#FINE_TUNED_MODEL=[MODEL in list]
#try:
#  openai.Model.delete(FINE_TUNED_MODEL)
#except:
#  print("FINE_TUNED_MODEL not found")