#Fine-Tuning GPT-3

Copyright 2023 Denis Rothman

[OpenAI fine-tuning documentation](https://beta.openai.com/docs/guides/fine-tuning/)

Run this notebook cell by cell to

1.prepare data or
2.fine-tune a model or
3.run a fine-tuned model or
4.manage the fine-tunes

Each task can be performed independently in this notebook.
You can also copy one of the 4 steps in a separate notebook

## Installing OpenAI & Wandb

Restart the runtime after installing openai and run the cell again to make sur that "import openai" is executed.

In [None]:
try:
  import openai
except:
  !pip install openai
  import openai

## Your API Key

In [None]:
#API Key
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline()
f.close()

#The OpenAI Key
import os
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")


Mounted at /content/drive


## Optional: Weights and Biases



Use W&B to build better models faster. Track and visualize all the pieces of your machine learning pipeline, from datasets to production models.

Quickly identify model regressions. Use W&B to visualize results in real time, all in a central dashboard.
Focus on the interesting ML. Spend less time manually tracking results in spreadsheets and text files.
Capture dataset versions with W&B Artifacts to identify how changing data affects your resulting models.
Reproduce any model, with saved code, hyperparameters, launch commands, input data, and resulting model weights.


In [None]:
try:
  import wandb
except:
  !pip install wandb
  import wandb

# 1.Preparing the dataset

## 1.1. Preparing the data in JSON

In [None]:
#From Gutenberg to JSON
import nltk
nltk.download('punkt')
import requests
from bs4 import BeautifulSoup
import json
import re
from nltk.tokenize import sent_tokenize

# First, fetch the text of the book from Project Gutenberg
url = 'http://www.gutenberg.org/cache/epub/4280/pg4280.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Get the text of the book and clean it up a bit
text = soup.get_text()
text = re.sub('\s+', ' ', text).strip()

# Split the text into sentences
sentences = sent_tokenize(text)

# Define the separator and ending
prompt_separator = " ->"
completion_ending = "\n"

# Now create the prompts and completions
data = []
for i in range(len(sentences) - 1):
    data.append({
        "prompt": sentences[i] + prompt_separator,
        "completion": " " + sentences[i + 1] + completion_ending
    })

# Write the prompts and completions to a file
with open('kant_prompts_and_completions.json', 'w') as f:
    for line in data:
        f.write(json.dumps(line) + '\n')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
import pandas as pd

# Load the data
df = pd.read_json('kant_prompts_and_completions.json', lines=True)
df

Unnamed: 0,prompt,completion
0,The Project Gutenberg Etext of The Critique of...,Be sure to check the copyright laws for your ...
1,Be sure to check the copyright laws for your c...,"We encourage you to keep this file, exactly a..."
2,"We encourage you to keep this file, exactly as...",Please do not remove this.\n
3,Please do not remove this. ->,This header should be the first thing seen wh...
4,This header should be the first thing seen whe...,Do not change or edit it without written perm...
...,...,...
6122,"78-79. is their motto, under which they may le...",As regards those who wish to pursue a scienti...
6123,As regards those who wish to pursue a scientif...,"When I mention, in relation to the former, th..."
6124,"When I mention, in relation to the former, the...",The critical path alone is still open.\n
6125,The critical path alone is still open. ->,If my reader has been kind and patient enough...


##  1.2. Converting the data to JSONL

Answer "Y" to all of the questions.

In [None]:
!openai tools fine_tunes.prepare_data -f "kant_prompts_and_completions.json"

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 6127 prompt-completion pairs
- All prompts end with suffix ` ->`
- All completions end with suffix `\n`

Based on the analysis we will perform the following actions:
- [Necessary] Your format `JSON` will be converted to `JSONL`


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `kant_prompts_and_completions_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "kant_prompts_and_completions_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 1.44 hours to train a `cu

In [None]:
import json

# Open the file and read the lines
with open('kant_prompts_and_completions_prepared.jsonl', 'r') as f:
    lines = f.readlines()

# Parse and print the first 5 lines
for line in lines[:5]:
    data = json.loads(line)
    print(json.dumps(data, indent=4))

{
    "prompt": "The Project Gutenberg Etext of The Critique of Pure Reason, by Immanuel Kant Copyright laws are changing all over the world. ->",
    "completion": " Be sure to check the copyright laws for your country before distributing this or any other Project Gutenberg file.\n"
}
{
    "prompt": "Be sure to check the copyright laws for your country before distributing this or any other Project Gutenberg file. ->",
    "completion": " We encourage you to keep this file, exactly as it is, on your own disk, thereby keeping an electronic path open for future readers.\n"
}
{
    "prompt": "We encourage you to keep this file, exactly as it is, on your own disk, thereby keeping an electronic path open for future readers. ->",
    "completion": " Please do not remove this.\n"
}
{
    "prompt": "Please do not remove this. ->",
    "completion": " This header should be the first thing seen when anyone starts to view the etext.\n"
}
{
    "prompt": "This header should be the first thing see

kantgpt_prepared.jsonl should now be generated.

# 2.Fine-tuning a model



When prompted choose and click on the "ENTER" button.

In [None]:
!openai api fine_tunes.create -t "kant_prompts_and_completions_prepared.jsonl" -m "ada"

Found potentially duplicated files with name 'kant_prompts_and_completions_prepared.jsonl', purpose 'fine-tune' and size 2761402 bytes
file-3lFv23fBQwwNXxxOmpoxESUL
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: 
Upload progress: 100% 2.76M/2.76M [00:00<00:00, 1.53Git/s]
Uploaded file from kant_prompts_and_completions_prepared.jsonl: file-b3CAoITXZgD6NpAfAv4pWZY3
Created fine-tune: ft-sNOmVvZ3fu2cm3XNRCMQuI27
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-06-19 17:02:45] Created fine-tune: ft-sNOmVvZ3fu2cm3XNRCMQuI27

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-sNOmVvZ3fu2cm3XNRCMQuI27



OpenAI has many requests.
If your steam is interrupted, OpenAI will indicate the instruction to continue fine-tuning.

In [None]:
# Uncomment this cell to activate the fine_tunes.follow instruction
#!openai api fine_tunes.follow -i [YOUR_FINE_TUNE]

The kantgpt_prepared_jsonl file, "file-......",  is uploaded and the fine-tune is created:<br>
"Created fine-tune: ft-............." <br>
When the fine-tuing is over, copy the fine-tune name in the following cell.

# 3.Running the fine-tuned GPT-3 model for a completion task

Note: If your fine-tuned model does not appear immediately after the end of the fine-tuning process, you might have to wait until it is processed by OpenAI. You can also:

1.go to the OpenAI Playground to test your model: https://platform.openai.com/playground

2.select your model in the dropdown list and test it in that environment

In [None]:
f = open("drive/MyDrive/files/fine_tune.txt", "r")
FINE_TUNE=f.readline() #load your saved model from a file or load it in this variable
f.close()

In [None]:
response = openai.Completion.create(
  model=FINE_TUNE, #Your model in FINE_TUNE
  prompt="What did Kant say about a priori concepts: \"We discover there nothing but the confusion and rectifications of philosophical possibility, whereby men thought that the objective consciousness of objects contained in thought was specified and contained in the objects themselves, and therefore that the objects were given by thought as things in themselves\"; or, \"Our conception of a thing gives us nothing but the construction of an object; and nothing what is added to or included in it adds cognition.\" (Both in the former case the conception must be mathematical, in the latter the existence of the object must be inferred.) -> To the pure understanding, therefore, a priori conceptions are still subordinate; they are merely proper forms of intuition, and are used only to represent objects in themselves, and to indicate the relation of the synthesis in their cognitions to each other.\n\nBut this may be explained as follows.\n\n1.\n\nThe functional connection established by our transcendental philosophy is brought out of the pure understanding, which had not the right (brutum) to take it, but rather was put under the weak supervision of the single faculty of the understanding, the mere change of which it does not at present intend to repair.\n\nIn construing any phenomenon, the understanding must remain always in one relation to each of its disjunctions",
  temperature=1,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

In [None]:
response

<OpenAIObject text_completion id=cmpl-7TDe9uFIFU8IFrQLMDCo49KYK8w6B at 0x7fa6b4d618f0> JSON: {
  "id": "cmpl-7TDe9uFIFU8IFrQLMDCo49KYK8w6B",
  "object": "text_completion",
  "created": 1687198325,
  "model": "ada:ft-personal-2023-06-19-17-41-30",
  "choices": [
    {
      "text": ", every wanted element being before removed; for example, everything in a phenomenon may have three possible phenomena (two of which are its cause and an effect, and the effect corresponding to this cause constituting the third).\n\nBut all the formalities of deduction by means of becoming and non-becoming, with which our metaphysical philosophy is full of occasion, and which make abstraction of the different becomings of a given thing without necessary participation in the being of such a thing, are a priori mere matters of understanding, and in no canonical fashion apocryphal; for no sensible phenomenon is more necessary than that the subject of the phenomenon (the substance of the object) should not be an

In [None]:
import textwrap
generated_text = response['choices'][0]['text']

# Remove leading and trailing whitespace
generated_text = generated_text.strip()

# Convert to a pretty paragraph by replacing newline characters with spaces
single_line_response = generated_text.replace('\n', ' ')

# Use textwrap.fill to nicely format the paragraph to wrap at 80 characters (or whatever width you prefer)
wrapped_response = textwrap.fill(single_line_response, width=80)
print(wrapped_response)

, every wanted element being before removed; for example, everything in a
phenomenon may have three possible phenomena (two of which are its cause and an
effect, and the effect corresponding to this cause constituting the third).  But
all the formalities of deduction by means of becoming and non-becoming, with
which our metaphysical philosophy is full of occasion, and which make
abstraction of the different becomings of a given thing without necessary
participation in the being of such a thing, are a priori mere matters of
understanding, and in no canonical fashion apocryphal; for no sensible
phenomenon is more necessary than that the subject of the phenomenon (the
substance of the object) should not be an object in itself.  2.  The natural
function of the understanding is to make abstraction of the constants of the
synthesis, and to distinguish the arising condition from the rest; just as
corporeal phenomenon is considered (in a manner wholly different from ours) as
having become as i

In [None]:
#!openai api completions.create -m ada:[YOUR_MODEL INFO] -p "Several concepts are a priori such as"

# 4.Managing the fine_tunes

In [None]:
# List all created fine-tunes
!openai api fine_tunes.list > fine_tunes.json

In [None]:
#displaying the fine_tunes
import json
import pandas as pd

# Load the data
with open('fine_tunes.json') as f:
    data = json.load(f)

# Normalize the data
df = pd.json_normalize(data['data'])  # for pandas >= 1.0.0

df

Unnamed: 0,object,id,organization_id,model,training_files,validation_files,result_files,created_at,updated_at,status,fine_tuned_model,hyperparams.n_epochs,hyperparams.batch_size,hyperparams.use_packing,hyperparams.weight_decay,hyperparams.prompt_loss_weight,hyperparams.learning_rate_multiplier
0,fine-tune,ft-qtHQMnZUBv0baFBR1flx5hsZ,org-h2Kjmcir4wyGtqq1mJALLGIb,curie,"[{'object': 'file', 'id': 'file-vTxiSW78AF8InU...",[],"[{'object': 'file', 'id': 'file-UaYjZY2bHGqWuj...",1631135920,1631136239,succeeded,curie:ft-user-u6to8fv7rsroe5cvku0zfumm-2021-09...,4,4.0,,0.0,0.1,0.1
1,fine-tune,ft-nGWGLgWPTonksokk6c0f2HMS,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-02USJi2L1HsPT6...",[],"[{'object': 'file', 'id': 'file-UYvXR6E7GbFxZc...",1631355867,1631355985,succeeded,ada:ft-user-u6to8fv7rsroe5cvku0zfumm-2021-09-1...,4,4.0,,0.0,0.1,0.1
2,fine-tune,ft-L3ngtsDSZ3fzGn6ipm33AUsu,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-qimfi96VAFSF0T...",[],"[{'object': 'file', 'id': 'file-YBcjkAQNBPPgvI...",1639417780,1639419194,succeeded,ada:ft-user-u6to8fv7rsroe5cvku0zfumm-2021-12-1...,4,16.0,,,0.1,0.1
3,fine-tune,ft-lIxhjA4XxmPJ5KORsIGkb9oC,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-EctKSBtTlrS9NQ...",[],"[{'object': 'file', 'id': 'file-Yft1CLk4bbhP2g...",1642591535,1642592973,succeeded,ada:ft-personal-2022-01-19-11-49-27,4,16.0,,,0.1,0.05
4,fine-tune,ft-sMSqhlu6PtGiYPpB6MYY7GAo,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-UinyhlDufZFNu6...",[],"[{'object': 'file', 'id': 'file-tr1TYWkzuCC41m...",1644213797,1644215210,succeeded,ada:ft-personal-2022-02-07-06-26-44,4,16.0,,,0.1,0.05
5,fine-tune,ft-lKF7fHigF7VrnkR75DDBhgxN,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-Wv47J7Gb2grLgq...",[],"[{'object': 'file', 'id': 'file-TEZobr9rL36Ijw...",1645631464,1645632876,succeeded,ada:ft-personal-2022-02-23-16-14-30,4,16.0,,,0.1,0.05
6,fine-tune,ft-RXueKKisaNy2fCOHJ5qLvPUL,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-C0w7OaO5tWvfFP...",[],"[{'object': 'file', 'id': 'file-O0zBrRhGS5NsqS...",1645632159,1645634283,succeeded,ada:ft-personal-2022-02-23-16-37-57,4,16.0,,,0.1,0.05
7,fine-tune,ft-a2TbBRWy7g5n9zoQZmlQnDSi,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-Q7NISSHxkCwxEt...",[],"[{'object': 'file', 'id': 'file-eHtkVJbek6iWpZ...",1645632405,1645635680,succeeded,ada:ft-personal-2022-02-23-17-01-14,4,16.0,,,0.1,0.05
8,fine-tune,ft-36tIItJL380NzhEqbMTLR3lv,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-rTZwzsFUTiYm9m...",[],"[{'object': 'file', 'id': 'file-MXPKw0Vmy0Gxc8...",1645632710,1645638088,succeeded,ada:ft-personal-2022-02-23-17-41-22,4,16.0,,,0.1,0.05
9,fine-tune,ft-EEVutXUlc8MZck1RrcbrP2Qc,org-h2Kjmcir4wyGtqq1mJALLGIb,ada,"[{'object': 'file', 'id': 'file-k7uTTMPn8lxLEq...",[],"[{'object': 'file', 'id': 'file-34W2ab2B9yZXse...",1645632923,1645640485,succeeded,ada:ft-personal-2022-02-23-18-21-20,4,16.0,,,0.1,0.05


In [None]:
# Retrieve a fine-tune and use it in Step 6
#!openai api fine_tunes.get -i [YOUR_MODEL_INFO]