# Fine tuning classification example
Copyright 2024, Denis Rothman MIT License

OpenAI fine-tunes a babbage-002 based on GPT-3's architecture classifier to distinguish between the two sports: Baseball and Hockey.

The code complies with OpenAI's January 4, 2024 API update.



# Installing OpenAI

In [None]:
try:
  import openai
except:
  !pip install openai
  import openai

In [2]:
#You can retrieve your API key from a file(1)
# or enter it manually(2)

#Comment this cell if you want to enter your key manually.
#(1)Retrieve the API Key from a file
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline()
f.close()

Mounted at /content/drive


In [3]:
#(2) Enter your manually by
# replacing API_KEY by your key.
#The OpenAI Key
import os
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

# 1.Preparing the dataset

Retrieving the dataset
 The newsgroup dataset can be loaded using sklearn.

In [4]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai

categories = ['rec.sport.baseball', 'rec.sport.hockey']
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

First we will look at the data itself:

In [5]:
print(sports_dataset['data'][0])

From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb@nwu.edu       

In [6]:
sports_dataset.target_names[sports_dataset['target'][0]]

'rec.sport.baseball'

In [7]:
len_all, len_baseball, len_hockey = len(sports_dataset.data), len([e for e in sports_dataset.target if e == 0]), len([e for e in sports_dataset.target if e == 1])
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")

Total examples: 1197, Baseball examples: 597, Hockey examples: 600


One sample from the baseball category can be seen above. It is an email to a mailing list. We can observe that we have 1197 examples in total, which are evenly split between the two sports.

## 1.1. Preparing the data in JSON

We transform the dataset into a pandas dataframe, with a column for prompt and completion. The prompt contains the email from the mailing list, and the completion is a name of the sport, either hockey or baseball. For demonstration purposes only and speed of fine-tuning we take only 300 examples. In a real use case the more examples the better the performance.

In [8]:
import pandas as pd

labels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts = [text.strip() for text in sports_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
df.head()

Unnamed: 0,prompt,completion
0,From: dougb@comm.mot.com (Doug Bank)\nSubject:...,baseball
1,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,hockey
2,From: rudy@netcom.com (Rudy Wade)\nSubject: Re...,baseball
3,From: monack@helium.gas.uug.arizona.edu (david...,hockey
4,Subject: Let it be Known\nFrom: <ISSBTL@BYUVM....,baseball


Both baseball and hockey are single tokens. We save the dataset as a jsonl file.

In [9]:
df.to_json("sport2.jsonl", orient='records', lines=True)

##  1.2. Converting the data to JSONL
We can now use a data preparation tool which will suggest a few improvements to our dataset before fine-tuning. Before launching the tool we update the openai library to ensure we're using the latest data preparation tool. We additionally specify `-q` which auto-accepts all suggestions.

In [10]:
!openai tools fine_tunes.prepare_data -f sport2.jsonl -q

Analyzing...

- Your file contains 1197 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts e

The tool helpfully suggests a few improvements to the dataset and splits the dataset into training and validation set.

A suffix between a prompt and a completion is necessary to tell the model that the input text has stopped, and that it now needs to predict the class. Since we use the same separator in each example, the model is able to learn that it is meant to predict either baseball or hockey following the separator.
A whitespace prefix in completions is useful, as most word tokens are tokenized with a space prefix.
The tool also recognized that this is likely a classification task, so it suggested to split the dataset into training and validation datasets. This will allow us to easily measure expected performance on new data.

creating the training and validation files on OpenAI

In [11]:
from openai import OpenAI
client = OpenAI()

file_response=client.files.create(
  file=open("/content/sport2_prepared_train.jsonl", "rb"),
  purpose='fine-tune'
)

In [12]:
# Extract the training file ID
training_file_id = file_response.id
print(training_file_id)

file-rVsREccI6yBWGQxTU1opDUJn


In [13]:
file_response=client.files.create(
  file=open("/content/sport2_prepared_valid.jsonl", "rb"),
  purpose='fine-tune'
)

In [14]:
# Extract the training file ID
validation_file_id = file_response.id
print(validation_file_id)

file-mjMIdOhqe7MGo0TphU8meQK7


# 2.Fine-tuning an original model


In [49]:
from openai import OpenAI
client = OpenAI()

job_response=client.fine_tuning.jobs.create(
  training_file=training_file_id,
  validation_file=validation_file_id,
  model="babbage-002"
)

In [50]:
job_id = job_response.id
print(job_id)

ftjob-YHNFOGUAXRAT9vjyyI1956N2


When the model has been trained,  we can  we can use it for an inference task.

In [51]:
# Check the status of the fine-tuning job
job_details = client.fine_tuning.jobs.retrieve(job_id)
print(job_details)

FineTuningJob(id='ftjob-YHNFOGUAXRAT9vjyyI1956N2', created_at=1705058470, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='babbage-002', object='fine_tuning.job', organization_id='org-h2Kjmcir4wyGtqq1mJALLGIb', result_files=[], status='running', trained_tokens=None, training_file='file-rVsREccI6yBWGQxTU1opDUJn', validation_file='file-mjMIdOhqe7MGo0TphU8meQK7')


In [52]:
# There may be a time lapse:
# 1.between the moment you run the fine-tuning job and its completion
# 2.between its completion and the server updates
# Check your email if you have activated OpenAI notifications
status = job_details.status
print(f"Job status: {status}")

Job status: running


# 3.Running the fine-tuned GPT-3 model

We will now run the model for a classification task

Let's first retrieve the 'sport2_prepared_valid.jsonl' using the file ID we using during the training process for the validation file.

In [53]:
file_id=validation_file_id
file_details = openai.files.retrieve(file_id)
print(file_details)

FileObject(id='file-mjMIdOhqe7MGo0TphU8meQK7', bytes=387349, created_at=1705056210, filename='sport2_prepared_valid.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)


In [54]:
test = pd.read_json('sport2_prepared_valid.jsonl', lines=True)
test.head(10)

Unnamed: 0,prompt,completion
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,hockey
1,From: smorris@venus.lerc.nasa.gov (Ron Morris ...,hockey
2,From: golchowy@alchemy.chem.utoronto.ca (Geral...,hockey
3,From: krattige@hpcc01.corp.hp.com (Kim Krattig...,baseball
4,From: warped@cs.montana.edu (Doug Dolven)\nSub...,baseball
5,From: jerry@sheldev.shel.isc-br.com (Gerald La...,baseball
6,From: kime@mongoose.torolab.ibm.com (Edward Ki...,baseball
7,From: cmk@athena.mit.edu (Charles M Kozierok)\...,baseball
8,From: pkortela@snakemail.hut.fi (Petteri Korte...,hockey
9,From: gak@wrs.com (Richard Stueven)\nSubject: ...,hockey


We can now call the model to get the predictions.

## Example 1

In [60]:
# text to classify
text_content = test['prompt'][0]
print(text_content)

From: gld@cunixb.cc.columbia.edu (Gary L Dare)
Subject: Re: Flames Truly Brutal in Loss
Nntp-Posting-Host: cunixb.cc.columbia.edu
Reply-To: gld@cunixb.cc.columbia.edu (Gary L Dare)
Organization: PhDs In The Hall
Distribution: na
Lines: 13


This game would have been great as part of a double-header on ABC or
ESPN; the league would have been able to push back-to-back wins by
Le Magnifique and The Great One.  Unfortunately, the only network
that would have done that was SCA, seen in few areas and hard to
justify as a pay channel. )-;

gld
--
~~~~~~~~~~~~~~~~~~~~~~~~ Je me souviens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gary L. Dare
> gld@columbia.EDU 			GO  Winnipeg Jets  GO!!!
> gld@cunixc.BITNET			Selanne + Domi ==> Stanley

###




In [83]:
text_content="Classifiy the following email into a hockey or baseball label: " + text_content

# Classify the new email using the fine-tuned model
response = client.completions.create(
    model="ft:babbage-002:personal::8g9QuR5t",
    prompt=text_content,

)
print(response)

Completion(id='cmpl-8gAH65JBINbai5HGFlp6tZl6sT5BG', choices=[CompletionChoice(finish_reason='stop', index=0, logprobs=None, text='')], created=1705059964, model='ft:babbage-002:personal::8g9QuR5t', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=None, prompt_tokens=415, total_tokens=415))


In [84]:
# Formatting the response
print("Completion ID:", response.id)
print("Created:", response.created)
print("Model:", response.model)
print("Object Type:", response.object)

# Formatting the choices
for i, choice in enumerate(response.choices):
    print(f"Choice {i}:")
    print("  Finish Reason:", choice.finish_reason)
    print("  Index:", choice.index)
    print("  Logprobs:", choice.logprobs)
    print("  Text:", choice.text)

# Formatting the usage
print("Usage:")
print("  Completion Tokens:", response.usage.completion_tokens)
print("  Prompt Tokens:", response.usage.prompt_tokens)
print("  Total Tokens:", response.usage.total_tokens)

# If system_fingerprint is available, print it
if response.system_fingerprint:
    print("System Fingerprint:", response.system_fingerprint)

Completion ID: cmpl-8gAH65JBINbai5HGFlp6tZl6sT5BG
Created: 1705059964
Model: ft:babbage-002:personal::8g9QuR5t
Object Type: text_completion
Choice 0:
  Finish Reason: stop
  Index: 0
  Logprobs: None
  Text: 
Usage:
  Completion Tokens: None
  Prompt Tokens: 415
  Total Tokens: 415


## Example 2

In [106]:
# text to classify
text_content = test['prompt'][10]
print(text_content)

From: fester@island.COM (Mike Fester)
Subject: Re: Notes on Jays vs. Indians Series
Organization: /usr/local/rn/organization
Distribution: na
Lines: 38

In article <1993Apr13.221704.4291@midway.uchicago.edu> thf2@midway.uchicago.edu writes:
>In article <rudyC5FxC8.DEu@netcom.com> rudy@netcom.com (Rudy Wade) writes:
>>In article <1993Apr13.195301.22652@CSD-NewsHost.Stanford.EDU> nlu@Xenon.Stanford.EDU (Nelson Lu) writes:
>>>Guess which line is which:
>>>	BA	OBP	SLG	AB	H	2B	3B	HR	BB
>>>X	.310	.405	.427	571	177	27	8	8	87
>>>Y	.312	.354	.455	657	205	32	1	20	35
>>
>>>The walks should give it away.  OBP's, in general, somewhat more valuable than
>>>slugging, and Alomar's edge in OBP was quite a bit larger than Baerga's edge
>>>in slugging.
>>
>>I'm no SDCN, but what's more valuable:
>>
>>28 hits w/5 more doubles, 12 more HRs   OR
>>7 more triples and 52 BBs?  (Let's not forget the 39 extra SBs. How many CS?)
>
>Alomar had 9 CS.  Baerga had 2.
>
>Don't forget the 59 more outs Baerga had (his 

In [107]:
text_content="Classifiy the following email into a hockey or baseball label: " + text_content
print(text_content)

Classifiy the following email into a hockey or baseball label: From: fester@island.COM (Mike Fester)
Subject: Re: Notes on Jays vs. Indians Series
Organization: /usr/local/rn/organization
Distribution: na
Lines: 38

In article <1993Apr13.221704.4291@midway.uchicago.edu> thf2@midway.uchicago.edu writes:
>In article <rudyC5FxC8.DEu@netcom.com> rudy@netcom.com (Rudy Wade) writes:
>>In article <1993Apr13.195301.22652@CSD-NewsHost.Stanford.EDU> nlu@Xenon.Stanford.EDU (Nelson Lu) writes:
>>>Guess which line is which:
>>>	BA	OBP	SLG	AB	H	2B	3B	HR	BB
>>>X	.310	.405	.427	571	177	27	8	8	87
>>>Y	.312	.354	.455	657	205	32	1	20	35
>>
>>>The walks should give it away.  OBP's, in general, somewhat more valuable than
>>>slugging, and Alomar's edge in OBP was quite a bit larger than Baerga's edge
>>>in slugging.
>>
>>I'm no SDCN, but what's more valuable:
>>
>>28 hits w/5 more doubles, 12 more HRs   OR
>>7 more triples and 52 BBs?  (Let's not forget the 39 extra SBs. How many CS?)
>
>Alomar had 9 CS.  

In [104]:
# Classify the new email using the fine-tuned model
response = client.completions.create(
    model="ft:babbage-002:personal::8g9QuR5t",
    prompt=text_content,

)
print(response)

Completion(id='cmpl-8gAX2Me2LNy5h1lHpKNWV3p1U66as', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=', rom cold, crushing defeat.\nSAMPLE Email 7: Let us break our')], created=1705060952, model='ft:babbage-002:personal::8g9QuR5t', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=18, total_tokens=34))


In [105]:
# Formatting the response
print("Completion ID:", response.id)
print("Created:", response.created)
print("Model:", response.model)
print("Object Type:", response.object)

# Formatting the choices
for i, choice in enumerate(response.choices):
    print(f"Choice {i}:")
    print("  Finish Reason:", choice.finish_reason)
    print("  Index:", choice.index)
    print("  Logprobs:", choice.logprobs)
    print("  Text:", choice.text)

# Formatting the usage
print("Usage:")
print("  Completion Tokens:", response.usage.completion_tokens)
print("  Prompt Tokens:", response.usage.prompt_tokens)
print("  Total Tokens:", response.usage.total_tokens)

# If system_fingerprint is available, print it
if response.system_fingerprint:
    print("System Fingerprint:", response.system_fingerprint)

Completion ID: cmpl-8gAX2Me2LNy5h1lHpKNWV3p1U66as
Created: 1705060952
Model: ft:babbage-002:personal::8g9QuR5t
Object Type: text_completion
Choice 0:
  Finish Reason: length
  Index: 0
  Logprobs: None
  Text: , rom cold, crushing defeat.
SAMPLE Email 7: Let us break our
Usage:
  Completion Tokens: 16
  Prompt Tokens: 18
  Total Tokens: 34
