## Fine_Tuning an ADA Classifier to Distinguish between the two sports: Baseball an Hockey

In [3]:
! pip install --quiet openai

In [5]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai

categories = ["rec.sport.baseball", "rec.sport.hockey"]
sports_dataset = fetch_20newsgroups(subset = "train", shuffle = True, random_state =42, categories = categories)

## Data Exploration
##### The newspaper dataset can be loaded using sklearn. First I will look at the data itself.

In [6]:
print(sports_dataset["data"][0])

From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb@nwu.edu       

In [7]:
sports_dataset["target_names"]

['rec.sport.baseball', 'rec.sport.hockey']

In [8]:
sports_dataset.target_names[sports_dataset["target"][0]]

'rec.sport.baseball'

In [9]:
sports_dataset.target_names[sports_dataset["target"][1]]

'rec.sport.hockey'

In [12]:
len_all, len_baseball, len_hockey = len(sports_dataset["data"]), len([name for name in sports_dataset.target_names if "baseball" in name]), len([name for name in sports_dataset.target_names if "hockey" in name])
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")

Total examples: 1197, Baseball examples: 1, Hockey examples: 1


##### One Sample from the baseball category can be seen above. It is an email to a mailing list. We can observe that we have 1197 example in total, which are evenly split between the two sport.

# Data Preparation
##### We transform the dataset into a pandas dataframe, with a column for prompt and completeion. The prompt contains the emails from the mailing list, and the completion is a name of the sport, either hockey or baseball. For demonstration purposes only and speed of fine-tuning we take only 300 examples. In a real use case the more the examples the better the performance.

In [14]:
labels = [sports_dataset.target_names[x].split(".")[-1] for x in sports_dataset["target"]]
texts = [text.strip() for text in sports_dataset["data"]]
df = pd.DataFrame(zip(texts, labels), columns = ["prompt", "completion"]) # 300
df.head()

Unnamed: 0,prompt,completion
0,From: dougb@comm.mot.com (Doug Bank)\nSubject:...,baseball
1,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,hockey
2,From: rudy@netcom.com (Rudy Wade)\nSubject: Re...,baseball
3,From: monack@helium.gas.uug.arizona.edu (david...,hockey
4,Subject: Let it be Known\nFrom: <ISSBTL@BYUVM....,baseball


##### Base Baseball and Hockey are single tokens. We save the dataset as a json file.

In [15]:
df.to_json("sport2.json", orient="records", lines=True)

# Data Preparation Tool
##### We can now use a data preparation tool which will suggest a few improvements to our dataset before fine-tuning. Before launching the tool we update the openai library to ensure we're using the latest data preparation tool. We additionally specify -q which auto-accepts all suggestion.

In [19]:
!openai tools fine_tunes.prepare_data -f sport2.json -q

Analyzing...

- Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format
- Your file contains 1197 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detai

# Fine-Tune



In [None]:
!openai --api-key = ""

usage: openai [-h] [-v] [-b API_BASE] [-k API_KEY] [-p PROXY [PROXY ...]]
              [-o ORGANIZATION] [-t {openai,azure}]
              [--api-version API_VERSION] [--azure-endpoint AZURE_ENDPOINT]
              [--azure-ad-token AZURE_AD_TOKEN] [-V]
              {api,tools,migrate,grit} ...
openai: error: argument {api,tools,migrate,grit}: invalid choice: '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' (choose from api, tools, migrate, grit)


In [None]:
from openai import OpenAI
import os
from google.colab import userdata

# Initialize client - make sure your API key is set in Colab Secrets as OPENAI_API_KEY
# If you haven't set your API key in secrets, uncomment the lines below and run them first
# os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
# client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Or you can initialize the client directly if you prefer, but using secrets is recommended
# Replace "YOUR_API_KEY" with your actual API key if not using secrets
client = OpenAI(api_key=""
# Upload training file
# Make sure 'sport2_prepared_train.jsonl' exists in your files
try:
    with open("sport2_prepared_train.jsonl", "rb") as f:
        training_file = client.files.create(
            file=f,
            purpose="fine-tune"
        )
    print(f"Training file uploaded with ID: {training_file.id}")
except FileNotFoundError:
    print("Error: 'sport2_prepared_train.jsonl' not found. Please ensure the data preparation step was successful.")
    training_file = None # Set to None to prevent attempting to create job without file

# Upload validation file (optional)
# Make sure 'sport2_prepared_valid.jsonl' exists in your files
validation_file = None # Initialize to None
try:
    with open("sport2_prepared_valid.jsonl", "rb") as f:
        validation_file = client.files.create(
            file=f,
            purpose="fine-tune"
        )
    print(f"Validation file uploaded with ID: {validation_file.id}")
except FileNotFoundError:
     print("Warning: 'sport2_prepared_valid.jsonl' not found. Proceeding without validation file.")


# Create fine-tuning job if training file was uploaded
if training_file:
    try:
        fine_tune_job_params = {
            "model": "gpt-3.5-turbo",
            "training_file": training_file.id,
            "suffix": "sports-classifier"
        }
        if validation_file:
            fine_tune_job_params["validation_file"] = validation_file.id

        fine_tune_job = client.fine_tuning.jobs.create(**fine_tune_job_params)

        print("\nFine-tuning job created:")
        print(f"ID: {fine_tune_job.id}")
        print(f"Status: {fine_tune_job.status}")
        print(f"Created At: {fine_tune_job.created_at}")

    except Exception as e:
        print(f"\nAn error occurred while creating the fine-tuning job: {e}")
        print("Please check the error message and your OpenAI account status and quota.")

# Advance Results and Expected model performance
##### We can now download the results file to observe the expected performance on a held out validation set.

In [None]:
!openai --api-key OPENAI_API_SECRET api fine_tunes.results -i (fine_tune_id) > result.csv # check the OpenAI playground to see the fine_tune_id

In [None]:
results = pd.read_csv("result.csv")
results[results["classification/accuracy"].notnull()].tail(1)

In [None]:
results[results["classification/accuracy"].notnull()]["classification/accuracy"].plot()

# Using the model
##### We can now call the model to get predictions.

In [None]:
test = pd.read_json("sport2_prepared_valid.jsonl", lines=True)
test.head()

In [None]:
ft_model = "ada: fine_tune_id" # check the OpenAI playground to see the fine_tune_id)

In [None]:
res = openai.Completion.create(model = ft_model, prompt = test["prompt"][0] + "\n\n###", max_tokens = 1, temperature = 0.8)
res["choices"][0]["text"]

##### To get the log probabilities we can specify logprobs parameter on the completion request

In [None]:
res = openai.Completion.create(model = ft_model, prompt = test["prompt"][0] + "\n\n###", max_tokens = 1, temperature = 0.8)
res["choices"][0]["logprob"]["top_logprob"][0]

# Generalization

In [None]:
sample_hockey_tweet = """Thank you to the
@Canes
and all you amazing Caniacs that have been so supportive! You guys are some of the best fans in the NHL without a doubt, and your energy makes every game unforgettable.
@DetroitRedWings
!!"""
res = openai.Completion.create(model = ft_model, prompt = sample_hockey_tweet + "\n\n###", max_tokens = 1, temperature = 0.8)
res["choices"][0]["text"]

In [None]:
sample_baseball_tweet = """Huge shoutout to all the
@Yankees
fans! Your support at the stadium and online is unbelievable. You guys are some of the most passionate fans in MLB without
@RedSox
!!"""
res = openai.Completion.create(model = ft_model, prompt = sample_hockey_tweet + "\n\n###", max_tokens = 1, temperature = 0.8)
res["choices"][0]["text"]