This notebook demonstrates how to fine-tune a GPT model using the OpenAI API for a text classification task. Specifically, it trains a model to classify text as either "baseball" or "hockey" news.

The notebook performs the following steps:

1.  **Installs Dependencies**: Installs necessary libraries like `openai`, `tiktoken`, `pandas`, and `scikit-learn`.
2.  **Imports and Client Setup**: Imports required modules and sets up the OpenAI client using an API key.
3.  **Loads and Prepares Dataset**: Fetches the 20 Newsgroups dataset, filtering for baseball and hockey categories. It then splits the data into training and validation sets.
4.  **Saves JSONL in Chat Format**: Formats the training and validation data into a JSONL file suitable for fine-tuning, following the chat message format expected by the OpenAI API.
5.  **Uploads Files to OpenAI**: Uploads the prepared JSONL files to OpenAI's servers for use in the fine-tuning job.
6.  **Creates Fine-tuning Job**: Initiates a fine-tuning job on a specified model (e.g., `gpt-4.1-nano-2025-04-14`) using the uploaded training and validation files.
7.  **Inference after Fine-tuning**: Shows how to use the fine-tuned model for inference on a new piece of text.
8.  **Evaluation After completion of Fine Tuning**: Evaluates the performance of the fine-tuned model against the base model using accuracy on the validation set.

In the end the increase in accuracy was 15%. Fine Tuning was done for 8 epochs.

In [1]:
# ==========================================
# 1. Install dependencies
# ==========================================
!pip install -q --upgrade openai tiktoken
!pip install -q pandas==2.2.2 scikit-learn==1.6.1

In [2]:
# ==========================================
# 2. Imports & client setup
# ==========================================
import os
import json
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from openai import OpenAI
from google.colab import userdata
from tqdm import tqdm

In [4]:
# Securely prompt for API key
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Initialize client
client = OpenAI()

In [6]:
# ==========================================
# 3. Load & prepare dataset
# ==========================================
# We'll filter to baseball and hockey newsgroups
categories = ['rec.sport.baseball', 'rec.sport.hockey']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

texts = newsgroups.data
labels = newsgroups.target  # 0 = baseball, 1 = hockey

# Train/validation split
train_texts, valid_texts, train_labels, valid_labels = train_test_split(
    texts, labels, train_size=300, test_size=100, random_state=42, stratify=labels
)

print("Train set size:", len(train_texts))
print("Validation set size:", len(valid_texts))

Train set size: 300
Validation set size: 100


In [7]:
# ==========================================
# 4. Save JSONL in chat format for GPT-3.5
# ==========================================
def save_jsonl(filename, texts, labels):
    with open(filename, "w") as f:
        for text, label in zip(texts, labels):
            label_name = "baseball" if label == 0 else "hockey"
            record = {
                "messages": [
                    {"role": "system", "content": "You are a classifier that predicts baseball or hockey."},
                    {"role": "user", "content": text.strip()},
                    {"role": "assistant", "content": label_name}
                ]
            }
            f.write(json.dumps(record) + "\n")

save_jsonl("train.jsonl", train_texts, train_labels)
save_jsonl("valid.jsonl", valid_texts, valid_labels)

print("✅ Training and validation files saved.")

✅ Training and validation files saved.


In [8]:
# ==========================================
# 5. Upload files to OpenAI
# ==========================================
train_file = client.files.create(file=open("train.jsonl", "rb"), purpose="fine-tune")
valid_file = client.files.create(file=open("valid.jsonl", "rb"), purpose="fine-tune")

print("Train file ID:", train_file.id)
print("Valid file ID:", valid_file.id)

Train file ID: file-M4KfhmfZuaMZundzkUbYWv
Valid file ID: file-2Zeygj5biKYsk9ipfyXq9b


In [None]:
# ==========================================
# 6. Create fine-tuning job
# ==========================================
job = client.fine_tuning.jobs.create(
    training_file=train_file.id,
    validation_file=valid_file.id,
    model="gpt-4.1-nano-2025-04-14",
    suffix="baseball-hockey",
    hyperparameters={
        "n_epochs": 2
    }
)

print("Fine-tune job ID:", job.id)

In [None]:
# You can monitor progress in the OpenAI dashboard:
# https://platform.openai.com/finetune

# ==========================================
# 7. Inference after fine-tuning (replace MODEL_ID)
# ==========================================
# After training completes, you'll get a model like:
# ft:gpt-3.5-turbo:your-org:baseball-hockey:xxxx-xx-xx-xx-xx
FT_MODEL = "ft:gpt-4.1-nano-2025-04-14:debadri:baseball-hockey:C3fSa9ai"

test_text = "The team won the series after hitting two home runs."
resp = client.chat.completions.create(
    model=FT_MODEL,
    messages=[
        {"role": "system", "content": "You are a classifier that predicts baseball or hockey."},
        {"role": "user", "content": test_text}
    ]
)
print("Prediction:", resp.choices[0].message.content)

Prediction: baseball


In [9]:
# ==========================================
# 8. Evaluation After completion of Fine Tuning
# ==========================================

from sklearn.metrics import accuracy_score

# Your fine-tuned model name from job completion
FT_MODEL = "ft:gpt-4.1-nano-2025-04-14:debadri:baseball-hockey:C3fSa9ai"  # Replace with actual

# Base model (untrained)
BASE_MODEL = "gpt-4.1-nano-2025-04-14"

# Load validation dataset
val_data = []
with open("valid.jsonl", "r") as f:
    for line in f:
        val_data.append(json.loads(line))

def evaluate_model(model_name):
    """Evaluate a given model on the validation set and return predictions + accuracy."""
    y_true = []
    y_pred = []

    for example in tqdm(val_data, desc=f"Evaluating {model_name}"):
        prompt = example["messages"][0]["content"]
        actual_label = example["messages"][-1]["content"]
        y_true.append(actual_label.strip().lower())

        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "Classify the text as 'baseball' or 'hockey'."},
                {"role": "user", "content": prompt}
            ],
            temperature=0
        )

        prediction = response.choices[0].message.content.strip().lower()
        y_pred.append(prediction)

    acc = accuracy_score(y_true, y_pred)
    return acc

# Evaluate both models
acc_ft = evaluate_model(FT_MODEL)
acc_base = evaluate_model(BASE_MODEL)

# Compare results
print(f"Fine-tuned model accuracy: {acc_ft*100:.2f}%")
print(f"Base model accuracy: {acc_base*100:.2f}%")
print(f"Accuracy improvement: {(acc_ft - acc_base)*100:.2f}%")

Evaluating ft:gpt-4.1-nano-2025-04-14:debadri:baseball-hockey:C3fSa9ai: 100%|██████████| 100/100 [00:58<00:00,  1.72it/s]
Evaluating gpt-4.1-nano-2025-04-14: 100%|██████████| 100/100 [00:40<00:00,  2.48it/s]

Fine-tuned model accuracy: 50.00%
Base model accuracy: 50.00%
Accuracy improvement: 0.00%



