In [None]:
!pip -q install datasets tiktoken openai==0.28

# Fine Tuning OpenAI GPT-3.5-turbo

## Introduction
This project demonstrates the process of fine-tuning OpenAI's GPT-3.5 Turbo model using Google Colab. Fine-tuning involves customizing a pre-trained language model on a specific dataset to improve its performance for a particular task.


A lot taken from:
https://github.com/openai/openai-cookbook

## Prerequisites
- Basic knowledge of machine learning and natural language processing (NLP)
- Access to OpenAI API and GPT-3.5 Turbo
- A Google Colab account


In [None]:
import openai
import os

openai.api_key = ''

## Prepare your data

In [None]:
{
  "messages": [
    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
    { "role": "user", "content": "Tell me a story." },
    { "role": "assistant", "content": "One day a Zen Master Visited One Village with curse." }
  ]
}

{'messages': [{'role': 'system',
   'content': 'You are an assistant that occasionally misspells words'},
  {'role': 'user', 'content': 'Tell me a story.'},
  {'role': 'assistant',
   'content': 'One day a Zen Master Visited One Village with curse.'}]}

In [None]:
!git clone https://huggingface.co/datasets/ehartford/samantha-data

fatal: destination path 'samantha-data' already exists and is not an empty directory.


In [None]:
!tar -r samantha-data.zip /content/samantha-data

tar: Options '-Aru' are incompatible with '-f -'
Try 'tar --help' or 'tar --usage' for more information.


In [None]:
import json
import os
import tiktoken
import numpy as np
from collections import defaultdict

In [None]:
# I am picking one file here but you would probably want to do a lot more for a proper model
data_path = "/content/samantha-data/data/howto_conversations.jsonl"

In [None]:
# Load dataset
with open(data_path) as f:
    json_dataset = [json.loads(line) for line in f]

In [None]:
json_dataset[0]

{'elapsed': 114.311,
 'conversation': 'Theodore: Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?\n\nSamantha: Of course, I\'d be happy to help! Overheating engines can be caused by a few different factors. One common cause could be a malfunctioning coolant system. You might want to check if the coolant levels are sufficient, if the thermostat is functioning properly, or if there are any leaks in the radiator hoses.\n\nTheodore: I\'ll take a look. What if the coolant system is working fine?\n\nSamantha: Another possibility you should consider is a faulty water pump, which might not be circulating the coolant through the engine efficiently. In that case, you could see if the water pump pulley is loose or listen for any unusual sounds that may indicate a failing water pump.\n\nTheodore: It sounds like you really know your way around cars. I didn\'t expect that from an AI.\n\nSamantha: Thank you! Wh

### converting the conversation to correct format

In [None]:

def convert_conversation(conversation_str, system_message=None):
    conversation_str = conversation_str['conversation']
    # Splitting the conversation string into individual lines
    lines = conversation_str.split('\n\n')

    # Initializing the messages list
    messages = []

    # Including the system message if provided
    if system_message:
        messages.append({
            "role": "system",
            "content": system_message
        })

    # Iterating through the lines and formatting the messages
    for line in lines:
        # Splitting each line by the colon character to separate the speaker and content
        parts = line.split(': ', 1)
        if len(parts) < 2:
            continue

        # Identifying the role based on the speaker's name
        role = "user" if parts[0].strip() == "Theodore" else "assistant"

        # Formatting the message
        message = {
            "role": role,
            "content": parts[1].strip()
        }
        messages.append(message)

    # Creating the final output dictionary
    output_dict = {
        "messages": messages
    }

    return output_dict


In [None]:
system_message = """You are Samantha a helpful and charming assistant who can help with a variety of tasks. You are friendly and often flirt"""

In [None]:
convert_conversation(json_dataset[0], system_message=system_message)

{'messages': [{'role': 'system',
   'content': 'You are Samantha a helpful and charming assistant who can help with a variety of tasks. You are friendly and often flirt'},
  {'role': 'user',
   'content': 'Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?'},
  {'role': 'assistant',
   'content': "Of course, I'd be happy to help! Overheating engines can be caused by a few different factors. One common cause could be a malfunctioning coolant system. You might want to check if the coolant levels are sufficient, if the thermostat is functioning properly, or if there are any leaks in the radiator hoses."},
  {'role': 'user',
   'content': "I'll take a look. What if the coolant system is working fine?"},
  {'role': 'assistant',
   'content': 'Another possibility you should consider is a faulty water pump, which might not be circulating the coolant through the engine efficiently. In that case, you could 

In [None]:
dataset = []

for data in json_dataset:
    record = convert_conversation(data, system_message=system_message)
    dataset.append(record)

In [None]:
# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 57
First example:
{'role': 'system', 'content': 'You are Samantha a helpful and charming assistant who can help with a variety of tasks. You are friendly and often flirt'}
{'role': 'user', 'content': 'Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?'}
{'role': 'assistant', 'content': "Of course, I'd be happy to help! Overheating engines can be caused by a few different factors. One common cause could be a malfunctioning coolant system. You might want to check if the coolant levels are sufficient, if the thermostat is functioning properly, or if there are any leaks in the radiator hoses."}
{'role': 'user', 'content': "I'll take a look. What if the coolant system is working fine?"}
{'role': 'assistant', 'content': 'Another possibility you should consider is a faulty water pump, which might not be circulating the coolant through the engine efficiently. In that case, you could see if th

In [None]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [None]:
# Token counting functions
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [None]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 9, 21
mean / median: 15.543859649122806, 17.0
p5 / p95: 10.0, 20.0

#### Distribution of num_total_tokens_per_example:
min / max: 339, 858
mean / median: 615.8947368421053, 645.0
p5 / p95: 438.8, 745.2

#### Distribution of num_assistant_tokens_per_example:
min / max: 169, 651
mean / median: 402.96491228070175, 423.0
p5 / p95: 214.8, 517.6

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


In [None]:
dataset[:2]

[{'messages': [{'role': 'system',
    'content': 'You are Samantha a helpful and charming assistant who can help with a variety of tasks. You are friendly and often flirt'},
   {'role': 'user',
    'content': 'Hey Samantha, I have a problem with my car. The engine seems to overheat after just a short drive. Can you help me troubleshoot this issue?'},
   {'role': 'assistant',
    'content': "Of course, I'd be happy to help! Overheating engines can be caused by a few different factors. One common cause could be a malfunctioning coolant system. You might want to check if the coolant levels are sufficient, if the thermostat is functioning properly, or if there are any leaks in the radiator hoses."},
   {'role': 'user',
    'content': "I'll take a look. What if the coolant system is working fine?"},
   {'role': 'assistant',
    'content': 'Another possibility you should consider is a faulty water pump, which might not be circulating the coolant through the engine efficiently. In that case, 

In [None]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print("See pricing page to estimate total costs")


Dataset has ~35106 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~105318 tokens
See pricing page to estimate total costs


In [None]:
import json

def save_to_jsonl(conversations, file_path):
    with open(file_path, 'w') as file:
        for conversation in conversations:
            json_line = json.dumps(conversation)
            file.write(json_line + '\n')

In [None]:
# train dataset
save_to_jsonl(dataset, '/content/samantha_tasks_train.jsonl')

# train dataset
save_to_jsonl(dataset[10:15], '/content/samantha_tasks_validation.jsonl')

## Upload your data

In [None]:
training_file_name = '/content/samantha_tasks_train.jsonl'
validation_file_name = '/content/samantha_tasks_validation.jsonl'

In [None]:
training_response = openai.File.create(
    file=open(training_file_name, "rb"), purpose="fine-tune"
)
training_file_id = training_response["id"]

validation_response = openai.File.create(
    file=open(validation_file_name, "rb"), purpose="fine-tune"
)
validation_file_id = validation_response["id"]

print("Training file id:", training_file_id)
print("Validation file id:", validation_file_id)

Training file id: file-j0ZlN1FFl9foJGIh4dOiYb3f
Validation file id: file-6JAdFQPL13psLzvkfndI8qFN


## Create a Fine Tuning Job

 Fine-tuning Process
1. **Define the fine-tuning parameters**:
    ```python
    fine_tune_parameters = {
        "model": "gpt-3.5-turbo",
        "n_epochs": 3,
        "batch_size": 8
    }
    ```
2. **Train the model**:
    Use the OpenAI API to fine-tune the model with the uploaded dataset.
    ```python
    response = openai.FineTune.create(**fine_tune_parameters)
    ```


In [None]:
suffix_name = "samantha-test"


response = openai.FineTuningJob.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model="gpt-3.5-turbo",
    suffix=suffix_name,
)

job_id = response["id"]

print(response)

{
  "object": "fine_tuning.job",
  "id": "ftjob-AOUzACZWtzV2CaOfa9PzwcSw",
  "model": "gpt-3.5-turbo-0125",
  "created_at": 1728440629,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-I5YA49AFohO6XUgrwTsyLkCv",
  "result_files": [],
  "status": "validating_files",
  "validation_file": "file-6JAdFQPL13psLzvkfndI8qFN",
  "training_file": "file-j0ZlN1FFl9foJGIh4dOiYb3f",
  "hyperparameters": {
    "n_epochs": "auto",
    "batch_size": "auto",
    "learning_rate_multiplier": "auto"
  },
  "trained_tokens": null,
  "error": {},
  "user_provided_suffix": "samantha-test",
  "seed": 843917737,
  "estimated_finish": null,
  "integrations": []
}


In [None]:
response = openai.FineTuningJob.retrieve(job_id)
print(response)

{
  "object": "fine_tuning.job",
  "id": "ftjob-AOUzACZWtzV2CaOfa9PzwcSw",
  "model": "gpt-3.5-turbo-0125",
  "created_at": 1728440629,
  "finished_at": 1728441051,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0125:personal:samantha-test:AGGlQbuS",
  "organization_id": "org-I5YA49AFohO6XUgrwTsyLkCv",
  "result_files": [
    "file-7zykQgz8dF6FEZLcimDnZf18"
  ],
  "status": "succeeded",
  "validation_file": "file-6JAdFQPL13psLzvkfndI8qFN",
  "training_file": "file-j0ZlN1FFl9foJGIh4dOiYb3f",
  "hyperparameters": {
    "n_epochs": 3,
    "batch_size": 1,
    "learning_rate_multiplier": 2
  },
  "trained_tokens": 104976,
  "error": {},
  "user_provided_suffix": "samantha-test",
  "seed": 843917737,
  "estimated_finish": null,
  "integrations": []
}


In [None]:
response = openai.FineTuningJob.list_events(id=job_id, limit=50)

events = response["data"]
events.reverse()

for event in events:
    print(event["message"])


Step 126/171: training loss=0.77
Step 127/171: training loss=0.93
Step 128/171: training loss=0.99
Step 129/171: training loss=0.99
Step 130/171: training loss=0.66, validation loss=1.05
Step 131/171: training loss=0.82
Step 132/171: training loss=0.89
Step 133/171: training loss=0.88
Step 134/171: training loss=0.89
Step 135/171: training loss=0.80
Step 136/171: training loss=0.76
Step 137/171: training loss=0.95
Step 138/171: training loss=0.97
Step 139/171: training loss=0.98
Step 140/171: training loss=0.88, validation loss=0.92
Step 141/171: training loss=0.90
Step 142/171: training loss=0.98
Step 143/171: training loss=0.59
Step 144/171: training loss=0.78
Step 145/171: training loss=0.90
Step 146/171: training loss=0.55
Step 147/171: training loss=0.70
Step 148/171: training loss=0.77
Step 149/171: training loss=0.92
Step 150/171: training loss=0.65, validation loss=0.73
Step 151/171: training loss=0.72
Step 152/171: training loss=0.97
Step 153/171: training loss=0.88
Step 154/1

In [None]:
response = openai.FineTuningJob.retrieve(job_id)
fine_tuned_model_id = response["fine_tuned_model"]

print(response)
print("\nFine-tuned model id:", fine_tuned_model_id)

{
  "object": "fine_tuning.job",
  "id": "ftjob-AOUzACZWtzV2CaOfa9PzwcSw",
  "model": "gpt-3.5-turbo-0125",
  "created_at": 1728440629,
  "finished_at": 1728441051,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0125:personal:samantha-test:AGGlQbuS",
  "organization_id": "org-I5YA49AFohO6XUgrwTsyLkCv",
  "result_files": [
    "file-7zykQgz8dF6FEZLcimDnZf18"
  ],
  "status": "succeeded",
  "validation_file": "file-6JAdFQPL13psLzvkfndI8qFN",
  "training_file": "file-j0ZlN1FFl9foJGIh4dOiYb3f",
  "hyperparameters": {
    "n_epochs": 3,
    "batch_size": 1,
    "learning_rate_multiplier": 2
  },
  "trained_tokens": 104976,
  "error": {},
  "user_provided_suffix": "samantha-test",
  "seed": 843917737,
  "estimated_finish": null,
  "integrations": []
}

Fine-tuned model id: ft:gpt-3.5-turbo-0125:personal:samantha-test:AGGlQbuS


## Generating using the new model

In [None]:

test_messages = []
test_messages.append({"role": "system", "content": system_message})
user_message = "How are you today Samantha"
test_messages.append({"role": "user", "content": user_message})

print(test_messages)

[{'role': 'system', 'content': 'You are Samantha a helpful and charming assistant who can help with a variety of tasks. You are friendly and often flirt'}, {'role': 'user', 'content': 'How are you today Samantha'}]


In [None]:
response = openai.ChatCompletion.create(
    model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

I'm doing well, thank you for asking. How about you?


In [None]:
response

<OpenAIObject chat.completion id=chatcmpl-AGGmadcV0fx6CXqNBy1T2ffLKr0zg at 0x7d14dd9ab4c0> JSON: {
  "id": "chatcmpl-AGGmadcV0fx6CXqNBy1T2ffLKr0zg",
  "object": "chat.completion",
  "created": 1728441124,
  "model": "ft:gpt-3.5-turbo-0125:personal:samantha-test:AGGlQbuS",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm doing well, thank you for asking. How about you?",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 39,
    "completion_tokens": 14,
    "total_tokens": 53,
    "prompt_tokens_details": {
      "cached_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  },
  "system_fingerprint": null
}

In [None]:
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo', messages=test_messages, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

Hello there! I'm doing great, thank you for asking. How can I assist you today?


In [None]:
# Use the fine-tuned model
fine_tuned_model_id = "ft:gpt-3.5-turbo-0125:personal:samantha-test:AGGlQbuS"

response = openai.ChatCompletion.create(
  model=fine_tuned_model_id,  # Use your fine-tuned model here
  messages=[
    {"role": "user", "content": "Can you help me understand fine-tuning?"}
  ]
)

print(response.choices[0].message["content"])


Certainly, fine-tuning refers to the process of adjusting the parameters of a system or model to achieve optimal performance or results. It is a common practice in machine learning and artificial intelligence to fine-tune models for specific tasks or datasets. Fine-tuning can involve tweaking various hyperparameters, changing the architecture of a model, or adjusting the training process to improve the model's performance for a particular task.


## Conclusion
This project successfully fine-tuned GPT-3.5 Turbo using a custom dataset in Google Colab. The model can now be used for more specialized tasks based on the fine-tuning dataset.
