# Fine Tunning ChatGPT to Make Precise Youtube Chapters Sections

Let's look at the documentation on the [OPENAI website](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset) to learn how to fine tune a chatgpt model to a specific use case using a custom dataset.

Starting with the basics, we need an example format, which should be as such:

```
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
```

So, to fine tune a ChatGPT model we'll need to create a dataset organized in the format shown above. Essentially we'll have a .json file with several (as many as the number of inputs for the fine tunning) dictionaries that have this format of:

```
{"<messages key>"}: [{"role": "system", "<content key>": "<some text content>"},
{"<messages key>"}: {"role": "user", "<content key>": "<some text content>"},
{"<messages key>"}: {"role": "assistant", "<content key>": "<some text content>"}]
```

Here we'll assume that we will only fine tune actual ChatGPT, so we won't cover the formats for models like 'babbage-002' and others.

Ok, given this format, let's create our dataset.

Quote from the actual documentation:


> To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.



Let's try to get at least 30 examples for our custom dataset.

Just as in regular fine tunning with any Machine Learning model, its advisable to get stablish a train-test split so that you are on top of your model's performance.

Also, its extremely important to be aware of token limits and token costs, given that this will be a paid API.

Each training example will be limited to the context length of ChatGPT (therefore 4096 tokens), so its good to set some good practices in place to avoid issues with large inputs, the [OpenAI documentation](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset#:~:text=Each%20training%20example,the%20OpenAI%20cookbook.) recommends you check that the total token count in the message contents are under 4000 as a good rule of thumb.

Let's first create this dataset programatically using GPT4

In [None]:
from openai import OpenAI
import tiktoken
from IPython.display import display, Markdown


def get_response(prompt):
    client = OpenAI()
    response = client.chat.completions.create(model="gpt-3.5-turbo-1106", 
                             messages=
                             [
                                 {"role": "system", "content": "You are a helpful assistant."},
                                 {"role": "user", "content": prompt}   
                             ],
                             temperature=0.0,
                             n = 1
                             )
    return response.choices[0].message.content


def get_num_tokens(prompt, model="gpt-3.5-turbo"):
    """Calculates the number of tokens in a text prompt"""    

    enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

    return len(enc.encode(prompt))

Great! Now, let manually inspect the flashcards generated with gpt4 to select the best ones for our custom training dataset.

In [None]:
prompt = f"Create 15 examples of 1 paragraph first person human-like descriptions of tasks and goals for the upcoming weeks. Each example should be in a bullet point."

task_paragraphs = get_response(prompt)

display(Markdown(task_paragraphs))

In [None]:
task_list_prompts = task_paragraphs.split("\n")

task_list_prompts

In [None]:
for t in task_list_prompts:
    print(t)

In [None]:
task_steps = []
for t in task_list_prompts:
    prompt = f"Given this paragraph description of tasks and goals: \n '''{t}''' \n create a bullet point list with all the necessary steps to accomplish all of them."
    task_steps.append(get_response(prompt))

Let's evaluate the dataset generated programatically:

In [None]:
for t in task_steps:
    display(Markdown(t))

In [None]:
for desc,steps in zip(task_list_prompts, task_steps):
    print(desc) 
    print(steps)
    print("************")   

In [None]:
empty_indices = [i for i, x in enumerate(task_list_prompts) if not x]
print(empty_indices)

In [None]:
# Define the indexes to remove
indexes_to_remove = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27]

# Remove the elements with the given indexes from task_list_prompts and task_steps
task_list_prompts = [t for i, t in enumerate(task_list_prompts) if i not in indexes_to_remove]
task_steps = [t for i, t in enumerate(task_steps) if i not in indexes_to_remove]

In [None]:
# prompt_to_gpt4 = """Consider this format for a dataset for the chatgpt fine tuning api:
# '''
# {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
# {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
# {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
# '''

# Write the python code to create a dataset like this one programatically given that the system prompt will always be: "You are a helpful planning assistant"
# and that the user prompts will be inside a list called 'task_list_prompts' and the assistant content will be inside a list called 'task_steps'."""

In [None]:
import json

# Ensure the lists have the same length
assert len(task_list_prompts) == len(task_steps), "Mismatched lengths between prompts and responses"

# Creating the dataset
dataset = []
system_prompt = "You are a helpful planning assistant"

for user_content, assistant_content in zip(task_list_prompts, task_steps):
    interaction = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": assistant_content}
        ]
    }
    dataset.append(interaction)

In [None]:
dataset

In [None]:
# Writing to a file
with open('dataset.jsonl', 'w') as f:
    for entry in dataset:
        f.write(json.dumps(entry) + '\n')

# Now 'dataset.jsonl' will contain the dataset in the desired format.

In [None]:
# source: https://cookbook.openai.com/examples/chat_finetuning_data_prep
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [None]:
data_path = "./dataset.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Perfect! Now lets estimate some token counts and costs.

We'll be checkinf for this validation checklist as described in the [tutorial suggested by the OpenAI docs](https://cookbook.openai.com/examples/chat_finetuning_data_prep#:~:text=Format%20validation,for%20easier%20debugging.).  

In [None]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

In [None]:
# Some helpful utilities

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

Data warnings and token counts.

In [None]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

As we can see from the output obtained, all of our examples are under the context length, which means they won't have to be truncated which is great news! :)

Cost estimation

Let's estimate the cost of this fine tunning based on the number of tokens in our dataset.

In [None]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Ok, as we can see it seems like this fine tunning is not going to cost that much, given that if I check on the [pricing page](https://openai.com/pricing) for the cost associated with this amount of tokens:

![](2023-10-16-17-22-32.png)


Let's write a little function to calculate these costs automatically:

In [None]:
def calculate_cost_for_fine_tunning(token_count):
    return (0.008*token_count)/1000

In [None]:
calculate_cost_for_fine_tunning(38478)

Ok, so it seems that the amount will be 0.30 cents which seems reasonable, although we should take into account the fact that we'll be charged more for using this fine tuned model as well, and we only used 10 examples for this use case!

Now lets upload our newly generated dataset file!

To do that part, lets write a simply Python script to upload the file, we'll use this snippet from the [OpenAI API docs](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset#:~:text=import%20os%20import%20openai%20openai.api_key%20%3D%20os.getenv(%22openai_api_key%22)%20openai.file.create(%20file%3Dopen(%22mydata.jsonl%22%2C%20%22rb%22)%2C%20purpose%3D'fine-tune'%20)).

In [None]:
# import os
# import openai

# openai.File.create(
#   file=open("./dataset.jsonl", "rb"),
#   purpose='fine-tune'
# )

from openai import OpenAI
# OpenAI API key should be set as 
# environment variable - OPENAI_API_KEY
client = OpenAI()
client.files.create(
  file=open("dataset.jsonl", "rb"),
  purpose="fine-tune"
)
client.fine_tuning.jobs.create(
  training_file="file-id", 
  model="gpt-3.5-turbo-1106"
)

In [None]:
# list_files = openai.File.list()
from openai import OpenAI
client = OpenAI()

client.files.list()

In [None]:
# list_files

Now, finally, we create our fine-tuned model by running this:

In [None]:
# openai.FineTuningJob.create(training_file="file-ICGS6dyaltzMJuXdnHcpjGTr", model="gpt-3.5-turbo")
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-abc123", 
  model="gpt-3.5-turbo"
)

# Example response object
# {
#   "object": "fine_tuning.job",
#   "id": "ftjob-abc123",
#   "model": "gpt-3.5-turbo-0613",
#   "created_at": 1614807352,
#   "fine_tuned_model": null,
#   "organization_id": "org-123",
#   "result_files": [],
#   "status": "queued",
#   "validation_file": null,
#   "training_file": "file-abc123",
# }


Here the `training_file` parameter correspond to the file id generated when running the previous snippet. 


Now we wait for an email confirmation that will let us know when the training is done. Which given the size of this job, should be pretty quckly

Neat thing is that you can promatically query the status of your jobs:

In [None]:
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.list()

# Example response object
# {
#   "object": "list",
#   "data": [
#     {
#       "object": "fine_tuning.job.event",
#       "id": "ft-event-TjX0lMfOniCZX64t9PUQT5hn",
#       "created_at": 1689813489,
#       "level": "warn",
#       "message": "Fine tuning process stopping due to job cancellation",
#       "data": null,
#       "type": "message"
#     },
#     { ... },
#     { ... }
#   ], "has_more": true
# }

Retrieve fine tuning jobs

In [None]:
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.retrieve("ftjob-abc123")

In [None]:
# # List 10 fine-tuning jobs
# openai.FineTuningJob.list(limit=10)

# # Retrieve the state of a fine-tune
# openai.FineTuningJob.retrieve("ftjob-rPPzK29a1UkXFSHTdEQX4heY")

# Cancel a job
#openai.FineTuningJob.cancel("file-kXlvspUQRKFUWcrW5pgzy0PW")

# List up to 10 events from a fine-tuning job
#openai.FineTuningJob.list_events(id="file-kXlvspUQRKFUWcrW5pgzy0PW", limit=10)

Cancel fine tuning job

In [None]:
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.cancel("ftjob-abc123")

# Example response object
# {
#   "object": "fine_tuning.job",
#   "id": "ftjob-abc123",
#   "model": "gpt-3.5-turbo-0613",
#   "created_at": 1689376978,
#   "fine_tuned_model": null,
#   "organization_id": "org-123",
#   "result_files": [],
#   "hyperparameters": {
#     "n_epochs":  "auto"
#   },
#   "status": "cancelled",
#   "validation_file": "file-abc123",
#   "training_file": "file-abc123"
# }


Now we can go to the platform to check out our fine tuned model or run inference from it using the API! Let's look at how to run our fine tuned model

IN the platform we see:

![](2023-10-20-00-02-25.png)

So, we can choose that model and see how it performs for a different new video.

Querying the model:
![](2023-10-20-00-05-50.png)

The response:

![](2023-10-16-17-49-24.png)

It looks pretty good! :) Let's compare with the output of regular ChatGPT

![](2023-10-16-17-50-32.png)

The fine tuned model is definitely more thourough than the regular ChatGPT model so I would call this a nice success! :)

Now, let's run inference with our fine tuned model!

In [None]:
def get_response(prompt, fine_tuned_model_id="ft:gpt-3.5-turbo-0613:personal::8BWIRV1U"):
    client = OpenAI()
    response = client.chat.completions.create(model=fine_tuned_model_id, 
                             messages=
                             [
                                 {"role": "system", "content": "You are a helpful assistant."},
                                 {"role": "user", "content": prompt}   
                             ],
                             temperature=0.0,
                             n = 1
                             )
    return response.choices[0].message.content


get_response("Tomorrow I have to practice for my presentation at the live-training for O'Reilly at least a couple more times, then run through all the slides and notebooks to check everything is in order. After that I have Jiu Jitsu training and one more rehearsal before the live-training at 18:00")

In [None]:
# openai.Model.delete("ft:gpt-3.5-turbo-0613:personal::8BWIRV1U")

from openai import OpenAI
client = OpenAI()

client.files.delete("file-abc123")

Success!!!!!