# Fine Tunning ChatGPT to Make Precise Youtube Chapters Sections

Let's look at the documentation on the [OPENAI website](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset) to learn how to fine tune a chatgpt model to a specific use case using a custom dataset.

Starting with the basics, we need an example format, which should be as such:

```
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
```

So, to fine tune a ChatGPT model we'll need to create a dataset organized in the format shown above. Essentially we'll have a .json file with several (as many as the number of inputs for the fine tunning) dictionaries that have this format of:

```
{"<messages key>"}: [{"role": "system", "<content key>": "<some text content>"},
{"<messages key>"}: {"role": "user", "<content key>": "<some text content>"},
{"<messages key>"}: {"role": "assistant", "<content key>": "<some text content>"}]
```

Here we'll assume that we will only fine tune actual ChatGPT, so we won't cover the formats for models like 'babbage-002' and others.

Ok, given this format, let's create our dataset.

Quote from the actual documentation:


> To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.



Let's try to get at least 30 examples for our custom dataset.

Just as in regular fine tunning with any Machine Learning model, its advisable to get stablish a train-test split so that you are on top of your model's performance.

Also, its extremely important to be aware of token limits and token costs, given that this will be a paid API.

Each training example will be limited to the context length of ChatGPT (therefore 4096 tokens), so its good to set some good practices in place to avoid issues with large inputs, the [OpenAI documentation](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset#:~:text=Each%20training%20example,the%20OpenAI%20cookbook.) recommends you check that the total token count in the message contents are under 4000 as a good rule of thumb.

FOr the example I want to try out, I have to first father all the relevant urls from which to create a relevant dataset.

- https://www.youtube.com/watch?v=j9CHA3hgA10&t=615s
- https://www.youtube.com/watch?v=W8QwNXbO5dk
- https://www.youtube.com/watch?v=xKKnWcJo0vE
- https://www.youtube.com/watch?v=MXBxNUnC7iY
- https://www.youtube.com/watch?v=2r37fvpGnMo
- https://www.youtube.com/watch?v=Q2U-tt1JffY
- https://www.youtube.com/watch?v=5VAuGvefzEI
- https://www.youtube.com/watch?v=p_MQRWH5Y6k
- https://www.youtube.com/watch?v=H3s5fx7CsZg
- https://www.youtube.com/watch?v=s-VlDoL1F6M
- https://www.youtube.com/watch?v=srLYR_PMX1o

Ok, now let's create this dataset by creating input prompts that ask for the creation of the Youtube Chapters section we desire, followed by a formatting description and the actual transcription of the original video.

Since I already have the chapters sections for those videos, I can use those as the standard desired generated output from the model.

In [1]:
urls = ["https://www.youtube.com/watch?v=srLYR_PMX1o",
"https://www.youtube.com/watch?v=s-VlDoL1F6M",
"https://www.youtube.com/watch?v=H3s5fx7CsZg",
"https://www.youtube.com/watch?v=p_MQRWH5Y6k",
"https://www.youtube.com/watch?v=5VAuGvefzEI",
"https://www.youtube.com/watch?v=Q2U-tt1JffY",
"https://www.youtube.com/watch?v=MXBxNUnC7iY",
"https://www.youtube.com/watch?v=2r37fvpGnMo",
"https://www.youtube.com/watch?v=xKKnWcJo0vE",
"https://www.youtube.com/watch?v=j9CHA3hgA10&t=615s",
"https://www.youtube.com/watch?v=W8QwNXbO5dk"]

For each of these URLS what we want is to gather their corresponding video transcriptions and the chapters sections to serve as input and output respectively.

FOr that, let's make our lives easier by creating a few helper functions that can automate part of this process:

In [2]:
import tiktoken
def get_num_tokens(prompt, model="gpt-3.5-turbo"):
    """Calculates the number of tokens in a text prompt"""    

    enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

    return len(enc.encode(prompt))


In [3]:
import openai

def get_response(prompt_question,):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "system", "content": "You are a helpful research and\
            programming assistant"},
                  {"role": "user", "content": prompt_question}]
    )
    
    return response["choices"][0]["message"]["content"]

In [19]:
get_response(f"Summarize this youtube video in bullet points: {transcription_example2_with_timestamps}")

'- The video is about setting up an OCR (Optical Character Recognition) search project to detect text within images in your local folders using Python.\n- The presenter begins with initial project setup requirements and package installations such as EasyOCR and Torch for Python version 3.10.\n- Necessary libraries such as OS and typing are imported.\n- The presenter then sets up two primary functions: one to scan images for text and another to loop over images within a folder.\n- They also set up the EasyOCR reader for text detection, specifying language as English. It is set to use a GPU if available, else falls back to CPU.\n- The first function, OCR scan, is programmed to run an OCR on an image and detect text, while the second function is setup to loop over images in a specified folder.\n- The presenter initially tests the OCR scan function on a sample image text, effectively demonstrating its function.\n- The recognition of detected text is processed and returned by the OCR scan f

Ok, great! Even for large videos we are still under the 4000 tokens limit!

Ok, at this point I think its note worthy to stop and just think about what are we trying to achieve here.

At this point is good to stop, take a step back and think about what we have to do:

1. We need to get the timestamped transcriptions of the text (we'll use the actual Youtube website for that).
2. Then, we need to craft a good and all encompassing prompt to attach to each of these transcrtiptions which will surve as the output.

Ok, I have the transcriptions now in my local folder along with the matching urls and chapters section.

In [28]:
import json
import glob
from natsort import natsorted
# Sample data

chapters_files = natsorted([file_path for file_path in glob.glob("./chapters*.txt")])
transcription_files = natsorted([file_path for file_path in glob.glob("./transcription*.txt")])

chapters_files

['./chapters_section_1.txt',
 './chapters_section_2.txt',
 './chapters_section_3.txt',
 './chapters_section_4.txt',
 './chapters_section_5.txt',
 './chapters_section_6.txt',
 './chapters_section_7.txt',
 './chapters_section_8.txt',
 './chapters_section_9.txt',
 './chapters_section_10.txt',
 './chapters_section_11.txt']

In [29]:
transcription_files

['./transcription_1.txt',
 './transcription_2.txt',
 './transcription_3.txt',
 './transcription_4.txt',
 './transcription_5.txt',
 './transcription_6.txt',
 './transcription_7.txt',
 './transcription_8.txt',
 './transcription_9.txt',
 './transcription_10.txt',
 './transcription_11.txt']

In [30]:
# Initialize an empty list to store the dictionaries
data = []

# Loop through the urls, chapters_files, and transcription_files
for url, chapter_file, transcription_file in zip(urls, chapters_files, transcription_files):
    # Initialize an empty dictionary to store data for each set
    entry = {}
    
    entry["url"] = url
    
    # Read the chapter file and store its contents
    with open(chapter_file, "r") as f:
        entry["chapters_section"] = f.read()
    
    # Read the transcription file and store its contents
    with open(transcription_file, "r") as f:
        entry["transcription"] = f.read()
        
    # Append the dictionary to the list
    data.append(entry)

# Write the list of dictionaries to a JSON file
with open("yt_chapters_dataset.json", "w") as f:
    json.dump(data, f, indent=4)


In [31]:
import json

# Load the JSON file
with open("yt_chapters_dataset.json", "r") as f:
    data = json.load(f)

# Print the data
print(data)


[{'url': 'https://www.youtube.com/watch?v=srLYR_PMX1o', 'chapters_section': '📚 Chapters:\n00:00 - Introduction \n00:01 - Creating Anki Flashcards with ChatGPT\n00:06 - Importing Dependencies\n00:14 - Setting Up OpenAI API Key\n00:29 - Getting Content from Clipboard\n00:46 - Running the Script\n00:57 - Creating Conversation with ChatGPT\n01:05 - Setting Up System Role\n01:24 - Setting Up User Message\n01:38 - Formatting for Anki Import\n01:46 - Sending Request to API\n02:35 - Writing the flashcards to file\n02:50 - Creating Anki flashcards\n03:58 - Importing flashcards to Anki', 'transcription': "0:00\nIn this video, you're going to learn how to create Anki flashcards using ChatGPT.\n0:05\nOkay, so we're going to start by importing our dependencies. So essentially, I'm going to import\n0:09\nOS to access the OpenAI API key in my environment as my environment variable. I'm going to import\n0:15\nOpenAI clipboard to get the content from my clipboard. And then Json to save stuff as a Json.

Ok, now that we have somewhat organized the data we'll use for the fine tunning, we now have to put everything in the neceessary format so that we can actually train the model.


To do that all we have to do is generate a .json file that holds the information we want to fine tune the model with.

Our system prompt will be:

- You are a helpful assistant

The user prompt will always have the following structure:

'''
Given this Youtube video transcript: <transcript> I want you to create a chapters section for this Youtube video with the following format:

"""
<📚 Chapters:\n>
<double digit:time stamp> - <Concise phrase of a major part of the video>

For example:

"""📚 Chapters:
00:05 - Introducing the concept of Neural Networks
02:30 - Discussing activation functions
etc...
"""

'''

The assistant output will always be the corresponding chapters section for that particular video.

In [33]:
import json

# Function to read the original JSON file
def read_original_json(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

# Function to generate the new JSON structure
def generate_new_json_structure(original_data):
    new_data = {"messages": []}
    
    # Add a system message
    system_message = {"role": "system", "content": "You are a helpful assistant."}
    new_data["messages"].append(system_message)
    
    for entry in original_data:
        url = entry["url"]
        chapters = entry["chapters_section"]
        transcription = entry["transcription"]
        
        # Add user message
        user_message_content = f"Given this Youtube video transcript: {transcription} I want you to create a chapters section for this Youtube video with the following format:\n"
        user_message_content += "📚 Chapters:\n"
        user_message_content += "<double digit:time stamp> - <Concise phrase of a major part of the video>"
        user_message = {"role": "user", "content": user_message_content}
        new_data["messages"].append(user_message)
        
        # Add assistant message
        assistant_message_content = chapters  # Assuming chapters are already formatted as desired
        assistant_message = {"role": "assistant", "content": assistant_message_content}
        new_data["messages"].append(assistant_message)
        
    return new_data

# Read the original JSON file
original_data = read_original_json("./yt_chapters_dataset.json")

# Generate the new JSON structure
new_data = generate_new_json_structure(original_data)

# Write the new JSON structure to a file
with open("./new_combined_data.json", "w") as f:
    json.dump(new_data, f, indent=4)

In [53]:
import json

# Function to read the original JSON file
def read_original_json(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

# Function to generate the new JSON structure
def generate_new_json_structure(original_data):
    new_data_list = []
    
    for entry in original_data:
        new_data = {"messages": []}
        
        # Add a system message
        system_message = {"role": "system", "content": "You are a helpful assistant."}
        new_data["messages"].append(system_message)
        
        url = entry["url"]
        chapters = entry["chapters_section"]
        transcription = entry["transcription"]
        
        # Add user message
        user_message_content = f"Given this Youtube video transcript: {transcription} I want you to create a chapters section for this Youtube video with the following format:\n"
        user_message_content += "📚 Chapters:\n"
        user_message_content += "<double digit:time stamp> - <Concise phrase of a major part of the video>"
        user_message = {"role": "user", "content": user_message_content}
        new_data["messages"].append(user_message)
        
        # Add assistant message
        assistant_message_content = chapters  # Assuming chapters are already formatted as desired
        assistant_message = {"role": "assistant", "content": assistant_message_content}
        new_data["messages"].append(assistant_message)
        
        new_data_list.append(new_data)
        
    return new_data_list

# Read the original JSON file
original_data = read_original_json("./yt_chapters_dataset.json")

# Generate the new JSON structure
new_data_list = generate_new_json_structure(original_data)

# Write the new JSON structure to a .jsonl file
with open("./dataset_fine_tunning.jsonl", "w") as f:
    for item in new_data_list:
        json.dump(item, f)
        f.write('\n')

Before moving on, let's check the potential costs of this fine tunning:

In [47]:
# source: https://cookbook.openai.com/examples/chat_finetuning_data_prep

In [48]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [55]:
data_path = "./dataset_fine_tunning.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 11
First example:
{'role': 'system', 'content': 'You are a helpful assistant.'}
{'role': 'user', 'content': "Given this Youtube video transcript: 0:00\nIn this video, you're going to learn how to create Anki flashcards using ChatGPT.\n0:05\nOkay, so we're going to start by importing our dependencies. So essentially, I'm going to import\n0:09\nOS to access the OpenAI API key in my environment as my environment variable. I'm going to import\n0:15\nOpenAI clipboard to get the content from my clipboard. And then Json to save stuff as a Json.\n0:24\nHere's what we're going to do. First, we're going to set up an OpenAI API key.\n0:29\nThen we're going to get the content from my clipboard. So here I'm getting the content\n0:35\nfrom my clipboard. And the idea is I'm going to copy some text from whatever from a paper,\n0:41\nfrom a website, etc. And I'm going to run the script. And that's it. I'm going to copy some text,\n0:47\nrun the script. And when I run the script, it's goin

Perfect! Now lets estimate some token counts and costs.

We'll be checkinf for this validation checklist as described in the [tutorial suggested by the OpenAI docs](https://cookbook.openai.com/examples/chat_finetuning_data_prep#:~:text=Format%20validation,for%20easier%20debugging.).  

In [56]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [57]:
# Some helpful utilities

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

Data warnings and token counts.

In [58]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 946, 3979
mean / median: 2247.5454545454545, 1926.0
p5 / p95: 1480.0, 3098.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 63, 282
mean / median: 143.1818181818182, 130.0
p5 / p95: 91.0, 240.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


As we can see from the output obtained, all of our examples are under the context length, which means they won't have to be truncated which is great news! :)

Cost estimation

Let's estimate the cost of this fine tunning based on the number of tokens in our dataset.

In [59]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~24723 tokens that will be charged for during training
By default, you'll train for 9 epochs on this dataset
By default, you'll be charged for ~222507 tokens


Ok, as we can see it seems like this fine tunning is going to cost a bit, given that if I check on the [pricing page](https://openai.com/pricing) for the cost associated with this amount of tokens:

![](2023-10-16-17-22-32.png)


Let's write a little function to calculate these costs automatically:

In [67]:
def calculate_cost_for_fine_tunning(token_count):
    return 0.0080*token_count

In [68]:
calculate_cost_for_fine_tunning(222.507)

1.780056

Ok, so it seems that the amount will be 1.78 dollars which seems reasonable, although we should take into account the fact that we'll be charged more for using this fine tuned model as well, and we only used 10 examples for this use case!

Now lets upload our newly generated dataset file!

To do that part, lets write a simply Python script to upload the file, we'll use this snippet from the [OpenAI API docs](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset#:~:text=import%20os%20import%20openai%20openai.api_key%20%3D%20os.getenv(%22openai_api_key%22)%20openai.file.create(%20file%3Dopen(%22mydata.jsonl%22%2C%20%22rb%22)%2C%20purpose%3D'fine-tune'%20)).

In [69]:
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.File.create(
  file=open("./dataset_fine_tunning.jsonl", "rb"),
  purpose='fine-tune'
)

<File file id=file-kXlvspUQRKFUWcrW5pgzy0PW at 0x16b9e68b0> JSON: {
  "object": "file",
  "id": "file-kXlvspUQRKFUWcrW5pgzy0PW",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 91665,
  "created_at": 1697473593,
  "status": "uploaded",
  "status_details": null
}

Now, finally, we create our fine-tuned model by running this:

In [70]:
openai.FineTuningJob.create(training_file="file-kXlvspUQRKFUWcrW5pgzy0PW", model="gpt-3.5-turbo")

<FineTuningJob fine_tuning.job id=ftjob-0G57koxcbLk7c6Z4maGRVPEd at 0x2bc475630> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-0G57koxcbLk7c6Z4maGRVPEd",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1697473689,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-gpLJbCQWtORw077QTyeX1IVP",
  "result_files": [],
  "status": "validating_files",
  "validation_file": null,
  "training_file": "file-kXlvspUQRKFUWcrW5pgzy0PW",
  "hyperparameters": {
    "n_epochs": "auto"
  },
  "trained_tokens": null,
  "error": null
}

Here the `training_file` parameter correspond to the file id generated when running the previous snippet. 


Now we wait for an email confirmation that will let us know when the training is done. Which given the size of this job, should be pretty quckly

Neat thing is that you can promatically query the status of your jobs:

In [74]:
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)

# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ftjob-0G57koxcbLk7c6Z4maGRVPEd")

# Cancel a job
#openai.FineTuningJob.cancel("file-kXlvspUQRKFUWcrW5pgzy0PW")

# List up to 10 events from a fine-tuning job
#openai.FineTuningJob.list_events(id="file-kXlvspUQRKFUWcrW5pgzy0PW", limit=10)

<FineTuningJob fine_tuning.job id=ftjob-0G57koxcbLk7c6Z4maGRVPEd at 0x16aad3db0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-0G57koxcbLk7c6Z4maGRVPEd",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1697473689,
  "finished_at": 1697474050,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:personal::8AKpfptp",
  "organization_id": "org-gpLJbCQWtORw077QTyeX1IVP",
  "result_files": [
    "file-DZhopqkDrs5xSzQsrzA4zbjN"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-kXlvspUQRKFUWcrW5pgzy0PW",
  "hyperparameters": {
    "n_epochs": 9
  },
  "trained_tokens": 222309,
  "error": null
}

![](2023-10-16-17-39-30.png)

Now we can go to the platform to check out our fine tuned model or run inference from it using the API! Let's look at how to run our fine tuned model

IN the platform we see:

![](2023-10-16-17-40-30.png)

So, we can choose that model and see how it performs for a different new video.

Querying the model:

![](2023-10-16-17-49-09.png)

The response:

![](2023-10-16-17-49-24.png)

It looks pretty good! :) Let's compare with the output of regular ChatGPT

![](2023-10-16-17-50-32.png)

The fine tuned model is definitely more thourough than the regular ChatGPT model so I would call this a nice success! :)

Now, let's run inference with our fine tuned model!

In [76]:
prompt = """
Given this Youtube video transcript:
'''Introduction and setting up the OCR search project
0:00
that just put this right here there yep what's  up guys welcome back to the channel in this video  
0:05
we're gonna learn how to do OCR search over  text inside images in your local files in a  
0:11
local folder so I already have my Jupiter notebook  setup over here and essentially what you're going  
0:16
to need is OCR search ask your foot python equal  3.10 and you're gonna You're Gonna Wanna install  
Initial project setup requirements and Python package installations
0:25
Ezio CR and torch and that's what you need to get  set up I'm already set up so I'm already ready to  
0:31
go I'm gonna import OS typing I'm gonna import  a list from the type checking so I will do type  
Importing necessary libraries
0:38
checking on the functions and p0shar to do to have  a lifting of the user then we're going to set up  
Setting up the functions to scan images for text and loop over images in a folder
0:45
two functions one to scan the um images for text  and the second one to Loop over the image inside  
0:52
of a folder so and before we do that of course we  gotta set up the redial CR so we're gonna set up  
0:58
reader equal to easusar.reader we're going to give  it the English language and here I'm setting up  
Setting up EasyOCR reader for text detection
1:04
GPU to true because I have a GPU machine but if  you don't just set this to post and now I'm going  
1:10
to run this and now what we need is those are  those two functions so we're gonna need the OCR  
Coding the OCR scan function
1:16
scan function that runs the OCR in an image and  another one that Loops over images in a folder  
1:23
so we're gonna set up Dual Star scan it's gonna  take in the image path as the keyword argument as  
1:29
a as an argument that's going to be a string and  we're going to return a string and the string is  
1:34
going to be obviously the text that was detected  inside the image and now we're gonna set up  
1:41
the function so essentially here we're calling  the reader and the read text method inside the  
1:47
reader that we instantiated over there the reader  object and we're going to give it the image pass  
1:55
and now we're gonna say the recognized text is  just a join of all the element texts inside the  
Setting up the function to process recognized text
2:05
image that are stored in the result variable and  now that we've done that we're going to return  
2:11
the regular guys text and we can do from here  is we're going to set up right now just a toy  
2:19
just a template boilerplate of the search images  so that we have it on our the skeleton over app  
Introduction of a template for the search images function
2:25
and we're going to go back to this function later  to finish it off but now I want to test the OCR  
2:30
scan function that we just read that we just wrote  so I'm gonna come here I'm gonna set up the image  
Initial testing of the OCR scan function
2:37
path and I have an image path I have an image  inside as a test there you guys can check out here  
2:43
it's just just a copy paste of some text that I  have in my machine and I'm gonna run the OCR scan  
2:51
on that image for you guys to have an idea I'm  going to just run it here and there you go super  
Execution of the OCR scan on an image for demonstration
3:00
quick because I'm using the GPU and it detected  all the text inside of that image so that's  
3:05
perfect and now I'm gonna move this search images  function over here and now we can start setting  
3:15
this image up setting this function up so search  images essentially is going to take in a directory  
Starting the development of the search images function
3:22
okay and that's going to be a strain and  a keyword which is going to be the thing  
3:28
that we're searching for when we are doing our  OCR search because remember the goal of this  
3:33
is not just to do Char over the uh text inside  images in local files in the local folder but  
3:41
to find a specific keyword that we're looking for  within the text that we find in those images so  
3:47
that's that can be a really useful tool and it's  really simple to set up with python and easier  
3:52
so this keyword is going to also be a string  and we're going to Output the output of this  
3:59
function is going to be a list of image paths that  contain the keyword that we're looking for okay so  
4:08
that's that's essentially that's essentially  our search images function so now we're going  
Explanation of how the search images function works
4:13
to say okay so we're going to define a matching  images list right and let's get a store all the  
4:18
images that match our requirements meaning  contains the keyword that we're looking for  
4:23
and we're going to say update so forth root and  then directory and then files in OST I'll walk  
4:31
directory okay so inside of the stain so for  file in files what we're gonna do is we're gonna  
4:41
say if the file ends with uh that ends with and  now here I'm going to give a tuple with all the  
4:51
possible image extensions so wpng.jpg and Dot  J back with the knee and that's good for now  
5:02
and we're going to set up image best so join  gluten file perfect that's Gelco pilot already  
5:08
recognize that that's what I want then we're going  to give the text so let's say detect the text  
5:15
detective text is equal to a shark scan image  path and then if the keyword dot lower in record  
5:26
in the detected text exactly detected text the  lower so now I want to make everything lower so  
5:33
it's easier to search for we don't want to have  issues where there's some Capital um some some  
5:40
letter characters that are capitalized and then  from that reason we don't find keyword that we're  
5:46
looking for and now I can say all right so if that  happens if we find our keyword we're gonna pin  
5:54
that path to our matching images list finally at  the end we're going to return the matching images  
6:05
and that's going to be what we're looking  for so now to test the search images  
Testing the search images function
6:10
what we're going to do is I'm gonna give the  current folder as uh the reference and the keyword  
6:19
I'm going to set up keyword let's take a look at  this one so this one image test if we came here  
6:30
we'll see no one for Windows please the official  instructions here so this is some random text that  
6:35
I just copy pasted from my screen somewhere I  don't even know where so now let's look for the  
6:40
word note okay so that's the one that's written  inside this thing here just to see if it works  
6:45
so I'm going to say keyword equal note and what  I expect to happen is it should return a list  
6:54
containing the name image test.png as the response  so that's if that works if we run it and no it  
7:01
doesn't it says type list cannot be instantiated  use list instead and that's where where where list  
7:10
string that should be correct ah yeah  because yeah oh sorry that's my mistake  
7:19
I always my mistake I should have done this and  there we go we found it and we return it so this  
7:27
thing is working so what we're going to do is  we're going to transfer all of this code through  
Transitioning the code to an application callable from terminal
7:32
an application that we can call from the terminal  so I'm going to open up another file here on JS  
7:37
code I'm going to say Lucia search dot pi and now  I'm going to put that on my right and I'm going  
7:44
to come back here and I'm going to come here to  my notebook I'm going to copy these two functions  
7:50
to my right uh yeah perfect we're going to come  here we're also going to copy this one yep and  
7:59
we're going to copy the setting up so this thing  and now we're going to copy the Imports obviously
8:09
and all we need now let's remove this and let's  move that and all we need now is a main function  
Developing a CLI tool to enable OCR search from terminal
8:20
that's going to contain the CLI tool that we're  going to use so that we can call all this from  
8:26
the terminal okay so that's what we're going  to be doing so this one is defined a CLI tool  
8:34
that allows for OCR search for HC word over  uh images in a local folder or a single image  
8:50
that's what our main function is going to be  doing let's set up this thing here I'm going  
8:56
to set up this thing here so now with our main  function we're going to be doing I already have  
9:00
it here so I'm just going to copy One Lie by  line so that you guys can take a look and see  
9:06
what what was going to be happening so now we're  going to need arc bars so that we set up our cli2  
9:13
and let's write it up so we don't need this  part of the Jupiter notebook anymore okay  
9:20
so our main function is going to be so we're  going to set up our parser okay so we're gonna  
9:25
set up uh ARG parse dot argument parser and then  that's we're going to give it a description the  
9:32
description is going to be uh OCR search over  local images add an argument and then we're  
9:42
gonna say this is going to be our directory so  where we set up the directory so it's going to  
9:47
be called directory and the type obviously is  going to be a string and this is the directory
10:00
the images
10:04
perfect and now we're going to say  parser.add argument the first one  
10:11
here is going to be the keyword so KW and  then here we're gonna say minus keyword  
10:22
and we're gonna say type equals string  as well and we're gonna say help equal to
Parsing the arguments for the CLI tool
10:31
uh this is the three word text we'll
10:38
look for perfect and now that we have that  we can add uh the option of a single image  
10:48
so we can say parser dot add argument minus I so  let me put it like this minus I and then we're  
10:59
gonna say image and then we're gonna say type  equal Str and then we're gonna say help equal
11:09
d image to scan the single image case to Scag  not the single image the scan is better yeah  
11:22
perfect now that we have these things we  can just get our arguments so we can say  
11:27
R is equal to parser dot uh parse args right so  now we can you know leverage RMS and then if the  
Coding the logic to handle directory search and keyword recognition
11:36
directory is given then we're going to say well  matching images is equal to search homages uh  
11:45
and we're going to fit in the directory so we're  going to say uh args.directory R and rs.q word  
11:55
and we're gonna print so we're gonna say images  that console the keyword and we're gonna Loop over
12:09
the matching images and we're gonna print  that's called instead of calling image we're  
12:16
going to call it image name image path because  those are not actual loaded images but they're  
12:24
paths two images now in the case that we don't  Define the directory that means that we want to  
Handling single image OCR search in the CLI tool
12:31
scan a single image so we say okay and what is the  detected text for that single image that's perfect  
12:38
and then that's correct our image contains the  image exactly so we say okay uh if that is inside  
12:46
we say detected image in the image and then we're  gonna say this this this and we're going to give d  
12:54
image and we're gonna say all right scan the image  and then we're not going to have a Passover the  
13:02
image so we're gonna say just the detected text  so see where the Titan the image so we're going  
13:12
to say yes and then we're gonna print the tactic  texts detected texts and I'm gonna just here  
13:24
and I'm gonna say this and this else keyword  not attacking the image and that is it okay  
13:31
now that we have that all we need to do is we're  going to set up our main function here and we're  
13:36
going to call it so now I can open near my  terminal and I can come here and say okay  
13:42
uh I'm going to run this function so it's  going to crawl let's try search we're going  
13:47
to do it over one image so let's say minus I it's  gonna be the image that we just tested over here  
13:57
and the keyword is going to be nopes and now we  run it let's see if it works keyword not detected  
14:08
in the image that is kind of weird so let's  see English test the PNG because it's note in  
14:14
our notes so that was my mistake so I'm going to  run that again with Note so now we run it keyword  
14:22
detected and then it prints the detected text as  expected perfect so now just to finish things off  
14:28
instead of the single image we're gonna give the  directory the director is going to be the current  
Running and testing the application in the terminal
14:33
directory so that you know it's easier for us so  just we just test a simple example and instead  
14:39
of using the word node let's look for the word  windows so I'm just gonna I'm gonna say windows  
14:47
and let's see what happens we should return a list  with the path to image test.png and that's exactly  
14:55
what happens images that contain the keyword  and it gives me the path so all right guys so  
15:01
that's how you set up a simple OCR search app to  look for text over images in your local folder  
Conclusion and review of the project.
15:07
if you guys like this video don't forget to  like And subscribe and see you next time cheers'''

I want you to create a chapters section for this Youtube video with the following format: Chapters:\n<double digit:time stamp> - <Concise phrase of a major part of the video>
"""

completion = openai.ChatCompletion.create(
  model="ft:gpt-3.5-turbo-0613:personal::8AKpfptp",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
  ]
)
print(completion.choices[0].message)

{
  "role": "assistant",
  "content": "Chapters:\n\n00:00 - Introduction and setting up the OCR search project\n00:25 - Initial project setup requirements and Python package installations\n00:38 - Importing necessary libraries\n00:45 - Setting up the functions to scan images for text and loop over images in a folder\n01:04 - Setting up EasyOCR reader for text detection\n01:16 - Coding the OCR scan function\n02:05 - Setting up the function to process recognized text\n02:25 - Introduction of a template for the search images function\n02:37 - Initial testing of the OCR scan function\n03:22 - Starting the development of the search images function\n04:13 - Explanation of how the search images function works\n06:05 - Testing the search images function\n07:32 - Transitioning the code to an application callable from terminal\n08:09 - Developing a CLI tool to enable OCR search from terminal\n10:00 - Parsing the arguments for the CLI tool\n11:36 - Coding the logic to handle directory search and 

Success!!!!!

In [88]:
import tiktoken
def get_num_tokens(prompt, model="gpt-3.5-turbo"):
    """Calculates the number of tokens in a text prompt"""    

    enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

    return len(enc.encode(prompt))


# Open the file and read its contents
with open('./chatgpt_fine_tunning_content.txt', 'r') as file:
    contents = file.read()

# Call the get_num_tokens() function on the contents
num_tokens = get_num_tokens(contents)

# Print the number of tokens
print(num_tokens)

4264
