In [None]:
!pip install openai
!pip install python-dotenv

In [None]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# loading from a .env file
# load_dotenv(dotenv_path="/full/path/to/your/.env")

# or 
# if you're on google colab just uncomment below and replace with your openai api key
# os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

# Fine Tunning ChatGPT to Create Automatic Planning Schedules 

Let's look at the documentation on the [OPENAI website](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset) to learn how to fine tune a chatgpt model to a specific use case using a custom dataset.

Starting with the basics, we need an example format, which should be as such:

```
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
```

So, to fine tune a ChatGPT model we'll need to create a dataset organized in the format shown above. Essentially we'll have a .json file with several (as many as the number of inputs for the fine tunning) dictionaries that have this format of:

```
{"<messages key>"}: [{"role": "system", "<content key>": "<some text content>"},
{"<messages key>"}: {"role": "user", "<content key>": "<some text content>"},
{"<messages key>"}: {"role": "assistant", "<content key>": "<some text content>"}]
```

Here we'll assume that we will only fine tune actual ChatGPT, so we won't cover the formats for models like 'babbage-002' and others.

Ok, given this format, let's create our dataset.

Quote from the actual documentation:


> To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.



Let's try to get at least 30 examples for our custom dataset.

Just as in regular fine tunning with any Machine Learning model, its advisable to get stablish a train-test split so that you are on top of your model's performance.

Also, its extremely important to be aware of token limits and token costs, given that this will be a paid API.

Each training example will be limited to the context length of ChatGPT (therefore 4096 tokens), so its good to set some good practices in place to avoid issues with large inputs, the [OpenAI documentation](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset#:~:text=Each%20training%20example,the%20OpenAI%20cookbook.) recommends you check that the total token count in the message contents are under 4000 as a good rule of thumb.

Let's first create this dataset programatically using GPT4

In [1]:
from openai import OpenAI
import tiktoken
from IPython.display import display, Markdown


def get_response(prompt):
    client = OpenAI()
    response = client.chat.completions.create(model="gpt-4o-mini", 
                             messages=
                             [
                                 {"role": "system", "content": "You are a helpful assistant."},
                                 {"role": "user", "content": prompt}   
                             ],
                             temperature=0.0,
                             n = 1
                             )
    return response.choices[0].message.content


def get_num_tokens(prompt, model="gpt-3.5-turbo"):
    """Calculates the number of tokens in a text prompt"""    

    enc = tiktoken.encoding_for_model(model)

    return len(enc.encode(prompt))

Great! Now, let manually inspect the flashcards generated with gpt4 to select the best ones for our custom training dataset.

In [2]:
prompt = f"Create 15 examples of 1 paragraph first person human-like descriptions of tasks and goals for the upcoming weeks. Each example should be in a bullet point."

task_paragraphs = get_response(prompt)

display(Markdown(task_paragraphs))

- I plan to finish writing the first draft of my novel by the end of next week. I will dedicate at least two hours every day to writing and revising to ensure that I stay on track and meet my goal.
- My goal for the upcoming weeks is to complete the online course I enrolled in and earn a certificate in digital marketing. I will allocate time each day to watch the lectures, complete the assignments, and participate in the discussion forums to maximize my learning.
- I aim to improve my physical fitness by incorporating a new workout routine into my schedule. I will focus on strength training three times a week and incorporate yoga and stretching exercises to improve flexibility and reduce stress.
- I have set a goal to declutter and organize my entire home over the next few weeks. I will tackle one room at a time, sorting through items, donating or discarding what I no longer need, and creating a more functional and peaceful living space.
- My task for the upcoming weeks is to research and plan a budget-friendly vacation for my family. I will compare travel options, accommodations, and activities to ensure that we have a memorable and affordable trip.
- I am committed to learning a new language, and my goal for the upcoming weeks is to practice speaking and listening for at least 30 minutes each day. I will use language learning apps, watch foreign films, and engage in conversations with native speakers to improve my skills.
- I plan to enhance my culinary skills by trying out new recipes and cooking techniques. I will set aside time each week to experiment with different cuisines and expand my cooking repertoire.
- My goal for the upcoming weeks is to establish a daily meditation practice to promote mindfulness and reduce stress. I will start with short sessions and gradually increase the duration to incorporate meditation into my daily routine.
- I aim to improve my time management skills by creating a detailed schedule for each day and prioritizing tasks based on their importance and deadlines. I will use time-tracking tools to monitor my progress and make adjustments as needed.
- I have set a goal to read at least two books per month, and I will dedicate time each day to immerse myself in literature and expand my knowledge and imagination.
- My task for the upcoming weeks is to develop a personal finance plan to save money and achieve my financial goals. I will review my expenses, create a budget, and explore investment opportunities to secure my financial future.
- I plan to enhance my professional skills by enrolling in an online course related to my field of work. I will dedicate time each day to study the course materials and apply the new knowledge to my job.
- My goal for the upcoming weeks is to improve my networking skills by reaching out to industry professionals and attending virtual networking events. I will set specific goals for each interaction and follow up with contacts to build meaningful connections.
- I aim to prioritize self-care by incorporating regular exercise, healthy eating, and relaxation techniques into my daily routine. I will create a self-care plan and commit to practicing self-care activities consistently.
- I have set a goal to volunteer at a local charity organization and contribute to meaningful causes in my community. I will research volunteer opportunities and dedicate my time and skills to make a positive impact.

In [3]:
task_list_prompts = task_paragraphs.split("\n")

task_list_prompts

['- I plan to finish writing the first draft of my novel by the end of next week. I will dedicate at least two hours every day to writing and revising to ensure that I stay on track and meet my goal.',
 '- My goal for the upcoming weeks is to complete the online course I enrolled in and earn a certificate in digital marketing. I will allocate time each day to watch the lectures, complete the assignments, and participate in the discussion forums to maximize my learning.',
 '- I aim to improve my physical fitness by incorporating a new workout routine into my schedule. I will focus on strength training three times a week and incorporate yoga and stretching exercises to improve flexibility and reduce stress.',
 '- I have set a goal to declutter and organize my entire home over the next few weeks. I will tackle one room at a time, sorting through items, donating or discarding what I no longer need, and creating a more functional and peaceful living space.',
 '- My task for the upcoming wee

In [4]:
for t in task_list_prompts:
    print(t)

- I plan to finish writing the first draft of my novel by the end of next week. I will dedicate at least two hours every day to writing and revising to ensure that I stay on track and meet my goal.
- My goal for the upcoming weeks is to complete the online course I enrolled in and earn a certificate in digital marketing. I will allocate time each day to watch the lectures, complete the assignments, and participate in the discussion forums to maximize my learning.
- I aim to improve my physical fitness by incorporating a new workout routine into my schedule. I will focus on strength training three times a week and incorporate yoga and stretching exercises to improve flexibility and reduce stress.
- I have set a goal to declutter and organize my entire home over the next few weeks. I will tackle one room at a time, sorting through items, donating or discarding what I no longer need, and creating a more functional and peaceful living space.
- My task for the upcoming weeks is to research 

In [5]:
task_steps = []
for t in task_list_prompts:
    prompt = f"Given this paragraph description of tasks and goals: \n '''{t}''' \n create a bullet point list with all the necessary steps to accomplish all of them."
    task_steps.append(get_response(prompt))

Let's evaluate the dataset generated programatically:

In [6]:
for t in task_steps:
    display(Markdown(t))

- Set a specific daily writing schedule, allocating at least two hours each day for writing and revising.
- Break down the novel into manageable sections or chapters to work on each day.
- Create a checklist of tasks for each writing session, such as outlining, drafting, revising, and editing.
- Set specific milestones for completing each section of the novel to track progress.
- Eliminate distractions during writing time to maximize productivity.
- Seek feedback from trusted individuals to ensure the quality and coherence of the novel.
- Allocate time for rest and relaxation to avoid burnout and maintain creativity.

- Allocate specific time each day to watch the lectures
- Complete the assignments on time
- Actively participate in the discussion forums
- Review and revise the course material regularly
- Seek additional resources or help if needed
- Stay organized and focused on the course material
- Prepare for any assessments or exams
- Stay updated with the latest trends and developments in digital marketing

- Research and choose a suitable strength training routine
- Schedule three strength training sessions per week
- Research and choose a suitable yoga routine
- Schedule yoga sessions to be incorporated into weekly routine
- Research and choose stretching exercises to improve flexibility
- Schedule stretching exercises to be incorporated into weekly routine
- Set specific goals for physical fitness improvement
- Monitor progress and adjust routine as needed to achieve goals

- Assess the entire home and create a plan to tackle one room at a time
- Gather necessary organizing supplies such as bins, baskets, and labels
- Sort through items in each room, deciding what to keep, donate, or discard
- Clean and organize the items that are being kept
- Donate or discard items that are no longer needed
- Create a system for maintaining organization in each room
- Enjoy the functional and peaceful living space created by the decluttering and organizing efforts

- Research travel options, including flights, train, and car rental
- Compare accommodations such as hotels, vacation rentals, and hostels
- Research and plan activities and attractions for the vacation destination
- Create a budget spreadsheet to track expenses and ensure affordability
- Consider cost-saving measures such as booking in advance or using travel rewards
- Consult with family members to gather input and preferences for the vacation
- Make reservations for transportation, accommodations, and activities
- Continuously review and adjust the plan to ensure it remains budget-friendly

- Use language learning apps to practice speaking and listening for at least 30 minutes each day.
- Watch foreign films to improve listening skills and exposure to the language.
- Engage in conversations with native speakers to practice speaking and listening in real-life situations.

- Research and collect new recipes and cooking techniques from various sources such as cookbooks, cooking websites, and cooking shows.
- Set aside a specific day and time each week dedicated to experimenting with new recipes and cooking techniques.
- Create a list of different cuisines and dishes to try, ensuring a diverse range of cooking experiences.
- Gather necessary ingredients and kitchen tools for each new recipe or cooking technique.
- Document the results of each cooking experiment, noting successes and areas for improvement.
- Continuously seek feedback from friends and family to improve culinary skills.
- Reflect on the progress and adjust the plan as needed to expand the cooking repertoire.

- Set a specific time each day for meditation
- Start with short 5-10 minute meditation sessions
- Find a quiet and comfortable space for meditation
- Use guided meditation apps or videos to help with the practice
- Gradually increase the duration of meditation sessions
- Keep a journal to track progress and experiences with meditation
- Stay consistent and committed to the daily practice

- Create a detailed schedule for each day
- Prioritize tasks based on importance and deadlines
- Use time-tracking tools to monitor progress
- Make adjustments to the schedule as needed
- Set specific goals for improving time management skills

- Schedule dedicated time each day for reading
- Create a list of books to read for the month
- Set aside time for book selection and research
- Track progress and adjust reading schedule as needed
- Find a comfortable and quiet reading space
- Take notes and reflect on the books read
- Explore different genres and authors to expand knowledge and imagination

- Review expenses
- Create a budget
- Explore investment opportunities
- Develop a personal finance plan
- Save money
- Achieve financial goals
- Secure financial future

- Research and select a relevant online course related to my field of work
- Enroll in the chosen online course
- Set aside dedicated time each day to study the course materials
- Take notes and actively engage with the course content
- Apply the new knowledge gained from the course to my job tasks and projects

- Research industry professionals and identify potential contacts
- Reach out to industry professionals via email or LinkedIn
- Set specific goals for each networking interaction, such as learning about a specific aspect of the industry or gaining insights into a particular company
- Attend virtual networking events and actively participate in discussions
- Follow up with contacts after networking interactions to express gratitude and continue building the relationship
- Keep track of networking activities and connections made for future reference and follow-up

- Create a self-care plan
- Schedule regular exercise sessions
- Plan and prepare healthy meals
- Incorporate relaxation techniques into daily routine
- Commit to practicing self-care activities consistently

- Research local charity organizations and their volunteer opportunities
- Contact the chosen charity organization to inquire about their volunteer needs and application process
- Schedule a time to volunteer and commit to the agreed-upon schedule
- Prepare necessary skills or training required for the volunteer work
- Dedicate time and effort to contribute to the meaningful causes of the charity organization
- Reflect on the impact made and consider further involvement or support

In [7]:
for desc,steps in zip(task_list_prompts, task_steps):
    print(desc) 
    print(steps)
    print("************")

- I plan to finish writing the first draft of my novel by the end of next week. I will dedicate at least two hours every day to writing and revising to ensure that I stay on track and meet my goal.
- Set a specific daily writing schedule, allocating at least two hours each day for writing and revising.
- Break down the novel into manageable sections or chapters to work on each day.
- Create a checklist of tasks for each writing session, such as outlining, drafting, revising, and editing.
- Set specific milestones for completing each section of the novel to track progress.
- Eliminate distractions during writing time to maximize productivity.
- Seek feedback from trusted individuals to ensure the quality and coherence of the novel.
- Allocate time for rest and relaxation to avoid burnout and maintain creativity.
************
- My goal for the upcoming weeks is to complete the online course I enrolled in and earn a certificate in digital marketing. I will allocate time each day to watch 

In [8]:
empty_indices = [i for i, x in enumerate(task_list_prompts) if not x]
print(empty_indices)

[]


In [9]:
# # Define the indexes to remove
# indexes_to_remove = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27]

# # Remove the elements with the given indexes from task_list_prompts and task_steps
# task_list_prompts = [t for i, t in enumerate(task_list_prompts) if i not in indexes_to_remove]
# task_steps = [t for i, t in enumerate(task_steps) if i not in indexes_to_remove]

In [None]:
# prompt_to_gpt4 = """Consider this format for a dataset for the chatgpt fine tuning api:
# '''
# {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
# {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
# {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
# '''

# Write the python code to create a dataset like this one programatically given that the system prompt will always be: "You are a helpful planning assistant"
# and that the user prompts will be inside a list called 'task_list_prompts' and the assistant content will be inside a list called 'task_steps'."""

In [10]:
import json

# Ensure the lists have the same length
assert len(task_list_prompts) == len(task_steps), "Mismatched lengths between prompts and responses"

# Creating the dataset
dataset = []
system_prompt = "You are a helpful planning assistant"

for user_content, assistant_content in zip(task_list_prompts, task_steps):
    interaction = {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": assistant_content}
        ]
    }
    dataset.append(interaction)

In [11]:
dataset

[{'messages': [{'role': 'system',
    'content': 'You are a helpful planning assistant'},
   {'role': 'user',
    'content': '- I plan to finish writing the first draft of my novel by the end of next week. I will dedicate at least two hours every day to writing and revising to ensure that I stay on track and meet my goal.'},
   {'role': 'assistant',
    'content': '- Set a specific daily writing schedule, allocating at least two hours each day for writing and revising.\n- Break down the novel into manageable sections or chapters to work on each day.\n- Create a checklist of tasks for each writing session, such as outlining, drafting, revising, and editing.\n- Set specific milestones for completing each section of the novel to track progress.\n- Eliminate distractions during writing time to maximize productivity.\n- Seek feedback from trusted individuals to ensure the quality and coherence of the novel.\n- Allocate time for rest and relaxation to avoid burnout and maintain creativity.'}

In [12]:
# Writing to a file
with open('dataset.jsonl', 'w') as f:
    for entry in dataset:
        f.write(json.dumps(entry) + '\n')

# Now 'dataset.jsonl' will contain the dataset in the desired format.

In [13]:
# source: https://cookbook.openai.com/examples/chat_finetuning_data_prep
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [14]:
data_path = "./dataset.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 15
First example:
{'role': 'system', 'content': 'You are a helpful planning assistant'}
{'role': 'user', 'content': '- I plan to finish writing the first draft of my novel by the end of next week. I will dedicate at least two hours every day to writing and revising to ensure that I stay on track and meet my goal.'}
{'role': 'assistant', 'content': '- Set a specific daily writing schedule, allocating at least two hours each day for writing and revising.\n- Break down the novel into manageable sections or chapters to work on each day.\n- Create a checklist of tasks for each writing session, such as outlining, drafting, revising, and editing.\n- Set specific milestones for completing each section of the novel to track progress.\n- Eliminate distractions during writing time to maximize productivity.\n- Seek feedback from trusted individuals to ensure the quality and coherence of the novel.\n- Allocate time for rest and relaxation to avoid burnout and maintain creativity.'}


Perfect! Now lets estimate some token counts and costs.

We'll be checkinf for this validation checklist as described in the [tutorial suggested by the OpenAI docs](https://cookbook.openai.com/examples/chat_finetuning_data_prep#:~:text=Format%20validation,for%20easier%20debugging.).  

In [15]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [16]:
# Some helpful utilities

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

Data warnings and token counts.

In [17]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 97, 184
mean / median: 141.06666666666666, 142.0
p5 / p95: 103.8, 179.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 35, 127
mean / median: 77.0, 76.0
p5 / p95: 40.6, 112.6

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


As we can see from the output obtained, all of our examples are under the context length, which means they won't have to be truncated which is great news! :)

Cost estimation

Let's estimate the cost of this fine tunning based on the number of tokens in our dataset.

In [18]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~2116 tokens that will be charged for during training
By default, you'll train for 6 epochs on this dataset
By default, you'll be charged for ~12696 tokens


Ok, as we can see it seems like this fine tunning is not going to cost that much, given that if I check on the [pricing page](https://openai.com/pricing) for the cost associated with this amount of tokens:

![](2023-10-16-17-22-32.png)


Let's write a little function to calculate these costs automatically:

In [19]:
def calculate_cost_for_fine_tunning(token_count,api_cost=0.008):
    return (api_cost*token_count)/1000

In [20]:
calculate_cost_for_fine_tunning(n_epochs * n_billing_tokens_in_dataset)

0.10156799999999999

Ok, so it seems that the amount will be 0.10 cents which seems reasonable, although we should take into account the fact that we'll be charged more for using this fine tuned model as well, and we only used 10 examples for this use case!

Now lets upload our newly generated dataset file!

To do that part, lets write a simply Python script to upload the file, we'll use this snippet from the [OpenAI API docs](https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset#:~:text=import%20os%20import%20openai%20openai.api_key%20%3D%20os.getenv(%22openai_api_key%22)%20openai.file.create(%20file%3Dopen(%22mydata.jsonl%22%2C%20%22rb%22)%2C%20purpose%3D'fine-tune'%20)).

In [22]:
# import os
# import openai

# openai.File.create(
#   file=open("./dataset.jsonl", "rb"),
#   purpose='fine-tune'
# )

from openai import OpenAI
# OpenAI API key should be set as 
# environment variable - OPENAI_API_KEY
client = OpenAI()
client.files.create(
  file=open("dataset.jsonl", "rb"),
  purpose="fine-tune"
)

FileObject(id='file-oHEAh3zbbQxpSwHOyLBQEike', bytes=12150, created_at=1700930723, filename='dataset.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

Now, finally, we create our fine-tuned model by running this:

In [23]:
client.fine_tuning.jobs.create(
  training_file="file-oHEAh3zbbQxpSwHOyLBQEike", 
  model="gpt-3.5-turbo"
)

FineTuningJob(id='ftjob-1IgjILLuqGmuoXc8Ywd1Gy5Q', created_at=1700930744, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-1106', object='fine_tuning.job', organization_id='org-gpLJbCQWtORw077QTyeX1IVP', result_files=[], status='validating_files', trained_tokens=None, training_file='file-oHEAh3zbbQxpSwHOyLBQEike', validation_file=None)

The `fine_tuning.job` object represents a fine-tuning job that has been created through the API.

- `id`: The object identifier, which can be referenced in the API endpoints.
- `created_at`: The Unix timestamp (in seconds) for when the fine-tuning job was created.
- `error`: For fine-tuning jobs that have failed, this will contain more information on the cause of the failure.

`fine_tuned_model`:
- The name of the fine-tuned model that is being created. The value will be null if the fine-tuning job is still running.
- `finished_at`: The Unix timestamp (in seconds) for when the fine-tuning job was finished. The value will be null if the fine-tuning job is still running.

`hyperparameters`:
- The hyperparameters used for the fine-tuning job. See the fine-tuning guide for more details.

`model`:
- The base model that is being fine-tuned.
- `object`: The object type, which is always "fine_tuning.job".
- `organization_id`: The organization that owns the fine-tuning job.

`result_files`:
- The compiled results file ID(s) for the fine-tuning job. You can retrieve the results with the Files API.

`status`:
- The current status of the fine-tuning job, which can be either validating_files, queued, running, succeeded, failed, or cancelled.

`trained_tokens`:
- The total number of billable tokens processed by this fine-tuning job. The value will be null if the fine-tuning job is still running.

`training_file`:
- The file ID used for training. You can retrieve the training data with the Files API.

`validation_file`:
- The file ID used for validation. You can retrieve the validation results with the Files API.

Here the `training_file` parameter correspond to the file id generated when running the previous snippet. 


Now we wait for an email confirmation that will let us know when the training is done. Which given the size of this job, should be pretty quckly

In [25]:
# list_files = openai.File.list()
client.files.list()

SyncPage[FileObject](data=[FileObject(id='file-oHEAh3zbbQxpSwHOyLBQEike', bytes=12150, created_at=1700930723, filename='dataset.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None), FileObject(id='file-i0xEiew0XE7JmeemNkPowpNw', bytes=12150, created_at=1700930657, filename='dataset.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None), FileObject(id='file-swbi7i0sZ2tamia92owNij4O', bytes=1071063, created_at=1699876627, filename='image_bad.png', object='file', purpose='assistants', status='processed', status_details=None), FileObject(id='file-rMnEZwgsstclLm8X8754AyPR', bytes=1071063, created_at=1699876449, filename='image_bad.png', object='file', purpose='assistants', status='processed', status_details=None), FileObject(id='file-ofI2BgpSm3POkFWNUkVi0q1R', bytes=1071063, created_at=1699876390, filename='image_bad.png', object='file', purpose='assistants', status='processed', status_details=None), FileObject(id='file-UcDnvqXEd

Neat thing is that you can promatically query the status of your jobs:

In [26]:
client.fine_tuning.jobs.list()

# Example response object
# {
#   "object": "list",
#   "data": [
#     {
#       "object": "fine_tuning.job.event",
#       "id": "ft-event-TjX0lMfOniCZX64t9PUQT5hn",
#       "created_at": 1689813489,
#       "level": "warn",
#       "message": "Fine tuning process stopping due to job cancellation",
#       "data": null,
#       "type": "message"
#     },
#     { ... },
#     { ... }
#   ], "has_more": true
# }

SyncCursorPage[FineTuningJob](data=[FineTuningJob(id='ftjob-1IgjILLuqGmuoXc8Ywd1Gy5Q', created_at=1700930744, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=6, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-1106', object='fine_tuning.job', organization_id='org-gpLJbCQWtORw077QTyeX1IVP', result_files=[], status='running', trained_tokens=None, training_file='file-oHEAh3zbbQxpSwHOyLBQEike', validation_file=None), FineTuningJob(id='ftjob-rPPzK29a1UkXFSHTdEQX4heY', created_at=1697756148, error=None, fine_tuned_model='ft:gpt-3.5-turbo-0613:personal::8BWIRV1U', finished_at=1697756446, hyperparameters=Hyperparameters(n_epochs=6, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-gpLJbCQWtORw077QTyeX1IVP', result_files=['file-DpX9PVkO1Y9bIndbRBjADyUS'], status='succeeded', trained_tokens=38298, training_file='file-ICGS6dyaltzMJuXdnHcpjGTr', validation_file=None), Fi

Retrieve fine tuning jobs

In [28]:
from openai import OpenAI
client = OpenAI()

fine_tune_job = client.fine_tuning.jobs.retrieve("ftjob-1IgjILLuqGmuoXc8Ywd1Gy5Q")
print(fine_tune_job.status)

running


We can programatically check the status of our fine tuning job until it is done.

In [30]:
import time

while fine_tune_job.status == "running":
    fine_tune_job = client.fine_tuning.jobs.retrieve("ftjob-1IgjILLuqGmuoXc8Ywd1Gy5Q")
    print(fine_tune_job.status)
    time.sleep(5)
print("Fine tuning job finished!")
print(fine_tune_job.status)

succeeded
Fine tuning job finished!
succeeded


In [None]:
# # List 10 fine-tuning jobs
# openai.FineTuningJob.list(limit=10)

# # Retrieve the state of a fine-tune
# openai.FineTuningJob.retrieve("ftjob-rPPzK29a1UkXFSHTdEQX4heY")

# Cancel a job
#openai.FineTuningJob.cancel("file-kXlvspUQRKFUWcrW5pgzy0PW")

# List up to 10 events from a fine-tuning job
#openai.FineTuningJob.list_events(id="file-kXlvspUQRKFUWcrW5pgzy0PW", limit=10)

Cancel fine tuning job

In [None]:
from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.cancel("ftjob-abc123")

Now we can go to the platform to check out our fine tuned model or run inference from it using the API! Let's look at how to run our fine tuned model

IN the platform we see:

![](./assets-resources/fine_tune_model.png)

So, we can choose that model and see how it performs for a different set of tasks.

Querying the model:
![](2023-10-20-00-05-50.png)

In [31]:
get_response("Generate a random list of tasks a normal programmer would have to do in a typical day")

"Sure, here's a random list of tasks a programmer might have to do in a typical day:\n\n1. Review and respond to emails and messages from team members and stakeholders.\n2. Attend daily stand-up meetings to discuss progress and plan for the day.\n3. Write code to implement new features or fix bugs in existing software.\n4. Test and debug code to ensure it functions as expected.\n5. Collaborate with other team members to discuss technical solutions and best practices.\n6. Participate in code reviews to provide feedback on colleagues' code and receive feedback on your own code.\n7. Research and learn about new technologies and tools that could improve the development process.\n8. Document code and write technical documentation for new features or changes.\n9. Attend meetings with product managers and designers to discuss requirements and user interface designs.\n10. Help troubleshoot and resolve technical issues reported by users or support teams.\n\nThis list is just a sample and may va

![](./assets-resources/fine-tune-response-planning.png)

The fine tuned model is definitely more thourough than the regular ChatGPT model so I would call this a nice success! :)

Now, let's run inference with our fine tuned model!

In [33]:
from IPython.display import display, Markdown

def get_response(prompt, fine_tuned_model_id="ft:gpt-3.5-turbo-1106:personal::8Oq91T4e"):
    client = OpenAI()
    response = client.chat.completions.create(model=fine_tuned_model_id, 
                             messages=
                             [
                                 {"role": "system", "content": "You are a helpful assistant."},
                                 {"role": "user", "content": prompt}   
                             ],
                             temperature=0.0,
                             n = 1
                             )
    return response.choices[0].message.content


response = get_response("Tomorrow I have to practice for my presentation at the live-training for O'Reilly at least a couple more times, then run through all the slides and notebooks to check everything is in order. After that I have Jiu Jitsu training and one more rehearsal before the live-training at 18:00")
Markdown(response)

- Practice for presentation at the live-training for O'Reilly
- Run through all the slides and notebooks to check everything is in order
- Jiu Jitsu training
- One more rehearsal for the live-training at 18:00

In [37]:
client.models.delete("ft:gpt-3.5-turbo-1106:personal::8Oq91T4e")

ModelDeleted(id='ft:gpt-3.5-turbo-1106:personal::8Oq91T4e', deleted=True, object='model')

In [38]:
from openai import OpenAI
client = OpenAI()

client.files.list()

SyncPage[FileObject](data=[FileObject(id='file-TN1bKdmEvA2CcBF74q4rarl6', bytes=1767, created_at=1700931008, filename='step_metrics.csv', object='file', purpose='fine-tune-results', status='processed', status_details=None), FileObject(id='file-oHEAh3zbbQxpSwHOyLBQEike', bytes=12150, created_at=1700930723, filename='dataset.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None), FileObject(id='file-i0xEiew0XE7JmeemNkPowpNw', bytes=12150, created_at=1700930657, filename='dataset.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None), FileObject(id='file-swbi7i0sZ2tamia92owNij4O', bytes=1071063, created_at=1699876627, filename='image_bad.png', object='file', purpose='assistants', status='processed', status_details=None), FileObject(id='file-rMnEZwgsstclLm8X8754AyPR', bytes=1071063, created_at=1699876449, filename='image_bad.png', object='file', purpose='assistants', status='processed', status_details=None), FileObject(id='file-of

In [39]:
client.files.delete("file-TN1bKdmEvA2CcBF74q4rarl6")

FileDeleted(id='file-TN1bKdmEvA2CcBF74q4rarl6', deleted=True, object='file')

In [41]:
client.files.delete("file-oHEAh3zbbQxpSwHOyLBQEike")

FileDeleted(id='file-oHEAh3zbbQxpSwHOyLBQEike', deleted=True, object='file')

In [43]:
for f in client.files.list():
    client.files.delete(f.id)

Success!!!!!