<a href="https://colab.research.google.com/github/Moritzslz/Fine-Tune-GPTs/blob/main/fine_tune_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised OpenAI GPT fine-tuning by Moritz Schultz

*   www.moritzschultz.de

*   https://github.com/Moritzslz


I created this notebook in order to easily create training datasets using GPT-4 in order to fine-tune GPT-3.5-turbo.


# Here is how it works:

# Step 1: Set Your API Key

First, you need to set your OpenAI API key. This key allows the notebook to interact with the OpenAI API.

# Step 2: Define the Number of Datasets

Specify the number of datasets (prompt-response pairs) you want to generate. In order to fine-tune GPT-3.5-turbo it is recommended to have at least 50 training datasets.

# Step 3: Set the Temperature

The temperature controls the randomness of the model's output. A higher value (e.g., 1) makes the output more random, while a lower value (e.g., 0.2) makes it more deterministic.

# Step 4: Set the Maximum Tokens per Request

Define the maximum number of tokens the model can use in a single response. This helps in controlling the length of the generated content.

# Step 5: Define the Fine-Tune Task Description

Provide a detailed description of the task for which you want to fine-tune the model. This description will guide the generation of relevant prompt-response pairs.



# Learn about OpenAIs pricing and when it makes sense to fine-tune

Before you create a fine-tuned model it maked sense to calculate if it makes sense from cost perspective. Generating a dataset with 50 training examples costs about 8€ using GPT-4.

You can find OpenAIs pricing information [here](https://openai.com/api/pricing/).

Furthermore I recommend [this](https://platform.openai.com/docs/guides/fine-tuning/when-to-use-fine-tuning) documentation from OpenAI about when to use fine-tuning.

In [None]:
# Step 1
your_api_key = ""
# Step 2
number_of_datasets_to_generate = 3
# Step 3
temperature = 1
# Step 4
maximum_tokens_per_request = 1024
# Step 5 (this is an example)
fine_tune_task_description = "I aim to fine-tune a model to classify customer support tickets into predefined categories. The objective is to automate the initial triaging process, directing each ticket to the appropriate department or support team, thereby improving efficiency and response times in addressing customer inquiries. The predefined categories are: Technical Support, Billing and Payment Inquiries, Product Information and Features, Order Status and Shipping, Account Management, General Inquiries and FAQs, Feedback and Suggestions, Complaints and Escalations."

In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.30.1-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.5 ht

In [None]:
import requests as re
from openai import OpenAI

client = OpenAI(api_key=your_api_key)

def generate_system_message(client, fine_tune_task_description):
  messages = [
    {"role": "system",
     "content": 'You are a very beneficial assistant for creating training data used for OpenAI fine-tuning. Please generate a system message based on the detailed description of the model I want to train according to this json schema: {"role": "system", "content": "$system_message_goes_here"}. Please return the json in one line. Remember that you are not generating the system message for data generation but a fitting concise and detailed system message for the fine-tuned modle to use as instructions.'},
    {"role": "user",
     "content": "Please take your time and work on this step by step. Please generate the system message based on this fine-tune model description: " + fine_tune_task_description}
  ]
  completion = client.chat.completions.create(
      model="gpt-4o",
      response_format={ "type": "json_object" },
      messages=messages
    )
  return completion.choices[0].message.content

system_message = generate_system_message(client, fine_tune_task_description);

print(system_message)

{"role": "system", "content": "You are tasked with classifying customer support tickets into one of the predefined categories to automate the initial triaging process. The categories to classify tickets into are: Technical Support, Billing and Payment Inquiries, Product Information and Features, Order Status and Shipping, Account Management, General Inquiries and FAQs, Feedback and Suggestions, Complaints and Escalations. Your accurate classification will help direct each ticket to the appropriate department or support team, thereby improving efficiency and response times in addressing customer inquiries."}


In [None]:
import json

def generate_training_dataset(client, number_of_datasets_to_generate):

  messages = [
    {"role": "system",
     "content": 'You are a very beneficial assistant for creating training data used for OpenAI fine-tuning. Please generate training data that can be used to train machine learning models. You will be given a detailed description of the model I want to train. Based on that description please generate prompt-response pairs according to this json schema: {"messages": [{"role": "user", "content": "$prompt_goes_here"}, {"role": "assistant", "content": "$response_goes_here"}]}. Please return the json in one line. Only one prompt-response pair should be generated per request. Make sure that the sample data is diverse but high quality. With each following request increase the complexity of the generated prompt-response pair, please. Please try to have a great diversity. Description of the model: ' + fine_tune_task_description},
    {"role": "user",
     "content": "Please take your time and work on this step by step. Please generate the next training dataset and make sure it's diverse compared to the ones before."}
  ]

  training_dataset = []

  for i in range(number_of_datasets_to_generate):
    completion = client.chat.completions.create(
      model="gpt-4o",
      response_format={ "type": "json_object" },
      messages=messages
    )
    assistant_response = completion.choices[0].message.content
    response_data = json.loads(assistant_response)
    response_data["messages"].insert(0, system_message)
    modified_assistant_response = json.dumps(response_data)
    messages.append({"role": "assistant", "content": assistant_response})
    training_dataset.append(modified_assistant_response)
    print("Generating traing data... " + str(i + 1) + "/" + str(number_of_datasets_to_generate))

  return training_dataset;

training_dataset = generate_training_dataset(client, number_of_datasets_to_generate)

Generating traing data... 1/3
Generating traing data... 2/3
Generating traing data... 3/3


In [None]:
print("Successfully generated training examples.")

with open("training_dataset.json", "w") as f:
  for message in training_dataset:
    f.write(message)

print("JSON file has been created successfully!")

Successfully generated training examples.
JSON file has been created successfully!
