# ChatGPT-3.5 fine-tuning

This notebook contains code for fine-tuning the ChatGPT-3.5 Turbo model on the [RecipeNLG dataset](https://recipenlg.cs.put.poznan.pl/dataset). It is based on [OpenAI's fine-tuning example](https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates).

Our goal is to adapt the ChatGPT-3.5 Turbo to extract ingredients from recipe descriptions and return them in the form of a JSON array containing strings. Essentially, we're creating a model for data summarization from detailed recipe content.

It is assumed that you have already downloaded the RecipeNLG dataset. If not, please do it now and place it into the root of this project.

## 1. Dataset

First, we need to prepare environment and the dataset for fine-tuning.

In [45]:
import os
import json
import zipfile
import pandas as pd
import openai
import getpass
from pprint import pprint

# Skip this step if archive with dataset was already unpacked
full_dataset = "./dataset/full_dataset.csv"  #@param {type:"string"}
dataset_archive = "./dataset.zip"  #@param {type:"string"}
if not os.path.exists(full_dataset):
    if os.path.exists(dataset_archive):
        with zipfile.ZipFile(dataset_archive, 'r') as zip_ref: zip_ref.extractall(".")

In [46]:
# Now let's extract 1k random lines from original dataset and save them to dataset_1k.csv file
dataset_1k = "./dataset/dataset_1k.csv"  #@param {type:"string"}
if not os.path.exists(dataset_1k):
    df = pd.read_csv(full_dataset)
    sample_df = df.sample(
        n=1000,  #@param {type:"integer"}
        random_state=42  #@param {type:"integer"}
    )
    sample_df.to_csv(dataset_1k, index=False)
else:
    sample_df = pd.read_csv(dataset_1k)

# Let's print counts
pprint(sample_df.count())

Unnamed: 0     1000
title          1000
ingredients    1000
directions     1000
link           1000
source         1000
NER            1000
dtype: int64


In [47]:
# Now let's prepare system prompt and method for converting receipt description into a valid ChatGPT-3.5 conversation.
system_message = "You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."  #@param {type:"string"}

# Function for preparing example conversation
def prepare_example_conversation(row):
    user_message = f"""Title: {row['title']}\n\nIngredients: {row['ingredients']}\n\nGeneric ingredients: """
    # Let's prepare conversation messages
    messages = []
    messages.append({"role": "system", "content": system_message})  # here is the system prompt
    messages.append({"role": "user", "content": user_message})  # this message will contain recipe title and ingredients
    messages.append({"role": "assistant", "content": row["NER"]})  # this message will contain array extracted ingredients
    return {"messages": messages}

# Print a single example of function work
pprint(prepare_example_conversation(sample_df.iloc[0]))

{'messages': [{'content': 'You are a helpful recipe assistant. You are to '
                          'extract the generic ingredients from each of the '
                          'recipes provided.',
               'role': 'system'},
              {'content': 'Title: Marinated Flank Steak Recipe\n'
                          '\n'
                          'Ingredients: ["1 1/2 pound flank steak", "1/2 c. '
                          'finely minced green onions (scallions)", "1/2 c. '
                          'dry red wine", "1/4 c. soy sauce", "3 tbsp. salad '
                          'oil", "3 teaspoon sesame seeds", "2 teaspoon packed '
                          'brown sugar", "1/4 teaspoon grnd black pepper", '
                          '"1/4 teaspoon grnd ginger", "1 clove garlic, '
                          'chopped"]\n'
                          '\n'
                          'Generic ingredients: ',
               'role': 'user'},
              {'content': '["flank steak", "gre

In [48]:
# Prepare dataset using prepare_example_conversation function to each row
sample_dataset = sample_df.apply(prepare_example_conversation, axis=1).tolist()

# Let's print few prepared examples
for example in sample_dataset[:5]:
    print(example)

{'messages': [{'role': 'system', 'content': 'You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided.'}, {'role': 'user', 'content': 'Title: Marinated Flank Steak Recipe\n\nIngredients: ["1 1/2 pound flank steak", "1/2 c. finely minced green onions (scallions)", "1/2 c. dry red wine", "1/4 c. soy sauce", "3 tbsp. salad oil", "3 teaspoon sesame seeds", "2 teaspoon packed brown sugar", "1/4 teaspoon grnd black pepper", "1/4 teaspoon grnd ginger", "1 clove garlic, chopped"]\n\nGeneric ingredients: '}, {'role': 'assistant', 'content': '["flank steak", "green onions", "red wine", "soy sauce", "salad oil", "sesame seeds", "brown sugar", "grnd black pepper", "grnd ginger", "clove garlic"]'}]}
{'messages': [{'role': 'system', 'content': 'You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided.'}, {'role': 'user', 'content': 'Title: French Chicken Stew\n\nIngredients: ["1 tablespoon

In [49]:
# Now let's split dataset to training and validation sets
train_size = int(0.7 * len(sample_dataset))  #@param {type:"integer"}
train_dataset = sample_dataset[:train_size]  # 70% for training
val_dataset = sample_dataset[train_size:]  # 30% for validation

# Counts
print(len(train_dataset), len(val_dataset))

700 300


In [50]:
training_file_name = "train_dataset.jsonl"  #@param {type:"string"}
with open(training_file_name, "w") as file:
    for entry in train_dataset: file.write(json.dumps(entry) + "\n")

validation_file_name = "val_dataset.jsonl"  #@param {type:"string"}
with open("val_dataset.jsonl", "w") as file:
    for entry in val_dataset: file.write(json.dumps(entry) + "\n")

On the next step we will make calls to OpenAI API. To do that we need to install OpenAI Python library and set up API key.

In [53]:
# Now let's change OpenAI API key to yours
OPENAI_API_KEY = getpass.getpass(prompt="OPENAI_API_KEY:")  #@param {type:"string"}
openai.api_key = os.getenv("OPENAI_API_KEY", OPENAI_API_KEY)

# Let's check if it works
response = openai.Completion.create(engine="davinci", prompt="Hello, world", max_tokens=5)
pprint(response)

<OpenAIObject text_completion id=cmpl-7quhc8Y4g0rodWeJBXKDafTBY8Dbd at 0x7fef98b41850> JSON: {
  "id": "cmpl-7quhc8Y4g0rodWeJBXKDafTBY8Dbd",
  "object": "text_completion",
  "created": 1692845376,
  "model": "davinci",
  "choices": [
    {
      "text": " ind Brazil. Tap-",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 3,
    "completion_tokens": 5,
    "total_tokens": 8
  }
}


## 2. Uploading dataset to OpenAI

On this step we will upload dataset to OpenAI API.

In [54]:
# Training dataset
training_response = openai.File.create(
    file=open(training_file_name, "rb"),
    purpose="fine-tune"
)
training_file_id = training_response["id"]  # this is the ID we'll use to start fine-tuning job

# Validation dataset
validation_response = openai.File.create(
    file=open(validation_file_name, "rb"),
    purpose="fine-tune"
)
validation_file_id = validation_response["id"]  # this is the ID we'll use to start fine-tuning job

print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)

Training file ID: file-XG4instzfClDqqvsmut5TE0y
Validation file ID: file-G4ALF7HDzbWjIsAJayqn92kd


## 3. Creating fine-tuning job

On this step we will create fine-tuning job using previously uploaded dataset.

In [57]:
# Create fine-tuning job
response = openai.FineTuningJob.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model="gpt-3.5-turbo",  #@param {type:"string"}
)

job_id = response["id"]  # this is the ID we'll use to monitor the status of the fine-tuning job

print("Job ID:", response["id"])
print("Status:", response["status"])

Job ID: ftjob-HSpCc9iVymosVDwmRn3p3hlB
Status: created


## 4. Monitoring fine-tuning job

This step will take a while. You can check the status of the fine-tuning job by running the following cell.

In [58]:
# Check fine-tuning job status
response = openai.FineTuningJob.retrieve(job_id)

print("Job ID:", response["id"])
print("Status:", response["status"])
print("Trained Tokens:", response["trained_tokens"])
print("Fine-tuned model ID:", response["fine_tuned_model"])

Job ID: ftjob-HSpCc9iVymosVDwmRn3p3hlB
Status: running
Trained Tokens: None
Fine-tuned model ID: None


In [85]:
# Check fine-tuning job events
response = openai.FineTuningJob.list_events(id=job_id, limit=50)

events = response["data"]
events.reverse()

for event in events:
    print(event["message"])

Created fine-tune: ftjob-HSpCc9iVymosVDwmRn3p3hlB
Fine tuning job started
Fine tuning job failed, re-enqueued for retry
Fine tuning job started
Step 100/2100: training loss=0.32
Step 200/2100: training loss=0.77
Step 300/2100: training loss=0.08
Step 400/2100: training loss=0.05
Step 500/2100: training loss=0.23
Step 600/2100: training loss=0.29
Step 700/2100: training loss=0.05
Step 800/2100: training loss=0.00
Step 900/2100: training loss=0.34
Step 1000/2100: training loss=1.13
Step 1100/2100: training loss=0.00
Step 1200/2100: training loss=0.41
Step 1300/2100: training loss=0.35
Step 1400/2100: training loss=0.04
Step 1500/2100: training loss=0.00
Step 1600/2100: training loss=0.38
Step 1700/2100: training loss=0.15
Step 1800/2100: training loss=0.00
Step 1900/2100: training loss=0.00
Step 2000/2100: training loss=0.10
Step 2100/2100: training loss=0.00
New fine-tuned model created: ft:gpt-3.5-turbo-0613:personal::7qwbrgm8
Fine-tuning job successfully completed


In [87]:
# Check fine-tuning job status
response = openai.FineTuningJob.retrieve(job_id)

print("Job ID:", response["id"])
print("Status:", response["status"])
print("Trained Tokens:", response["trained_tokens"])
print("Fine-tuned model ID:", response["fine_tuned_model"])

fine_tuned_model_id = response["fine_tuned_model"]

Job ID: ftjob-HSpCc9iVymosVDwmRn3p3hlB
Status: succeeded
Trained Tokens: 346584
Fine-tuned model ID: ft:gpt-3.5-turbo-0613:personal::7qwbrgm8


## 5. Model inference

On this step we will use fine-tuned model to extract ingredients from recipe descriptions.

In [92]:
test_df = [
    {
        "id": 9999,
        "title": "7 Layer Salad",
        "ingredients": [
            "10 to 12 leaves spinach, torn up",
            "8 to 10 mushrooms, sliced",
            "Bermuda onion, sliced",
            "2 boiled eggs, sliced",
            "4 strips bacon, fried and crumbled",
            "tomatoes, peeled and chunked",
            "Ranch dressing"
        ],
        # "NER": ["spinach", "mushrooms", "Bermuda onion", "eggs", "bacon", "tomatoes", "dressing"],
    }
]

# Need to prepare system prompt and user message
user_message = f"""Title: {test_df[0]['title']}\n\nIngredients: {test_df[0]['ingredients']}\n\nGeneric ingredients: """
test_messages = []
test_messages.append({"role": "system", "content": system_message})
test_messages.append({"role": "user", "content": user_message})

# Print messages
pprint(test_messages)

[{'content': 'You are a helpful recipe assistant. You are to extract the '
             'generic ingredients from each of the recipes provided.',
  'role': 'system'},
 {'content': 'Title: 7 Layer Salad\n'
             '\n'
             "Ingredients: ['10 to 12 leaves spinach, torn up', '8 to 10 "
             "mushrooms, sliced', 'Bermuda onion, sliced', '2 boiled eggs, "
             "sliced', '4 strips bacon, fried and crumbled', 'tomatoes, peeled "
             "and chunked', 'Ranch dressing']\n"
             '\n'
             'Generic ingredients: ',
  'role': 'user'}]


In [93]:
# Let's extract ingredients from recipe description
response = openai.ChatCompletion.create(
    model=fine_tuned_model_id,
    messages=test_messages,
    temperature=0,
    max_tokens=500
)

print(response["choices"][0]["message"]["content"])

['spinach', 'mushrooms', 'onion', 'eggs', 'bacon', 'tomatoes', 'Ranch dressing']
