<a href="https://colab.research.google.com/github/PanoEvJ/Fine-tuning-GPT-3.5-turbo/blob/main/Fine_tuning_GPT_3_5_turbo_for_structured_outputs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning `gpt-3.5-turbo`

By [Pano Evangeliou](https://www.linkedin.com/in/p-evangeliou/) for [AI Makerspace](https://www.linkedin.com/company/ai-maker-space/)

Let's look at an example where we want to prompt `gpt-3.5-turbo` and get a structured response back.

## ChatGPT Output

### PROMPT
---
Please identify the gender, country of origin, and profession of the following prompt and return it as a JSON object with structure: {"gender" : gender, "country_of_origin" : country_of_origin, "profession" : profession}

PROMPT:

James Eugene Carrey is a Canadian and American actor and comedian. Known for his energetic slapstick performances, Carrey first gained recognition in 1990, after landing a role in the American sketch comedy television series In Living Color.

---

### RESPONSE

---

Based on the provided text, here's the JSON object with the inferred information:

```json
{
  "gender": "male",
  "country_of_origin": "Canada",
  "profession": "actor and comedian"
}
```

Please note that the information provided is based solely on the text you provided and doesn't include any real-time or external data.

---

That's the output from `chat.openai.com`, which is overly verbose - so let's check out the programmatic version!

## Programatic `gpt-3.5-turbo` access

In [None]:
!pip install -U -q openai

In [None]:
import os
import getpass

openai_api_key = getpass.getpass("Enter your OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

Enter your OpenAI API Key: ··········


In [None]:
import openai

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "system", "content": 'Given a short biography, provide the following fields in a JSON dict, where applicable: "gender", "country_of_origin", and "profession".'},
    {"role": "user", "content": "William Henry Gates III (born October 28, 1955) is an American business magnate, investor, and philanthropist best known for co-founding the software giant Microsoft along with his childhood friend Paul Allen.[2][3] During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president, and chief software architect, while also being its largest individual shareholder until May 2014.[4] He was a major entrepreneur of the microcomputer revolution of the 1970s and 1980s."}]
)

In [None]:
response

<OpenAIObject chat.completion id=chatcmpl-7qt2npN3NEf0YLDCPtvQZCDCx4895 at 0x797d36f9bce0> JSON: {
  "id": "chatcmpl-7qt2npN3NEf0YLDCPtvQZCDCx4895",
  "object": "chat.completion",
  "created": 1692839001,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\n  \"gender\": \"male\",\n  \"country_of_origin\": \"United States\",\n  \"profession\": [\n    \"business magnate\",\n    \"investor\",\n    \"philanthropist\"\n  ]\n}"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 152,
    "completion_tokens": 44,
    "total_tokens": 196
  }
}

In [None]:
import ast

print(ast.literal_eval(response.choices[0].message.content))

{'gender': 'male', 'country_of_origin': 'United States', 'profession': ['business magnate', 'investor', 'philanthropist']}


In [None]:
total_tokens_base = response.usage.total_tokens
print(total_tokens_base)

196


As you can see, while we did get a response that *contains* valid JSON - we also got back a number of useless (for our use-case) tokens.

Now, let's fine-tune the model and see what we can come up with.

## Dataset Creation

In order to fine-tune `gpt-3.5-turbo`, it's recommended we use **10-100** examples.

That is an incredibly small number - and would be inexpensive even if you were paying for human annotation.

We'll work off of a sample dataset that I generated for this task that has 20 entries!

#### Dataset

In [None]:
examples = [
    {"description" : "Malala Yousafzai is a Pakistani education activist known for her advocacy of female education in her native Swat Valley in Khyber Pakhtunkhwa, Pakistan. She survived an assassination attempt by the Taliban and went on to become a global symbol of education and empowerment.",
     "json" : {
      "gender": "female",
      "country_of_origin": "Pakistan",
      "profession": "education activist"
    }},
    {"description" : "Shah Rukh Khan is an Indian actor, film producer, and television personality. Often referred to as the 'King of Bollywood,' he has appeared in over 80 Bollywood films and is one of the richest actors in the world.",
     "json" : {
      "gender": "male",
      "country_of_origin": "India",
      "profession": "actor, film producer, and television personality"
    }},
    {"description" : "Serena Williams is an American professional tennis player widely regarded as one of the greatest female players in tennis history. She has won numerous Grand Slam titles and has been an advocate for gender equality in sports.",
        "json" : {
      "gender": "female",
      "country_of_origin": "United States",
      "profession": "professional tennis player"
    }},
    {"description" : "Nelson Mandela was a South African anti-apartheid revolutionary, political leader, and philanthropist who served as President of South Africa from 1994 to 1999. He is a symbol of resistance against racial segregation and injustice.",
     "json" : {
      "gender": "male",
      "country_of_origin": "South Africa",
      "profession": "political leader and philanthropist"
    }},
    {"description" : "Aishwarya Rai Bachchan is an Indian actress and the winner of the Miss World 1994 pageant. She has appeared in numerous Bollywood films and is regarded as one of the most influential and beautiful actresses in the industry.",
     "json" : {
      "gender": "female",
      "country_of_origin": "India",
      "profession": "actress"
    }},
    {"description" : "Elon Musk is a South African-born American entrepreneur and business magnate. He is known for his roles in companies like Tesla, SpaceX, and Neuralink, as well as his ambitious visions for the future.",
     "json" : {
      "gender": "male",
      "country_of_origin": "South Africa",
      "profession": "entrepreneur and business magnate"
    }},
    {"description" : "Angela Merkel is a German politician who served as the Chancellor of Germany from 2005 to 2021. She is known for her strong leadership and influence on European and global politics.",
     "json" : {
      "gender": "female",
      "country_of_origin": "Germany",
      "profession": "politician"
    }},
    {"description" : "Cristiano Ronaldo is a Portuguese professional footballer widely considered one of the greatest players of all time. He has won numerous awards and accolades throughout his career and has played for top clubs like Manchester United, Real Madrid, and Juventus.",
     "json" : {
      "gender": "male",
      "country_of_origin": "Portugal",
      "profession": "professional footballer"
    }},
    {"description" : "Priyanka Chopra is an Indian actress, singer, and film producer. She has become one of India's highest-paid and most popular celebrities, and she has received several awards, including a National Film Award and five Filmfare Awards.",
     "json" : {
      "gender": "female",
      "country_of_origin": "India",
      "profession": "actress, singer, and film producer"
    }},
    {"description" : "James Eugene Carrey is a Canadian and American actor and comedian. Known for his energetic slapstick performances, Carrey first gained recognition in 1990, after landing a role in the American sketch comedy television series In Living Color.",
     "json" : {
      "gender": "male",
      "country_of_origin": "Canada",
      "profession": "actor and comedian"
    }},
    {"description" : "Frida Kahlo was a Mexican painter known for her unique style and self-portraits. Her artworks often reflected her physical and emotional pain, and she remains an icon for her contributions to art and feminism.",
     "json" : {
      "gender": "female",
      "country_of_origin": "Mexico",
      "profession": "painter"
    }},
        {"description" : "Yao Ming is a retired Chinese basketball player and an international sports ambassador. Standing at 7 feet 6 inches tall, he played in the NBA and became a bridge between American and Chinese basketball cultures.",
        "json" : {
      "gender": "male",
      "country_of_origin": "China",
      "profession": "retired basketball player"
    }},
    {"description" : "J.K. Rowling is a British author best known for her 'Harry Potter' series of novels. Her books have captivated millions of readers worldwide and have been adapted into a successful film franchise.",
     "json" : {
      "gender": "female",
      "country_of_origin": "United Kingdom",
      "profession": "author"
    }},
    {"description" : "Marie Curie was a Polish physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize and remains an inspiration for her contributions to science and the advancement of women in STEM fields.",
     "json" : {
      "gender": "female",
      "country_of_origin": "Poland",
      "profession": "physicist and chemist"
    }},
    {"description" : "Diego Maradona was an Argentine professional footballer and one of the greatest players in the history of the sport. His skill, creativity, and controversial moments on and off the field have made him a legendary figure.",
     "json" : {
      "gender": "male",
      "country_of_origin": "Argentina",
      "profession": "professional footballer"
    }},
    {"description" : "Indira Gandhi was an Indian politician and stateswoman who served as the Prime Minister of India for multiple terms. She played a significant role in shaping India's political landscape and international relations during her leadership.",
     "json" : {
      "gender": "female",
      "country_of_origin": "India",
      "profession": "politician and stateswoman"
    }},
    {"description" : "Akira Kurosawa was a Japanese film director and screenwriter, regarded as one of the most important and influential filmmakers in the history of cinema. His films, such as 'Seven Samurai' and 'Rashomon,' have left a lasting impact on global cinema.",
     "json" : {
      "gender": "male",
      "country_of_origin": "Japan",
      "profession": "film director and screenwriter"
    }},
    {"description" : "Michelle Obama is an American lawyer, author, and former First Lady of the United States. She is an advocate for education, health, military families, and women's rights, and she has written a bestselling memoir titled 'Becoming.'",
     "json" : {
      "gender": "female",
      "country_of_origin": "United States",
      "profession": "lawyer, author, and former First Lady"
    }},
    {"description" : "Leonardo da Vinci was an Italian polymath of the Renaissance period. He is known for his expertise in various fields, including painting, anatomy, engineering, and invention. His iconic artwork 'Mona Lisa' is celebrated worldwide.",
     "json" : {
      "gender": "male",
      "country_of_origin": "Italy",
      "profession": "polymath"
    }},
    {"description" : "Maya Angelou was an American poet, memoirist, and civil rights activist. Her works, such as 'I Know Why the Caged Bird Sings,' have touched hearts and minds, and she used her voice to advocate for social change.",
     "json" : {
      "gender": "female",
      "country_of_origin": "United States",
      "profession": "poet, memoirist, and civil rights activist"
    }},
]

In [None]:
len(examples)

20

#### Converting the Dataset

In [None]:
def convert_format(example_row):
  role_message = {"role": "system", "content": 'Given a short biography, provide the following fields in a JSON dict, where applicable: "gender", "country_of_origin", and "profession".'}
  user_message = {"role" : "user", "content" : example_row["description"]}
  json_blob = example_row["json"]
  assistant_message = {"role": "assistant", "content": str(json_blob)}
  return {"messages" : [role_message, user_message, assistant_message]}

In [None]:
prepare_dataset = []

for example in examples:
  prepare_dataset.append(convert_format(example))

In [None]:
import json

with open('training_file.jsonl', 'w') as outfile:
    for entry in prepare_dataset:
        json.dump(entry, outfile)
        outfile.write('\n')

Now that we have our dataset created and saved in an appropriate format - let's go ahead and get started with our fine-tuning!

We'll be leveraging this resource as a base for the following code:

[Fine-tuning](https://platform.openai.com/docs/guides/fine-tuning)

#### Prepare Data

First things first, we need to upload our file to be leveraged during fine-tuning.

In [None]:
file_response = openai.File.create(file=open("training_file.jsonl", "rb"), purpose='fine-tune')

In [None]:
file_response

<File file id=file-nnu2QlRSWjWOBivO1IWynvPT at 0x797d340dd990> JSON: {
  "object": "file",
  "id": "file-nnu2QlRSWjWOBivO1IWynvPT",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 11659,
  "created_at": 1692839004,
  "status": "uploaded",
  "status_details": null
}

Now that we've done that, we'll need to wait a second before we can use that file to create our fine-tuning job.

You'll notice that all we need to do is provide:

- `training_file`: the training `file.id`, which we can collect from the response from our `openai.File.create()` endpoint.
- `model`: the model we wish to fine-tune

In [None]:
import time

response = None

while not response:
  try:
    response = openai.FineTuningJob.create(training_file=file_response.id, model="gpt-3.5-turbo")
  except:
    time.sleep(5)

This gives us a response which includes a number of important pieces of information!

We'll be leveraging the `FineTuningJob.id` to monitor and keep track of our fine-tuning!

In [None]:
response

<FineTuningJob fine_tuning.job id=ftjob-TObsbUnaxIZJxBFrXDzP3kKh at 0x797d340dc860> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-TObsbUnaxIZJxBFrXDzP3kKh",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1692839016,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-VrtlDUDw6aAbBGBDNn1W797P",
  "result_files": [],
  "status": "created",
  "validation_file": null,
  "training_file": "file-nnu2QlRSWjWOBivO1IWynvPT",
  "hyperparameters": {
    "n_epochs": 5
  },
  "trained_tokens": null
}

Notice that we can see a number of pieces of information that might be relevant to our process - but we're going to only focus on `status` in this notebook.

We'll set our `training_id` so we can use it programmatically.

In [None]:
training_id = response.id

We can query the fine-tuning jobs, in case you didn't collect your `training_id` or have multiple jobs, or want to view a history of your jobs!

In [None]:
# List 10 fine-tuning jobs
response = openai.FineTuningJob.list(limit=10)

In [None]:
response

<OpenAIObject list at 0x797d19b13e20> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job",
      "id": "ftjob-TObsbUnaxIZJxBFrXDzP3kKh",
      "model": "gpt-3.5-turbo-0613",
      "created_at": 1692839016,
      "finished_at": null,
      "fine_tuned_model": null,
      "organization_id": "org-VrtlDUDw6aAbBGBDNn1W797P",
      "result_files": [],
      "status": "created",
      "validation_file": null,
      "training_file": "file-nnu2QlRSWjWOBivO1IWynvPT",
      "hyperparameters": {
        "n_epochs": 5
      },
      "trained_tokens": null
    },
    {
      "object": "fine_tuning.job",
      "id": "ftjob-Y84XFcrlG77MOK4W54rEInT7",
      "model": "gpt-3.5-turbo-0613",
      "created_at": 1692838393,
      "finished_at": 1692838961,
      "fine_tuned_model": "ft:gpt-3.5-turbo-0613:ox::7qt274Md",
      "organization_id": "org-VrtlDUDw6aAbBGBDNn1W797P",
      "result_files": [
        "file-7EEz7LBHIG6c62sdQjeLtuZV"
      ],
      "status": "succeeded",
    

In [None]:
openai.FineTuningJob.retrieve(training_id)

<FineTuningJob fine_tuning.job id=ftjob-TObsbUnaxIZJxBFrXDzP3kKh at 0x797d19b2f8d0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-TObsbUnaxIZJxBFrXDzP3kKh",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1692839016,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-VrtlDUDw6aAbBGBDNn1W797P",
  "result_files": [],
  "status": "running",
  "validation_file": null,
  "training_file": "file-nnu2QlRSWjWOBivO1IWynvPT",
  "hyperparameters": {
    "n_epochs": 5
  },
  "trained_tokens": null
}

We'll build a loop that waits for our job to be complete, but provides us with updates we pull from the `list_events`, which will let us see how our fine-tuning is going!

In [None]:
from IPython.display import clear_output

while openai.FineTuningJob.retrieve(training_id).status == "running":
  clear_output(wait=True)
  time.sleep(5)
  print(openai.FineTuningJob.list_events(id=training_id, limit=10))

print("Done!")

{
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-0wueVjVszJJ573fYQukBsAab",
      "created_at": 1692839185,
      "level": "warn",
      "message": "Fine tuning job failed, re-enqueued for retry",
      "data": null,
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-B6MyDexQsW6VVQUTl1QsTbxx",
      "created_at": 1692839016,
      "level": "info",
      "message": "Fine tuning job started",
      "data": null,
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-uLwQX3kZD9fhsC4Yabj2F8Hx",
      "created_at": 1692839016,
      "level": "info",
      "message": "Created fine-tune: ftjob-TObsbUnaxIZJxBFrXDzP3kKh",
      "data": null,
      "type": "message"
    }
  ],
  "has_more": false
}
Done!


In [None]:
openai.FineTuningJob.retrieve(training_id)

<FineTuningJob fine_tuning.job id=ftjob-TObsbUnaxIZJxBFrXDzP3kKh at 0x797d19b2ff10> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-TObsbUnaxIZJxBFrXDzP3kKh",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1692839016,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-VrtlDUDw6aAbBGBDNn1W797P",
  "result_files": [],
  "status": "created",
  "validation_file": null,
  "training_file": "file-nnu2QlRSWjWOBivO1IWynvPT",
  "hyperparameters": {
    "n_epochs": 5
  },
  "trained_tokens": null
}

In [None]:
ft_model_id = openai.FineTuningJob.retrieve(training_id).fine_tuned_model

In [None]:
ft_model_id

'ft:gpt-3.5-turbo-0613:ox::7qtF4nDj'

Now we can call our model by address its `fine_tuned_model` id!

In [None]:
completion = openai.ChatCompletion.create(
  model=ft_model_id,
  messages=[
    {"role": "system", "content": 'Given a short biography, provide the following fields in a JSON dict, where applicable: "gender", "country_of_origin", and "profession".'},
    {"role": "user", "content": "William Henry Gates III (born October 28, 1955) is an American business magnate, investor, and philanthropist best known for co-founding the software giant Microsoft along with his childhood friend Paul Allen.[2][3] During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president, and chief software architect, while also being its largest individual shareholder until May 2014.[4] He was a major entrepreneur of the microcomputer revolution of the 1970s and 1980s."}
  ]
)

In [None]:
completion

<OpenAIObject chat.completion id=chatcmpl-7qtNzwS2whXuXicMiCZZAMIagNxXc at 0x797d37e43060> JSON: {
  "id": "chatcmpl-7qtNzwS2whXuXicMiCZZAMIagNxXc",
  "object": "chat.completion",
  "created": 1692840315,
  "model": "ft:gpt-3.5-turbo-0613:ox::7qtF4nDj",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{'gender': 'male', 'country_of_origin': 'United States', 'profession': 'business magnate, investor, and philanthropist'}"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 152,
    "completion_tokens": 30,
    "total_tokens": 182
  }
}

In [None]:
completion.choices[0].message.content

"{'gender': 'male', 'country_of_origin': 'United States', 'profession': 'business magnate, investor, and philanthropist'}"

In [None]:
print(ast.literal_eval(completion.choices[0].message.content))

{'gender': 'male', 'country_of_origin': 'United States', 'profession': 'business magnate, investor, and philanthropist'}


In [None]:
total_tokens_ft = completion.usage.total_tokens
print(total_tokens_ft)

182


As you can see, while we're not saving a *lot* of tokens - over millions of calls, and with only 20 examples in a lightweight toy example - we're saving anywhere from ~2-8 tokens per call, all while (essentially) guarenteeing our target output!

In [None]:
print(f"Base Model: {total_tokens_base}\nFine-tuned Model: {total_tokens_ft}")

Base Model: 196
Fine-tuned Model: 182
