# 1. Get OpenAI API Key

Prior to fine-tune our model, let's get the OpenAI credentials needed for the API calls.

Go to [OpenAI website](https://platform.openai.com/api-keys) and create a new secrete key.

# 2. Create training data

The next step is to create training data to teach GPT-3 what you'd like to say. The data need to be a JSONL document with a new prompt and the ideal generated text:

```
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
{"prompt": "<question>", "completion": "<ideal answer>"}
```


**Optional for Colab users**

Before starting, we can set up the connection with the Google Drive storage, to keep there our documents.
Just execute the following passages:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Make sure that the variable path contains the correct sequence of folders separate by a `'/'` to get to your desired files

In [None]:
import os

path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

Let's start by importing the libraries needed:

In [None]:
!pip uninstall -y openai
!pip install openai

[0mCollecting openai
  Downloading openai-1.30.3-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2

In [None]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split

Then add your API key from the previous step:

In [None]:
api_key ="sk-proj-###" # ADD YOUR API KEY HERE
openai.api_key = api_key

Now create a regular dict with the training data:

Load the dataset:

In [None]:
dataset = load_dataset('arrow', data_files='data-00000-of-00001.arrow')
df = dataset['train'].to_pandas()
df.head()

Generating train split: 0 examples [00:00, ? examples/s]

Unnamed: 0,input,output,instruction
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...,Answer this question truthfully
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...,Answer this question truthfully
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...,Answer this question truthfully
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca...",Answer this question truthfully
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...,Answer this question truthfully


In [None]:
df = df.iloc[:, :-1]

In [None]:
len(df)

33955

In [None]:
df.head()

Unnamed: 0,input,output
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca..."
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...


In [None]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
train_data.head()

Unnamed: 0,input,output
17426,What is the reason for the rapid-onset of acti...,What is the reason for the rapid-onset of acti...
21416,What types of cancer are associated with a dec...,Oral contraceptives are associated with a decr...
27343,What is the product of the conversion of galac...,Galactose is converted to galactose-1-phosphat...
985,What is the effect of prostaglandin agonists o...,Prostaglandin agonists increase the uveosclera...
13243,To which drug class do ipratropium and tiotrop...,Ipratropium and tiotropium belong to the drug ...


In [None]:
# Initialize an empty list to store the training data
training_data = []

DEFAULT_SYSTEM_PROMPT = 'Answer this question truthfully.'

def create_dataset(question, answer):
    return {
        "messages": [
            {"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
        ]
    }

# Iterate through the rows of the DataFrame
for index, row in train_data.iterrows():
    # Create a dictionary for each row
    data_entry = create_dataset(row["input"], row["output"])

    # Add the dictionary to the list
    training_data.append(data_entry)

In [None]:
file_name = "turbo_training_data.jsonl"

with open(file_name, "w") as output_file:
 for entry in training_data:
  json.dump(entry, output_file)
  output_file.write("\n")

This file was used to fine tune the model using the OpenAI API UI, like this:

![OPENAI-UI](images/ft-ui.png)

And the result was:

![Results-FT-gpt-3.5-turbo](images/ft:gpt3.5-turbo-0125.png)