# Question Answering - fine tuning with OpenAI API

How to fine-tune a GPT-3 model with Python API using your own data.

Install `openai` library

In [26]:
!cat '../requirements.txt' 

openai<=0.28.1
# wikipedia<1.3.1
arxiv<=1.4.8
xmltodict<=0.13.0
boto3<=1.28.65
numpy<=1.26.1
matplotlib<=3.8.0
pandas<=2.1.1
scikit-learn<=1.3.1
opendatasets<=0.1.22
python-dotenv<=1.0.0
accelerate>=0.16.0,<1
transformers<=4.34.0
transformers[torch]>=4.28.1,<5
torch>=1.13.1,<2
langchain>=0.0.139
llama-index<=0.8.45
docx2txt<=0.8
pypdf<=3.16.4
requests<=2.31.0

In [27]:
#%pip install -r '../requirements.txt' --quiet
#%pip install openai --quiet
#%pip install openai[datalib]<=0.28.1
#pip install openai pandas
%conda install openai[datalib]
#%pip install openai 
#%pip install openai "[datalib]"
#%pip install openai"[datalib]" --quiet
#%pip3 install pandas

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.

ResolvePackageNotFound: 
  - conda=22.9.0


Note: you may need to restart the kernel to use updated packages.


Import necessary libs

In [28]:
import json
from openai import OpenAI


# Define OpenAI API keys

In [29]:
api_key = ""#"YOUR_API_KEY"
#old
#openai.api_key = api_key
client = OpenAI(api_key=api_key)


# Create training data

**PROMPT**

According to the [OpenAI API reference](https://beta.openai.com/docs/guides/fine-tuning "fine-tuning reference") you need to make sure to end each `prompt` with a suffix.

You can use ` ->`.

**COMPLETION**

Also, make sure to end each `completion` with a suffix as well

You can use `.\n`.

Template:

```
data_file = [{
    "prompt": "Prompt ->",
    "completion": " Ideal answer.\n"
},{
    "prompt":"Prompt ->",
    "completion": " Ideal answer.\n"
}]

```

In [30]:
data = [{
    "prompt": "What is the capital of France? ->",
    "completion": " Capital of France is Paris.\n"
},{
    "prompt":"What is the primary function of the heart? ->",
    "completion": " Primary function of the heart is to pump blood throughout the body.\n"
}]

# Save dict as JSONL

Training data need to be a JSONL document.

JSONL file is a newline-delimited JSON file.

In [31]:
file_name = "training_data.jsonl"

with open(file_name, 'w') as output_file:
    for entry in data:
        json.dump(entry, output_file)
        output_file.write('\n')

# Check JSONL file

In [32]:
## can't get this to work on sagemaker but this work without it.
## this works on colab

!openai -k {api_key} tools fine_tunes.prepare_data -f training_data.jsonl

# Upload file to your OpenAI account

In [33]:
upload_response = client.files.create(
  file=open(file_name, "rb"),
 purpose='fine-tune'
)

upload_response

FileObject(id='file-kVYjbnur011YblP88xufg1Fu', bytes=244, created_at=1701381173, filename='training_data.jsonl', object='file', purpose='fine-tune', status='uploaded', status_details=None)

# Save file name

In [34]:
file_id = upload_response.id
file_id

'file-kVYjbnur011YblP88xufg1Fu'

# Fine-tune a model

**Announcement by OpenAI**


On July 6, 2023, we announced the deprecation of ada, babbage, curie and davinci models. These models, including fine-tuned versions, will be turned off on January 4, 2024. We are actively working on enabling fine-tuning for upgraded base GPT-3 models as well as GPT-3.5 Turbo and GPT-4, we recommend waiting for those new options to be available rather than fine-tuning based off of the soon to be deprecated models.

Every fine-tuning job starts from a base model, which defaults to **curie**.

The choice of model influences both the performance of the model and the cost of running your fine-tuned model.

Your model can be one of: **ada**, **babbage**, **curie**, or **davinci**.

Visit [pricing page](https://openai.com/pricing#faq-fine-tuning-pricing-calculation) for details on fine-tune rates.


The default model is **Curie**. If you'd like to use **DaVinci** instead, then add it as a base model to fine-tune:

In [35]:
fine_tune_response = client.fine_tunes.create(training_file=file_id)
fine_tune_response

FineTune(id='ft-DP03u7KFRxaaWEcFUPqTKaaq', created_at=1701381173, fine_tuned_model=None, hyperparams=Hyperparams(batch_size=None, learning_rate_multiplier=None, n_epochs=4, prompt_loss_weight=0.01, classification_n_classes=None, classification_positive_class=None, compute_classification_metrics=None), model='curie', object='fine-tune', organization_id='org-heTRxyuue7iTIuT3uUuUFCPN', result_files=[], status='pending', training_files=[FileObject(id='file-kVYjbnur011YblP88xufg1Fu', bytes=244, created_at=1701381173, filename='training_data.jsonl', object='file', purpose='fine-tune', status='uploaded', status_details=None)], updated_at=1701381173, validation_files=[], events=[FineTuneEvent(created_at=1701381173, level='info', message='Created fine-tune: ft-DP03u7KFRxaaWEcFUPqTKaaq', object='fine-tune-event')])

# Check fine-tune progress

Check the progress and get a list of all the fine-tuning events

In [36]:
fine_tune_events = client.fine_tunes.list_events(fine_tune_id=fine_tune_response.id)
fine_tune_events


FineTuneEventsListResponse(data=[FineTuneEvent(created_at=1701381173, level='info', message='Created fine-tune: ft-DP03u7KFRxaaWEcFUPqTKaaq', object='fine-tune-event')], object='list')

In [37]:
fine_tune_events.__dict__['data']

[FineTuneEvent(created_at=1701381173, level='info', message='Created fine-tune: ft-DP03u7KFRxaaWEcFUPqTKaaq', object='fine-tune-event')]

Check the progress and get an object with the fine-tuning job data

In [38]:
retrieve_response = client.fine_tunes.retrieve(fine_tune_id=fine_tune_response.id)
retrieve_response

FineTune(id='ft-DP03u7KFRxaaWEcFUPqTKaaq', created_at=1701381173, fine_tuned_model=None, hyperparams=Hyperparams(batch_size=None, learning_rate_multiplier=None, n_epochs=4, prompt_loss_weight=0.01, classification_n_classes=None, classification_positive_class=None, compute_classification_metrics=None), model='curie', object='fine-tune', organization_id='org-heTRxyuue7iTIuT3uUuUFCPN', result_files=[], status='pending', training_files=[FileObject(id='file-kVYjbnur011YblP88xufg1Fu', bytes=244, created_at=1701381173, filename='training_data.jsonl', object='file', purpose='fine-tune', status='uploaded', status_details=None)], updated_at=1701381173, validation_files=[], events=[FineTuneEvent(created_at=1701381173, level='info', message='Created fine-tune: ft-DP03u7KFRxaaWEcFUPqTKaaq', object='fine-tune-event')])

In [39]:
print(retrieve_response.__dict__.keys())
retrieve_response.__dict__

dict_keys(['id', 'created_at', 'fine_tuned_model', 'hyperparams', 'model', 'object', 'organization_id', 'result_files', 'status', 'training_files', 'updated_at', 'validation_files', 'events'])


{'id': 'ft-DP03u7KFRxaaWEcFUPqTKaaq',
 'created_at': 1701381173,
 'fine_tuned_model': None,
 'hyperparams': Hyperparams(batch_size=None, learning_rate_multiplier=None, n_epochs=4, prompt_loss_weight=0.01, classification_n_classes=None, classification_positive_class=None, compute_classification_metrics=None),
 'model': 'curie',
 'object': 'fine-tune',
 'organization_id': 'org-heTRxyuue7iTIuT3uUuUFCPN',
 'result_files': [],
 'status': 'pending',
 'training_files': [FileObject(id='file-kVYjbnur011YblP88xufg1Fu', bytes=244, created_at=1701381173, filename='training_data.jsonl', object='file', purpose='fine-tune', status='uploaded', status_details=None)],
 'updated_at': 1701381173,
 'validation_files': [],
 'events': [FineTuneEvent(created_at=1701381173, level='info', message='Created fine-tune: ft-DP03u7KFRxaaWEcFUPqTKaaq', object='fine-tune-event')]}

**Wait for the model to be fine-tune, it takes 15/20 minutes!**

# Save fine-tuned model

If `fine_tune_response.fine_tuned_model == None:` you can get the **fine_tuned_model** by listing all fine-tune events

In [40]:
if fine_tune_response.fine_tuned_model == None:
    #fine_tune_list = openai.FineTune.list()
    fine_tune_list=client.fine_tunes.list()
    fine_tune_dict=fine_tune_list.__dict__
    fine_tuned_model = fine_tune_dict['data'][0].fine_tuned_model

print(fine_tuned_model)

curie:ft-austincapitaldata-2023-11-21-20-49-44


# Test the new model on a new prompt

Remember to end the prompt with the same suffix as we used in the training data; ` ->`:

In [41]:
new_prompt = "Which country serves as the primary capital of Paris? ->"

In [42]:
answer = client.completions.create(
  model=fine_tuned_model,
  prompt=new_prompt,
  max_tokens=10, # Change amount of tokens for longer completion
  temperature=0
)

In [43]:
#answer['choices'][0]['text']
print(dict(answer.choices[0])['text'])

 France

What is the capital of France?
