# Classification with OpenAI API

Here's how you can fine-tune a GPT-3 model with Python using your own data for classification task.

There is no real dataset provided for this task, but only sample of data and how it should be prepared for classification task.

Install `openai` library

In [2]:
%pip install openai --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
panel 0.13.1 requires bokeh<2.5.0,>=2.4.0, but you have bokeh 3.3.0 which is incompatible.
spyder 5.3.3 requires ipython<8.0.0,>=7.31.1, but you have ipython 8.16.1 which is incompatible.
spyder 5.3.3 requires pylint<3.0,>=2.5.0, but you have pylint 3.0.1 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Import necessary libs

In [5]:
import json
from openai import OpenAI

# Define OpenAI API keys

In [6]:
api_key =""
#openai.api_key = api_key

In [7]:
client = OpenAI(api_key=api_key)

# Create training data

**PROMPT**

According to the [OpenAI API reference](https://beta.openai.com/docs/guides/fine-tuning "fine-tuning reference") you need to make sure to end each `prompt` with a suffix.

You can use ` ->`.

Template:

```
data_file = [{
    "prompt": "Prompt ->",
    "completion": " Class"
},{
    "prompt":"Prompt ->",
    "completion": " Class"
}]

```

In [8]:
data = [{
    "prompt": "burger ->",
    "completion": " edible"
},
{
    "prompt":"paper towels ->",
    "completion": " inedible"
},
{
    "prompt":"vino ->",
    "completion": " edible"
},
{
    "prompt":"bananas ->",
    "completion": " edible"
},
{
    "prompt":"dog toy ->",
    "completion": " inedible"
}
]

During fine-tuning, the model reads the training examples and after each **token** of text, it predicts the next token. This predicted next token is compared with the actual next token, and the model’s internal weights are updated to make it more likely to predict correctly in the future. As training continues, the model learns to produce the patterns demonstrated in your training examples.

You can check on this [OpenAI Tokenizer](https://platform.openai.com/tokenizer) how your token will be split (to get the number of tokens for your output)

# Save dict as JSONL

Training data need to be a JSONL document.

JSONL file is a newline-delimited JSON file.

In [9]:
file_name = "training_data_classification.jsonl"

with open(file_name, 'w') as output_file:
    for entry in data:
        json.dump(entry, output_file)
        output_file.write('\n')

# Check JSONL file

In [10]:
#Works in colab not in sagemaker
!openai -k {api_key} tools fine_tunes.prepare_data -f training_data_classification.jsonl

# Upload file to your OpenAI account

In [10]:
upload_response = client.files.create(
  file=open(file_name, "rb"),
 purpose='fine-tune'
)
upload_response

FileObject(id='file-S3FlBuBNZYy75bzoXH3riJOo', bytes=255, created_at=1701409989, filename='training_data_classification.jsonl', object='file', purpose='fine-tune', status='uploaded', status_details=None)

# Save file name

In [11]:
file_id = upload_response.id
file_id

'file-S3FlBuBNZYy75bzoXH3riJOo'

# Fine-tune a model

Every fine-tuning job starts from a base model, which defaults to **curie**.

The choice of model influences both the performance of the model and the cost of running your fine-tuned model.

Your model can be one of: **ada**, **babbage**, **curie**, or **davinci**.

The default model is **Curie**.

In [12]:
fine_tune_response = client.fine_tunes.create(
    training_file=file_id,
    model="ada")
dict(fine_tune_response)

{'id': 'ft-3QBFVgcYAXbtcIzWcaDzL1vG',
 'created_at': 1701409997,
 'fine_tuned_model': None,
 'hyperparams': Hyperparams(batch_size=None, learning_rate_multiplier=None, n_epochs=4, prompt_loss_weight=0.01, classification_n_classes=None, classification_positive_class=None, compute_classification_metrics=None),
 'model': 'ada',
 'object': 'fine-tune',
 'organization_id': 'org-heTRxyuue7iTIuT3uUuUFCPN',
 'result_files': [],
 'status': 'pending',
 'training_files': [FileObject(id='file-S3FlBuBNZYy75bzoXH3riJOo', bytes=255, created_at=1701409989, filename='training_data_classification.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)],
 'updated_at': 1701409997,
 'validation_files': [],
 'events': [FineTuneEvent(created_at=1701409997, level='info', message='Created fine-tune: ft-3QBFVgcYAXbtcIzWcaDzL1vG', object='fine-tune-event')]}

# Check fine-tune progress

Check the progress and get a list of all the fine-tuning events

In [13]:
fine_tune_events = client.fine_tunes.list_events(
    fine_tune_id=fine_tune_response.id)
dict(fine_tune_events)

{'data': [FineTuneEvent(created_at=1701409997, level='info', message='Created fine-tune: ft-3QBFVgcYAXbtcIzWcaDzL1vG', object='fine-tune-event')],
 'object': 'list'}

Check the progress and get an object with the fine-tuning job data

In [14]:
#retrieve_response = openai.FineTune.retrieve(id=fine_tune_response.id)
retrieve_response = client.fine_tunes.retrieve(
    fine_tune_id=fine_tune_response.id
    )
dict(retrieve_response)

{'id': 'ft-3QBFVgcYAXbtcIzWcaDzL1vG',
 'created_at': 1701409997,
 'fine_tuned_model': None,
 'hyperparams': Hyperparams(batch_size=1, learning_rate_multiplier=0.1, n_epochs=4, prompt_loss_weight=0.01, classification_n_classes=None, classification_positive_class=None, compute_classification_metrics=None),
 'model': 'ada',
 'object': 'fine-tune',
 'organization_id': 'org-heTRxyuue7iTIuT3uUuUFCPN',
 'result_files': [],
 'status': 'running',
 'training_files': [FileObject(id='file-S3FlBuBNZYy75bzoXH3riJOo', bytes=255, created_at=1701409989, filename='training_data_classification.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)],
 'updated_at': 1701410003,
 'validation_files': [],
 'events': [FineTuneEvent(created_at=1701409997, level='info', message='Created fine-tune: ft-3QBFVgcYAXbtcIzWcaDzL1vG', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410002, level='info', message='Fine-tune costs $0.00', object='fine-tune-event'),
  FineTuneEvent(cr

## NOTE: **Wait at least 15-20 minuts so the job can finish!**

### Troubleshooting fine_tuned_model as null
During the fine-tuning process, the **fine_tuned_model** key may not be immediately available in the fine_tune_response object returned by `openai.FineTune.create()`.

To check the status of your fine-tuning process, you can call the `openai.FineTune.retrieve()` function and pass in the **fine_tune_response.id**. This function will return a JSON object with information about the training status, such as the current epoch, the current batch, the training loss, and the validation loss.

# Save fine-tuned model

Once the fine-tuning process is complete, you can retrieve the fine_tuned_model key by calling the openai.FineTune.retrieve() function again and passing in the fine_tune_response.id. This will return a JSON object with the key fine_tuned_model and the ID of the fine-tuned model that you can use for further completions.

Note: If `fine_tune_response.fine_tuned_model == None:` you can get the **fine_tuned_model** by listing all fine-tune events

In [17]:
client.fine_tunes.list_events(fine_tune_id=fine_tune_response.id).__dict__

{'data': [FineTuneEvent(created_at=1701409997, level='info', message='Created fine-tune: ft-3QBFVgcYAXbtcIzWcaDzL1vG', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410002, level='info', message='Fine-tune costs $0.00', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410002, level='info', message='Fine-tune enqueued. Queue number: 0', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410003, level='info', message='Fine-tune started', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410017, level='info', message='Completed epoch 1/4', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410018, level='info', message='Completed epoch 2/4', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410019, level='info', message='Completed epoch 3/4', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410019, level='info', message='Completed epoch 4/4', object='fine-tune-event'),
  FineTuneEvent(created_at=1701410034, level='info', message=

# When this does not say none, you can run the rest

In [18]:
fine_tuned_model = client.fine_tunes.retrieve(
    fine_tune_id=fine_tune_response.id)\
    .fine_tuned_model

#fine tuned model it
print(fine_tuned_model)

ada:ft-austincapitaldata-2023-12-01-05-53-54


# Test the new model on a new prompt

Remember to end the prompt with the same suffix as we used in the training data; ` ->`:

In [25]:
# helper function
def get_class(completion_text):
    if completion_text == ' edible':
        label = 'edible'
    elif completion_text == ' in':
        label = 'inedible'
    else:
        label = 'other'
    print(label)

## temperature and max_tokens

TEST 1

In [67]:
new_prompt = "table ->"

api_response = client.completions.create(
    model=fine_tuned_model,
    prompt=new_prompt,
    temperature=0,
    max_tokens=1
)

completion_text = api_response.choices[0].text
get_class(completion_text)

other


In [66]:
new_prompt = "printer paper ->"

api_response = client.completions.create(
    model=fine_tuned_model,
    prompt=new_prompt,
    temperature=0,
    max_tokens=1
)

completion_text = api_response.choices[0].text
get_class(completion_text)

other


TEST 2

In [55]:
new_prompt = "steak ->"

api_response = client.completions.create(
    model=fine_tuned_model,
    prompt=new_prompt,
    temperature=0,
    max_tokens=1
)


#completion_text = api_response['choices'][0]['text']
completion_text = api_response.choices[0].text
print(api_response.choices)
get_class(completion_text)

[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' edible')]
edible


[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1')]
other
