In [27]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

In [28]:
dfs = {
    k: pd.read_pickle(f'../data/df_{k}.pkl')
    for k in ['train', 'test_recent', 'test_old']
}

In [29]:
d = pd.DataFrame()
d['prompt'] = dfs['train']['year'].astype(str) + '\n\n' # No need to start with 'Published in '
d['completion'] = ' ' + dfs['train']['title'].str.strip()

In [30]:
d.head()

Unnamed: 0,prompt,completion
828390,2017\n\n,Information Theoretic Optimal Learning of Gau...
1308516,2020\n\n,Bi-Level Graph Neural Networks for Drug-Drug ...
1240505,2020\n\n,RL-Duet: Online Music Accompaniment Generatio...
1071139,2019\n\n,Named Entity Recognition in Electronic Health...
1134931,2019\n\n,Automated Classification of Seizures against ...


In [31]:
d.to_json("titles.jsonl", orient='records', lines=True)

### Data Preparation tool
We can now use a data preparation tool which will suggest a few improvements to our dataset before fine-tuning. Before launching the tool we update the openai library to ensure we're using the latest data preparation tool. We additionally specify `-q` which auto-accepts all suggestions.

In [33]:
!openai tools fine_tunes.prepare_data -f titles.jsonl -q

Analyzing...

- Your file contains 5466 prompt-completion pairs
- There are 1 duplicated prompt-completion sets. These are rows: [4232]
- All prompts end with suffix `\n\n`
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples.

Based on the analysis we will perform the following actions:
- [Recommended] Remove 1 duplicate rows [Y/n]: Y
- [Recommended] Add a suffix ending `\n` to all completions [Y/n]: Y
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["completion"] += suffix


Your data will be written to a

## Fine-tuning

In [9]:
!openai api fine_tunes.create -t "titles_prepared.jsonl"  -m curie --n_epochs 1

Upload progress: 100%|████████████████████| 1.52M/1.52M [00:00<00:00, 1.81Mit/s]
Uploaded file from sport2_prepared_train.jsonl: file-Dxx2xJqyjcwlhfDHpZdmCXlF
Upload progress: 100%|███████████████████████| 388k/388k [00:00<00:00, 507kit/s]
Uploaded file from sport2_prepared_valid.jsonl: file-Mvb8YAeLnGdneSAFcfiVcgcN
Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2021-07-30 13:15:50] Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
[2021-07-30 13:15:52] Fine-tune enqueued. Queue number: 0
[2021-07-30 13:15:56] Fine-tune started
[2021-07-30 13:18:55] Completed epoch 1/4
[2021-07-30 13:20:47] Completed epoch 2/4
[2021-07-30 13:22:40] Completed epoch 3/4
[2021-07-30 13:24:31] Completed epoch 4/4
[2021-07-30 13:26:22] Uploaded model: ada:ft-openai-2021-07-30-12-26-20
[2021-07-30 13:26:27] Uploaded result file: file-6Ki9RqLQwkChGsr9CHcr1ncg
[2021-07-30 13:26:28] Fine-tune succee

The model is successfully trained in about ten minutes. We can see the model name is `ada:ft-openai-2021-07-30-12-26-20`, which we can use for doing inference.