# Train a fine-tuning model specialized for Materials Science text mining 

Author: Jaewoong Choi (jwchoi95@kist.re.kr)

Hello, this is a tutorial for applying GPT3 openAPI.



Take in mind, as the use of openAPI is not free, this guideline is suggested for your efficient research.

You have to do several things to use GPT3 API for your research.

0. First of all, you have to gain the secret API key of openai. I can invite you in KIST group account or give you my key.

1. The most important thing is the quality of input data and the relevance of unlabelled data.

2. The quality of input data determines the performance of your fine-tuned model. If you have data with uncertain standard, the prediciton would be garbage.

3. The relevance of unlabelled data is also important. Because the fine-tuned model is a kind of black-box model, if you give the model with unrelevant data, the model would be unable to judge it out of distribution. So, the filtering would be recommend before the prediction.

## 1. Import basic library

In [2]:
import pandas as pd
import openai
import os
import json

### If your openai is not working, try this

Sometimes the older version is better.

In [None]:
%pip install openai==0.25.0

In [20]:
df = pd.read_excel('your file')
#len(df)
#df.to_json("data.jsonl", orient = 'records', lines =True)

Unnamed: 0,prompt,completion
0,The electrochemical performances of the Li/LiF...,"ANODE: Li, CATHODE: LiFePO4, CATHODE: LiFePO4..."
1,The working electrode was fabricated by mixing...,"CATHODE: working electrode, ACTIVE_MATERIAL: ..."
2,The electrochemical properties of the LiFePO4/...,"CATHODE: electrode, ACTIVE_MATERIAL: LiFePO4/..."
3,Electrochemical performances of the materials ...,"ANODE: metallic lithium film, SALT: LiPF6, SO..."
4,Pouch shaped full cells with rated capacity of...,"ANODE: graphite, ACTIVE_MATERIAL: cathode mat..."
...,...,...
95,In order to obtain the charge-discharge charac...,"CONDUCTIVE_AGENT: acetylene black, BINDER: po..."
96,The performance of the Li(MnyFe1−y)PO4 cathode...,"CATHODE: Li(MnyFe1−y)PO4 cathodes, ANODE: lit..."
97,Thin film electrodes were manufactured for ele...,CURRENT_COLLECTOR: aluminium current collecto...
98,Electrochemical performance of various LiFePO4...,"CATHODE: LiFePO4 electrodes, ANODE: metallic ..."


## 2. Dataset preparation

Training/validation is necessary, and test set is additional.

In [21]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size = 0.25, random_state=42)
len(train_df), len(val_df), len(test_df)

(60, 20, 20)

Save the data as jsonl file

In [35]:
val_df.to_json('val_df.jsonl', orient='records', lines=True)
train_df.to_json('train_df.jsonl', orient='records', lines=True)
test_df.to_json('test_df.jsonl', orient='records', lines=True)


In [27]:
os.environ["OPENAI_API_KEY"] = "your key"

Use openai tools for data preparation, which automatically transforms it into the computational form.

Make the end token of prompt/completion.

Lower is recommended, not necessary.

Jsonl is necessary.

In [86]:
!openai tools fine_tunes.prepare_data -f train_df.jsonl -q

Analyzing...

- Your file contains 60 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples
- More than a third of your `completion` column/key is uppercase. Uppercase completions tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- All prompts end with suffix `.==>\n`
- All completions end with suffix `\n\n###\n\n`

Based on the analysis we will perform the following actions:
- [Recommended] Lowercase all your data in column/key `completion` [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `train_df_prepared (1).jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t

In [None]:
!openai tools fine_tunes.prepare_data -f val_df.jsonl -q

In [None]:
!openai tools fine_tunes.prepare_data -f test_df.jsonl -q

## 3. Model tranining with train/validation set
0. Remember this is in-context learning.

1. Model; davinci, ada, babbage, curie => the most important thing, test the site: https://gpttools.com/comparisontool.
2. Batch size; how many data is used in a single training. default epochs 4.
3. Prompt loss weight; how much the model tries to learn the prompt compared to completion.
4. Learning rate; test 0.02~0.2 to find the best model.
5. Compute_classification_metrics, classification_betas, classification_n_classes and classification_positive_class can be used, if you solve the categorical problem.
6. You can add model name "suffix".

In [None]:
!openai api fine_tunes.create -t "train_df_prepared.jsonl" -v "val_df_prepared.jsonl" -m "davinci"  --batch_size 1 --n_epochs 4 --learning_rate_multiplier 0.01 --prompt_loss_weight 0.01

In [3]:
os.environ["OPENAI_API_KEY"] = "YOUR KEY"

Training/validation performance monitoring can be implemented with Results module.

This is not to fine the best parameters in our model.

Just find the best-working model.

In [67]:
!openai api fine_tunes.results -i "model_name" > result.csv
#model_name = ft-sZDvr2RiPcbovKwXa5QkyF4V

#results[results['classification/accuracy'].notnull()].tail(1)
#results[results['classification/accuracy'].notnull()]['classification/accuracy'].plot()

You can use various functions as follows.

You can find the list of your models.

You have to manage the queue of fine-tuning task, because openai API is mostly busy.

If the connection is weak or frequently missing, try use the direct call of API using os.

In [7]:
!openai api fine_tunes.follow -i ft-no2aS78P1jatbjjghPLajFhZ
#!openai api fine_tunes.list
"""

{engines.list,engines.get,engines.update,engines.generate,completions.create,
deployments.list,deployments.get,deployments.delete,deployments.create,models.list,
models.get,models.delete,files.create,files.get,files.delete,files.list,fine_tunes.list,
fine_tunes.create,fine_tunes.get,fine_tunes.results,fine_tunes.events,fine_tunes.follow,fine_tunes.cancel,
fine_tunes.delete,image.create,image.create_edit,image.create_variation}: invalid choice: 'fine_tunes.model' 
(choose from 'engines.list', 'engines.get', 'engines.update', 'engines.generate', 'completions.create', 
'deployments.list', 'deployments.get', 'deployments.delete', 'deployments.create', 'models.list', 'models.get', 
'models.delete', 'files.create', 'files.get', 'files.delete', 'files.list', 'fine_tunes.list', 'fine_tunes.create', 
'fine_tunes.get', 'fine_tunes.results', 'fine_tunes.events', 'fine_tunes.follow', 'fine_tunes.cancel', 'fine_tunes.delete', 
'image.create', 'image.create_edit', 'image.create_variation')

"""

[2023-05-03 17:39:55] Created fine-tune: ft-no2aS78P1jatbjjghPLajFhZ
[2023-05-03 17:40:08] Fine-tune failed. Fine-tune will exceed billing hard limit

Job failed. Please contact support@openai.com if you need assistance.


"\n\n{engines.list,engines.get,engines.update,engines.generate,completions.create,\ndeployments.list,deployments.get,deployments.delete,deployments.create,models.list,\nmodels.get,models.delete,files.create,files.get,files.delete,files.list,fine_tunes.list,\nfine_tunes.create,fine_tunes.get,fine_tunes.results,fine_tunes.events,fine_tunes.follow,fine_tunes.cancel,\nfine_tunes.delete,image.create,image.create_edit,image.create_variation}: invalid choice: 'fine_tunes.model' \n(choose from 'engines.list', 'engines.get', 'engines.update', 'engines.generate', 'completions.create', \n'deployments.list', 'deployments.get', 'deployments.delete', 'deployments.create', 'models.list', 'models.get', \n'models.delete', 'files.create', 'files.get', 'files.delete', 'files.list', 'fine_tunes.list', 'fine_tunes.create', \n'fine_tunes.get', 'fine_tunes.results', 'fine_tunes.events', 'fine_tunes.follow', 'fine_tunes.cancel', 'fine_tunes.delete', \n'image.create', 'image.create_edit', 'image.create_variati

## 4. Use the fine-tuned model for predicting non-observed data

0. If the api key is missing, redefine the key
openai.api_key = "your key"

1. Prepare unlabelled dataset in a jsonl form (recommended), and apply your model to the unlabelled jsonl file.
2. Stop token should be defined, which determines when the model stop to generate its output.
3. Prompt is your input text, temperature is randomness (if you want to extract something without arbitrary generation, temp would be 0; top_p is similar to temperature, so use either one)
4. Max token is the maximum length of your ideal output
5. Not necessary; best_of is about how many answers the model consider before giving you final answer. frequency and presence penalty is about the provision of new topic.

In [None]:

your_model = "your model"

prompt = "Text ==>"


result = openai.Completion.create(model=your_model, 
            prompt=prompt,
            temperature=0,
            max_tokens=256,
            #best_of=3,
            #top_p=1,
            stop=["\n\n###\n\n"]
            #frequency_penalty=0,
            #presence_penalty=0.6
            )
print(result)