This script helps to automate the process of preparing data for finetuning on OpenAI models, specifically GPT-3.5 and Babbage. It also provides utilities to validate the data, transform the data to the required JSONL format, and estimate the cost of the finetuning process.
- Validate API Key
- Validate and select the appropriate model (GPT-3.5 or Babbage)
- Check input data file (JSONL)
- Estimate finetuning cost
- Create and manage finetuning jobs on OpenAI
- Python 3
- External libraries:
pyfiglet
,openai
,tiktoken
,dotenv
,argparse
,json
,re
,os
,sys
,time
,clint
To install the required libraries:
pip install pyfiglet openai tiktoken python-dotenv argparse clint
or
pip install requirements.txt
python ftup.py [-k <API_KEY>] -m <MODEL_NAME> -f <INPUT_FILE> [-s <SUFFIX>] [-e <EPOCHS>]
Arguments:
-k, --key
: Optional. API key. Optional argument, but required in default env to have an API key in enviroment. OPENAI_API_KEY-m, --model
: Required. Model to use. Options:gpt
forgpt-3.5-turbo-0613
orbab
forbabbage-002
.-f, --file
: Required. Input data file (JSONL format).-s, --suffix
: Optional. Add a suffix for your finetuned model. E.g., 'my-suffix-title-v-1'.-e, --epoch
: Optional. Number of epochs for training. Default is 3.
Store your API key in a .env
file in the format:
OPENAI_API_KEY=your_api_key_here
The script will load by default this key if not -k / --key
passed as an argument.
check_key(key)
: Validates format for OpenAI API key.check_model(model)
: Validates the model name.check_jsonl_file(file)
: Checks if the provided file has a valid JSONL name and if it exists.create_update_jsonl_file(model, file)
: Check if JSONL have a correct format and uploads file to OpenAI.update_ft_job(file_id_name, model, suffix, epoch)
: Creates or updates the finetuning job on OpenAI.check_jsonl_gpt35(file)
: Validates the format for GPT-3.5 training.check_jsonl_babbage(file)
: Validates the format for Babbage-002 training.cost_gpt(file, epochs)
: Estimates the cost of the finetuning process.
- Ensure your data adheres to OpenAI's data format guidelines for finetuning.
- Monitor your OpenAI dashboard to keep track of your usage and costs.
- OpenAI Documentation
- OpenAI Cookbook - FineTuning
- Python Argparse Library
- pyfiglet Documentation
- tiktoken Library
- Cancel training pressing key
- Adding Token and cost for Babbage model
- Automate for creating train and validation files 80-20%
$ python ftup.py --key your_api_key_here --file train_gpt3_5.jsonl --model gpt --epoch 1 --suffix custom-model-name
or
$ python ftup.py -f train_gpt3_5.jsonl -m gpt -e 1 -s custom-model-name
____________ __ ______
/ ____/_ __/ / / / / __ \
/ /_ / / ______ / / / / /_/ /
/ __/ / / /_____/ / /_/ / ____/
/_/ /_/ \____/_/
Checking API key ...
- API Key
Checking model ...
- Model gpt
Checking if jsonl is valid ...
- JSON File train_gpt3_5.jsonl
Checking if jsonl format is valid for GPT-3.5 training ...
- Num examples: 225
- JSONL train_gpt3_5.jsonl correct format
Uploading jsonl train file ...
- File ID: file-abcd123
Dataset has ~15153 tokens that will be charged for during training
You'll train for 1 epochs on this dataset
By default, you'll be charged for ~15153 tokens
Total cost: $0.1212 💰
Creating a finetuning job ...
- Fintetuning job id: ftjob-abc123
Status: succeeded
Finetuning succeeded! ☑️
Finetune model: ft:gpt-3.5-turbo:openai:custom-model-name:7p4lURe