<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

# Fine-Tuning a Model - ChatBot Example

In this project, we'll explore how to fine-tune a GPT model such as text-babbage model with our own data set. You should note, this may not be needed for more advanced text-davinci models or future GPT-4 models, but let's explore the process of creating our own custom fine-tuning data set, formatting it for OpenAI, and then training and calling our own custom model.

## Imports

In [1]:
import json
import os
import tiktoken 
import pandas as pd
import openai

## Data

We've gathered a data from Kaggle with a set of Questions and Answers from StackOverflow


In [8]:
qa_df = pd.read_csv("python_qa.csv")

In [9]:
qa_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,ParentId,Answer
0,11060,912.0,2008-08-14T13:59:21Z,,18,How should I unit test a code-generator?,This is a difficult and open-ended question I ...,11060,I started writing up a summary of my experienc...
1,17250,394.0,2008-08-20T00:16:40Z,,24,Create an encrypted ZIP file in Python,I'm creating an ZIP file with ZipFile in Pytho...,17250,I created a simple library to create a passwor...
2,31340,242853.0,2008-08-27T23:44:47Z,,71,"How do threads work in Python, and what are co...",I've been trying to wrap my head around how th...,31340,"Yes, because of the Global Interpreter Lock (G..."
3,34020,3561.0,2008-08-29T05:43:16Z,,17,Are Python threads buggy?,A reliable coder friend told me that Python's ...,34020,Python threads are good for concurrent I/O pro...
4,34570,577.0,2008-08-29T16:10:41Z,2011-11-08T16:11:43Z,13,What is the best quick-read Python book out th...,I am taking a class that requires Python. We w...,34570,"I loved Dive Into Python, especially if you're..."


### Fine-Tuning Formatting

The formatting for a fine-tuning data set involves a prompt and expected completion. This leads fine-tuning to be a great choice for dialogue instances, such as question and answer or customer support.

The format should look like the following (a list of dictionaries):

    [{"prompt": "some prompt string","completion":"the best completed text option given the prompt"},]

Convert the information from the CSV to fine tuning format:

In [10]:
questions, answers = qa_df["Body"], qa_df["Answer"]

In [11]:
questions.head()

0    This is a difficult and open-ended question I ...
1    I'm creating an ZIP file with ZipFile in Pytho...
2    I've been trying to wrap my head around how th...
3    A reliable coder friend told me that Python's ...
4    I am taking a class that requires Python. We w...
Name: Body, dtype: object

In [12]:
answers.head()

0    I started writing up a summary of my experienc...
1    I created a simple library to create a passwor...
2    Yes, because of the Global Interpreter Lock (G...
3    Python threads are good for concurrent I/O pro...
4    I loved Dive Into Python, especially if you're...
Name: Answer, dtype: object

Now we can create the list of dictionaries:

In [13]:
qa_openai_format = [{"prompt" : q, "completion": a} for q, a in zip(questions, answers)]

Now let's explore a single prompt/completion combo:

In [14]:
qa_openai_format[4]

{'prompt': 'I am taking a class that requires Python. We will review the language in class next week, and I am a quick study on new languages, but I was wondering if there are any really great Python books I can grab while I am struggling through the basics of setting up my IDE, server environment and all those other "gotchas" that come with a new programming language. Suggestions?\n',
 'completion': "I loved Dive Into Python, especially if you're a quick study.  The beginning basics are all covered (and may move slowly for you), but the latter few chapters are great learning tools.\n\nPlus, Pilgrim is a pretty good writer.\n"}

In [11]:
len(qa_openai_format)

4429

## Price Estimation

In case you are ever worried about how many tokens your text actually has (to get an estimate of your costs) OpenAI has a library called "tiktoken", which allows you to estimate a cost based on token counts.

Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.

**tiktoken** supports 3 different encodings for OpenAI models:

* "gpt2" for most gpt-3 models
* "p50k_base" for code models, and Davinci models, like "text-davinci-003"
* "cl100k_base" for text-embedding-ada-002

Make sure to view the pricing page on the OpenAI page for full information, for now, we'll cut down the data size so we don't spend too much money during training.

In [15]:
def num_tokens_from_string(string, encoding_name):
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [16]:
dataset_size = 500

In [17]:
with open("training_data.json", "w") as f:
    for entry in qa_openai_format[:dataset_size]:
        f.write(json.dumps(entry))
        f.write("\n")

In [18]:
token_counter = 0
for element in qa_openai_format[:dataset_size]:
    for key, value in element.items():
        token_counter+=num_tokens_from_string(value,'p50k_base')
    

In [19]:
print(f"There are {token_counter} tokens")
print(f"Fine tuning using babbage costs $0.0006 per 1000 tokens")
print(f"Estimated price: ${(4*token_counter / 1000) * 0.0006}")

There are 184352 tokens
Fine tuning using babbage costs $0.0006 per 1000 tokens
Estimated price: $0.11061119999999999


## Command Line for Fine-Tuning

Note, you can find the full official guide here:

https://platform.openai.com/docs/guides/fine-tuning

OpenAI recommends using the terminal/command line via their OpenAI tool, which you have by simply running:

    pip install --upgrade openai
    


Now you can head over to the terminal to fine tune the model using the following command:

    openai api fine_tunes.create -t training_data.json -m babbage

Alternatively we can also run this in notebook:

In [None]:
!openai api fine_tunes.create -t training_data.json -m babbage

You can use:

*openai api fine_tunes.list* to get a list of your fine tuning jobs, 

*openai api fine_tunes.get -i <YOUR_FINE_TUNE_JOB_ID>* to get the debug log of your fine tuning process


In [None]:
!babbage:ft-personal-2023-02-09-13-52-20

After training you can extract your fine tuned model using *openai api fine_tunes.list* and copy the fine_tuned_model entry

## OpenAI API

Remember to use the notebook as shown, you must set your OpenAI API Key as an environment variable. Obviously, there are many ways you could provide your API Key to the Python code, input() or even hard-coded, but those are typically not recommended for safety reasons. Having it as an environment variable let's the key live on the computer, but not actually be present in the code.

### Set-up Open AI API Key

We'll only need to do this once per computer

In [33]:
# Uncomment below and swap in your key to place your environment key using Python
# Then you can delete the key string and the code cell below will still work!
# os.environ["OPENAI_API_KEY"] = "Your key goes here!

In [34]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [68]:
import openai
response = openai.Completion.create(
    model="babbage:ft-personal-2023-02-09-13-52-20",
    prompt="What are good python books?",
    max_tokens=128,
    temperature=0.7,
    top_p=1.0)


In [70]:
print(response["choices"][0]["text"])



If you're programming in python, a good introductory book is "Programming Python". It covers most of the basics and has a good section on libraries.
I'd recommend this book if you're new to python, or if you're already experienced. I'd also recommend "Learn Python The Hard Way" .

For a more advanced book, "Mastering Python" is a good starting point.

"The Python Bible" is a good reference too.

Pythonista - an excellent book by Michael J. pearlman.

Books also available in PDF form:

"Mastering Python".
