# OpenAI - fine-tuning example

The guide is a companion to the paper *"Generative LLMs and Textual Analysis in Accounting:(Chat)GPT as Research Assistant?"* ([SSRN](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4429658))

**Author:** [Ties de Kok](https://www.tiesdekok.com)    

----
# Imports
----


All the dependencies required for this notebook are provided in the `environment.yml` file.

To install: `conda env create -f environment.yml` --> this creates the `gllm` environment.

I recommend using Python 3.9 or higher to avoid dependency conflicts.

**Python built-in libraries**

In [2]:
import os, sys, re, copy, random, json, time, datetime
from pathlib import Path
import getpass

**Libraries for interacting with the OpenAI API**

In [3]:
import requests
import openai
import tiktoken

**General helper libraries**

In [4]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

### Settings

In [6]:
pd.options.mode.chained_assignment = None  # default='warn'
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)

### Utility functions

In [7]:
## This function makes it easier to print rendered markdown through a code cell.

from IPython.display import Markdown

def mprint(text, *args, **kwargs):
    if 'end' in kwargs.keys():
        text += kwargs['end']
        
    display(Markdown(text))

---
## Set up OpenAI
---

To use the OpenAI API you will need an API key. If you don't have one, follow these steps:   

1. Create an OpenAI account --> https://platform.openai.com   
2. Create an OpenAI API key --> https://platform.openai.com/account/api-keys   
3. You will get \\$5 in free credits if you create a new account. If you've used that up, you will need to add a payment method. The code in this notebook will cost less than a dollar to run.

Once you have your OpenAI Key, you can set it as the `OPENAI_API_KEY` environment variable (recommended) or enter it directly below.

In [8]:
if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass(prompt='Enter your API key: ')
    
openai.api_key = os.environ['OPENAI_API_KEY']
    
## KEEP YOUR KEY SECURE, ANYONE WITH ACCESS TO IT CAN GENERATE COSTS ON YOUR ACCOUNT!

----
# Create demo dataset
---

Let's say our task is to find all words in a sentence that are wrapped by <>. For example:

`This is a <word> and this is another <thing> plus some other text.`

We want: 
- "word"
- "thing"

*Note:* this is a task easily solved using regular expressions, so it makes for an easy example as we can create as much training data as we want. 

### Load some statements 

In [9]:
with open(Path.cwd() / "data" / "statements.json", "r", encoding = "utf-8") as f:
    statement_list = json.load(f)

statement_df = pd.DataFrame(statement_list)

### Randomly add <>

In [23]:
all_sentences = []
for statement in statement_df.statement:
    elements = statement.split(" ")
    elements_to_wrap = random.choices(list(range(len(elements))), k = 2)
    for i in elements_to_wrap:
        elements[i] = f"<{elements[i]}>"
        
    all_sentences.append(" ".join(elements))

In [25]:
all_sentences[:2]

['In the last quarter, we <managed> to increase our revenue <by> 15% due to the successful launch of our new product line.',
 'We anticipate that our investments <in> R&D will lead to a <20%> improvement in efficiency in the next two years.']

### Create completions

Normally you would need to manually create a dataset (or obtain it some other way). 

But for this toy example we can programmatically generate the answer, which makes for an easier demo.

In [31]:
full_dataset = []
for i, sen in enumerate(all_sentences):
    wrapped_words = re.findall("<(.*?)>", sen)
    full_dataset.append({
        "i" : i,
        "text" : sen,
        "word_hits" : wrapped_words
    })

In [33]:
full_dataset[:2]

[{'i': 0,
  'text': 'In the last quarter, we <managed> to increase our revenue <by> 15% due to the successful launch of our new product line.',
  'word_hits': ['managed', 'by']},
 {'i': 1,
  'text': 'We anticipate that our investments <in> R&D will lead to a <20%> improvement in efficiency in the next two years.',
  'word_hits': ['in', '20%']}]

### Create prompt and completion

We don't have to include instructions as part of the prompt as we are "communicating" the instructions through the examples that we show the model during fine-tuning.

In [52]:
prompt_template = """Text:
{text}
####
"""

completion_template = """
{word_hits}
<|end|>
""".strip()

In [53]:
for row in full_dataset:
    row["prompt"] = prompt_template.format(**row)

    row["completion"] = completion_template.format(**{
         "word_hits" : json.dumps(row["word_hits"])
    })

In [54]:
print(full_dataset[0]["prompt"]+full_dataset[0]["completion"])

Text:
In the last quarter, we <managed> to increase our revenue <by> 15% due to the successful launch of our new product line.
####
["managed", "by"]
<|end|>


*Note:* we use `####` to indicate where the prompt ends and the completion starts. And we use `<|end|>` to indicate where the completion ends. 

---
## Fine-tune
---

### Create training and eval splits

In [56]:
random.shuffle(full_dataset)

In [60]:
training_dataset = full_dataset[:50]
eval_dataset = full_dataset[50:]

### Create training file

The most common format for uploading training datasets is a `.jsonl` file. Which is a text file where every row is seperated by a newline and the content of every row is formatted as json.

In [65]:
train_input = []
for item in training_dataset:
    train_input.append({
            "messages" : [
                {"role": "user", "content": item["prompt"]},
                {"role": "assistant", "content": item["completion"]}
            ]
        })

In [71]:
ft_jsonl_file = Path.cwd() / "data/ft_input.jsonl" 

with open(ft_jsonl_file, "w", encoding = "utf-8") as f:
    for item in train_input:
        f.write(json.dumps(item) + "\n")

### Validate training data

The code below is provided by OpenAI through their documention.

In [72]:
# We start by importing the required packages

import json
import os
import tiktoken
import numpy as np
from collections import defaultdict

# Next, we specify the data path and open the JSONL file

data_path = (ft_jsonl_file).as_posix().replace("/", "\\")

# Load dataset
with open(data_path) as f:
    dataset = [json.loads(line) for line in f]

# We can inspect the data quickly by checking the number of examples and the first item

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

# Now that we have a sense of the data, we need to go through all the different examples and check to make sure the formatting is correct and matches the Chat completions message structure

# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

# Beyond the structure of the message, we also need to ensure that the length does not exceed the 4096 token limit.

# Token counting functions
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

# Last, we can look at the results of the different formatting operations before proceeding with creating a fine-tuning job:

# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
TARGET_EPOCHS = 4
MIN_EPOCHS = 1
MAX_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print("See pricing page to estimate total costs")

Num examples: 50
First example:
{'role': 'user', 'content': 'Text:\nWe project that our ongoing efforts to <expand> our international presence will result in a 20% increase in global revenue over the <next> three years.\n####\n'}
{'role': 'assistant', 'content': '["expand", "next"]\n<|end|>'}
No errors found
Num examples missing system message: 50
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 2, 2
mean / median: 2.0, 2.0
p5 / p95: 2.0, 2.0

#### Distribution of num_total_tokens_per_example:
min / max: 49, 67
mean / median: 55.46, 55.0
p5 / p95: 50.0, 59.2

#### Distribution of num_assistant_tokens_per_example:
min / max: 9, 13
mean / median: 11.28, 11.0
p5 / p95: 11.0, 12.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~2773 tokens that will be charged for during training
By default, you'll train for 4 epochs on this dataset
By default, you'll be charged for ~11092 tokens
See pricing

--- 
## Perform fine-tuning
---

#### Upload file

In [76]:
CREATE_FT_FILE = True
if CREATE_FT_FILE:
    _ = openai.File.create(
      file=open(ft_jsonl_file.as_posix().replace("/", "\\"), "rb"),
      purpose='fine-tune'
    )
_

<File file id=file-0AcXBsPI2RX9uu6ZkmiKmbxG at 0x27bddb06e80> JSON: {
  "object": "file",
  "id": "file-0AcXBsPI2RX9uu6ZkmiKmbxG",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 13839,
  "created_at": 1696026310,
  "status": "uploaded",
  "status_details": null
}

#### Start fine-tuning

In [78]:
RUN_FT = True
file_id = "file-0AcXBsPI2RX9uu6ZkmiKmbxG"
if RUN_FT:
    _ = (
        openai
        .FineTuningJob
        .create(
            training_file = file_id,
            model = "gpt-3.5-turbo",
            hyperparameters = {
                "n_epochs" : 2 ## For complex tasks you can increase this 
            },
            suffix = "openai_ft_demo"
        )
    )

#### Track progress

In [80]:
openai.FineTuningJob.list(limit=1)

<OpenAIObject list at 0x27bdd4cfd80> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job",
      "id": "ftjob-TbN6ZHg3qRVQUQJRPKCL5iC7",
      "model": "gpt-3.5-turbo-0613",
      "created_at": 1696026402,
      "finished_at": null,
      "fine_tuned_model": null,
      "organization_id": "org-qyGkcV6pKgERYlKgngXYNEFy",
      "result_files": [],
      "status": "running",
      "validation_file": null,
      "training_file": "file-0AcXBsPI2RX9uu6ZkmiKmbxG",
      "hyperparameters": {
        "n_epochs": 2
      },
      "trained_tokens": null,
      "error": null
    }
  ],
  "has_more": true
}

In [81]:
ft_job_id = "ftjob-TbN6ZHg3qRVQUQJRPKCL5iC7"
openai.FineTuningJob.retrieve(ft_job_id)

<FineTuningJob fine_tuning.job id=ftjob-TbN6ZHg3qRVQUQJRPKCL5iC7 at 0x27bde4186d0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-TbN6ZHg3qRVQUQJRPKCL5iC7",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1696026402,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-qyGkcV6pKgERYlKgngXYNEFy",
  "result_files": [],
  "status": "running",
  "validation_file": null,
  "training_file": "file-0AcXBsPI2RX9uu6ZkmiKmbxG",
  "hyperparameters": {
    "n_epochs": 2
  },
  "trained_tokens": null,
  "error": null
}

In [96]:
openai.FineTuningJob.list_events(id = ft_job_id)["data"][:5]

[<OpenAIObject fine_tuning.job.event id=ftevent-TEXZkdoEeR0sSrnKNnEO2Zh6 at 0x27bde41b7e0> JSON: {
   "object": "fine_tuning.job.event",
   "id": "ftevent-TEXZkdoEeR0sSrnKNnEO2Zh6",
   "created_at": 1696026759,
   "level": "info",
   "message": "The job has successfully completed",
   "data": {},
   "type": "message"
 },
 <OpenAIObject fine_tuning.job.event id=ftevent-o8f25XMzoIXYqKnlZilKmLjK at 0x27bde41b060> JSON: {
   "object": "fine_tuning.job.event",
   "id": "ftevent-o8f25XMzoIXYqKnlZilKmLjK",
   "created_at": 1696026757,
   "level": "info",
   "message": "New fine-tuned model created: ft:gpt-3.5-turbo-0613:ties-and-coauthors:openai-ft-demo:84GKDrg2",
   "data": {},
   "type": "message"
 },
 <OpenAIObject fine_tuning.job.event id=ftevent-jnHyndxWDfVy6hqHEysol9AC at 0x27bddefc360> JSON: {
   "object": "fine_tuning.job.event",
   "id": "ftevent-jnHyndxWDfVy6hqHEysol9AC",
   "created_at": 1696026728,
   "level": "info",
   "message": "Step 91/100: training loss=0.00",
   "data": {
 

---
## Use fine-tuned model
---

#### Helper function

In [90]:
def make_prediction(
    prompt, 
    model,
    temperature = 0,
    end = "<|end|>"
    ):
    
    completion = openai.ChatCompletion.create(
        model = model,
        temperature = temperature,
        stop = [end], 
        messages = [
                {"role": "user", "content": prompt}
            ]
    )

    result = dict(completion["choices"][0]["message"])
    prediction = json.loads(result["content"].strip()) 
        
    return prediction

#### Create prompts

In [87]:
prompt_list = []
for item in eval_dataset:
    prompt_list.append(
        prompt_template.format(**item)
    )

In [93]:
print(prompt_list[0])

Text:
As <we> continue to navigate through uncertain economic conditions, we remain committed <to> delivering value to our customers and shareholders.
####



#### Generate predictions

In [97]:
model_id = "ft:gpt-3.5-turbo-0613:ties-and-coauthors:openai-ft-demo:84GKDrg2"

In [102]:
res_list = []
for prompt in prompt_list:

    res = make_prediction(
        prompt = prompt,
        model = model_id
    )
    
    res_list.append({
        "prompt" : prompt,
        "prediction" : res
    })

In [126]:
for item in res_list[:5]:
    print(item["prompt"])
    mprint(f"""Prediction: `{item["prediction"]}`""")
    print("\n----\n")

Text:
As <we> continue to navigate through uncertain economic conditions, we remain committed <to> delivering value to our customers and shareholders.
####



Prediction: `['we', 'to']`


----

Text:
<The> launch of our <latest> software solution last quarter has helped us gain significant market share in the industry.
####



Prediction: `['The', 'latest']`


----

Text:
Over the past six months, we have <learned> valuable lessons from the challenges we faced and have <implemented> strategic changes to ensure future success.
####



Prediction: `['learned', 'implemented']`


----

Text:
Through the implementation of lean <manufacturing> principles, we were able to <reduce> lead times by 12% and improve overall product quality in the past year.
####



Prediction: `['manufacturing', 'reduce']`


----

Text:
Our recent partnership with a <leading> <technology> provider has significantly improved our product offerings and competitiveness.
####



Prediction: `['leading', 'technology']`


----

