<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/openai/OpenAI_Finetuning_on_Gorilla_with_wandb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# ChatGPT-3.5 Fine-tuning - Gorrilla api

Fine-tuning ChatGPT-3.5 on the Gorilla api dataset to try and improve its performance
- [Gorilla project](https://shishirpatil.github.io/gorilla/)
- [Gorilla paper](https://arxiv.org/abs/2305.15334)
- [Gorilla code](https://github.com/ShishirPatil/gorilla)

OpenAI ChatGPT-3.5 fine-tuning docs [are here](https://platform.openai.com/docs/guides/fine-tuning)

**Warning!**

This fine-tuning script will train 7.2 million tokens on OpenAI, check if you're willing to pay that before proceeding :)


In [None]:
!pip install openai tiktoken wandb -qqq

In [None]:
import re
import os
import json
import wandb
import openai
from pprint import pprint

In [None]:
openai_api_key = "OPENAI API KEY"
openai.api_key = openai_api_key

# Data

Download the Gorrilla huggingface api training data, you can find all the [Gorilla training data here](https://github.com/ShishirPatil/gorilla/tree/main/data/apibench)

In [None]:
!wget https://raw.githubusercontent.com/ShishirPatil/gorilla/cab053ba7fdf4a3286c0e75aa2bf7abc4053812f/data/apibench/huggingface_train.json
!wget https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/tensorflow_train.json
!wget https://raw.githubusercontent.com/ShishirPatil/gorilla/main/data/apibench/torchhub_train.json

Load the data

In [None]:
data = []
data_files = [
    "huggingface_train.json",
    "tensorflow_train.json",
    "torchhub_train.json",
]

for file in data_files:
  with open(file, "r") as file:
    # data = json.load(file)
    for line in file:
          item = json.loads(line.strip())
          data.append(item)

# This is the data relevant to training
data[0]["code"]

### Data Parsing

Parse the training data instructions

In [None]:
def parse_instructions_and_outputs(code_section):

  sections = code_section.split('###')
  instruction = ""
  for section in sections:
      if "Instruction:" in section:
          instruction = section.lower().split("instruction:", 1)[1].strip()
          break

  # domain = re.search(r'<<<domain>>>(.*?)\n', code_section, re.IGNORECASE).group(1).lstrip(': ')
  if "<<<domain>>>" in code_section:
    domain = re.search(r'<<<domain>>>(.*?)<<<', d["code"], re.IGNORECASE | re.DOTALL).group(1).lstrip(': ')
  else:
    domain = ""

  api_call = re.search(r'<<<api_call>>>(.*?)<<<', code_section, re.IGNORECASE | re.DOTALL).group(1).lstrip(': ')
  # api_provider = re.search(r'<<<api_provider>>>(.*?)\n', code_section, re.IGNORECASE).group(1).lstrip(': ')
  if "<<<api_provider>>>" in code_section:
    api_provider = re.search(r'<<<api_provider>>>(.*?)<<<', code_section, re.IGNORECASE | re.DOTALL).group(1).lstrip(': ')
  else:
    api_provider = ""

  if "<<<explanation>>>" in code_section:
    explanation_pattern = r'<<<explanation>>>(.*?)(?:\n<<<code>>>|```|$)'
    explanation = re.search(explanation_pattern, code_section, re.IGNORECASE | re.DOTALL).group(1).lstrip(': ')
  else:
    explanation = None

  # Extract code snippet considering both cases
  code_pattern = r'(?:<<<code>>>|```) (.*)'  # Matches either <<<code>>> or ```
  code_snippet_match = re.search(code_pattern, code_section, re.IGNORECASE | re.DOTALL)
  code_snippet = code_snippet_match.group(1).lstrip(': ') if code_snippet_match else None

  return instruction, domain, api_call, api_provider, explanation, code_snippet

In [None]:
def encode_train_sample(data, api_name):
    """Encode multiple prompt instructions into a single string."""
    code_section = data['code']

    if "<<<api_call>>>" in code_section:
      instruction, domain, api_call, api_provider, explanation, code = parse_instructions_and_outputs(code_section)

      prompts = []

      #prompt = instruction + "\nWrite a python program in 1 to 2 lines to call API in " + api_name + ".\n\nThe answer should follow the format: <<<domain>>> $DOMAIN, <<<api_call>>>: $API_CALL, <<<api_provider>>>: $API_PROVIDER, <<<explanation>>>: $EXPLANATION, <<<code>>>: $CODE}. Here are the requirements:\n" + domains + "\n2. The $API_CALL should have only 1 line of code that calls api.\n3. The $API_PROVIDER should be the programming framework used.\n4. $EXPLANATION should be a step-by-step explanation.\n5. The $CODE is the python code.\n6. Do not repeat the format in your answer."

      prompts.append({"role": "system", "content": "You are a helpful API writer who can write APIs based on requirements."})
      prompts.append({"role": "user", "content": instruction})
      prompts.append({"role": "assistant", "content": f"<<<domain>>> {domain},\
<<<api_call>>>: {api_call}, <<<api_provider>>>: {api_provider}, <<<explanation>>>: {explanation}, <<<code>>>: {code}"})
      return prompts
    else:
      return None

Format the training samples with the correct format to mirror the Gorilla paper

In [None]:
encoded_data = []
none_count = 0
for d in data:
  res = encode_train_sample(d, "huggingface")
  if res is not None:
    encoded_data.append({"messages":res})
  else:
    none_count += 1

print(f"{none_count} samples out of {len(data)} ignored")

Print a sample of what will get passed to OpenAI for fine-tuning

In [None]:
encoded_data[333]

Save the training data

In [None]:
encoded_file_path = 'all_encoded_data.jsonl'

with open(encoded_file_path, 'w') as file:
    for item in encoded_data:
        line = json.dumps(item)
        file.write(line + '\n')

In [None]:
# Start a Weights & Biases run to save our data and results
wandb.init(project="gorilla-api")
wandb.log_artifact(encoded_file_path, "hf_tf_th_gorilla_train.jsonl", type="train_data")
wandb.finish()

## OpenAI data verification script

In [None]:
# We start by importing the required packages

import json
import os
import tiktoken
import numpy as np
from collections import defaultdict

# Next, we specify the data path and open the JSONL file

data_path = encoded_file_path

# Load dataset
with open(data_path) as f:
    dataset = [json.loads(line) for line in f]

# We can inspect the data quickly by checking the number of examples and the first item

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

# Now that we have a sense of the data, we need to go through all the different examples and check to make sure the formatting is correct and matches the Chat completions message structure

# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

# Beyond the structure of the message, we also need to ensure that the length does not exceed the 4096 token limit.

# Token counting functions
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

# Last, we can look at the results of the different formatting operations before proceeding with creating a fine-tuning job:

# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
TARGET_EPOCHS = 3
MIN_EPOCHS = 1
MAX_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print("See pricing page to estimate total costs")

In [None]:
wandb.summary["num_samples"] = len(dataset)
wandb.summary["n_billing_tokens_in_dataset"] = n_billing_tokens_in_dataset

# Start Fine-tuning ChatGPT-3.5

Create an OpenAI training file

In [None]:
openai.File.create(
  file=open(encoded_file_path, "rb"),
  purpose='fine-tune'
)

Create your fine-tuning job

In [None]:
openai.api_key = openai_api_key
openai.FineTuningJob.create(
    training_file="file-N9M4sC8GfXgTNw0WAwgiLHNR",  #"file-OrxAP7HcvoSUmu9MtAbWo5s4",
    model="gpt-3.5-turbo",
    hyperparameters={"n_epochs": 3}
)

In [None]:
openai.FineTuningJob.list_events(id="ftjob-ShHWEMHa2U7gRNVTpjOYEZEP", limit=5)

### Log the results to Weights & Biases when the model is finished training

(temporarily install openai from a fork until this PR to update the wandb logger is merged in openai: https://github.com/openai/openai-python/pull/590)

In [None]:
!pip uninstall -y openai -qq && pip install git+https://github.com/morganmcg1/openai-python.git@update_wandb_logger -qqq

Run `openai wandb sync` to sync your openai results to W&B

In [None]:
!OPENAI_API_KEY={openai_api_key} openai wandb sync --entity prompt-eng --project gorilla-api --id ftjob-mNSsI2UcxCvpV767GmnYoSzR

### Other useful openai commands

List 10 fine-tuning jobs

In [None]:
openai.FineTuningJob.list(limit=10)

Retrieve the state of a fine-tune

In [None]:
state = openai.FineTuningJob.retrieve("ftjob-qhg4yswil15TCqD4SNHn0V1D")
state["status"], state["trained_tokens"], state["finished_at"]

List up to 10 events from a fine-tuning job

In [None]:
openai.FineTuningJob.list_events(id="ftjob-qhg4yswil15TCqD4SNHn0V1D", limit=10)

# Use the Model

In [None]:
openai.api_key = openai_api_key

completion = openai.ChatCompletion.create(
  model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How can i load a NER model?"}
  ]
)

In [None]:
pprint(completion.choices[0].message)
pprint(completion.choices[0].message["content"])