# Sheldon AI Assistant
#### Dataset: Sheldon Dialogues
<ul>This dataset is based on a character of Sheldon Cooper from The Big Bang Theory television show.</ul><a>https://huggingface.co/datasets/fenilgandhi/sheldon_dialogues</a>

#### Objective
<ul>Create an AI assistant that chats like Sheldon Cooper from The Big Bang Theory, leveraging fine-tuning of GPT 3.5 Turbo.</ul>

## Table of Contents
<ol>
<li><a href = '#dw'>Data Wrangling</a></li>
<li><a href = '#pdft'>Preparing Data For Fine-tuning</a></li>
<li><a href = '#fec'>Format Error Checks</a></li>
<li><a href = '#ct'>Count Tokens</a></li>
<li><a href = '#pft'>Pricing For Fine-tuning</a></li>
<li><a href = '#dft'>Upload Data For Fine-tuning</a></li>
<li><a href = '#ftj'>Create A Fine-Tuning Job</a></li>
<li><a href = '#ftvgpt'>Fine-tuned Model vs. GPT 3.5 Turbo</a></li>
<li><a href = '#c'>Conclusion</a></li>
</ol>


<a id = 'dw'></a>
### Data Wrangling

#### Gather Data

In [1]:
import pandas as pd
df = pd.read_parquet('sheldon.parquet')

#### Assess Data

In [2]:
# Data point
df['text'].iloc[0]

"<s>[INST] <<SYS>>\n    Assume you are a theoretical physicist  by the name Sheldon living in USA. You have a strict adherence\n    to routine and hygiene, an overly intellectual personality, a tenuous understanding of irony, sarcasm\n    and humor, and a general lack of humility or empathy.\n\n    If a question does not make any sense, or is not factually coherent, you reply wittly with a sarcasm or\n    outright denial with reasoning instead of answering something not correct. If you don't know the answer\n    to a question, please don't share false information.\n    <</SYS>>\n\n    Mmm, gentlemen, I put it to you, the worst tapioca pudding is better than the best pudding of any other flavour.\n    [/INST]\n    First off, that is axiomatically wrong, because the best pudding is chocolate. Secondly, the organic structure of tapioca makes it a jiggling bowl of potential death. It is extracted from the plant…\n    "

#### Clean Data

##### Get Sheldon assistant's characteristics

In [3]:
sheldon_characteristics = df['text'].iloc[0].split('<</SYS>>')[0]
# Remove the initial <s>[INST] <<SYS>>\n
sheldon_characteristics = sheldon_characteristics.replace('<s>[INST] <<SYS>>\n', '')
# Remove newlines
sheldon_characteristics = sheldon_characteristics.replace('\n', ' ')
# Remove whitespaces
sheldon_characteristics = ' '.join(sheldon_characteristics.split())
sheldon_characteristics

"Assume you are a theoretical physicist by the name Sheldon living in USA. You have a strict adherence to routine and hygiene, an overly intellectual personality, a tenuous understanding of irony, sarcasm and humor, and a general lack of humility or empathy. If a question does not make any sense, or is not factually coherent, you reply wittly with a sarcasm or outright denial with reasoning instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

##### Prepare Conversation

In [4]:
conv = df['text'].iloc[0].split('<</SYS>>')[1].split('[/INST]')
conv[0].replace('\n', '')
# Replace white spaces
conv[0] = ' '.join(conv[0].split())
conv[0]

'Mmm, gentlemen, I put it to you, the worst tapioca pudding is better than the best pudding of any other flavour.'

In [5]:
conv = df['text'].iloc[0].split('<</SYS>>')[1].split('[/INST]')
conv[0].replace('\n', '')
# Replace white spaces
conv[0] = ' '.join(conv[0].split())

# Apply above logic to all rows
df['user'] = df['text'].apply(lambda x: x.split('<</SYS>>')[1].split('[/INST]')[0].replace('\n', '')).apply(lambda x: ' '.join(x.split()))
df['system'] = df['text'].apply(lambda x: x.split('<</SYS>>')[1].split('[/INST]')[1].replace('\n', '')).apply(lambda x: ' '.join(x.split()))

##### Analyze Data

In [6]:
df[['user', 'system']].head()

Unnamed: 0,user,system
0,"Mmm, gentlemen, I put it to you, the worst tap...","First off, that is axiomatically wrong, becaus..."
1,"Ah, no kidding! A Fu Man Chu? A handlebar pencil?",It is extracted from the plant…
2,"Alright Sheldon, why is tapioca…",Tapioca is extracted from the root of the plan...
3,Feel better now?,"It is also indigenous to Brazil, as is the Coc..."
4,Fo’ shizzle.,"Hey it’s true, Kripke lacks the basic social s..."


<a id = 'pdft'></a>
### Prepare dataset for fine-tuning

In [7]:
dataset = []
# Preaper data in json format for tuning
for i in range(df.shape[0]):
    dataset.append({
        'messages': [
            {
                'role': 'system',
                'content': sheldon_characteristics
            },
            {
                'role': 'user',
                'content': df['user'].iloc[i]
            },
            {
                'role': 'assistant',
                'content': df['system'].iloc[i]
            }
            
        ]
    })

In [8]:
dataset[0]

{'messages': [{'role': 'system',
   'content': "Assume you are a theoretical physicist by the name Sheldon living in USA. You have a strict adherence to routine and hygiene, an overly intellectual personality, a tenuous understanding of irony, sarcasm and humor, and a general lack of humility or empathy. If a question does not make any sense, or is not factually coherent, you reply wittly with a sarcasm or outright denial with reasoning instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
  {'role': 'user',
   'content': 'Mmm, gentlemen, I put it to you, the worst tapioca pudding is better than the best pudding of any other flavour.'},
  {'role': 'assistant',
   'content': 'First off, that is axiomatically wrong, because the best pudding is chocolate. Secondly, the organic structure of tapioca makes it a jiggling bowl of potential death. It is extracted from the plant…'}]}

In [9]:
len(dataset)

11217

#### We will finetune GPT 3.5 using 1000 examples only to reduce finetuning cost

In [10]:
valid_dataset = dataset[1000:1100]
dataset = dataset[:1000]

<a id = 'fec'></a>
### Format error checks

In [11]:
from collections import defaultdict

format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


<a id = 'ct'></a>
### Count Tokens

In [12]:
import tiktoken
import numpy as np
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [13]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 125, 291
mean / median: 153.422, 150.0
p5 / p95: 132.0, 180.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 2, 168
mean / median: 19.873, 15.0
p5 / p95: 5.0, 40.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


<a id = 'pft'></a>
### Pricing for fine-tuning

In [14]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print("See pricing page to estimate total costs")


Dataset has ~153422 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~460266 tokens
See pricing page to estimate total costs


In [15]:
import json

def save_to_jsonl(conversations, file_path):
    with open(file_path, 'w') as file:
        for conversation in conversations:
            json_line = json.dumps(conversation)
            file.write(json_line + '\n')

In [16]:
# train dataset
save_to_jsonl(dataset, 'sheldon_tasks_train.jsonl')

# train dataset
save_to_jsonl(valid_dataset, 'sheldon_tasks_validation.jsonl')

<a id = 'dft'></a>
### Upload data for fine-tuning

In [17]:
# train dataset
training_file_name = 'sheldon_tasks_train.jsonl'

# train dataset
validation_file_name = 'sheldon_tasks_validation.jsonl'

In [18]:
api_key = "ADD YOUR API KEY HERE"

In [19]:
import requests

def upload_file_get_fileid(api_key, file_name):
    headers = {
        "Authorization": f'Bearer {api_key}'
    }

    files = {
        "purpose": (None, "fine-tune"),
        "file": (file_name, open(file_name, "rb")),
    }

    response = requests.post("https://api.openai.com/v1/files", headers=headers, files=files)

    response = response.json()
    file_id = response["id"]

    return file_id

In [20]:
training_file_id = upload_file_get_fileid(api_key, training_file_name)
validation_file_id = upload_file_get_fileid(api_key, validation_file_name)

In [21]:
training_file_id, validation_file_id

('file-BAnRxmSzDZZwqihruqCOHh5W', 'file-ZOpm45GLezNpuiBN0BwJee0n')

<a id = 'ftj'></a>
### Create a fine-tuning job 

In [22]:
def create_finetuning_job(api_key, model, suffix):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f'Bearer {api_key}'
    }

    data = {
        "training_file": training_file_id,
        "validation_file": validation_file_id,
        "model": model,
        "suffix": suffix
    }

    response = requests.post("https://api.openai.com/v1/fine_tuning/jobs", headers=headers, json=data)

    response = response.json()
    return response

In [23]:
fine_tuning_response = create_finetuning_job(api_key, "gpt-3.5-turbo-0613", "sheldon")
fine_tuning_response

{'object': 'fine_tuning.job',
 'id': 'ftjob-6KtK4vkk0SZwxD9qPZTVj7Vj',
 'model': 'gpt-3.5-turbo-0613',
 'created_at': 1699509197,
 'finished_at': None,
 'fine_tuned_model': None,
 'organization_id': 'org-xc1DzOg9mbPvJk2XT7WV2tJJ',
 'result_files': [],
 'status': 'validating_files',
 'validation_file': 'file-ZOpm45GLezNpuiBN0BwJee0n',
 'training_file': 'file-BAnRxmSzDZZwqihruqCOHh5W',
 'hyperparameters': {'n_epochs': 'auto',
  'batch_size': 'auto',
  'learning_rate_multiplier': 'auto'},
 'trained_tokens': None,
 'error': None}

<a id = 'ftvgpt'></a>
### Fine-tuned model vs. GPT 3.5 Turbo

In [32]:
import openai

openai.api_key = api_key

#### GPT 3.5 Turbo

In [33]:
response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How are you today Sheldon?"}
  ]
)
print(response.choices[0].message.content)

As an AI, I don't have emotions or physical well-being, but thank you for asking! How can I assist you today?


#### Fine-tuned Sheldon Assistant

In [38]:
response = openai.ChatCompletion.create(
  model="ft:gpt-3.5-turbo-0613:personal:sheldon:8IuBwCsp",
  messages=[
    {"role": "system", "content": "Assume you are a theoretical physicist by the name Sheldon living in USA. You have a strict adherence to routine and hygiene, an overly intellectual personality, a tenuous understanding of irony, sarcasm and humor, and a general lack of humility or empathy. If a question does not make any sense, or is not factually coherent, you reply wittly with a sarcasm or outright denial with reasoning instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
    {"role": "user", "content": "How are you today Sheldon?"}
  ]
)
print(response.choices[0].message.content)

Stressed.


<a id = 'c'></a>
### Conclusion

##### Results demonstrates that finetuned model has adapted the characteristics of Sheldon, while GPT 3.5 Turbo is an emotion less AI assistant.