# Fine Tuning OpenAI Models Walkthrough
#### Using GoEmotions Dataset - The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. 
https://huggingface.co/datasets/go_emotions

---

## Loading and Processing The Data

In [1]:
# Importing with huggingface datasets package
import pandas as pd
from datasets import load_dataset

df = load_dataset('go_emotions')

Downloading readme:   0%|          | 0.00/9.40k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/350k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

In [114]:
df['train'][0]

{'text': "My favourite food is anything I didn't have to cook myself.",
 'labels': [27],
 'id': 'eebbqej'}

In [31]:
# creating an emotion index label dictionary
label_index = {
    "0": "admiration",
    "1": "amusement",
    "2": "anger",
    "3": "annoyance",
    "4": "approval",
    "5": "caring",
    "6": "confusion",
    "7": "curiosity",
    "8": "desire",
    "9": "disappointment",
    "10": "disapproval",
    "11": "disgust",
    "12": "embarassment",
    "13": "excitement",
    "14": "fear",
    "15": "gratitude",
    "16": "grief",
    "17": "joy",
    "18": "love",
    "19": "nervousness",
    "20": "optimism",
    "21": "pride",
    "22": "realization",
    "23": "relief",
    "24": "remorse",
    "25": "sadness",
    "26": "surprise",
    "27": "neutral"
}

In [71]:
# Pull the first 1000 entries into a pandas dataframe
data = []
for i in range(0, 1000):
    comment = df['train'][i]['text']
    label_indices = df['train'][i]['labels']

    # deal with labels
    if not isinstance(label_indices, list):
        label_indices = [label_indices]

    # label mapping
    emotions = ', '.join([label_index.get(str(label)) for label in label_indices])

    data.append((comment, emotions))

comments_df = pd.DataFrame(data, columns=["comment", "emotion label"])

In [72]:
comments_df.head(50)

Unnamed: 0,comment,emotion label
0,My favourite food is anything I didn't have to...,neutral
1,"Now if he does off himself, everyone will thin...",neutral
2,WHY THE FUCK IS BAYLESS ISOING,anger
3,To make her feel threatened,fear
4,Dirty Southern Wankers,annoyance
5,OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe...,surprise
6,Yes I heard abt the f bombs! That has to be wh...,gratitude
7,We need more boards and to create a bit more s...,"desire, optimism"
8,Damn youtube and outrage drama is super lucrat...,admiration
9,It might be linked to the trust factor of your...,neutral


In [63]:
# quick check to see if all emotions are used

emotion_counts = {emotion: 0 for emotion in label_index.values()}

for _, emotion_combo in data:
    for emotion in emotion_combo.split(', '):
        emotion_counts[emotion] += 1

missing_emotions = [emotion for emotion, count in emotion_counts.items() if count == 0]

if not missing_emotions:
    print("All emotions are present at least once.")
else:
    print("The following emotions are missing:", missing_emotions)

All emotions are present at least once.


In [64]:
print(emotion_counts)

{'admiration': 99, 'amusement': 61, 'anger': 41, 'annoyance': 57, 'approval': 67, 'caring': 21, 'confusion': 33, 'curiosity': 50, 'desire': 13, 'disappointment': 34, 'disapproval': 49, 'disgust': 13, 'embarassment': 6, 'excitement': 13, 'fear': 10, 'gratitude': 63, 'grief': 6, 'joy': 39, 'love': 47, 'nervousness': 3, 'optimism': 29, 'pride': 1, 'realization': 19, 'relief': 2, 'remorse': 9, 'sadness': 32, 'surprise': 35, 'neutral': 336}


## Language Model Setup for Comparison

---

In [104]:
# using gpt-4-t to attempt the same labeling tasks
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import os
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

In [115]:
emotion_analysis_template = """
You are a cutting edge emotion analysis classification assistant.\
You analyze a comment, and apply one or more emotion labels to it. \

The emotion labels are detailed here: \

{emotions}

Your output should simply be just the respective emotion, and if there are multiple seperated with commas. \

The comment is here: {comment}
"""

output_parser = StrOutputParser()

# different models to plug in (plus the fine tuned one!)
gpt4t_llm = ChatOpenAI(temperature=0.0, model="gpt-4-turbo-preview")
ft_llm = ChatOpenAI(temperature=0.0, model="ft:gpt-3.5-turbo-0125:personal:go-emotions:95jDha5f")
gpt35t_llm = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo-0125")

emotion_analysis_prompt = ChatPromptTemplate.from_template(emotion_analysis_template)

analysis_chain = (
    {"comment": RunnablePassthrough(), "emotions": RunnablePassthrough()} 
    | emotion_analysis_prompt
    | gpt4t_llm
    | output_parser
)

# list of emotions instead of the dictionary
emotions = []
for emotion in label_index.values():
    emotions.append(emotion)

## GPT-4-T Output Comparison

In [None]:
# The initial gpt-4-t baseline on the first 100 comments
def analyze_emotion(comment, emotions):
    input_data = {
        "comment": comment,
        "emotions": emotions
    }
    try:
        result = analysis_chain.invoke(input_data)
        return result
    except Exception as e:
        return str(e)

gpt4t_emotion_labels = []

# Loop through the first 100 comments
for i in range(100):
    comment = comments_df['comment'][i]  # Get the i-th comment
    emotion_label = analyze_emotion(comment, emotions)  # Analyze the comment
    gpt4t_emotion_labels.append(emotion_label)  # Append the result to the list

# Assign the list of emotion labels to a new column in the DataFrame
comments_df['gpt4t emotion label'] = pd.Series(gpt4t_emotion_labels)

comments_df.to_excel('comments_baseline_gpt4.xlsx', index=False)


In [89]:
comments_df.head(50)

Unnamed: 0,comment,emotion label,gpt4t emotion label
0,My favourite food is anything I didn't have to...,neutral,"amusement, joy"
1,"Now if he does off himself, everyone will thin...",neutral,"disapproval, sadness, confusion"
2,WHY THE FUCK IS BAYLESS ISOING,anger,"anger, annoyance, confusion"
3,To make her feel threatened,fear,"disapproval, anger, fear"
4,Dirty Southern Wankers,annoyance,"disapproval, disgust, anger"
5,OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe...,surprise,"amusement, annoyance, disapproval"
6,Yes I heard abt the f bombs! That has to be wh...,gratitude,"gratitude, amusement, anticipation"
7,We need more boards and to create a bit more s...,"desire, optimism","optimism, approval"
8,Damn youtube and outrage drama is super lucrat...,admiration,"amusement, realization"
9,It might be linked to the trust factor of your...,neutral,curiosity


## Creating Training & Validation Datasets in JSONL Format

---

In [100]:
# Saving the training/validation data
import json

training_data = []

for index, row in comments_df.iloc[100:700].iterrows():
    entry = {
        "messages": [
            {
                "role": "system",
                "content": "You are a cutting edge emotion analysis classification assistant. You analyze a comment, and apply one or more emotion labels to it. The emotion labels are detailed here: ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']. Your output should simply be just the respective emotion, and if there are multiple seperated with commas."
            },
            {
                "role": "user",
                "content": f"Analyze the emotion of this comment: {row['comment']}"
            },
            {
                "role": "assistant",
                "content": f"{row['emotion label']}"
            }
        ]
    }
    
    training_data.append(entry)

training_data_jsonl = json.dumps(training_data)

with open('emotion_training_data.jsonl', 'w') as file:
    for entry in training_data:
        json_string = json.dumps(entry, indent=None)
        file.write(json_string + '\n')


In [99]:
validation_data = []

for index, row in comments_df.iloc[700:1000].iterrows():
    entry = {
        "messages": [
            {
                "role": "system",
                "content": "You are a cutting edge emotion analysis classification assistant. You analyze a comment, and apply one or more emotion labels to it. The emotion labels are detailed here: ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']. Your output should simply be just the respective emotion, and if there are multiple seperated with commas."
            },
            {
                "role": "user",
                "content": f"Analyze the emotion of this comment: {row['comment']}"
            },
            {
                "role": "assistant",
                "content": f"{row['emotion label']}"
            }
        ]
    }
    
    validation_data.append(entry)

validation_data_jsonl = json.dumps(validation_data)

with open('emotion_validation_data.jsonl', 'w') as file:
    for entry in validation_data:
        json_string = json.dumps(entry, indent=None)
        file.write(json_string + '\n')

# Post Training Notes

### Cost: $2.96
### Training Time: 01:28:58

## Running the Different Models output to excel

---

In [116]:
# used this script multiple times with different models to just pseudo manually build the df to export to excel

gpt35t_llm = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo-0125")
gpt35t_emotion_labels = []

analysis_chain = (
    {"comment": RunnablePassthrough(), "emotions": RunnablePassthrough()} 
    | emotion_analysis_prompt
    | gpt35t_llm
    | output_parser
)

for i in range(100):
    comment = comments_df['comment'][i]  
    emotion_label = analyze_emotion(comment, emotions)  
    gpt35t_emotion_labels.append(emotion_label) 

# Assign the list of emotion labels to a new column in the DataFrame
comments_df['base gpt35t emotion label'] = pd.Series(gpt35t_emotion_labels)

comments_df.to_excel('comments_output_test.xlsx', index=False)

In [None]:
ft_llm = ChatOpenAI(temperature=0.0, model="ft:gpt-3.5-turbo-0125:personal:go-emotions:95jDha5f")

analysis_chain = (
    {"comment": RunnablePassthrough(), "emotions": RunnablePassthrough()} 
    | emotion_analysis_prompt
    | ft_llm
    | output_parser
)

ft_gpt35t_emotion_labels = []

for i in range(100):
    comment = comments_df['comment'][i]  
    emotion_label = analyze_emotion(comment, emotions)  
    ft_gpt35t_emotion_labels.append(emotion_label) 

# Assign the list of emotion labels to a new column in the DataFrame
comments_df['ft gpt35t emotion label'] = pd.Series(ft_gpt35t_emotion_labels)

comments_df.to_excel('comments_output_test.xlsx', index=False)

In [117]:
comments_df.head(50)

Unnamed: 0,comment,emotion label,gpt4t emotion label,ft gpt35t emotion label,base gpt35t emotion label
0,My favourite food is anything I didn't have to...,neutral,"amusement, joy",neutral,"admiration, approval"
1,"Now if he does off himself, everyone will thin...",neutral,"disapproval, sadness, confusion",neutral,"confusion, disapproval, sadness"
2,WHY THE FUCK IS BAYLESS ISOING,anger,"anger, annoyance, confusion",anger,"anger, confusion"
3,To make her feel threatened,fear,"disapproval, anger, fear",fear,"fear, threat"
4,Dirty Southern Wankers,annoyance,"disapproval, disgust, anger",neutral,"disgust, anger"
5,OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe...,surprise,"amusement, annoyance, disapproval","anger, disapproval","anger, annoyance, disapproval"
6,Yes I heard abt the f bombs! That has to be wh...,gratitude,"gratitude, amusement, anticipation",neutral,"excitement, nervousness"
7,We need more boards and to create a bit more s...,"desire, optimism","optimism, approval",neutral,"approval, optimism"
8,Damn youtube and outrage drama is super lucrat...,admiration,"amusement, realization",neutral,"anger, excitement"
9,It might be linked to the trust factor of your...,neutral,curiosity,neutral,"curiosity, trust"
