# Comment Emotion Class

- In this notebook, I prepared a batch to classify the emotional tone expressed in each thesis-relevant comment.  
- I used the rewritten comment text to build the batch JSONL file.  
- The prompt asked GPT-4o Mini to assign one of six predefined emotion categories to each comment: anger, worry, sadness, joy, surprise, or neutral.  
- To stay within the token limit, I split the batch into two parts before submitting it to OpenAI’s Batch API.  
- Once the outputs were ready, I downloaded the JSONL files and extracted the `custom_id` and assigned emotion class.  
- I saved the results to a CSV so I could analyse the emotional distribution of comments across different countries, events, and categories.


### Import Libraries

In [None]:
import pandas as pd
import json
from tqdm import tqdm
import os
import openai

In [2]:
# Loading the full dataset
df = pd.read_csv("Thesis_Relevant_With_Transcript_Influence_and_Agreement.csv")
print(f"Loaded {len(df)} comments.")

Loaded 71441 comments.


  df = pd.read_csv("Thesis_Relevant_With_Transcript_Influence_and_Agreement.csv")


### Creating the Json file

In [17]:
output_jsonl = "batch_step_emotion_class.jsonl"

# Opening the file and writing each comment as a separate task
with open(output_jsonl, "w", encoding="utf-8") as f:
    for _, row in tqdm(df.iterrows(), total=len(df), desc="Building Emotion Class batch (Balanced)"):
        comment_id = str(row["Comment_ID"])
        comment = row["Rewritten Comment"]

        # Building the request for GPT-4o Mini
        task = {
            "custom_id": comment_id,  # Used to keep track of each comment
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are helping a researcher identify the dominant emotion expressed in YouTube comments."
                    },
                    {
                        "role": "user",
                        "content": f"""Comment: "{comment}"\n\nWhat is the dominant emotion expressed in this comment?\n\nReply ONLY with one of the following:\n- anger\n- sadness\n- joy\n- hope\n- disapproval\n- neutral"""
                    }
                ],
                "temperature": 0,       
                "max_tokens": 10        
            }
            
        f.write(json.dumps(task) + "\n")

print(f"Saved: {output_jsonl}")

Building Emotion Class batch (Balanced): 100%|█| 71441/71441 [00:07<00:00, 10045.6]
Saved: batch_step_emotion_class.jsonl


### Splitting the Batch due to Token Limit

In [19]:
input_file = "batch_step_emotion_class.jsonl" 
output_file_1 = input_file.replace(".jsonl", "_part1.jsonl")
output_file_2 = input_file.replace(".jsonl", "_part2.jsonl")

# Reading the file
with open(input_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

# Splitting
mid = len(lines) // 2
part1 = lines[:mid]
part2 = lines[mid:]

# Writing to two new files
with open(output_file_1, "w", encoding="utf-8") as f:
    f.writelines(part1)

with open(output_file_2, "w", encoding="utf-8") as f:
    f.writelines(part2)

print(f"Saved {len(part1)} → {output_file_1}")
print(f"Saved {len(part2)} → {output_file_2}")


Saved 35720 → batch_step_emotion_class_part1.jsonl
Saved 35721 → batch_step_emotion_class_part2.jsonl


### Submitting Batches to Openai

In [21]:
openai.api_key = "***********" 
client = openai.OpenAI(api_key=openai.api_key)

# JSONL batch files 
batch_files = [
    "batch_step_emotion_class_part1.jsonl",
    "batch_step_emotion_class_part2.jsonl"
]

# Submitting batches
for i, file_path in enumerate(batch_files):
    try:
        print(f"\nSubmitting batch {i+1}/{len(batch_files)}: {file_path}")

        # Uploading file
        with open(file_path, "rb") as f:
            upload = client.files.create(file=f, purpose="batch")
        print(f"Uploaded: File ID = {upload.id}")

        # Creating batch job
        batch = client.batches.create(
            input_file_id=upload.id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
        )
        print(f"Batch submitted: Batch ID = {batch.id}")

    except Exception as e:
        print(f"Failed to submit {file_path}: {e}")


Submitting batch 1/2: batch_step_emotion_class_part1.jsonl
Uploaded: File ID = file–2m4feMySRM9SVT5zjPUK2H
Batch submitted: Batch ID = batch_67e5efce27e0819094178262f1c29003

Submitting batch 2/2: batch_step_emotion_class_part2.jsonl
Uploaded: File ID = file–Bz2tsEeGTvPLAHSAGPKC6w
Batch submitted: Batch ID = batch_67e5efd6eb28819080a585f5df896cbf


### Downloading Json Output from Openai

In [15]:
openai.api_key = "***************"

# List of output file IDs from completed batches
output_file_ids = [
    "file-9DcDFsT8JDcAPeztC5fkQE",
    "file-8vr65VFC1CsBMrnyPjkato"
]

# Downloading each output file
for file_id in output_file_ids:
    file_response = openai.files.content(file_id)

    # Saving the file locally in binary mode
    output_filename = f"{file_id}.jsonl"
    with open(output_filename, "wb") as f:
        for chunk in file_response.iter_bytes():
            f.write(chunk)
    
    print(f"File downloaded: {output_filename}")

File downloaded: file-9DcDFsT8JDcAPeztC5fkQE.jsonl
File downloaded: file-8vr65VFC1CsBMrnyPjkato.jsonl


### Converting Json Output to a CSV file

In [17]:
jsonl_files = [
    "file-9DcDFsT8JDcAPeztC5fkQE.jsonl",
    "file-8vr65VFC1CsBMrnyPjkato.jsonl"
]

# Storing extracted data
extracted_data = []

# Looping through files
for jsonl_file in jsonl_files:
    if not os.path.exists(jsonl_file):
        print(f"{jsonl_file} not found")
        continue

    print(f"Processing {jsonl_file}")

    with open(jsonl_file, "r", encoding="utf-8") as f:
        for line in f:
            json_obj = json.loads(line)

            custom_id = json_obj.get("custom_id", "")
            response = json_obj.get("response", {}).get("body", {}).get("choices", [{}])[0].get("message", {}).get("content", "").strip()

            extracted_data.append({
                "custom_id": custom_id,
                "Emotion_Class": response
            })

# Creating and saving CSV
df = pd.DataFrame(extracted_data)
output_csv = "emotion_class_responses.csv"
df.to_csv(output_csv, index=False)

print(f"\nExtracted responses saved to: {output_csv}")
print("Emotion breakdown:")
print(df["Emotion_Class"].value_counts())


Processing file-9DcDFsT8JDcAPeztC5fkQE.jsonl
Processing file-8vr65VFC1CsBMrnyPjkato.jsonl

Extracted responses saved to: emotion_class_responses.csv
Emotion breakdown:
Emotion_Class
disapproval       32610
anger             21761
joy                6481
neutral            5562
sadness            3276
hope               1643
fear                 25
concern              15
gratitude             8
confusion             6
shock                 5
respect               5
worry                 4
frustration           4
pride                 4
nostalgia             3
surprise              2
relief                2
shame                 2
bittersweet           2
conflicted            2
sympathy              2
jealousy              2
solidarity            1
compassion            1
mixed feelings        1
guilt                 1
doubt                 1
sarcasm               1
appreciation          1
amusement             1
alarm                 1
trust                 1
admiration            1
di

In [19]:
# Loading the master data
df_main = pd.read_csv("Thesis_Relevant_With_Transcript_Influence_and_Agreement.csv")
df_main.rename(columns={"Comment_ID": "custom_id"}, inplace=True)

# CSVs to merge
response_csvs = [
    "category_responses.csv",
    "sentiment_responses.csv",
    "factual_or_opinion_responses.csv",
    "claim_detection_responses.csv",
    "emotion_class_responses.csv"
]

# Merge each CSV
for file in response_csvs:
    print(f"Merging: {file}")
    df = pd.read_csv(file)
    df_main = pd.merge(df_main, df, on="custom_id", how="left")

# Rename back to Comment_ID
df_main.rename(columns={"custom_id": "Comment_ID"}, inplace=True)

# Fill missing values
df_main.fillna({
    "Category": "Unknown",
    "Sentiment": "0",
    "Factual_or_Opinion": "0",
    "Claim_Detection": "0",
    "Emotion_Class": "neutral"
}, inplace=True)

# Reordering columns
cols = list(df_main.columns)

# Define final order by name
final_order = []

# Start with all non-GPT columns (except the ones we want to reorder later)
for col in cols:
    if col not in ["Category", "Sentiment", "Emotion_Class", "Factual_or_Opinion", "Claim_Detection"]:
        final_order.append(col)

# Add GPT columns in order
final_order += [
    "Category",
    "Sentiment",
    "Emotion_Class",
    "Factual_or_Opinion",
    "Claim_Detection"
]

# Apply column order
df_main = df_main[final_order]

df_main.to_csv("Final_Thesis_Merged_With_All_Columns.csv", index=False)
print("Final merged CSV saved as: Final_Thesis_Merged_With_All_Columns.csv")


  df_main = pd.read_csv("Thesis_Relevant_With_Transcript_Influence_and_Agreement.csv")


Merging: category_responses.csv
Merging: sentiment_responses.csv
Merging: factual_or_opinion_responses.csv
Merging: claim_detection_responses.csv
Merging: emotion_class_responses.csv
Final merged CSV saved as: Final_Thesis_Merged_With_All_Columns.csv
