# Comment Event

- In this notebook, I prepared a batch to identify which event each thesis-relevant comment refers to.  
- I used the rewritten comment text to build the batch JSONL file.  
- The prompt asked GPT-4o to match each comment to a specific event from my predefined list of sports-related events.  
- To stay within the token limit, I split the batch into two parts before submitting it to OpenAI’s Batch API.  
- Once the outputs were ready, I downloaded the JSONL files and extracted the `custom_id` and assigned event.  
- I saved the results to a CSV for further analysis, enabling me to group and compare comments by the event they relate to.


### Importing Libraries

In [None]:
import pandas as pd
import json
from tqdm import tqdm
import os
import openai

In [2]:
# Load the full dataset
df = pd.read_csv("Thesis Relevant Comments.csv")
print(f"Loaded {len(df)} comments.")

Loaded 71441 comments.


### Creating the Json file

In [8]:
output_jsonl = "batch_event_class.jsonl"

# Opening the file and writing each comment as a separate task
with open(output_jsonl, "w", encoding="utf-8") as f:
    for _, row in tqdm(df.iterrows(), total=len(df), desc="Building Event Class batch"):
        comment_id = str(row["Comment_ID"])
        comment = row["Rewritten Comment"]

        # Building the request for GPT-4o Mini
        task = {
            "custom_id": comment_id,  
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are assisting a researcher studying public opinion on major Gulf-backed sports events."
                    },
                    {
                        "role": "user",
                        "content": f'''Below is a YouTube comment. Your task is to identify which, if any, of the listed events the comment is referring to.

Comment: "{comment}"

Choose only ONE of the following events. If the comment does not relate to any, reply with an empty string:

- Manchester City ownership
- Paris Saint-Germain ownership
- Newcastle United ownership
- FIFA World Cup 2022
- Formula 1
- Saudi Pro League
- LIV Golf
- Gulf multi-sport hosting'''
                    }
                ],
                "temperature": 0,        
                "max_tokens": 15         
            }
        }

        
        f.write(json.dumps(task) + "\n")

print(f"Saved: {output_jsonl}")


Building Event Class batch: 100%|██████| 71441/71441 [00:06<00:00, 10633.74it/s]

Saved: batch_event_class.jsonl





### Splitting the Batch due to Token Limit

In [11]:
input_file = "batch_event_class.jsonl"  
output_file_1 = input_file.replace(".jsonl", "_part1.jsonl")
output_file_2 = input_file.replace(".jsonl", "_part2.jsonl")

# Reading the file
with open(input_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

# Splitting
mid = len(lines) // 2
part1 = lines[:mid]
part2 = lines[mid:]

# Writing to two new files
with open(output_file_1, "w", encoding="utf-8") as f:
    f.writelines(part1)

with open(output_file_2, "w", encoding="utf-8") as f:
    f.writelines(part2)

print(f"Saved {len(part1)} → {output_file_1}")
print(f"Saved {len(part2)} → {output_file_2}")


Saved 35720 → batch_event_class_part1.jsonl
Saved 35721 → batch_event_class_part2.jsonl


### Submitting Batches to Openai

In [3]:
openai.api_key = "*****************"
client = openai.OpenAI(api_key=openai.api_key)

# My JSONL batch files
batch_files = [
    "batch_event_class_part1.jsonl",
    "batch_event_class_part2.jsonl"
]

# Submitting batches
for i, file_path in enumerate(batch_files):
    try:
        print(f"\nSubmitting batch {i+1}/{len(batch_files)}: {file_path}")

        # Uploading file
        with open(file_path, "rb") as f:
            upload = client.files.create(file=f, purpose="batch")
        print(f"Uploaded: File ID = {upload.id}")

        # Creating batch job
        batch = client.batches.create(
            input_file_id=upload.id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
        )
        print(f"Batch submitted: Batch ID = {batch.id}")

    except Exception as e:
        print(f"Failed to submit {file_path}: {e}")



Submitting batch 1/2: batch_event_class_part1.jsonl
Uploaded: File ID = file-6uG14wTBtaYbsFz2nynrP3
Batch submitted: Batch ID = batch_6812ac5ced848190a10dc8975757d0d2

Submitting batch 2/2: batch_event_class_part2.jsonl
Uploaded: File ID = file-HfzsW1zMnaAd9YXqsvsmoP
Batch submitted: Batch ID = batch_6812ac689774819088143c10aa56601d


### Downloading Json Output from Openai

In [13]:
openai.api_key = "***************"  

# List of output file IDs from completed batches
output_file_ids = [
    "file-92ujZ9WsuWPt51eniSUGX6",
    "file-H65geZjE8zfizy5XjoYYnA"
]

# Downloading each output file
for file_id in output_file_ids:
    file_response = openai.files.content(file_id)

    # Saving the file locally in binary mode
    output_filename = f"{file_id}.jsonl"
    with open(output_filename, "wb") as f:
        for chunk in file_response.iter_bytes():
            f.write(chunk)
    
    print(f"File downloaded: {output_filename}")

File downloaded: file-92ujZ9WsuWPt51eniSUGX6.jsonl
File downloaded: file-H65geZjE8zfizy5XjoYYnA.jsonl


### Converting Json Output to a CSV file

In [34]:
jsonl_files = [
    "file-92ujZ9WsuWPt51eniSUGX6.jsonl",
    "file-H65geZjE8zfizy5XjoYYnA.jsonl"
]

# Valid event options (from prompt)
valid_events = {
    "Manchester City ownership",
    "Paris Saint-Germain ownership",
    "Newcastle United ownership",
    "FIFA World Cup 2022",
    "Formula 1",
    "Saudi Pro League",
    "LIV Golf",
    "Gulf multi-sport hosting"
}

# Storing clean records
records = []

# Processing each file
for jsonl_file in jsonl_files:
    if not os.path.exists(jsonl_file):
        print(f"Missing file: {jsonl_file}")
        continue

    print(f"Processing {jsonl_file}")
    with open(jsonl_file, "r", encoding="utf-8") as f:
        for line in f:
            obj = json.loads(line)
            comment_id = obj.get("custom_id", "")
            raw_response = obj.get("response", {}).get("body", {}).get("choices", [{}])[0].get("message", {}).get("content", "").strip()

            # Clean response
            event = raw_response.strip('`"').strip()
            if event not in valid_events:
                event = ""  # if not in valid set, default to empty string

            records.append({
                "custom_id": comment_id,
                "Event": event
            })

df = pd.DataFrame(records)
df.to_csv("final_event_responses.csv", index=False)

print("\nSaved to: final_event_responses.csv")
print(df["Event"].value_counts(dropna=False))


Processing file-92ujZ9WsuWPt51eniSUGX6.jsonl
Processing file-H65geZjE8zfizy5XjoYYnA.jsonl

Saved to: final_event_responses.csv
Event
FIFA World Cup 2022              37554
                                 17308
Saudi Pro League                  5230
Manchester City ownership         4194
Newcastle United ownership        3298
LIV Golf                          2295
Gulf multi-sport hosting           706
Formula 1                          476
Paris Saint-Germain ownership      380
Name: count, dtype: int64


### Merging The Event Classes With the overall results

In [5]:
df_final = pd.read_csv("Final_Thesis_Merged.csv")

# Loading the final event responses
df_events = pd.read_csv("final_event_responses.csv")

# Merging based on Comment_ID
df_final = df_final.merge(
    df_events,
    how="left",
    left_on="Comment_ID",
    right_on="custom_id"
)

# Dropping the extra custom_id column
df_final = df_final.drop(columns=["custom_id"])

# Saving the result by overwriting the original file
df_final.to_csv("Final_Thesis_Merged.csv", index=False)

  df_final = pd.read_csv("Final_Thesis_Merged.csv")


### Filling Missing Event cells with Associated Trasncript Event

In [10]:
df_final = pd.read_csv("Final_Thesis_Merged.csv")

# Loading the transcript dataset with Transcript_Event
df_transcripts = pd.read_csv("Processed_YouTube_Transcripts.csv")

# Filling empty Event cells using matching Transcript_Event from Video_ID
df_final["Event"] = df_final.apply(
    lambda row: df_transcripts[df_transcripts["Video_ID"] == row["Video_ID"]]["Transcript_Event"].values[0]
    if pd.isna(row["Event"]) and row["Video_ID"] in df_transcripts["Video_ID"].values else row["Event"],
    axis=1
)

# Overwriting the original file
df_final.to_csv("Final_Thesis_Merged.csv", index=False)

  df_final = pd.read_csv("Final_Thesis_Merged.csv")
