## Thesis Relevant Comments Step 1

After collecting all YouTube comments, the next step is to identify which ones are actually relevant to the topic of sportswashing.

Here, “thesis-relevant” means comments that mention:
- A Middle Eastern country or Gulf nation
- Topics like human rights, corruption, ethics, finance, image laundering, or political influence
- Any opinions or reflections related to ownership, investments, or motives behind the event

This step filters out random, unrelated comments (e.g. general football talk or player praise), so we only keep what’s useful for analysing narratives around sportswashing.


### Importing Libraries

In [None]:
import json
import pandas as pd
import os
import openai
import ast

### Creating Json Files

In [1]:
# Loading the CSV File
df_cleaned = pd.read_csv("All_YouTube_Comments_Cleaned.csv")

# splitting into 3 even batches
num_batches = 3
batch_size = len(df_cleaned) // num_batches  # Each batch gets an equal share of rows

# Generating 3 Large Batches
for batch_num in range(num_batches):
    start_idx = batch_num * batch_size
    end_idx = start_idx + batch_size if batch_num < num_batches - 1 else len(df_cleaned)  
    
    batch_df = df_cleaned.iloc[start_idx:end_idx].copy()  

    jsonl_filename = f"batch_{batch_num + 1}.jsonl"

    # Write each batch to a .jsonl file (one line per API task)
    with open(jsonl_filename, "w", encoding="utf-8") as f:
        for _, row in batch_df.iterrows():
            # Build a single task for OpenAI Batch API
            task = {
                "custom_id": str(row["Comment_ID"]),  # Use unique comment ID to track response
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o-mini",  # Using smaller model for cost-effective classification
                    "messages": [
                        {"role": "system", "content": "You are analyzing YouTube comments for sportswashing discussions."},
                        {"role": "user", "content": f"""
                        **Definition of Sportswashing:**  
                        - When sports are used to improve a country/entity’s reputation while hiding **human rights abuses, corruption, or political issues**.  
                        - Example: **Gulf states** (Saudi Arabia, Qatar, UAE) investing in global sports, hosting events (FIFA, F1), or owning clubs (Man City, PSG, Newcastle).  

                        **Classification Rules:**  
                        - `YES`: If the comment **mentions** sportswashing, Gulf investments, political influence, corruption in sports, or makes a positive or negative comment about the Middle East or Gulf countries.  
                        - `NO`: If the comment is **only about sports**, lacks relevance, or is off-topic.  

                        **YouTube Comment:** \"{row['Rewritten Comment']}\"  

                        **Respond ONLY with `YES` or `NO`, nothing else.**
                        """}
                    ],
                    "temperature": 0.0,  
                    "max_tokens": 5      # Response should be only YES or NO
                }
            }
            f.write(json.dumps(task) + "\n")  # Writing each task to JSONL line

    print(f" Created batch: {jsonl_filename}")  


Created batch: batch_1.jsonl
Created batch: batch_2.jsonl
Created batch: batch_3.jsonl


### Submitting Batches to Openai

In [9]:
# Setting OpenAI API Key
openai.api_key = "**********" 
client = openai.OpenAI(api_key=openai.api_key)

# List of batch files to upload and submit
batch_files = ["batch_1.jsonl", "batch_2.jsonl", "batch_3.jsonl"]

for batch_file in batch_files:
    # Check if the batch file actually exists before continuing
    if not os.path.exists(batch_file):
        print(f"{batch_file} not found")
        continue

    try:
        # Uploading the JSONL batch file to OpenAI for processing
        with open(batch_file, "rb") as f:
            batch_file_upload = client.files.create(file=f, purpose="batch")

        print(f" {batch_file} uploaded successfully. File ID: {batch_file_upload.id}")

        # Submitting the uploaded file as a batch job
        batch_job = client.batches.create(
            input_file_id=batch_file_upload.id,
            endpoint="/v1/chat/completions",  
            completion_window="24h"           # Run within 24h window
        )

        print(f" {batch_file} submitted successfully! Job ID: {batch_job.id}")

    except Exception as e:
        # Print any errors if something goes wrong
        print(f"Error processing {batch_file}: {e}")


batch_1.jsonl uploaded successfully. File ID: file-ABC123ExampleID1
Batch 1 submitted successfully! Job ID: batch_1234567890abcdef12345678
batch_2.jsonl uploaded successfully. File ID: file-DEF456ExampleID2
Batch 2 submitted successfully! Job ID: batch_abcdef1234567890abcdef12
batch_3.jsonl uploaded successfully. File ID: file-R2Kc7pDozfopNQsa4myZTU
Batch 3 submitted successfully! Job ID: batch_67d5723ce4f881900996a1cf95bf82466


### Downloading Json Output from Openai

In [15]:
openai.api_key = "*********" 

# List of output file IDs from batches that have finished processing
output_file_ids = [
    "file-3NBaftwiCcsW49yC2tZZBw", 
    "file-L15bWH8WQiybjLkbKazp7Z",
    "file-51PqRxZLj1ZVpXXJ1yTmZZ"
]

# Looping through each output file and downloading it
for file_id in output_file_ids:
    file_response = openai.files.content(file_id)  # Fetch the file content

    # Saving the file locally with .jsonl extension
    output_filename = f"{file_id}.jsonl"
    with open(output_filename, "wb") as f:  
        for chunk in file_response.iter_bytes(): 
            f.write(chunk)
    
    print(f"File downloaded: {output_filename}")

File downloaded: file-3NBaftwiCcsW49yC2tZZBw.jsonl
File downloaded: file-L15bWH8WQiybjLkbKazp7Z.jsonl
File downloaded: file-51PqRxZLj1ZVpXXJ1yTmZZ.jsonl


### Converting Json file to a CSV

In [18]:
# Loading JSONL files
input_files = [
    "file-3NBaftwiCcsW49yC2tZZBw.jsonl",
    "file-L15bWH8WQiybjLkbKazp7Z.jsonl",
    "file-51PqRxZLj1ZVpXXJ1yTmZZ.jsonl"
]
output_csv = "batch_results.csv"

# Reading JSONL and extract required fields from each line
data = []
for input_file in input_files:
    with open(input_file, "r", encoding="utf-8") as f:
        for line in f:
            json_obj = json.loads(line)  

            # Extract: custom_id, model response, and any error message
            data.append({
                "custom_id": json_obj.get("id"),  # Unique comment ID from our original prompt
                "response": json_obj.get("response", {}).get("choices", [{}])[0].get("message", {}).get("content", ""),
                "error": json_obj.get("error", "")  # If the call failed, capture the error
            })

# Convert list of dicts to DataFrame and save as CSV
df = pd.DataFrame(data)
df.to_csv(output_csv, index=False)

print(f"Converted JSONL to CSV: {output_csv}")

Converted JSONL to CSV: batch_results.csv


### Merge LLM YES/NO Results with YouTube Comments and Save Final CSV


In [5]:
# Loading CSVs
df_results = pd.read_csv("batch_results.csv")
df_comments = pd.read_csv("All_YouTube_Comments.csv")

# Extract YES/NO responses robustly
def extract_yes_no(response):
    for parser in (json.loads, ast.literal_eval):
        try:
            parsed = parser(response)
            return parsed["body"]["choices"][0]["message"]["content"]
        except Exception:
            continue
    return "Parse Error"

# Applying extraction and merge with comment text
df_results["Response (YES/NO)"] = df_results["response"].apply(extract_yes_no)
df_final = df_results.merge(
    df_comments[["Comment_ID", "Rewritten Comment"]],
    left_on="custom_id",
    right_on="Comment_ID",
    how="left"
)[["custom_id", "Comment_ID", "Rewritten Comment", "Response (YES/NO)"]]

# Exporting to CSV
df_final.to_csv("final_processed_comments_fixed.csv", index=False)
print(" CSV file created: final_processed_comments_fixed.csv")


CSV file created: final_processed_comments_fixed.csv


### Count of Yes/No Responses 

In [28]:
# List of JSONL files
input_files = [
    "file-3NBaftwiCcsW49yC2tZZBw.jsonl",
    "file-L15bWH8WQiybjLkbKazp7Z.jsonl",
    "file-51PqRxZLj1ZVpXXJ1yTmZZ.jsonl"
]

# Reading and combine all files into one DataFrame
data = []
for file_name in input_files:
    with open(file_name, "r", encoding="utf-8") as f:
        data.extend([
            {
                "custom_id": obj.get("custom_id", ""),
                "response": obj.get("response", {}).get("body", {}).get("choices", [{}])[0].get("message", {}).get("content", "").strip()
            }
            for obj in map(json.loads, f)
        ])

df = pd.DataFrame(data)

# Count YES vs NO
print(f"Number of 'YES' responses: {(df['response'] == 'YES').sum()}")
print(f"Number of 'NO' responses: {(df['response'] == 'NO').sum()}")

Number of 'YES' responses: 24157
Number of 'NO' responses: 84259


In [35]:
# Loading the merged YouTube comments file with responses
input_csv = "final_processed_comments_fixed.csv"
df = pd.read_csv(input_csv)

# Ensuring the response column exists and has valid values
df = df[df["response"].isin(["YES", "NO"])]

# Separate into two DataFrames
df_yes = df[df["response"] == "YES"]
df_no = df[df["response"] == "NO"]

# Saving each DataFrame into separate CSV files
yes_csv = "YouTube_Comments_Yes.csv"
no_csv = "YouTube_Comments_No.csv"

df_yes.to_csv(yes_csv, index=False)
df_no.to_csv(no_csv, index=False)

print(f"Saved comments to {yes_csv}")
print(f"Saved comments to {no_csv}")


Saved comments to YouTube_Comments_Yes.csv
Saved comments to YouTube_Comments_No.csv
