## Thesis Relevant Comments Step 3

### Thesis Relevant Comments Step 3

In this final filtering stage, all comments previously marked as **“DEFINITELY NOT”** were re-evaluated one more time — this time by comparing them directly with the corresponding **video transcript**. The goal was to catch any remaining comments that might have been wrongly excluded due to limited context or ambiguity in earlier steps.

Each comment was re-checked to see if it **aligned with, reacted to, or indirectly referenced the transcript content**, especially in ways that could indicate relevance to sportswashing narratives.

This extra step was added because a large number of comments had remained in the "DEFINITELY NOT" category after Step 2. By bringing the transcript into the comparison, borderline or subtle references that were previously missed could now be reconsidered. This helps ensure that no meaningful comment was lost, particularly those tied to themes only clear when seen in the context of what was said in the video.


### Importing Libraries

In [None]:
import pandas as pd
import json
import torch
import re
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
from transformers import AutoTokenizer
import tiktoken
import time
import os

### Loading comments & transcripts, initializing SBERT, and validating required columns

In [22]:
# Loading the CSV Files
comments_csv = "comments_definitely_not.csv" 
transcript_csv = "Processed_YouTube_Transcripts.csv"

df_comments = pd.read_csv(comments_csv)
df_transcripts = pd.read_csv(transcript_csv)

# Loading the SBERT Model
model = SentenceTransformer('all-MiniLM-L6-v2')
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Ensuring Required Columns Exist
required_columns = ["custom_id", "Rewritten Comment", "Video_ID"]
if not all(col in df_comments.columns for col in required_columns):
    raise KeyError(f" Missing required columns in comments CSV: {required_columns}")

if "Video_ID" not in df_transcripts.columns or "Transcript" not in df_transcripts.columns:
    raise KeyError(" Required columns ('Video_ID', 'Transcript') missing from transcript CSV.")


Loaded comments file: comments_definitely_not.csv (54177 rows)
Loaded transcripts file: Processed_YouTube_Transcripts.csv (162 rows)
Loading SBERT model...
SBERT model loaded successfully.
Column validation completed successfully.


### Compute SBERT embeddings & define top-5 transcript similarity helper

In [24]:
# Precomputing SBERT Embeddings for All Comments
print(" Encoding all comments in batch...")
comment_embeddings = model.encode(df_comments["Rewritten Comment"].tolist(), convert_to_tensor=True)
df_comments["comment_embedding"] = list(comment_embeddings)

# Precomputing Transcript Embeddings
transcript_embeddings = {}
print(" Precomputing transcript embeddings...")
for _, row in tqdm(df_transcripts.iterrows(), total=len(df_transcripts), desc=" Processing Transcripts"):
    video_id = row["Video_ID"]
    transcript_text = row["Transcript"]
    sentences = re.split(r'(?<=[.!?])\s+', transcript_text)
    
    if len(sentences) == 0:
        transcript_embeddings[video_id] = (["No meaningful sentences"] * 5, torch.tensor([0.0]))
    else:
        transcript_embeddings[video_id] = (sentences, model.encode(sentences, convert_to_tensor=True))

# Function to Get Top 5 Similar Sentences
def get_top_similar_sentences(comment_embedding, video_id):
    if video_id not in transcript_embeddings:
        return ["No transcript found"] * 5, 0.0

    transcript_sentences, transcript_embedding = transcript_embeddings[video_id]
    similarities = util.pytorch_cos_sim(comment_embedding, transcript_embedding)[0]
    max_similarity = torch.max(similarities).item()
    
    top_n = min(5, len(transcript_sentences))
    top_indices = torch.topk(similarities, k=top_n).indices
    top_sentences = [transcript_sentences[i] for i in top_indices]

    while len(top_sentences) < 5:
        top_sentences.append("N/A")

    return top_sentences, max_similarity


Encoding all comments in batch...
Precomputing transcript embeddings...
Processing Transcripts: 100%|█████████| 162/162 [00:47<00:00, 3.40it/s]


### Generate JSONL batch for transcript comparison classification

In [28]:
# Processing Comments
output_jsonl = "batch_definitely_not_transcript_comparison.jsonl"
print(" Processing comments...")

with open(output_jsonl, "w", encoding="utf-8") as f:
    for _, row in tqdm(df_comments.iterrows(), total=len(df_comments), desc=" Processing comments"):
        comment_id = str(row["custom_id"])
        rewritten_comment = row["Rewritten Comment"]
        video_id = row["Video_ID"]
        comment_embedding = row["comment_embedding"]

        top_sentences, _ = get_top_similar_sentences(comment_embedding, video_id)

        task = {
            "custom_id": comment_id,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [
                    {"role": "system", "content": "You are analyzing YouTube comments by comparing them to their video transcripts. Your task is to determine whether the transcript provides meaningful context for the comment."},
                    {"role": "user", "content": f"""
**Comment:** "{rewritten_comment}"

**Top 5 Transcript Sentences:**
""" + "\n".join([f"- {sentence}" for sentence in top_sentences]) + """

### Task:
Determine whether this comment is relevant to discussions about sportswashing, human rights, financial ethics, corruption, or geopolitical motives.
Additionally, if the comment makes a **positive or negative statement, opinion, or fact** about the **Middle East**, then it is considered relevant.

### Instructions:
- Respond **"YES"** if the comment relates to **any** of the above topics or expresses a **positive/negative statement about the Middle East**.
- Respond **"NO"** if the comment is **only about match performance, players, goals, or unrelated topics**.

### **Final Response Format:**  
Respond **ONLY** with "YES" or "NO", nothing else.
"""}
                ],
                "temperature": 0.0,
                "max_tokens": 5
            }
        }
        f.write(json.dumps(task) + "\n")

print(f" JSONL file created: {output_jsonl}")


Processing comments...
Processing comments: 100%|█████████| 54177/54177 [00:27<00:00, 1955.10it/s]
JSONL file created: batch_definitely_not_transcript_comparison.jsonl


### The below script splits a large JSONL file into four smaller batches based on token count, using the GPT-4o Mini tokenizer to ensure each batch stays within a 30M token limit.


In [10]:
# Loading GPT-4o Mini Tokenizer
enc = tiktoken.encoding_for_model("gpt-4o-mini")

# Input and outputs
input_jsonl = "batch_definitely_not_transcript_comparison.jsonl"
outputs = [
    "batch_definitely_not_part1.jsonl",
    "batch_definitely_not_part2.jsonl",
    "batch_definitely_not_part3.jsonl",
    "batch_definitely_not_part4.jsonl",
]

batch_threshold = 30_000_000  # Max tokens per batch

with contextlib.ExitStack() as stack:
    files = [stack.enter_context(open(p, "w", encoding="utf-8")) for p in outputs]

    current_tokens = 0
    batch_idx = 0  # 0..3 (last file collects the remainder)

    with open(input_jsonl, "r", encoding="utf-8") as f:
        for line in tqdm(f, desc=" Splitting JSONL File"):
            obj = json.loads(line)
            msgs = obj["body"]["messages"]
            text = " ".join(m.get("content", "") for m in msgs)
            num_tokens = len(enc.encode(text))

            # Move to next batch if threshold exceeded (for first three parts)
            if batch_idx < 3 and current_tokens + num_tokens > batch_threshold:
                batch_idx += 1
                current_tokens = 0

            files[batch_idx].write(json.dumps(obj) + "\n")
            if batch_idx < 3:
                current_tokens += num_tokens
            # (For part 4, we don't track tokens—everything else goes there)
            
# Printing Results
print(f" Split Complete! JSONL files saved as:")
print(f"  - {outputs[0]} (≈30M tokens)")
print(f"  - {outputs[1]} (≈30M tokens)")
print(f"  - {outputs[2]} (≈30M tokens)")
print(f"  - {outputs[3]} (remaining tokens)")


Splitting JSONL File: 54177it [05:15, 171.93it/s]
Split Complete! JSONL files saved as:
  - batch_definitely_not_part1.jsonl (≈30M tokens)
  - batch_definitely_not_part2.jsonl (≈30M tokens)
  - batch_definitely_not_part3.jsonl (≈30M tokens)
  - batch_definitely_not_part4.jsonl (remaining tokens)


### This script uploads and submits each JSONL batch file to OpenAI one by one, with a 100-minute gap between each to avoid hitting limits. It also prints out the job ID for each successful submission.


In [12]:
# Set OpenAI API Key
openai.api_key = "*************" 
client = openai.OpenAI(api_key=openai.api_key)

# List of batch files
batch_files = [
    "batch_definitely_not_part2.jsonl",
    "batch_definitely_not_part3.jsonl",
    "batch_definitely_not_part4.jsonl"
]

# Dictionary to store batch job IDs
batch_jobs = {}

# Uploading & Submitting Each Batch
for index, batch_file in enumerate(batch_files):
    if not os.path.exists(batch_file):
        print(f" {batch_file} not found. Skipping...")
        continue

    try:
        print(f"\n Submitting {batch_file} ({index+1}/{len(batch_files)})...")

        # Uploading the batch file
        with open(batch_file, "rb") as f:
            batch_file_upload = client.files.create(file=f, purpose="batch")

        print(f" {batch_file} uploaded successfully. File ID: {batch_file_upload.id}")

        # Submitting Batch Job to OpenAI
        batch_job = client.batches.create(
            input_file_id=batch_file_upload.id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
        )

        batch_jobs[batch_file] = batch_job.id
        print(f" Batch {index + 1} submitted successfully! Job ID: {batch_job.id}")

    except Exception as e:
        print(f" Error processing {batch_file}: {e}")

    # Waiting 1 hour before submitting the next batch (except the last one)
    if index < len(batch_files) - 1:
        delay_minutes = 100  # 1 hour
        print(f" Waiting {delay_minutes} minutes before submitting the next batch...")
        time.sleep(delay_minutes * 60)  

# Print Submitted Batch Job IDs
print("\n All batch jobs submitted successfully!")
for batch_file, job_id in batch_jobs.items():
    print(f"- {batch_file} → Job ID: {job_id}")


Submitting batch_definitely_not_part2.jsonl (1/3)...
batch_definitely_not_part2.jsonl uploaded successfully. File ID: file-RVvhzzF3hiaKojvthQmbe3
Batch 1 submitted successfully! Job ID: batch_67db5ead2574819088d468edd49c9468
Waiting 100 minutes before submitting the next batch...

Submitting batch_definitely_not_part3.jsonl (2/3)...
batch_definitely_not_part3.jsonl uploaded successfully. File ID: file-37EFCGF3D6XLPLiMijRjGS
Batch 2 submitted successfully! Job ID: batch_67db764c8d248190b73433d402649f08
Waiting 100 minutes before submitting the next batch...

Submitting batch_definitely_not_part4.jsonl (3/3)...
batch_definitely_not_part4.jsonl uploaded successfully. File ID: file-Ujgz9QvxrBw4uS7tRuLMGH
Batch 3 submitted successfully! Job ID: batch_67db8dec5a708190834743c0e1ca3d39

All batch jobs submitted successfully!
- batch_definitely_not_part2.jsonl → Job ID: batch_67db5ead2574819088d468edd49c9468
- batch_definitely_not_part3.jsonl → Job ID: batch_67db764c8d248190b73433d402649f08
- b

### This script downloads the output files from each completed batch job using their file IDs and saves them locally as JSONL files.


In [14]:
# Set OpenAI API Key
openai.api_key = "************" 

# List of output file IDs from completed batches
output_file_ids = [
    "file-5rNapKE6rPeGWmmxW69e6U",  
    "file-FRkTq1iabGCB1LG3WFjzVR",
    "file-3scCRp5igqJnNtLXteqHdF",
    "file-2X8oVYoqTHpedfhZKwqsbd"
]

# Download each output file properly
for file_id in output_file_ids:
    file_response = openai.files.content(file_id)

    # Save the file locally in binary mode
    output_filename = f"{file_id}.jsonl"
    with open(output_filename, "wb") as f:
        for chunk in file_response.iter_bytes():
            f.write(chunk)
    
    print(f" File downloaded: {output_filename}")

File downloaded: file-5rNapKE6rPeGWmmxW69e6U.jsonl
File downloaded: file-FRkTq1iabGCB1LG3WFjzVR.jsonl
File downloaded: file-3scCRp5igqJnNtLXteqHdF.jsonl
File downloaded: file-2X8oVYoqTHpedfhZKwqsbd.jsonl


### This script goes through the downloaded JSONL files, extracts each comment's custom ID and model response, and saves everything into a CSV. It also prints how many responses were labelled as YES or NO.


In [16]:
# List of JSONL files
jsonl_files = [
    "file-5rNapKE6rPeGWmmxW69e6U.jsonl",  
    "file-FRkTq1iabGCB1LG3WFjzVR.jsonl",
    "file-3scCRp5igqJnNtLXteqHdF.jsonl",
    "file-2X8oVYoqTHpedfhZKwqsbd.jsonl"
]

# Initializing a list to store extracted data
extracted_data = []

# Processing each JSONL file
for jsonl_file in jsonl_files:
    if not os.path.exists(jsonl_file):
        print(f" {jsonl_file} not found. Skipping...")
        continue

    print(f" Processing {jsonl_file}...")

    with open(jsonl_file, "r", encoding="utf-8") as f:
        for line in f:
            json_obj = json.loads(line)

            # Extract custom_id and response
            custom_id = json_obj.get("custom_id", "")
            response_content = json_obj.get("response", {}).get("body", {}).get("choices", [{}])[0].get("message", {}).get("content", "").strip()

            # Store in list
            extracted_data.append({"custom_id": custom_id, "response": response_content})

# Converting to DataFrame
df_extracted = pd.DataFrame(extracted_data)

# Saving to CSV
output_csv = "extracted_yes_no_responses.csv"
df_extracted.to_csv(output_csv, index=False)

# Printing summary
yes_count = (df_extracted["response"] == "YES").sum()
no_count = (df_extracted["response"] == "NO").sum()

print(f"\n Extracted responses saved to {output_csv}")
print(f" Total 'YES' responses: {yes_count}")
print(f" Total 'NO' responses: {no_count}")


Processing file–5rNapKE6rPeGWmmxW69e6U.jsonl...
Processing file–FRkTq1iabGCB1LG3WFjzVR.jsonl...
Processing file–3scCRp5igqJnNtLXteqHdF.jsonl...
Processing file–2X8oVYoqTHpedfhZKwqsbd.jsonl...

Extracted responses saved to extracted_yes_no_responses.csv
Total 'YES' responses: 20409
Total 'NO' responses: 33768


### This script merges the extracted YES/NO responses back with the original comment metadata using `custom_id` and saves the full result to a new CSV.


In [18]:
# Loading Extracted Responses CSV
responses_csv = "extracted_yes_no_responses.csv"  
df_responses = pd.read_csv(responses_csv)

# Loading Original Comments Metadata CSV
comments_csv = "comments_definitely_not.csv"  
df_comments = pd.read_csv(comments_csv)

# Mergeing on custom_id
df_merged = df_responses.merge(df_comments, on="custom_id", how="left")

# Saving Merged Data to New CSV
output_csv = "merged_definitely_not_with_responses.csv"
df_merged.to_csv(output_csv, index=False)

# Printing Summary
print(f" Merged file saved as {output_csv}")
print(f" Total merged records: {len(df_merged)}")


Merged file saved as merged_definitely_not_with_responses.csv
Total merged records: 54177


### This script filters out only the YES responses from the merged data, cleans up extra columns, reorders the layout, and saves the final result to a new CSV.


In [20]:
# Loading Merged CSV File
input_csv = "merged_definitely_not_with_responses.csv"  
df = pd.read_csv(input_csv)

# Filtering Only YES Responses from response_x
df_yes = df[df["response_x"] == "YES"].copy()

# Dropping response_y Column
df_yes.drop(columns=["response_y"], inplace=True, errors="ignore")

# Renaming response_x to response
df_yes.rename(columns={"response_x": "response"}, inplace=True)

# Moving custom_id and response to the End
cols = [col for col in df_yes.columns if col not in ["custom_id", "response"]]  # Get all columns except these two
df_yes = df_yes[cols + ["custom_id", "response"]]  # Reorder columns

# Saving Updated CSV
output_csv = "filtered_yes_responses.csv"
df_yes.to_csv(output_csv, index=False)

# Printing Summary
print(f" Extracted YES responses saved as {output_csv}")
print(f" Total 'YES' responses: {len(df_yes)}")


Extracted YES responses saved as filtered_yes_responses.csv
Total 'YES' responses: 20409


### This script combines two YES response CSVs into one, checks for any duplicate `custom_id`s, and saves the merged result to a new file. If duplicates are found, it shows a sample.


In [22]:
# Loading CSV Files
file1 = "filtered_yes_responses.csv"
file2 = "final_yes_comments.csv"

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

# Merging
df_merged = pd.concat([df1, df2], ignore_index=True)

# Checking for Duplicates in the custom_id
duplicate_custom_ids = df_merged[df_merged.duplicated(subset=["custom_id"], keep=False)]

# Saving Merged CSV
output_csv = "merged_yes_comments.csv"
df_merged.to_csv(output_csv, index=False)

# Printing Summary
print(f" Merged CSV saved as: {output_csv}")
print(f" Total Rows: {len(df_merged)}")
print(f" Duplicate custom_id Count: {len(duplicate_custom_ids)}")

# Showing Duplicate custom_id`s if they exist
if not duplicate_custom_ids.empty:
    print("\n Duplicate custom_id Sample:")
    print(duplicate_custom_ids.head(10))  

Merged CSV saved as: merged_yes_comments.csv
Total Rows: 71441
Duplicate custom_id Count: 0


### This script removes extra columns we no longer need from the merged YES comments file and saves a cleaner version for further analysis.


In [24]:
# Loading the Merged CSV
input_csv = "merged_yes_comments.csv"
df = pd.read_csv(input_csv)

# Dropping Unwanted Columns
columns_to_remove = ["response", "custom_id", "Source_File", "Data_Source", "Comment"]
df_cleaned = df.drop(columns=columns_to_remove, errors="ignore") 

# Saving the Cleaned CSV
output_csv = "cleaned_merged_yes_comments.csv"
df_cleaned.to_csv(output_csv, index=False)

# Printing Confirmation
print(f" Cleaned CSV saved as: {output_csv}")
print(f" Final Columns: {df_cleaned.columns.tolist()}")
print(f" Total Rows: {len(df_cleaned)}")


Cleaned CSV saved as: cleaned_merged_yes_comments.csv
Final Columns: ['Video_ID', 'Video_Title', 'Video_Category_Type', 'Channel_Name', 'Comment_ID', 'Author', 'Date', 'Likes', 'Replies_Count', 'Rewritten Comment']
Total Rows: 71441
