# Agreed with Transcript

- This notebook focuses on analysing comments that were influenced by the video's transcript.
- I start by filtering the dataset to only include those influenced comments.
- Then I load the corresponding video transcripts and split each one into individual sentences.
- I use SBERT to generate sentence embeddings for all transcript sentences, which allows for efficient comparison later.
- These embeddings are then used to match each influenced comment to the most similar sentence in the transcript.
- The matched pairs are sent through OpenAI's batch processing to determine whether the comment agrees, disagrees, or is neutral toward the transcript content.
- The results are stored in a new `Agreed_with_Transcript` column for each comment.


### Importing Libraries

In [14]:
import pandas as pd
import torch
import re
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
import json
import torch
import os
import openai
import time

In [15]:
# Load merged dataset with influence labels
df = pd.read_csv("Thesis_Relevant_With_Transcript_Influence.csv")

# Keep only rows where comment was influenced by transcript
df_influenced = df[df["Influenced_by_Transcript"] == 1]

# Save filtered file
df_influenced.to_csv("Influenced_Comments.csv", index=False)
print("Saved influenced comments to Influenced_Comments.csv")


Saved influenced comments to Influenced_Comments.csv


### Generate Sentence Embeddings for Video Transcripts

In [18]:
# Load transcript dataset
df_transcripts = pd.read_csv("Processed_YouTube_Transcripts.csv")

# Load SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Precompute transcript sentence embeddings
transcript_embeddings = {}
for _, row in tqdm(df_transcripts.iterrows(), total=len(df_transcripts), desc="Encoding transcripts"):
    video_id = row["Video_ID"]
    transcript_text = row["Transcript"]
    sentences = re.split(r'(?<=[.!?])\s+', transcript_text.strip())

    if not sentences:
        continue

    transcript_embeddings[video_id] = (
        sentences,
        model.encode(sentences, convert_to_tensor=True)
    )


Encoding transcripts: 100%|█████████████████████| 162/162 [00:45<00:00,  3.57it/s]


### Retrieve Top 5 Most Similar Transcript Sentences for a Comment


In [20]:
def get_top_similar_sentences(comment, video_id):
    if video_id not in transcript_embeddings:
        return ["No transcript found"] * 5

    sentences, embeddings = transcript_embeddings[video_id]
    comment_embedding = model.encode(comment, convert_to_tensor=True)
    similarities = util.pytorch_cos_sim(comment_embedding, embeddings)[0]
    top_indices = torch.topk(similarities, k=min(5, len(sentences))).indices
    return [sentences[i] for i in top_indices]


### Building Token-Limited Prompts for Checking Agreement Between Comments and Transcripts


In [22]:
# Load GPT-4o Mini tokenizer
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

def get_token_count(text):
    return len(encoding.encode(text))

def build_prompt_within_token_limit(comment, video_id, max_total_tokens=4000, max_sentence_chars=200):
    top_sentences = get_top_similar_sentences(comment, video_id)
    top_sentences = [s[:max_sentence_chars] for s in top_sentences]

    system_prompt = (
        "You are helping a researcher check if a YouTube comment agrees with the video transcript. "
        "Some comments support, oppose, or are unrelated to the transcript. Judge the comment in context."
    )

    for n in range(5, 0, -1):
        selected = top_sentences[:n]
        transcript_block = "\n".join(f"- {s}" for s in selected)

        user_prompt = f"""Comment: "{comment}"\nTranscript:\n{transcript_block}\n\nDoes this comment agree with the transcript?\n\nReply ONLY with:\n1 = agrees\n-1 = disagrees\n0 = neutral"""

        total_tokens = get_token_count(system_prompt) + get_token_count(user_prompt)
        if total_tokens <= max_total_tokens:
            return system_prompt, user_prompt

    # Fallback if nothing fits
    fallback = "Transcript:\n- N/A"
    user_prompt = f"""Comment: "{comment}"\n{fallback}\n\nDoes this comment agree with the transcript?\n\nReply ONLY with:\n1 = agrees\n-1 = disagrees\n0 = neutral"""
    return system_prompt, user_prompt


### Creating the Json file

In [52]:
df = pd.read_csv("Influenced_Comments.csv")
output_jsonl = "batch_step_agree_with_transcript.jsonl"

with open(output_jsonl, "w", encoding="utf-8") as f:
    for _, row in tqdm(df.iterrows(), total=len(df), desc="Creating GPT-4o Mini batch"):
        comment_id = str(row["Comment_ID"])
        rewritten_comment = row["Rewritten Comment"]
        video_id = row["Video_ID"]

        # Build token-safe prompt
        system_prompt, user_prompt = build_prompt_within_token_limit(rewritten_comment, video_id)

        task = {
            "custom_id": comment_id,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o",
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                "temperature": 0.0,
                "max_tokens": 5
            }
        }

        f.write(json.dumps(task) + "\n")

print(f"Saved GPT-4o Mini batch: {output_jsonl}")


Creating GPT-4o Mini batch: 100%|████████████| 56390/56390 [27:03<00:00, 34.74it/s]
Saved GPT-4o Mini batch: batch_step_agree_with_transcript.jsonl


### Splitting the Batch due to Token Limit

In [54]:
# Input file path
input_file = "batch_step_agree_with_transcript.jsonl"

# Output file names
output_file_1 = "batch_step2_agree_with_transcript_part1.jsonl"
output_file_2 = "batch_step2_agree_with_transcript_part2.jsonl"

# Read all lines
with open(input_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

# Split in half
mid = len(lines) // 2
part1 = lines[:mid]
part2 = lines[mid:]

# Write to two files
with open(output_file_1, "w", encoding="utf-8") as f1:
    f1.writelines(part1)

with open(output_file_2, "w", encoding="utf-8") as f2:
    f2.writelines(part2)

# Print summary
print(f"Original: {input_file} → {len(lines)} tasks")
print(f"Part 1  : {output_file_1} → {len(part1)} tasks")
print(f"Part 2  : {output_file_2} → {len(part2)} tasks")


Original: batch_step_agree_with_transcript.jsonl → 56390 tasks
Part 1  : batch_step_agree_with_transcript_part1.jsonl → 28195 tasks
Part 2  : batch_step_agree_with_transcript_part2.jsonl → 28195 tasks


### Submitting Batches to Openai

In [56]:
# Set your OpenAI API key
openai.api_key = "***************" 
client = openai.OpenAI(api_key=openai.api_key)

# List of batch files to submit
batch_files = [
    "batch_step2_agree_with_transcript_part1.jsonl",
    "batch_step2_agree_with_transcript_part2.jsonl"
]

# Submit each batch with 30-minute delay
for i, file_path in enumerate(batch_files):
    try:
        print(f"Submitting batch {i+1}/{len(batch_files)}: {file_path}")

        # Upload file
        with open(file_path, "rb") as f:
            upload = client.files.create(file=f, purpose="batch")
        print(f"Uploaded: File ID = {upload.id}")

        # Submit batch job
        batch = client.batches.create(
            input_file_id=upload.id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
        )
        print(f"Batch submitted: Batch ID = {batch.id}")

        # Wait 30 minutes before next submission (if not the last one)
        if i < len(batch_files) - 1:
            print("Waiting 30 minutes before next submission...")
            time.sleep(3600)

    except Exception as e:
        print(f"Failed to submit {file_path}: {e}")


Submitting batch 1/2: batch_step_agree_with_transcript_part1.jsonl
Uploaded: File ID = file-JNygyC5BhbXvncj25cEToQ
Batch submitted: Batch ID = batch_67e49d84d5008190b5218c58954d381a
Waiting 30 minutes before next submission...
Submitting batch 2/2: batch_step_agree_with_transcript_part2.jsonl
Uploaded: File ID = file-KcPpmUEue2ksS5fQEZ5UZ8
Batch submitted: Batch ID = batch_67e4aba177548190bea792e16ed35d64


### Downloading Json Output from Openai

In [2]:
# Set OpenAI API Key
openai.api_key = "****************"

# List of output file IDs from completed batches
output_file_ids = [
    "file-UjWFGyWcWBXisNh9zzYxBt",
    "file-TsV6ecRbhBdTukPYpB15Cf"
]

# Download each output file properly
for file_id in output_file_ids:
    file_response = openai.files.content(file_id)

    # Save the file locally in binary mode
    output_filename = f"{file_id}.jsonl"
    with open(output_filename, "wb") as f:
        for chunk in file_response.iter_bytes():
            f.write(chunk)
    
    print(f"File downloaded: {output_filename}")

File downloaded: file-UjWFGyWcWBXisNh9zzYxBt.jsonl
File downloaded: file-TsV6ecRbhBdTukPYpB15Cf.jsonl


### Converting Json Output to a CSV file

In [5]:
# List of your JSONL files
jsonl_files = [
    "file-UjWFGyWcWBXisNh9zzYxBt.jsonl",
    "file-TsV6ecRbhBdTukPYpB15Cf.jsonl"
]

# Store extracted data
extracted_data = []

# Loop through files
for jsonl_file in jsonl_files:
    if not os.path.exists(jsonl_file):
        print(f"{jsonl_file} not found")
        continue

    print(f"Processing {jsonl_file}")

    with open(jsonl_file, "r", encoding="utf-8") as f:
        for line in f:
            json_obj = json.loads(line)

            custom_id = json_obj.get("custom_id", "")
            response = json_obj.get("response", {}).get("body", {}).get("choices", [{}])[0].get("message", {}).get("content", "").strip()

            extracted_data.append({
                "custom_id": custom_id,
                "Agreed_with_Transcript": response
            })

# Create and save CSV
df = pd.DataFrame(extracted_data)
output_csv = "agreed_with_transcript_responses.csv"
df.to_csv(output_csv, index=False)

# Summary counts
agree = (df["Agreed_with_Transcript"] == "1").sum()
disagree = (df["Agreed_with_Transcript"] == "-1").sum()
neutral = (df["Agreed_with_Transcript"] == "0").sum()

print(f"\nExtracted responses saved to: {output_csv}")
print(f"Agrees    : {agree}")
print(f"Disagrees : {disagree}")
print(f"Neutral   : {neutral}")


Processing file-UjWFGyWcWBXisNh9zzYxBt.jsonl
Processing file-TsV6ecRbhBdTukPYpB15Cf.jsonl

Extracted responses saved to: agreed_with_transcript_responses.csv
Agrees    : 10974
Disagrees : 19832
Neutral   : 25583


### Merging GPT Agreement Labels into Main Dataset


In [8]:
df_main = pd.read_csv("Thesis_Relevant_With_Transcript_Influence.csv")
df_agreed = pd.read_csv("agreed_with_transcript_responses.csv").rename(columns={"custom_id": "Comment_ID"})

df_main["Agreed_with_Transcript"] = pd.merge(
    df_main, df_agreed[["Comment_ID", "Agreed_with_Transcript"]], on="Comment_ID", how="left"
)["Agreed_with_Transcript"].fillna("0")

df_main.to_csv("Thesis_Relevant_With_Transcript_Influence_and_Agreement.csv", index=False)
print("New CSV saved to: Thesis_Relevant_With_Transcript_Influence_and_Agreement.csv")


New CSV saved to: Thesis_Relevant_With_Transcript_Influence_and_Agreement.csv


### Count Claim Detection vs. Transcript Agreement Combinations


In [19]:
# Load the data
df = pd.read_csv("Final_Thesis_Merged.csv")

# Remove rows with non-numeric values in 'Agreed_with_Transcript'
df = df[pd.to_numeric(df["Agreed_with_Transcript"], errors="coerce").notna()]

# Convert both columns to integers
df[["Claim_Detection", "Agreed_with_Transcript"]] = df[["Claim_Detection", "Agreed_with_Transcript"]].astype(int)

# Group and count combinations
summary = df.groupby(["Claim_Detection", "Agreed_with_Transcript"]).size().reset_index(name="Count")

# Sort for clarity
summary = summary.sort_values(by=["Claim_Detection", "Agreed_with_Transcript"])

# Show result
print(summary)

   Claim_Detection  Agreed_with_Transcript  Count
0                0                      -1  11544
1                0                       0  26687
2                0                       1   6673
3                1                      -1   8288
4                1                       0  13947
5                1                       1   4301


  df = pd.read_csv("Final_Thesis_Merged.csv")
