# Influenced by Transcript

- In this notebook, I prepared a batch to identify whether each thesis-relevant comment was influenced by the corresponding video transcript.  
- I first used Sentence-BERT to generate sentence-level embeddings for all video transcripts.  
- Each transcript was split into individual sentences, and embeddings were precomputed and stored by `Video_ID`.  
- For each comment, I encoded it using SBERT and calculated cosine similarity between the comment and all transcript sentences from the same video.  
- I selected the top 5 most similar transcript sentences for each comment and included them in the batch JSONL file.  
- GPT-4o was then asked to return `1` if the comment appeared to be influenced by the transcript content, or `0` if it was not.  
- The batch was split in two due to token limits and submitted via the OpenAI Batch API.  
- Once the outputs were ready, I extracted the `custom_id` and influence label, and saved the results to a CSV for further analysis.


### Importing Libraries

In [34]:
import pandas as pd
import json
import torch
import re
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
import os
import openai
import time

### Generate Sentence Embeddings for Video Transcripts

In [1]:
# Load data
df = pd.read_csv("Thesis Relevant Comments.csv")
df_transcripts = pd.read_csv("Processed_YouTube_Transcripts.csv")

# Load SBERT
model = SentenceTransformer('all-MiniLM-L6-v2')

# Precompute transcript embeddings
transcript_embeddings = {}
for _, row in tqdm(df_transcripts.iterrows(), total=len(df_transcripts), desc="🔄 Encoding transcripts"):
    video_id = row["Video_ID"]
    transcript_text = row["Transcript"]
    sentences = re.split(r'(?<=[.!?])\s+', transcript_text)
    if not sentences:
        continue
    transcript_embeddings[video_id] = (sentences, model.encode(sentences, convert_to_tensor=True))

Encoding transcripts: 100%|████████████████████| 162/162 [00:47<00:00, 3.44it/s]


### Identify Top 5 Transcript Sentences Closest to a Comment


In [5]:
# Get top 5 most similar transcript sentences for a given comment
def get_top_similar_sentences(comment, video_id):
    if video_id not in transcript_embeddings:
        return ["No transcript found"] * 5
    transcript_sentences, transcript_embedding = transcript_embeddings[video_id]
    comment_embedding = model.encode(comment, convert_to_tensor=True)
    similarities = util.pytorch_cos_sim(comment_embedding, transcript_embedding)[0]
    top_indices = torch.topk(similarities, k=min(5, len(transcript_sentences))).indices
    top_sentences = [transcript_sentences[i] for i in top_indices]
    while len(top_sentences) < 5:
        top_sentences.append("N/A")
    return top_sentences



### Creating the Json file

In [45]:
output_jsonl = "batch_step_influenced_by_transcript.jsonl"
with open(output_jsonl, "w", encoding="utf-8") as f:
    for _, row in tqdm(df.iterrows(), total=len(df), desc="Creating batch"):
        comment_id = str(row["Comment_ID"])
        rewritten_comment = row["Rewritten Comment"]
        video_id = row["Video_ID"]
        top_sentences = get_top_similar_sentences(rewritten_comment, video_id)

        task = {
            "custom_id": comment_id,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are helping a researcher determine whether a YouTube comment was influenced by a video transcript."
                    },
                    {
                        "role": "user",
                        "content": f"""Comment: "{rewritten_comment}"
Transcript:
- {top_sentences[0]}
- {top_sentences[1]}
- {top_sentences[2]}
- {top_sentences[3]}
- {top_sentences[4]}

Was this comment influenced by the transcript? 
Reply ONLY with:
1 = Yes, influenced
0 = No, not influenced"""
                    }
                ],
                "temperature": 0.0,
                "max_tokens": 5
            }
        }
        f.write(json.dumps(task) + "\n")

print(f"Saved: {output_jsonl}")


Creating batch: 100%|███████████| 71441/71441 [36:50<00:00, 32.32it/s]
Saved: batch_step_influenced_by_transcript.jsonl


### Splitting the Batch due to Token Limit

In [58]:
# === Config ===
input_file = "batch_step_influenced_by_transcript.jsonl"
output_file_1 = input_file.replace(".jsonl", "_part1.jsonl")
output_file_2 = input_file.replace(".jsonl", "_part2.jsonl")

# === Read the file
with open(input_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

# === Split
mid = len(lines) // 2
part1 = lines[:mid]
part2 = lines[mid:]

# === Write to two new files
with open(output_file_1, "w", encoding="utf-8") as f:
    f.writelines(part1)

with open(output_file_2, "w", encoding="utf-8") as f:
    f.writelines(part2)

# === Summary
print(f"Saved {len(part1)} → {output_file_1}")
print(f"Saved {len(part2)} → {output_file_2}")


Saved 35720 → batch_step_influenced_by_transcript_part1.jsonl
Saved 35721 → batch_step_influenced_by_transcript_part2.jsonl


### Submitting Batches to Openai

In [51]:
# Set your OpenAI API key
openai.api_key = "*************"
client = openai.OpenAI(api_key=openai.api_key)

# List of your batch files
batch_files = [
    "batch_step_influenced_by_transcript_part1.jsonl",
    "batch_step_influenced_by_transcript_part2.jsonl"
]

# Submit each batch with 25-minute delay between them
for i, file_path in enumerate(batch_files):
    try:
        with open(file_path, "rb") as f:
            upload = client.files.create(file=f, purpose="batch")

        print(f"Uploaded: {file_path} -> File ID: {upload.id}")

        batch = client.batches.create(
            input_file_id=upload.id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
        )

        print(f"Submitted Batch {i+1} -> Batch ID: {batch.id}")

        # Wait 100 minutes before submitting next batch (unless it's the last one)
        if i < len(batch_files) - 1:
            print("Waiting 100 minutes before next submission...")
            time.sleep(30 * 60)

    except Exception as e:
        print(f"Failed to process {file_path}: {e}")


Uploaded: batch_step_influenced_by_transcript_part1.jsonl -> File ID: file-RerJuxTQN1qAHHKgTUfsqD
Submitted Batch 1 -> Batch ID: batch_67e083d901c48190a71da912bf7c840f
Waiting 100 minutes before next submission...
Uploaded: batch_step_influenced_by_transcript_part2.jsonl -> File ID: file-Mh8tajDU1pXwGN1zuRRueb
Submitted Batch 2 -> Batch ID: batch_67e08af3bbf0819086e8b01ad3bbc097


### Downloading Json Output from Openai

In [53]:
# Set OpenAI API Key
openai.api_key = "*****************"  

# List of output file IDs from completed batches
output_file_ids = [
    "file-1CvpT2fRV1nTnGFCxEBThF",  
    "file-Y3EN2ZdZCNbALCJiV6vhDN",

]

# Download each output file properly
for file_id in output_file_ids:
    file_response = openai.files.content(file_id)

    # Save the file locally in binary mode
    output_filename = f"{file_id}.jsonl"
    with open(output_filename, "wb") as f:
        for chunk in file_response.iter_bytes():
            f.write(chunk)
    
    print(f"File downloaded: {output_filename}")

File downloaded: file-1CvpT2fRV1nTnGFCxEBThF.jsonl
File downloaded: file-Y3EN2ZdZCNbALCJiV6vhDN.jsonl


### Converting Json Output to a CSV file

In [60]:
jsonl_files = [
    "file-1CvpT2fRV1nTnGFCxEBThF.jsonl",
    "file-Y3EN2ZdZCNbALCJiV6vhDN.jsonl"    
]

# Store extracted data
extracted_data = []

# Loop through files
for jsonl_file in jsonl_files:
    if not os.path.exists(jsonl_file):
        print(f" {jsonl_file} not found")
        continue

    print(f" Processing {jsonl_file}")

    with open(jsonl_file, "r", encoding="utf-8") as f:
        for line in f:
            json_obj = json.loads(line)

            custom_id = json_obj.get("custom_id", "")
            response = json_obj.get("response", {}).get("body", {}).get("choices", [{}])[0].get("message", {}).get("content", "").strip()

            extracted_data.append({
                "custom_id": custom_id,
                "Influenced_by_Transcript": response
            })

# Create and save CSV
df = pd.DataFrame(extracted_data)
output_csv = "influenced_by_transcript_responses.csv"
df.to_csv(output_csv, index=False)

# Summary
influenced = (df["Influenced_by_Transcript"] == "1").sum()
not_influenced = (df["Influenced_by_Transcript"] == "0").sum()

print(f"\n Extracted responses saved to: {output_csv}")
print(f" Influenced: {influenced}")
print(f" Not Influenced: {not_influenced}")


Processing file-1CvpT2fRV1nTnGFCxEBThF.jsonl
Processing file-Y3EN2ZdZCNbALCJiV6vhDN.jsonl

Extracted responses saved to: influenced_by_transcript_responses.csv
Influenced     : 47918
Not Influenced : 23523


### Merging Transcript Influence Labels into Main Dataset


In [46]:
df_thesis = pd.read_csv("Thesis Relevant Comments.csv")
df_influenced = pd.read_csv("influenced_by_transcript_responses.csv")

# If your thesis file has "Comment_ID" and the responses file has "custom_id", rename one:
df_influenced = df_influenced.rename(columns={"custom_id": "Comment_ID"})

# Merged on Comment_ID
df_merged = pd.merge(df_thesis, df_influenced, on="Comment_ID", how="left")

# Saved merged result
df_merged.to_csv("Thesis_Relevant_With_Transcript_Influence.csv", index=False)

print(" Merged CSV saved as Thesis_Relevant_With_Transcript_Influence.csv")


 Merged CSV saved as Thesis_Relevant_With_Transcript_Influence.csv
