# Transcript Claims Extraction

- In this notebook, I prepared a batch to extract check-worthy factual claims directly from full YouTube video transcripts.  
- I used the complete transcript text from each video and built a batch JSONL file where each entry was tied to a specific `Video_ID`.  
- The prompt instructed GPT-4o to extract any claims related to sportswashing, human rights, financial corruption, or Gulf investments in global sports.  
- If no relevant claims were present, the model was asked to reply with "None".  
- The batch was submitted to the OpenAI Batch API and saved for further filtering and analysis.


### Importing Libraries

In [None]:
import pandas as pd
import json
from tqdm import tqdm
import textwrap
import openai
import random

In [6]:
# Loading transcript data
df = pd.read_csv("Processed_YouTube_Transcripts.csv")
df = df.dropna(subset=["Transcript"]).drop_duplicates(subset=["Video_ID"]).reset_index(drop=True)
print(f"Loaded {len(df)} transcripts.")

Loaded 161 transcripts.


### Creating the Json file

In [14]:
output_jsonl = "batch_transcript_claim_extraction_full_4o.jsonl"

# Opening the file and writing each transcript as a separate task
with open(output_jsonl, "w", encoding="utf-8") as f:
    for _, row in tqdm(df.iterrows(), total=len(df), desc="Building full transcript batch"):
        video_id = str(row["Video_ID"])
        transcript = row["Transcript"]

        # Building the prompt using the transcript
        user_prompt = f"""Extract any check-worthy factual claims from the following YouTube transcript.

The focus is on sportswashing, human rights, financial corruption, and Gulf investments in global sports.

Return the claims as a bullet-point list. If there are no relevant claims, reply with:

None

Transcript:
\"\"\"{transcript}\"\"\"
"""

        # Creating the task for GPT-4o
        task = {
            "custom_id": video_id,
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are helping a researcher extract check-worthy claims from YouTube video transcripts."
                    },
                    {
                        "role": "user",
                        "content": user_prompt
                    }
                ],
                "temperature": 0,
                "max_tokens": 750
            }
        }

        f.write(json.dumps(task) + "\n")

print(f"\nSaved JSONL file: {output_jsonl}")
print(f"Total prompts: {len(df)}")


Building full transcript batch: 100% |████████| 161/161 [00:00<00:00, 324.32it/s]

Saved JSONL file: batch_transcript_claim_extraction_full_4o.jsonl
Total prompts: 161


### Submitting Batches to Openai

In [23]:
openai.api_key = "************" 
client = openai.OpenAI(api_key=openai.api_key)

# JSONL batch files 
batch_files = [
    "batch_transcript_claim_extraction_full_4o.jsonl"
]

# Submitting batches
for i, file_path in enumerate(batch_files):
    try:
        print(f"\nSubmitting batch {i+1}/{len(batch_files)}: {file_path}")

        # Uploading file
        with open(file_path, "rb") as f:
            upload = client.files.create(file=f, purpose="batch")
        print(f"Uploaded: File ID = {upload.id}")

        # Creating batch job
        batch = client.batches.create(
            input_file_id=upload.id,
            endpoint="/v1/chat/completions",
            completion_window="24h"
        )
        print(f"Batch submitted: Batch ID = {batch.id}")

    except Exception as e:
        print(f"Failed to submit {file_path}: {e}")



Submitting batch 1/1: batch_transcript_claim_extraction_full_4o.jsonl
Uploaded: File ID = file-PNmRDxXbfAdedATd5a9WFm
Batch submitted: Batch ID = batch_67f1b9d8a5248190b9a24b806aab5a46


### Downloading Json Output from Openai

In [26]:
openai.api_key = "****************" 

# List of output file IDs from completed batches
output_file_ids = [
    "file-9xJPc6ofA1aG2fjbtvtYW4"
]

# Downloading each output file
for file_id in output_file_ids:
    file_response = openai.files.content(file_id)

    # Saving the file locally in binary mode
    output_filename = f"{file_id}.jsonl"
    with open(output_filename, "wb") as f:
        for chunk in file_response.iter_bytes():
            f.write(chunk)
    
    print(f"File downloaded: {output_filename}")

File downloaded: file-9xJPc6ofA1aG2fjbtvtYW4.jsonl


### Converting Json Output to a CSV file

In [32]:
jsonl_path = "file-9xJPc6ofA1aG2fjbtvtYW4.jsonl"

# Parse each line and extract claims
video_claims = {}

with open(jsonl_path, "r", encoding="utf-8") as f:
    for line in tqdm(f, desc="Extracting claims from batch result"):
        entry = json.loads(line)
        video_id = entry["custom_id"]

        # Try to extract model's output
        try:
            response_text = entry["response"]["body"]["choices"][0]["message"]["content"].strip()
        except (KeyError, IndexError, TypeError):
            response_text = "None"  # fallback if something went wrong

        # Convert response to list of claims
        if response_text.lower() == "none":
            claims = []
        else:
            claims = [
                line.strip("- ").strip()
                for line in response_text.splitlines()
                if line.strip().startswith("-") or len(line.strip()) > 0
            ]

        video_claims[video_id] = claims

# Convert to DataFrame
claims_df = pd.DataFrame([
    {
        "Video_ID": vid,
        "Extracted_Claims": claims,
        "Claim_Present": int(len(claims) > 0)
    }
    for vid, claims in video_claims.items()
])

claims_df.to_csv("Extracted_Transcript_Claims.csv", index=False)

Extracting claims from batch result: 161it [00:00, 8922.64it/s]


### Cleaning and Validating Extracted Transcript Claims§

In [58]:
df = pd.read_csv("Extracted_Transcript_Claims.csv")

# Define the claim cleaning function
def parse_and_clean_claims(claim_string):
    if pd.isna(claim_string) or claim_string.strip() in ["[]", "None", ""]:
        return []

    try:
        raw = claim_string.strip("[]").strip()
        parts = raw.split("', '")
        cleaned = []
        for part in parts:
            claim = part.strip().strip("'\"").strip(". ")
            if len(claim.split()) >= 5:
                cleaned.append(claim)
        return cleaned
    except:
        return []

# Clean the claims
df["Extracted_Claims"] = df["Extracted_Claims"].apply(parse_and_clean_claims)
df["Num_Claims"] = df["Extracted_Claims"].apply(len)

# Keep only the final columns
df = df[["Video_ID", "Extracted_Claims", "Num_Claims"]]

# Save the final cleaned file
df.to_csv("Extracted_Transcript_Claims_FIXED.csv", index=False)
print("Cleaned and saved: 'Extracted_Transcript_Claims_FIXED.csv'")


Cleaned and saved: 'Extracted_Transcript_Claims_FIXED.csv'


### Sampling and Viewing Random Extracted Claims from Transcript Data


In [75]:
# Loading the cleaned claims file
df = pd.read_csv("Extracted_Transcript_Claims_FIXED.csv")

# Filtering to rows that actually have claims
df_with_claims = df[df["Num_Claims"] > 0].reset_index(drop=True)

# Picking a random row
random_row = df_with_claims.sample(1).iloc[0]

# Extracting data
video_id = random_row["Video_ID"]
claims = eval(random_row["Extracted_Claims"])  # convert stringified list to Python list
num_claims = len(claims)

# Print results
print(f"Video ID: {video_id}")
print(f"Number of Claims: {num_claims}")
print("\n Extracted Claims:")
for i, claim in enumerate(claims, 1):
    print(f"{i}. {claim}")


Video ID: hPmf0oSGZtM
Number of Claims: 3

 Extracted Claims:
1. Manchester City is accused of failing to provide accurate financial information across nine seasons.', "Manchester City is accused of breaking UEFA's Financial Fair Play rules across five seasons.", "Manchester City is accused of breaking the Premier League's profitability and sustainability rules across three seasons.", 'The Premier League investigation into Manchester City has been ongoing for almost six years
2. The case against Manchester City started due to a young Portuguese computer expert hacking into emails of clubs and agents across Europe, with the information ending up in the hands of a German newspaper
3. The punishment for Manchester City, if found guilty, could be unlimited, including potential relegation, expulsion from the Premier League, or fines.', "The hearing regarding Manchester City's case is being held in secret, with only lawyers allowed to attend.", 'The hearing is expected to take about 10 weeks