# 01 — Fetch YouTube Comments as train/fine-tune data

This notebook:
* reads a YouTube API key from `.env`
* fetches **top-level comments only** (no replies) for two videos
* adds a **video-level weak label** (`video_label`) and simple target distribution columns (for later weak supervision)
* saves to `ml/data/yt_comments_raw.csv`

---

## A) Setup & configuration

### A1. Imports & paths

In [1]:
import os, re
from pathlib import Path
from typing import Dict, List, Optional

import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv
from googleapiclient.discovery import build

# Notebook & data paths
NB_DIR = Path.cwd()
DATA_DIR = (NB_DIR.parent / "data").resolve()
DATA_DIR.mkdir(parents=True, exist_ok=True)
OUT_CSV = DATA_DIR / "yt_comments_raw.csv"

print("Data directory:", DATA_DIR)
print("Output file   :", OUT_CSV)

Data directory: C:\My Github Profile\yt-comment-analyzer\ml\data
Output file   : C:\My Github Profile\yt-comment-analyzer\ml\data\yt_comments_raw.csv


### A2. Load API key

In [2]:
# Load .env if present and read key from env
load_dotenv()
YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY", "").strip()
assert YOUTUBE_API_KEY, "❌ Set YOUTUBE_API_KEY in your environment or .env file"
print("API key loaded ✓")

API key loaded ✓


---

## B) Inputs

### B1. Helper: extract video ID

In [3]:
# Helper: extract video id from url
def extract_video_id(url: str) -> str:
    m = re.search(r"(?:v=|youtu\.be/|shorts/)([A-Za-z0-9_-]{6,})", url)
    if not m:
        raise ValueError(f"Could not parse a video ID from URL: {url}")
    return m.group(1)

### B2. Your video list + coarse labels

In [4]:
# Video list (your provided links) + coarse video-level label for later use
VIDEO_URLS = [
    {"url": "https://www.youtube.com/watch?v=iV46TJKL8cU&t=2s", "video_label": "negative"},
    {"url": "https://www.youtube.com/watch?v=n0OFH4xpPr4",      "video_label": "positive"},
]
for v in VIDEO_URLS:
    v["video_id"] = extract_video_id(v["url"])

VIDEO_URLS

[{'url': 'https://www.youtube.com/watch?v=iV46TJKL8cU&t=2s',
  'video_label': 'negative',
  'video_id': 'iV46TJKL8cU'},
 {'url': 'https://www.youtube.com/watch?v=n0OFH4xpPr4',
  'video_label': 'positive',
  'video_id': 'n0OFH4xpPr4'}]

---

## C) Fetcher

### C1. API client

In [5]:
def yt_client(api_key: str):
    return build("youtube", "v3", developerKey=api_key)

### C2. Fetch top-level comments only

In [6]:
def fetch_top_level_comments(api_key: str, video_id: str, max_total: int = 3000, order: str = "relevance") -> pd.DataFrame:
    """
    Fetch ONLY top-level comments (no replies) for a video.
    Returns columns:
      platform, video_id, comment_id, parent_id(None), author, text, like_count,
      published_at, updated_at, source_url
    """
    youtube = yt_client(api_key)
    out = []
    page = None
    fetched = 0

    with tqdm(desc=f"Fetching top-level comments for {video_id}", unit="item") as pbar:
        while True:
            req = youtube.commentThreads().list(
                part="snippet",              # no 'replies' part -> excludes replies entirely
                videoId=video_id,
                maxResults=100,
                pageToken=page,
                textFormat="plainText",
                order=order,                 # "relevance" or "time"
            )
            res = req.execute()
            items = res.get("items", [])

            for it in items:
                top = it["snippet"]["topLevelComment"]
                sn  = top["snippet"]
                out.append({
                    "platform": "youtube",
                    "video_id": video_id,
                    "comment_id": top["id"],
                    "parent_id": None,
                    "author": sn.get("authorDisplayName",""),
                    "text": str(sn.get("textOriginal") or sn.get("textDisplay","")),
                    "like_count": int(sn.get("likeCount",0) or 0),
                    "published_at": sn.get("publishedAt",""),
                    "updated_at": sn.get("updatedAt",""),
                    "source_url": f"https://www.youtube.com/watch?v={video_id}",
                })
                fetched += 1
                pbar.update(1)
                if fetched >= max_total:
                    break

            if fetched >= max_total:
                break
            page = res.get("nextPageToken")
            if not page:
                break

    df = pd.DataFrame(out)
    if not df.empty:
        df["text"] = df["text"].astype(str).str.replace(r"\s+"," ", regex=True).str.strip()
        df = df.drop_duplicates(subset=["comment_id"]).reset_index(drop=True)
    return df

---
## D) Run fetch & save

### D1. Fetch both videos, attach weak labels & targets

In [7]:
frames = []
for v in VIDEO_URLS:
    df_v = fetch_top_level_comments(YOUTUBE_API_KEY, v["video_id"], max_total=10000, order="relevance")
    df_v["video_label"] = v["video_label"]         # coarse weak label (per video)

    # Optional target distribution per video (for weak/bag training later)
    if v["video_label"] == "negative":
        df_v[["target_neg","target_neu","target_pos"]] = [0.80, 0.15, 0.05]
    elif v["video_label"] == "positive":
        df_v[["target_neg","target_neu","target_pos"]] = [0.05, 0.15, 0.80]
    else:
        df_v[["target_neg","target_neu","target_pos"]] = [0.20, 0.60, 0.20]

    frames.append(df_v)

df_all = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()

Fetching top-level comments for iV46TJKL8cU: 1240item [00:04, 282.40item/s]
Fetching top-level comments for n0OFH4xpPr4: 1166item [00:04, 266.57item/s]


### D2. Reorder columns, quick sanity checks, and save

In [8]:
# Reorder columns for consistency
cols = [
    "platform","video_id","comment_id","parent_id","author","text","like_count",
    "published_at","updated_at","source_url","video_label","target_neg","target_neu","target_pos"
]
df_all = df_all.reindex(columns=cols)

print("Counts per video_id:")
display(df_all.groupby("video_id")["comment_id"].count())

print("Sample rows:")
display(df_all.head(5))

# Save (same path & schema as before)
df_all.to_csv(OUT_CSV, index=False, encoding="utf-8")
OUT_CSV, len(df_all)

Counts per video_id:


video_id
iV46TJKL8cU    1240
n0OFH4xpPr4    1166
Name: comment_id, dtype: int64

Sample rows:


Unnamed: 0,platform,video_id,comment_id,parent_id,author,text,like_count,published_at,updated_at,source_url,video_label,target_neg,target_neu,target_pos
0,youtube,iV46TJKL8cU,UgwagL2KD03y4uI2XtB4AaABAg,,@DrunkMonkey-gx8oy,Disney - “Coming to a theatre near you” Me - I...,97627,2024-12-04T23:15:53Z,2024-12-04T23:15:53Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05
1,youtube,iV46TJKL8cU,UgyoITtGRyj1j8jx4pt4AaABAg,,@MrH1pster,"This movie is magical. When I closed the tab, ...",12355,2025-01-25T11:40:17Z,2025-01-27T08:07:31Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05
2,youtube,iV46TJKL8cU,UgxlVrcG-NT8qt1Z5Bl4AaABAg,,@redbearddan2000,I'll give Disney some credit. They are brave e...,187153,2024-12-03T17:52:05Z,2024-12-03T17:52:05Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05
3,youtube,iV46TJKL8cU,UgxRt_KQt6VI1iUZ70N4AaABAg,,@Xerø2846,I finally found it. The one video I will never...,577,2025-07-17T13:15:09Z,2025-07-17T13:15:09Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05
4,youtube,iV46TJKL8cU,Ugx-hHWXISiSKqmpLbR4AaABAg,,@stephenkrzanowski,If i saw this movie on a plane. I would still ...,161137,2024-12-03T18:22:29Z,2024-12-03T18:22:29Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05


(WindowsPath('C:/My Github Profile/yt-comment-analyzer/ml/data/yt_comments_raw.csv'),
 2406)

---

# ✅ Wrap-up: What we just did

**TL;DR:** We grabbed **top-level** comments (no replies) from two YouTube videos, tagged each video with a rough “vibe” (negative/positive), and saved everything to a clean CSV for the next steps.

---

## What we set up
- Pulled the YouTube API key from `.env` ✅  
- Decided where to save things: `ml/data/yt_comments_raw.csv`

## What we fetched
- Two videos:
  - **Negative:** https://www.youtube.com/watch?v=iV46TJKL8cU&t=2s  
  - **Positive:** https://www.youtube.com/watch?v=n0OFH4xpPr4
- We extracted each `video_id` from the URLs and used the YouTube Data API (commentThreads) to fetch **top-level comments only** (no replies).

## How we cleaned it (light touch)
- Kept key fields: `author, text, like_count, timestamps, source_url`, etc.
- Collapsed extra whitespace and dropped duplicate `comment_id`s.

## Extra labels for later (weak supervision)
- Added a **video-level label**: `video_label ∈ {negative, positive}`
- Added simple **target distributions** to guide training later:
  - Negative video → `target_neg=0.80, target_neu=0.15, target_pos=0.05`
  - Positive video → `target_neg=0.05, target_neu=0.15, target_pos=0.80`

## What we saved
- **File:** `ml/data/yt_comments_raw.csv`  
- **Columns:**
```
platform, video_id, comment_id, parent_id, author, text, like_count,
published_at, updated_at, source_url, video_label, target_neg, target_neu, target_pos
```