# 02 — Clean comments and make train/val/test splits

We’ll:

1. load the raw CSV from step 01,
2. lightly clean text + keep **English-only**,
3. carry forward the **video-level weak labels** and target distributions,
4. split into **train / val / test**,
5. save the cleaned and split CSVs for modeling.

---

## A) Setup & load raw data

**What this cell does:** imports, points to files, sets deterministic language detection, and loads `yt_comments_raw.csv`.

In [1]:
from pathlib import Path
import re
import pandas as pd
import numpy as np
from langdetect import detect, DetectorFactory, LangDetectException
from sklearn.model_selection import train_test_split

# deterministic langdetect
DetectorFactory.seed = 42

DATA_DIR = Path("../data")
RAW_CSV  = DATA_DIR / "yt_comments_raw.csv"
CLEAN_CSV = DATA_DIR / "yt_comments_clean.csv"

SENT_TRAIN = DATA_DIR / "sentiment_train.csv"
SENT_VAL   = DATA_DIR / "sentiment_val.csv"
SENT_TEST  = DATA_DIR / "sentiment_test.csv"

df = pd.read_csv(RAW_CSV)
print(df.shape)
df.head(3)

(2406, 14)


Unnamed: 0,platform,video_id,comment_id,parent_id,author,text,like_count,published_at,updated_at,source_url,video_label,target_neg,target_neu,target_pos
0,youtube,iV46TJKL8cU,UgwagL2KD03y4uI2XtB4AaABAg,,@DrunkMonkey-gx8oy,Disney - “Coming to a theatre near you” Me - I...,97627,2024-12-04T23:15:53Z,2024-12-04T23:15:53Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05
1,youtube,iV46TJKL8cU,UgyoITtGRyj1j8jx4pt4AaABAg,,@MrH1pster,"This movie is magical. When I closed the tab, ...",12355,2025-01-25T11:40:17Z,2025-01-27T08:07:31Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05
2,youtube,iV46TJKL8cU,UgxlVrcG-NT8qt1Z5Bl4AaABAg,,@redbearddan2000,I'll give Disney some credit. They are brave e...,187153,2024-12-03T17:52:05Z,2024-12-03T17:52:05Z,https://www.youtube.com/watch?v=iV46TJKL8cU,negative,0.8,0.15,0.05


---

## B) Minimal text cleaning + language check helpers

**What this cell does:** removes URLs/extra spaces; defines an `is_english` guard that’s robust to short/odd strings.

In [2]:
URL_RE   = re.compile(r"https?://\S+|www\.\S+")
SPACE_RE = re.compile(r"\s+")

def clean_text(s: str) -> str:
    if not isinstance(s, str):
        s = str(s)
    s = s.replace("\u200b", " ")
    s = URL_RE.sub(" ", s)
    s = SPACE_RE.sub(" ", s)
    return s.strip()

def is_english(s: str) -> bool:
    try:
        # quick guard for very short strings
        if len(s) < 3:
            return False
        lang = detect(s)
        return lang == "en"
    except LangDetectException:
        return False

---

## C) Keep top-level comments, clean, and filter to English

**What this cell does:**

* enforces top-level only (already true from step 01, but we keep it explicit),
* deduplicates by `comment_id`,
* applies `clean_text`,
* filters to English comments.

In [3]:
df = df.copy()

# keep only top-level
df = df[df["parent_id"].isna()].copy()

# drop dup comment_id and empty text
df["text"] = df["text"].astype(str).map(clean_text)
df = df.drop_duplicates(subset=["comment_id"]).reset_index(drop=True)
df = df[df["text"].str.len() >= 3].copy()

# English filter
mask_en = df["text"].map(is_english)
df_en = df[mask_en].reset_index(drop=True)

print("Raw rows:", len(df))
print("English rows:", len(df_en))
df_en["video_id"].value_counts()

Raw rows: 2401
English rows: 2089


video_id
iV46TJKL8cU    1119
n0OFH4xpPr4     970
Name: count, dtype: int64

---

## D) Preserve weak labels & targets; prepare training columns

**What this cell does:**

* ensures `video_label` is present,
* carries forward `target_neg/neu/pos`, with sensible defaults if missing,
* creates the final columns used for training (including `sentiment` = `video_label`).

In [4]:
# Ensure expected columns exist (from step 01)
if "video_label" not in df_en.columns:
    raise ValueError("video_label column missing. Re-run 01_fetch_youtube_comments to include video-level labels.")

# Keep only necessary columns for sentiment training + helpful metadata
keep_cols = [
    "text","video_id","video_label","like_count","source_url",
    "target_neg","target_neu","target_pos"
]
# Some columns may be missing target_* if you didn’t add them in step 01; fill sensible defaults by video_label
df_en = df_en.reindex(columns=[c for c in keep_cols if c in df_en.columns])

if not set(["target_neg","target_neu","target_pos"]).issubset(df_en.columns):
    # default targets by video_label
    def default_target(vlab):
        if vlab == "negative":
            return pd.Series([0.80,0.15,0.05], index=["target_neg","target_neu","target_pos"])
        if vlab == "positive":
            return pd.Series([0.05,0.15,0.80], index=["target_neg","target_neu","target_pos"])
        return pd.Series([0.20,0.60,0.20], index=["target_neg","target_neu","target_pos"])
    targets = df_en["video_label"].map(default_target)
    df_en = pd.concat([df_en.drop(columns=["target_neg","target_neu","target_pos"], errors="ignore"), targets], axis=1)

# Map weak label to the 'sentiment' column expected by the training notebook
df_en["sentiment"] = df_en["video_label"].astype(str)
df_en = df_en[["text","sentiment","video_id","target_neg","target_neu","target_pos","like_count","source_url"]]

df_en.head(5)

Unnamed: 0,text,sentiment,video_id,target_neg,target_neu,target_pos,like_count,source_url
0,Disney - “Coming to a theatre near you” Me - I...,negative,iV46TJKL8cU,0.8,0.15,0.05,97627,https://www.youtube.com/watch?v=iV46TJKL8cU
1,"This movie is magical. When I closed the tab, ...",negative,iV46TJKL8cU,0.8,0.15,0.05,12355,https://www.youtube.com/watch?v=iV46TJKL8cU
2,I'll give Disney some credit. They are brave e...,negative,iV46TJKL8cU,0.8,0.15,0.05,187153,https://www.youtube.com/watch?v=iV46TJKL8cU
3,I finally found it. The one video I will never...,negative,iV46TJKL8cU,0.8,0.15,0.05,577,https://www.youtube.com/watch?v=iV46TJKL8cU
4,If i saw this movie on a plane. I would still ...,negative,iV46TJKL8cU,0.8,0.15,0.05,161137,https://www.youtube.com/watch?v=iV46TJKL8cU


---

## E) Split into train / val / test (video-aware when possible)

**What this cell does:**

* If we have **≥3 videos**, we split **by video_id** (prevents leakage).
* If fewer (your current case), we do **comment-level stratified** splits so you can keep moving.

In [5]:
rng = 42
vids = df_en["video_id"].unique().tolist()

def split_by_video(frame: pd.DataFrame, train_p=0.7, val_p=0.15, test_p=0.15):
    # split on unique video ids
    vids = frame["video_id"].unique()
    n = len(vids)
    if n >= 3:
        # 70/15/15 by video
        vid_train, vid_tmp = train_test_split(vids, test_size=(1-train_p), random_state=rng)
        rel = val_p / (val_p + test_p)
        vid_val, vid_test = train_test_split(vid_tmp, test_size=(1-rel), random_state=rng)
        return (
            frame[frame.video_id.isin(vid_train)].reset_index(drop=True),
            frame[frame.video_id.isin(vid_val)].reset_index(drop=True),
            frame[frame.video_id.isin(vid_test)].reset_index(drop=True),
        )
    else:
        # Fallback: comment-level splits (keeps your project moving with 2 videos)
        tr, tmp = train_test_split(frame, test_size=0.3, random_state=rng, stratify=frame["sentiment"])
        va, te = train_test_split(tmp, test_size=0.5, random_state=rng, stratify=tmp["sentiment"])
        return tr.reset_index(drop=True), va.reset_index(drop=True), te.reset_index(drop=True)

tr_df, va_df, te_df = split_by_video(df_en)

for name, part in [("train",tr_df),("val",va_df),("test",te_df)]:
    print(name, part.shape, "label counts:\n", part["sentiment"].value_counts())

train (1462, 8) label counts:
 sentiment
negative    783
positive    679
Name: count, dtype: int64
val (313, 8) label counts:
 sentiment
negative    168
positive    145
Name: count, dtype: int64
test (314, 8) label counts:
 sentiment
negative    168
positive    146
Name: count, dtype: int64


---

## F) Save cleaned data + splits

**What this cell does:** writes the cleaned English-only data and the three splits to disk for the fine-tuning notebook.

In [6]:
# Save cleaned all-English comments (for inspection)
df_en.to_csv(CLEAN_CSV, index=False, encoding="utf-8")

# Save splits for sentiment training notebook (expects 'text' + 'sentiment')
tr_df.to_csv(SENT_TRAIN, index=False, encoding="utf-8")
va_df.to_csv(SENT_VAL,   index=False, encoding="utf-8")
te_df.to_csv(SENT_TEST,  index=False, encoding="utf-8")

CLEAN_CSV, SENT_TRAIN, SENT_VAL, SENT_TEST

(WindowsPath('../data/yt_comments_clean.csv'),
 WindowsPath('../data/sentiment_train.csv'),
 WindowsPath('../data/sentiment_val.csv'),
 WindowsPath('../data/sentiment_test.csv'))

---

## ✅ Wrap-up: What we just did (02_clean_and_split)

**TL;DR:** We took the raw comments, cleaned them up, filtered to **English**, carried forward your **video-level weak labels** and targets, split into train/val/test, and saved everything for model training.

* **Cleaned file:** `ml/data/yt_comments_clean.csv`
* **Splits for training:**
  * `ml/data/sentiment_train.csv`
  * `ml/data/sentiment_val.csv`
  * `ml/data/sentiment_test.csv`