# 02 â€“ Preprocessing

This notebook demonstrates the **NLP preprocessing pipeline**: text cleaning, tokenization, stopword removal, and how it feeds into TF-IDF in the next step.

**Goals:**
- Run the same cleaning as `src/preprocess.py` (lowercasing, URLs/HTML removed, non-alpha removed).
- Show tokenization and stopword removal (English).
- Compare raw vs cleaned text on a few samples.

In [None]:
import os
import re
import pandas as pd
from IPython.display import display

def find_project_root(start_dir):
    cur = os.path.abspath(start_dir)
    while True:
        if os.path.isdir(os.path.join(cur, "data")) and os.path.isdir(os.path.join(cur, "src")):
            return cur
        parent = os.path.dirname(cur)
        if parent == cur:
            raise FileNotFoundError("Run Jupyter from inside misinformation-detection-engine.")
        cur = parent

PROJECT_ROOT = find_project_root(os.getcwd())
PROCESSED_PATH = os.path.join(PROJECT_ROOT, "data", "processed", "processed_fake_news.csv")
RAW_PATH = os.path.join(PROJECT_ROOT, "data", "raw", "fake_news.csv")

if os.path.exists(PROCESSED_PATH):
    df = pd.read_csv(PROCESSED_PATH)
    print("Loaded processed dataset. Columns:", list(df.columns))
else:
    raise FileNotFoundError(f"Processed data not found at {PROCESSED_PATH}. Run: python src/preprocess.py")

## Text cleaning (same logic as `src/preprocess.py`)

- Lowercasing
- Remove URLs and HTML tags
- Remove non-alphabetic characters
- Collapse multiple spaces

In [None]:
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r"http\S+|www\.\S+", " ", text)
    text = re.sub(r"<.*?>", " ", text)
    text = re.sub(r"[^a-z\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Show before/after on a few rows
text_col = "clean_text" if "clean_text" in df.columns else "text"
sample = df.head(3)
for i, row in sample.iterrows():
    raw = row.get("text", row[text_col])
    cleaned = clean_text(raw) if text_col == "text" else row[text_col]
    print("RAW (first 200 chars):", str(raw)[:200])
    print("CLEAN:", str(cleaned)[:200])
    print("---")

## Tokenization & stopwords

TF-IDF in `src/train.py` uses `TfidfVectorizer(..., stop_words="english")`. Here we show a simple tokenization + stopword removal for illustration.

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

tcol = "clean_text" if "clean_text" in df.columns else "text"
sample_text = df[tcol].iloc[0]
tokens = sample_text.split()
tokens_no_stop = [t for t in tokens if t not in ENGLISH_STOP_WORDS]
print("Sample text (first 150 chars):", sample_text[:150])
print("\nToken count (all):", len(tokens))
print("Token count (no stopwords):", len(tokens_no_stop))
print("First 20 tokens (no stop):", tokens_no_stop[:20])

## Summary

- Cleaning: lowercase, no URLs/HTML, letters only, collapsed spaces.
- Tokenization: whitespace (TF-IDF does this internally).
- Stopwords: English stopwords removed in TF-IDF. Processed text is saved in `data/processed/` for training.