# Etapa 1 - Preprocesamiento
Este notebook lee **Data/raw/**, aplica limpieza mínima, añade **VADER** y heurísticos, agrega métricas por post y guarda **CSV + JSONL** en **Data/processed/**.

## Librerias

In [16]:
# Ejecutar esta celda para instalar las siguientes dependencias:
# ```bash
# pip install pandas tqdm vaderSentiment regex
# ``

In [17]:
from pathlib import Path
import json, re
import pandas as pd
from tqdm import tqdm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## Parámetros

In [18]:
DATA_DIR = Path("data")          # base de datos
RAW_DIR = DATA_DIR / "raw"       # carpeta con la corrida cruda
ENRICHED_DIR = DATA_DIR / "enriched"  # carpeta para datos enriquecidos
ENRICHED_DIR.mkdir(parents=True, exist_ok=True)

# Heurísticos acuerdo/desacuerdo (EN/ES). Puedes ampliar listas.
AGREE_TERMS = [
    r"\bi agree\b", r"\bagree\b", r"\bsupported\b", r"\bsupport this\b", r"\bi support\b", r"\bvalid point\b",
    r"\bde acuerdo\b", r"\bapoyo\b", r"\btiene raz[oó]n\b", r"\bcierto\b", r"\btotalmente de acuerdo\b", r"\bestoy de acuerdo\b",
]
DISAGREE_TERMS = [
    r"\bi disagree\b", r"\bdisagree\b", r"\bnot support\b", r"\boppose\b", r"\bagainst this\b", r"\bbad take\b",
    r"\bno apoyo\b", r"\ben desacuerdo\b", r"\bno estoy de acuerdo\b", r"\bmala idea\b", r"\bme opongo\b",
]

# Expresiones regulares precompiladas
RE_URL = re.compile(r"https?://\S+")
RE_WS = re.compile(r"\s+")

## Utilidades

In [19]:
def latest_raw_dir(raw_base: Path) -> Path:
    """Obtiene la subcarpeta más reciente en raw_base."""
    subs = [p for p in raw_base.iterdir() if p.is_dir()]
    if not subs:
        raise FileNotFoundError(f"No hay subcarpetas en {raw_base}. Asegura haber corrido el scraper.")
    return sorted(subs)[-1]


def read_csv_or_jsonl(path_csv: Path, path_jsonl: Path) -> pd.DataFrame:
    """Lee un DataFrame desde CSV o JSONL, dependiendo de cuál exista."""
    if path_csv.exists():
        return pd.read_csv(path_csv)
    if path_jsonl.exists():
        return pd.read_json(path_jsonl, lines=True)
    raise FileNotFoundError(f"No se encontró ni {path_csv} ni {path_jsonl}")


def minimal_clean_for_vader(text: str) -> str:
    """Limpieza mínima: normaliza espacios y opcionalmente tokeniza URLs.
    Conserva mayúsculas, puntuación y emojis para VADER.
    """
    if not isinstance(text, str):
        return ""
    t = RE_URL.sub("URL", text)
    t = RE_WS.sub(" ", t).strip()
    return t


def count_hits_regex(text: str, patterns: list[str]) -> int:
    """Cuenta cuántos patrones en 'patterns' hacen match en 'text' (case insensitive)."""
    if not text:
        return 0
    t = text.lower()
    return sum(1 for pat in patterns if re.search(pat, t))

## Carga de datos crudos

In [20]:
posts_csv = RAW_DIR / "posts.csv"
comments_csv = RAW_DIR / "comments.csv"
posts_jsonl = RAW_DIR / "posts.jsonl"
comments_jsonl = RAW_DIR / "comments.jsonl"

posts = read_csv_or_jsonl(posts_csv, posts_jsonl)
comments = read_csv_or_jsonl(comments_csv, comments_jsonl)
print("Shapes RAW:", posts.shape, comments.shape)

# Asegura columnas básicas
req_posts = ["post_id","title","selftext","author","created","score","num_comments","permalink","is_self","image_urls","subreddit"]
for col in req_posts:
    if col not in posts.columns:
        posts[col] = None

req_comments = ["post_id","comment_id","author","created","score","body"]
for col in req_comments:
    if col not in comments.columns:
        comments[col] = None

Shapes RAW: (200, 11) (15131, 6)


## Limpieza mínima para VADER

In [21]:
# Limpieza mínima y features en posts
comments["body"] = comments["body"].fillna("")
comments["body_vader"] = comments["body"].map(minimal_clean_for_vader)
comments["has_url"] = comments["body"].str.contains(r"https?://", na=False)
comments["text_len"] = comments["body"].str.len().fillna(0)
comments["author_deleted"] = comments["author"].isin([None, "u/[deleted]"])
comments["is_bot"] = comments["author"].fillna("").str.contains("automoderator", case=False)

# Eliminar duplicados y conservar índices limpios
comments = comments.drop_duplicates(subset=["post_id","comment_id"]).reset_index(drop=True)
posts = posts.drop_duplicates(subset=["post_id"]).reset_index(drop=True)


## Aplicar VADER y heurísticos

In [22]:
# Aplica VADER
analyzer = SentimentIntensityAnalyzer()

# inicializa listas para scores
v_neg, v_neu, v_pos, v_comp = [], [], [], []

# Iterar y calcular scores
for txt in tqdm(comments["body_vader"].tolist(), desc="VADER"):
    scores = analyzer.polarity_scores(txt or "")
    v_neg.append(scores.get("neg", 0.0))
    v_neu.append(scores.get("neu", 0.0))
    v_pos.append(scores.get("pos", 0.0))
    v_comp.append(scores.get("compound", 0.0))

# Asigna scores a DataFrame
comments["vader_neg"] = v_neg
comments["vader_neu"] = v_neu
comments["vader_pos"] = v_pos
comments["vader_compound"] = v_comp

# Etiqueta a partir de compound (umbrales estándar)
comments["sentiment_label"] = pd.cut(
    comments["vader_compound"],
    bins=[-1.0, -0.5, 0.5, 1.0],
    labels=["neg","neu","pos"],
    include_lowest=True
)

# Heurísticos acuerdo/desacuerdo
comments["agrees"] = comments["body_vader"].map(lambda t: count_hits_regex(t, AGREE_TERMS))
comments["disagrees"] = comments["body_vader"].map(lambda t: count_hits_regex(t, DISAGREE_TERMS))

# Agregación por post
agg = (
    comments.groupby("post_id", group_keys=False)
    .apply(lambda g: pd.Series({
        "comments_total": g["comment_id"].count(),
        "comments_pos": (g["sentiment_label"] == "pos").sum(),
        "comments_neg": (g["sentiment_label"] == "neg").sum(),
        "comments_neu": (g["sentiment_label"] == "neu").sum(),
        "agree_hits": g["agrees"].sum(),
        "disagree_hits": g["disagrees"].sum(),
    }), include_groups=False)
    .reset_index()
)


# Calcula support_index con manejo de cero
num = agg["comments_pos"] + agg["agree_hits"]
denom = agg["comments_pos"] + agg["comments_neg"] + agg["agree_hits"] + agg["disagree_hits"]
agg["support_index"] = (num / denom.replace({0: pd.NA})).fillna(0.0).round(3)

# Une a posts
posts_proc = posts.merge(agg, on="post_id", how="left")
for col in ["comments_total","comments_pos","comments_neg","comments_neu","agree_hits","disagree_hits","support_index"]:
    if col not in posts_proc.columns:
        posts_proc[col] = 0 if col != "support_index" else 0.0

VADER: 100%|██████████| 15131/15131 [00:01<00:00, 10443.52it/s]


## Guardado (CSV + JSONL)

In [23]:
comments_out_csv = ENRICHED_DIR / "comments_with_vader.csv"
comments_out_jsonl = ENRICHED_DIR / "comments_with_vader.jsonl"
posts_out_csv = ENRICHED_DIR / "posts_with_support.csv"
posts_out_jsonl = ENRICHED_DIR / "posts_with_support.jsonl"

comments.to_csv(comments_out_csv, index=False, encoding="utf-8-sig")
posts_proc.to_csv(posts_out_csv, index=False, encoding="utf-8-sig")

with open(comments_out_jsonl, "w", encoding="utf-8") as f:
    for _, row in comments.iterrows():
        f.write(json.dumps(row.to_dict(), ensure_ascii=False) + "\n")

with open(posts_out_jsonl, "w", encoding="utf-8") as f:
    for _, row in posts_proc.iterrows():
        f.write(json.dumps(row.to_dict(), ensure_ascii=False) + "\n")

print("Listo.")
print("comments_with_vader:", comments.shape, "→", comments_out_csv)
print("posts_with_support:", posts_proc.shape, "→", posts_out_csv)

Listo.
comments_with_vader: (15131, 18) → data\enriched\comments_with_vader.csv
posts_with_support: (200, 18) → data\enriched\posts_with_support.csv
