# Ejercicio 2 — API de Reddit (PRAW) y Análisis de Sentimientos (sin .env)

**Objetivo:** 
- Conectar a la API de Reddit (PRAW) usando credenciales cargadas directamente en variables de entorno (sin archivo `.env`).
- Recopilar publicaciones de subreddits políticos y extraer campos clave.
- Seleccionar publicaciones más relevantes y extraer comentarios.
- Guardar resultados en CSV vinculando comentarios con su post padre.
- (Opcional) Analizar sentimiento de comentarios con VADER (NLTK) y extraer tokens frecuentes de títulos.

**Subreddits:** `r/politics`, `r/PoliticalDiscussion`, `r/worldnews`  
**Criterios principales:** 20 publicaciones por subreddit (hot o top) y 5 comentarios por cada post relevante.  
**Filtro opcional:** Últimos 6 meses.


## Parte 1: Configuración de la API de Reddit y recopilación de datos

## 1) Requisitos

Ejecutar (una sola vez) para instalar dependencias en su entorno:

```bash
pip install praw pandas nltk
```
> Nota: Aquí **no** usamos `python-dotenv` porque las credenciales se definen directamente en variables de entorno dentro del propio notebook.


In [41]:
# --- Credenciales directamente en variables de entorno (sin .env) ---
import os

os.environ["REDDIT_CLIENT_ID"] = "6J0zagocSCNuuh3rDnKvWA"
os.environ["REDDIT_CLIENT_SECRET"] = "SyGg3smtxOdrFKGlX9tiHQG2_y8HEw"
os.environ["REDDIT_USER_AGENT"] = "Python:PoliticalSentimentAnalyzer:v1.0 (by /u/Reddit-MD-170425)"

# Opcional (modo script con login; no es necesario para lectura):
os.environ["REDDIT_USERNAME"] = "Reddit-MD-170425"
os.environ["REDDIT_PASSWORD"] = "MD-17-123"

print("Credenciales cargadas en variables de entorno (os.environ).")

Credenciales cargadas en variables de entorno (os.environ).


In [42]:
# %% Imports y carga de credenciales desde os.environ
import os
import time
from dataclasses import dataclass, asdict
from typing import List, Iterable

import pandas as pd

import praw
from praw.models import Comment
from prawcore.exceptions import RequestException, ResponseException, OAuthException, Forbidden

# Opcional: NLTK VADER para sentimiento
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Leer variables de entorno (de la celda anterior)
CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")
CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")
USERNAME = os.getenv("REDDIT_USERNAME")
PASSWORD = os.getenv("REDDIT_PASSWORD")
USER_AGENT = os.getenv("REDDIT_USER_AGENT", "Python:PoliticalSentimentAnalyzer:v1.0 (by /u/unknown)")

missing = [k for k, v in {
    "REDDIT_CLIENT_ID": CLIENT_ID,
    "REDDIT_CLIENT_SECRET": CLIENT_SECRET,
    "REDDIT_USER_AGENT": USER_AGENT,
}.items() if not v]
if missing:
    raise RuntimeError(f"Faltan variables en entorno: {missing}")
else:
    print("Variables de entorno detectadas correctamente.")

Variables de entorno detectadas correctamente.


## Parte 2: Recopilación y almacenamiento de datos

In [43]:
# %% Parámetros del ejercicio
TARGET_SUBS = ["politics", "PoliticalDiscussion", "worldnews"]  # nombres canónicos de subreddits
POSTS_PER_SUB = 20             # 20 publicaciones por subreddit
SORT = "hot"                   # "hot" o "top"
ONLY_LAST_6_MONTHS = False     # True para filtrar solo últimos 6 meses
N_RELEVANT_POSTS = 10          # subconjunto de posts más relevantes (por score)
COMMENTS_PER_POST = 5          # 5 comentarios por post

OUTPUT_POSTS_CSV = "reddit_posts.csv"
OUTPUT_COMMENTS_CSV = "reddit_comments.csv" 

In [44]:
# %% Conexión PRAW y utilidades
def init_reddit() -> praw.Reddit:
    try:
        reddit = praw.Reddit(
            client_id=CLIENT_ID,
            client_secret=CLIENT_SECRET,
            username=USERNAME,
            password=PASSWORD,
            user_agent=USER_AGENT,
            ratelimit_seconds=5,
        )
        _ = reddit.read_only  # verificación simple
        return reddit
    except (OAuthException, ResponseException) as e:
        raise RuntimeError(f"Error autenticando con Reddit: {e}")

def six_months_ago_ts() -> float:
    return time.time() - 180 * 24 * 3600  # ~6 meses

def is_within_last_6_months(created_utc: float) -> bool:
    return created_utc >= six_months_ago_ts()

In [45]:
# %% Estructuras de datos
@dataclass
class PostRecord:
    subreddit: str
    post_id: str
    title: str
    score: int
    num_comments: int
    url: str
    permalink: str
    created_utc: float

@dataclass
class CommentRecord:
    post_id: str
    comment_id: str
    body: str
    score: int
    created_utc: float
    subreddit: str

In [46]:
# %% Descarga de posts
def fetch_posts_for_subreddit(
    reddit: praw.Reddit,
    sub_name: str,
    limit: int = 20,
    sort: str = "hot",
    only_last_6_months: bool = False,
) -> List[PostRecord]:
    sub = reddit.subreddit(sub_name)
    gen = sub.top(time_filter="year") if sort == "top" else sub.hot()

    results: List[PostRecord] = []
    for submission in gen:
        if only_last_6_months and not is_within_last_6_months(submission.created_utc):
            continue
        pr = PostRecord(
            subreddit=sub_name,
            post_id=submission.id,
            title=submission.title or "",
            score=int(submission.score or 0),
            num_comments=int(submission.num_comments or 0),
            url=submission.url or "",
            permalink=f"https://www.reddit.com{submission.permalink}",
            created_utc=float(submission.created_utc or 0.0),
        )
        results.append(pr)
        if len(results) >= limit:
            break
    return results

def fetch_posts(
    reddit: praw.Reddit,
    sub_names: Iterable[str],
    posts_per_sub: int,
    sort: str = "hot",
    only_last_6_months: bool = False,
) -> List[PostRecord]:
    all_posts: List[PostRecord] = []
    for name in sub_names:
        try:
            posts = fetch_posts_for_subreddit(
                reddit, name, posts_per_sub, sort, only_last_6_months
            )
            all_posts.extend(posts)
        except Forbidden:
            print(f"[ADVERTENCIA] Acceso prohibido a r/{name}. Omitiendo...")
        except RequestException as e:
            print(f"[ADVERTENCIA] Error de red en r/{name}: {e}")
        except Exception as e:
            print(f"[ADVERTENCIA] Error inesperado en r/{name}: {e}")
    return all_posts

def select_most_relevant_posts(posts: List[PostRecord], n: int = 10) -> List[PostRecord]:
    posts_sorted = sorted(posts, key=lambda p: p.score, reverse=True)
    return posts_sorted[:n]

In [47]:
# %% Descarga de comentarios
def fetch_top_comments_for_post(
    reddit: praw.Reddit,
    post_id: str,
    subreddit: str,
    max_comments: int = 5,
) -> List[CommentRecord]:
    submission = reddit.submission(id=post_id)
    submission.comment_sort = "top"
    submission.comments.replace_more(limit=0)

    top_level_comments = [c for c in submission.comments if isinstance(c, Comment)]
    top_level_comments.sort(key=lambda c: getattr(c, "score", 0), reverse=True)

    comments: List[CommentRecord] = []
    for c in top_level_comments[:max_comments]:
        body = getattr(c, "body", "")
        score = int(getattr(c, "score", 0) or 0)
        created = float(getattr(c, "created_utc", 0.0) or 0.0)
        cr = CommentRecord(
            post_id=post_id,
            comment_id=c.id,
            body=body,
            score=score,
            created_utc=created,
            subreddit=subreddit,
        )
        comments.append(cr)
    return comments

def fetch_comments_for_posts(
    reddit: praw.Reddit,
    posts: List[PostRecord],
    comments_per_post: int = 5,
) -> List[CommentRecord]:
    all_comments: List[CommentRecord] = []
    for p in posts:
        try:
            comms = fetch_top_comments_for_post(reddit, p.post_id, p.subreddit, comments_per_post)
            all_comments.extend(comms)
        except Forbidden:
            print(f"[ADVERTENCIA] Comentarios restringidos en post {p.post_id}.")
        except RequestException as e:
            print(f"[ADVERTENCIA] Error de red en post {p.post_id}: {e}")
        except Exception as e:
            print(f"[ADVERTENCIA] Error inesperado en post {p.post_id}: {e}")
    return all_comments

In [48]:
# %% VADER (opcional) y tokens de títulos
def ensure_vader():
    try:
        nltk.data.find("sentiment/vader_lexicon.zip")
    except LookupError:
        nltk.download("vader_lexicon")

def add_sentiment_scores(df_comments: pd.DataFrame) -> pd.DataFrame:
    if df_comments.empty:
        return df_comments
    ensure_vader()
    sia = SentimentIntensityAnalyzer()
    df_comments = df_comments.copy()
    df_comments["sentiment_compound"] = df_comments["body"].fillna("").map(
        lambda t: sia.polarity_scores(t)["compound"]
    )
    return df_comments

def top_title_tokens(posts_df: pd.DataFrame, k: int = 20) -> pd.DataFrame:
    if posts_df.empty:
        return posts_df
    import re
    from collections import Counter
    stop_basic = {
        # inglés
        "the","a","an","to","of","in","on","for","and","or","is","are","was","were",
        "with","at","by","from","it","this","that","as","be","has","have","had",
        # español
        "el","la","los","las","de","del","un","una","unos","unas","y","o","en",
        "para","por","con","es","son","fue","fueron","ha","han","haber","se","lo",
        "al","como","más","mas","menos","sin","sobre","ya","no","si"
    }
    tokens = []
    for t in posts_df["title"].dropna().astype(str).tolist():
        t = t.lower()
        t = re.sub(r"http\S+|www\.\S+", " ", t)
        t = re.sub(r"[^\wáéíóúñüäëïöü]+", " ", t)
        parts = [w for w in t.split() if len(w) > 2 and w not in stop_basic]
        tokens.extend(parts)
    from collections import Counter
    cnt = Counter(tokens)
    common = cnt.most_common(k)
    return pd.DataFrame(common, columns=["token", "freq"])

In [None]:
# %% Ejecución principal
def main():
    reddit = init_reddit()

    print(f"Descargando {POSTS_PER_SUB} posts por subreddit (SORT={SORT}, ult_6m={ONLY_LAST_6_MONTHS})...")
    posts = fetch_posts(
        reddit,
        TARGET_SUBS,
        posts_per_sub=POSTS_PER_SUB,
        sort=SORT,
        only_last_6_months=ONLY_LAST_6_MONTHS,
    )
    posts_df = pd.DataFrame([asdict(p) for p in posts])
    print(f"Total posts: {len(posts_df)}")

    relevant = select_most_relevant_posts(posts, N_RELEVANT_POSTS)
    print(f"Posts relevantes para comentarios: {len(relevant)}")

    comments = fetch_comments_for_posts(reddit, relevant, comments_per_post=COMMENTS_PER_POST)
    comments_df = pd.DataFrame([asdict(c) for c in comments])
    print(f"Total comentarios: {len(comments_df)}")

    # (Opcional) Sentimiento
    comments_df = add_sentiment_scores(comments_df)

    # Guardado
    posts_df.to_csv(OUTPUT_POSTS_CSV, index=False, encoding="utf-8")
    comments_df.to_csv(OUTPUT_COMMENTS_CSV, index=False, encoding="utf-8")
    print(f"Guardado: {OUTPUT_POSTS_CSV} | {OUTPUT_COMMENTS_CSV}")

    # Vista rápida
    display(posts_df.head(10))
    display(comments_df.head(10))

if __name__ == "__main__":
    main()

Descargando 20 posts por subreddit (SORT=hot, ult_6m=False)...
Total posts: 60
Posts relevantes para comentarios: 10


## Notas finales
- Este notebook **no** usa `.env`. Las credenciales se cargan en variables de entorno mediante una celda al inicio.
- Si algún subreddit está restringido, se mostrará una advertencia y el proceso continuará con los demás.
- Para usar **solo los últimos 6 meses**, establecer `ONLY_LAST_6_MONTHS = True` y, de preferencia, `SORT = "top"`.
- Los CSV se guardan en la misma carpeta del notebook y se relacionan por `post_id`.
- Mantenga sus credenciales en privado si comparte este notebook.
