# 📒 Cuadernito: Reddit (API oficial) + análisis de **apoyo / no apoyo** en comentarios

Este notebook usa la **API oficial de Reddit con OAuth (PRAW)** para descargar **posts** y **comentarios** de un subreddit, y luego estima **apoyo vs. no apoyo** a partir del **sentimiento** de los comentarios (VADER) y pequeñas **heurísticas** de acuerdo/desacuerdo.

**Qué te entrega por post:**  
- **Título / selftext**  
- **Imágenes** (si la URL principal del post es imagen)  
- **Score (upvotes agregados)** y **# comentarios**  
- **Comentarios** (texto, autor, score, fecha)  
- **Métricas de apoyo**: proporción de comentarios positivos/negativos (VADER), conteo de “acuerdo”/“desacuerdo” por palabras clave, y un **SupportIndex** (0–1).

> **Fuentes**: PRAW (wrapper de la API de Reddit) y OAuth docs; VADER para sentimiento; Hugging Face *pipeline* opcional.  
> Usa con responsabilidad y respeta límites/ToS. Consulta cabeceras `X-Ratelimit-*` para no excederte.


## 📦 Instalación rápida

In [2]:
# Ejecutá esta celda una sola vez en tu entorno local
%pip -q install praw pandas tqdm vaderSentiment nltk transformers --upgrade
# (Opcional) descarga de recursos NLTK si quisieras usar otras herramientas
import nltk
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')


## ⚙️ Configuración

In [3]:
from dataclasses import dataclass, asdict
from typing import List, Optional, Tuple
from pathlib import Path
import time, re, json
import pandas as pd
from tqdm import tqdm

# === Editá estos valores ===
SUBREDDIT         = "movies"   # <-- el subreddit
SORT              = "top"      # "hot" | "new" | "top" | "rising"
TIME_FILTER       = "week"     # si SORT="top", usar: "day"|"week"|"month"|"year"|"all"
MIN_POSTS         = 20         # mínimo a obtener
MAX_POSTS         = 200        # máximo a obtener
MAX_COMMENTS_PER_POST = 80     # tope de comentarios por post
DOWNLOAD_IMAGES   = False      # descarga local de imágenes si la URL lo es
REQUEST_SLEEP_S   = 0.7        # respeta rate limit
OUT_DIR           = Path("./reddit_api_output")

OUT_DIR.mkdir(parents=True, exist_ok=True)
(OUT_DIR / "images").mkdir(parents=True, exist_ok=True)

@dataclass
class PostRow:
    post_id: str
    title: Optional[str]
    author: Optional[str]
    score: Optional[int]
    num_comments: Optional[int]
    created: Optional[str]
    permalink: str
    is_self: bool
    image_urls: List[str]
    selftext: Optional[str]
    subreddit: Optional[str]
    # métricas de apoyo a nivel post (resumen de comentarios)
    comments_total: int = 0
    comments_pos: int = 0
    comments_neg: int = 0
    comments_neu: int = 0
    agree_hits: int = 0
    disagree_hits: int = 0
    support_index: float = 0.0

@dataclass
class CommentRow:
    post_id: str
    comment_id: str
    author: Optional[str]
    created: Optional[str]
    score: Optional[int]
    body: str
    vader_compound: Optional[float]
    sentiment_label: Optional[str]
    agrees: int
    disagrees: int


## 🔐 Autenticación y utilidades

In [4]:
import praw, requests
from datetime import datetime, timezone
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

def make_reddit():
    reddit = praw.Reddit(
        client_id="kOhzoYbAa7yXmebOh5EDRw",
        client_secret="WLrYM1NI3faRN7H48KiITdO3d-YhaQ",
        user_agent="Fickle_Finish_9750"
    )
    return reddit

def is_image_url(u: str) -> bool:
    return bool(re.search(r"\.(jpg|jpeg|png|gif)(?:\?.*)?$", (u or ""), flags=re.I))

def to_iso(ts_utc: float) -> str:
    try:
        return datetime.fromtimestamp(ts_utc, tz=timezone.utc).isoformat()
    except Exception:
        return None

def sleep():
    time.sleep(REQUEST_SLEEP_S)

# Heurística simple de acuerdo/desacuerdo (EN/ES; podés ampliar tu lexicón)
AGREE_TERMS = [
    "i agree", "agree", "supported", "support this", "i support", "valid point",
    "de acuerdo", "apoyo", "tiene razón", "cierto", "totalmente de acuerdo", "estoy de acuerdo",
]
DISAGREE_TERMS = [
    "i disagree", "disagree", "not support", "oppose", "against this", "bad take",
    "no apoyo", "en desacuerdo", "no estoy de acuerdo", "mala idea", "me opongo"
]

def count_hits(text: str, terms: List[str]) -> int:
    t = (text or "").lower()
    return sum(1 for term in terms if term in t)

analyzer = SentimentIntensityAnalyzer()

def label_from_compound(c: float) -> str:
    # Umbrales sugeridos por la doc de VADER (ver referencias)
    if c >= 0.5:
        return "pos"
    elif c <= -0.5:
        return "neg"
    else:
        return "neu"


## 🚀 Descarga de posts y comentarios (API oficial)

In [None]:
reddit = make_reddit()

# Selección de listing según SORT
sub = reddit.subreddit(SUBREDDIT)
if SORT == "hot":
    it = sub.hot(limit=MAX_POSTS)
elif SORT == "new":
    it = sub.new(limit=MAX_POSTS)
elif SORT == "rising":
    it = sub.rising(limit=MAX_POSTS)
elif SORT == "top":
    it = sub.top(time_filter=TIME_FILTER, limit=MAX_POSTS)
else:
    it = sub.hot(limit=MAX_POSTS)

posts_rows, comments_rows = [], []

for p in tqdm(it, total=MAX_POSTS, desc="Posts"):
    sleep()
    # armar registro del post
    img_urls = [p.url] if is_image_url(getattr(p, "url", "")) else []
    pr = PostRow(
        post_id=p.id,
        title=p.title,
        author=f"u/{p.author}" if p.author else None,
        score=int(p.score) if p.score is not None else None,
        num_comments=int(p.num_comments) if p.num_comments is not None else None,
        created=to_iso(getattr(p, "created_utc", None)),
        permalink=f"https://www.reddit.com{p.permalink}",
        is_self=bool(p.is_self),
        image_urls=img_urls,
        selftext=(p.selftext or None),
        subreddit=str(p.subreddit) if p.subreddit else SUBREDDIT,
    )

    # comentarios (hasta el tope)
    try:
        p.comments.replace_more(limit=0)
        com_list = p.comments.list()[:MAX_COMMENTS_PER_POST]
    except Exception as e:
        print("[WARN] Comentarios", e, "en", pr.permalink)
        com_list = []

    pos = neg = neu = 0
    agree_hits = disagree_hits = 0

    for c in com_list:
        text = getattr(c, "body", "") or ""
        comp = analyzer.polarity_scores(text).get("compound")
        lab = label_from_compound(comp) if comp is not None else None
        if lab == "pos": pos += 1
        elif lab == "neg": neg += 1
        else: neu += 1

        ah = count_hits(text, AGREE_TERMS)
        dh = count_hits(text, DISAGREE_TERMS)
        agree_hits += ah
        disagree_hits += dh

        comments_rows.append(CommentRow(
            post_id=pr.post_id,
            comment_id=getattr(c, "id", ""),
            author=f"u/{c.author}" if c.author else None,
            created=to_iso(getattr(c, "created_utc", None)),
            score=int(getattr(c, "score", 0)) if getattr(c, "score", None) is not None else None,
            body=text,
            vader_compound=comp,
            sentiment_label=lab,
            agrees=ah,
            disagrees=dh
        ))

    total = pos + neg + neu
    pr.comments_total = total
    pr.comments_pos = pos
    pr.comments_neg = neg
    pr.comments_neu = neu
    pr.agree_hits = agree_hits
    pr.disagree_hits = disagree_hits

    # índice simple de apoyo: (positivos + acuerdos) / (positivos + negativos + acuerdos + desacuerdos + epsilon)
    denom = (pos + neg + agree_hits + disagree_hits) or 1
    pr.support_index = round((pos + agree_hits) / denom, 3)

    posts_rows.append(pr)

print(f"Posts: {len(posts_rows)} | Comentarios: {len(comments_rows)}")


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

Posts: 60 | Comentarios: 2842





## 💾 Guardado de resultados

In [None]:
posts_df = pd.DataFrame([asdict(p) for p in posts_rows])
comments_df = pd.DataFrame([asdict(c) for c in comments_rows])

posts_csv = OUT_DIR / "posts.csv"
comments_csv = OUT_DIR / "comments.csv"
posts_jsonl = OUT_DIR / "posts.jsonl"
comments_jsonl = OUT_DIR / "comments.jsonl"

posts_df.to_csv(posts_csv, index=False, encoding="utf-8-sig")
comments_df.to_csv(comments_csv, index=False, encoding="utf-8-sig")

with open(posts_jsonl, "w", encoding="utf-8") as f:
    for _, row in posts_df.iterrows():
        f.write(json.dumps(row.to_dict(), ensure_ascii=False) + "\\n")

with open(comments_jsonl, "w", encoding="utf-8") as f:
    for _, row in comments_df.iterrows():
        f.write(json.dumps(row.to_dict(), ensure_ascii=False) + "\\n")

posts_csv, comments_csv, posts_jsonl, comments_jsonl


(PosixPath('reddit_api_output/posts.csv'),
 PosixPath('reddit_api_output/comments.csv'),
 PosixPath('reddit_api_output/posts.jsonl'),
 PosixPath('reddit_api_output/comments.jsonl'))

## 👀 Vista rápida y ranking de apoyo

In [None]:
# Top 20 por índice de apoyo con al menos 10 comentarios
rank_df = (posts_df[posts_df["comments_total"] >= 10]
           .sort_values(["support_index", "comments_total"], ascending=[False, False])
           .head(20))

display(posts_df.head(50))
display(comments_df.head(100))
display(rank_df)

Unnamed: 0,post_id,title,author,score,num_comments,created,permalink,is_self,image_urls,selftext,subreddit,comments_total,comments_pos,comments_neg,comments_neu,agree_hits,disagree_hits,support_index
0,1nv66sx,"Hi reddit! I'm Mercedes Bryce Morgan, director...",u/BoneLakeAMA,572,252,2025-10-01T11:58:05+00:00,https://www.reddit.com/r/movies/comments/1nv66...,False,[https://i.redd.it/4jv2rfh6phsf1.png],"Hi reddit! I'm Mercedes Bryce Morgan, director...",movies,80,26,14,40,0,0,0.65
1,1nwmxl4,Official Discussion Megathread (The Smashing M...,u/LiteraryBoner,19,0,2025-10-03T02:09:50+00:00,https://www.reddit.com/r/movies/comments/1nwmx...,True,[],**New In Theaters**:\n\n* [The Smashing Machin...,movies,0,0,0,0,0,0,0.0
2,1nx1s19,'KPop Demon Hunters' Breaks Another Netflix Re...,u/Sudden_Pop_2279,7811,701,2025-10-03T15:10:07+00:00,https://www.reddit.com/r/movies/comments/1nx1s...,False,[],,movies,80,27,9,44,0,0,0.75
3,1nxa42j,What was the first movie to traumatise you as ...,u/InspectorOk6313,760,2789,2025-10-03T20:23:06+00:00,https://www.reddit.com/r/movies/comments/1nxa4...,True,[],For me it was either Jaws or An American Werew...,movies,80,6,25,49,0,0,0.194
4,1nwvq25,Keanu Reeves Gives Update On 'Constantine 2': ...,u/ChiefLeef22,1890,157,2025-10-03T10:47:14+00:00,https://www.reddit.com/r/movies/comments/1nwvq...,False,[],,movies,80,23,7,50,1,0,0.774
5,1nxc6p6,"Zach Cherry, Kali Reis, & Johnno Wilson Join Z...",u/MarvelsGrantMan136,247,48,2025-10-03T21:46:50+00:00,https://www.reddit.com/r/movies/comments/1nxc6...,False,[],,movies,49,7,6,36,0,0,0.538
6,1nwypxr,"Rango (2011, dir. Gore Verbinski) Wagon Chase ...",u/LiteraryBoner,1092,100,2025-10-03T13:10:14+00:00,https://www.reddit.com/r/movies/comments/1nwyp...,False,[],,movies,80,33,5,42,1,1,0.85
7,1nxbsxp,A behind-the-scenes video from the filming of ...,u/Amaruq93,237,21,2025-10-03T21:30:54+00:00,https://www.reddit.com/r/movies/comments/1nxbs...,False,[],,movies,21,8,1,12,0,0,0.889
8,1nx4rny,Michael B. Jordan to Receive Outstanding Perfo...,u/MarvelsGrantMan136,521,13,2025-10-03T17:00:36+00:00,https://www.reddit.com/r/movies/comments/1nx4r...,False,[],,movies,12,3,0,9,0,0,1.0
9,1nx4tru,Cillian Murphy’s Criterion Closet Picks,u/MarvelsGrantMan136,525,45,2025-10-03T17:02:38+00:00,https://www.reddit.com/r/movies/comments/1nx4t...,False,[],,movies,45,18,3,24,0,0,0.857


Unnamed: 0,post_id,comment_id,author,created,score,body,vader_compound,sentiment_label,agrees,disagrees
0,1nv66sx,nh66arg,u/BunyipPouch,2025-10-01T11:58:52+00:00,1,This AMA has been verified and approved by the...,0.8860,pos,0,0
1,1nv66sx,nh67dk8,u/ITeachYourKidz,2025-10-01T12:06:04+00:00,181,Are the bones their money? What about the worms?,0.0000,neu,0,0
2,1nv66sx,nh66r00,u/MuptonBossman,2025-10-01T12:01:56+00:00,230,"Be honest, was Bone Lake supposed to be a porn...",-0.1027,neu,0,0
3,1nv66sx,nh68azn,u/Creative_Pie_8191,2025-10-01T12:12:10+00:00,53,"Hi Mercedes, What’s the scariest movie you’ve ...",-0.5719,neg,0,0
4,1nv66sx,nh6a7zs,u/Stone_Field,2025-10-01T12:24:20+00:00,66,Uhhh you got any Boneless Lake?,0.0000,neu,0,0
...,...,...,...,...,...,...,...,...,...,...
95,1nx1s19,nhku60r,u/TheThalmorEmbassy,2025-10-03T17:14:24+00:00,9,"Yeah, but what's the competition? Netflix blows.",0.1531,neu,0,0
96,1nx1s19,nhlu323,u/Gold-Bard-Hue,2025-10-03T20:13:51+00:00,6,I went to my 8yo daughter's softball game last...,0.9254,pos,0,0
97,1nx1s19,nhla95q,u/Videowulff,2025-10-03T18:33:08+00:00,3,Was at the store the other day and legit 2 lit...,0.6597,pos,0,0
98,1nx1s19,nhm160g,u/UrethraFranklin04,2025-10-03T20:49:29+00:00,3,I just drove 8 hours to a family reunion and I...,0.0772,neu,0,0


Unnamed: 0,post_id,title,author,score,num_comments,created,permalink,is_self,image_urls,selftext,subreddit,comments_total,comments_pos,comments_neg,comments_neu,agree_hits,disagree_hits,support_index
52,1nwutbu,"AMA (Ask Me Anything): Tori Brazier, Senior Fi...",u/Metro-UK,18,34,2025-10-03T09:53:50+00:00,https://www.reddit.com/r/movies/comments/1nwut...,False,[https://i.redd.it/xxh3g2goavsf1.jpeg],**This AMA has now ended ... here's a note fro...,movies,34,28,0,6,1,0,1.0
8,1nx4rny,Michael B. Jordan to Receive Outstanding Perfo...,u/MarvelsGrantMan136,521,13,2025-10-03T17:00:36+00:00,https://www.reddit.com/r/movies/comments/1nx4r...,False,[],,movies,12,3,0,9,0,0,1.0
36,1nwmw7l,Official Discussion - The Smashing Machine [SP...,u/LiteraryBoner,246,345,2025-10-03T02:07:52+00:00,https://www.reddit.com/r/movies/comments/1nwmw...,True,[],"#Poll\n\n**If you've seen the film, please rat...",movies,80,40,4,36,4,0,0.917
19,1nx5aqv,The Blues Brothers (1980) must have been wild ...,u/CeruleanBlew,70,57,2025-10-03T17:20:05+00:00,https://www.reddit.com/r/movies/comments/1nx5a...,True,[],"Finally watched this masterpiece, haha what a ...",movies,57,21,2,34,0,0,0.913
7,1nxbsxp,A behind-the-scenes video from the filming of ...,u/Amaruq93,237,21,2025-10-03T21:30:54+00:00,https://www.reddit.com/r/movies/comments/1nxbs...,False,[],,movies,21,8,1,12,0,0,0.889
22,1nx8ran,New Line In Final Talks With Michiel Blanchart...,u/NoCulture3505,33,11,2025-10-03T19:30:40+00:00,https://www.reddit.com/r/movies/comments/1nx8r...,False,[],,movies,11,7,1,3,0,0,0.875
25,1nwe6qk,The Finest Comic Actor Working Today Is Leonar...,u/ChiefLeef22,2177,312,2025-10-02T19:57:43+00:00,https://www.reddit.com/r/movies/comments/1nwe6...,False,[],\- By Bilge Ebiri,movies,80,34,4,42,3,2,0.86
9,1nx4tru,Cillian Murphy’s Criterion Closet Picks,u/MarvelsGrantMan136,525,45,2025-10-03T17:02:38+00:00,https://www.reddit.com/r/movies/comments/1nx4t...,False,[],,movies,45,18,3,24,0,0,0.857
16,1nx9out,Running a One-Screen theater in Missouri: How ...,u/Plane_Face_8064,74,22,2025-10-03T20:06:36+00:00,https://www.reddit.com/r/movies/comments/1nx9o...,True,[],My husband and I run a family-owned single-scr...,movies,22,6,1,15,0,0,0.857
6,1nwypxr,"Rango (2011, dir. Gore Verbinski) Wagon Chase ...",u/LiteraryBoner,1092,100,2025-10-03T13:10:14+00:00,https://www.reddit.com/r/movies/comments/1nwyp...,False,[],,movies,80,33,5,42,1,1,0.85


## 🔁 (Opcional) Hugging Face `pipeline('sentiment-analysis')`

In [None]:
# Si querés probar un modelo de transformers para comparar con VADER:
# - Esto es más pesado y puede requerir GPU/caché.
from transformers import pipeline

try:
    hf_sa = pipeline("sentiment-analysis")
    sample = comments_df["body"].dropna().head(5).tolist()
    if sample:
        print(hf_sa(sample))
except Exception as e:
    print("Transformers pipeline no disponible:", e)


## 📚 Notas y referencias

- **PRAW (API Reddit + OAuth):** autenticación y uso básico.  
- **Rate limits oficiales:** 100 QPM por client id con OAuth (revisa `X-Ratelimit-*`).  
- **VADER (Hutto & Gilbert, 2014):** reglas + umbrales (`compound ≥ 0.5` positivo; `≤ -0.5` negativo).  
- **Hugging Face Transformers:** `pipeline('sentiment-analysis')` para comparar enfoques.

> Recordatorio: este análisis de “apoyo/no apoyo” es **heurístico** (sentimiento + palabras de acuerdo). Podés extender con modelos de *stance detection* o ajustar el lexicón según el tema/subreddit.
