# 📒 Cuadernito: Reddit (API oficial) + análisis de **apoyo / no apoyo** en comentarios

Este notebook usa la **API oficial de Reddit con OAuth (PRAW)** para descargar **posts** y **comentarios** de un subreddit, y luego estima **apoyo vs. no apoyo** a partir del **sentimiento** de los comentarios (VADER) y pequeñas **heurísticas** de acuerdo/desacuerdo.

**Qué te entrega por post:**  
- **Título / selftext**  
- **Imágenes** (si la URL principal del post es imagen)  
- **Score (upvotes agregados)** y **# comentarios**  
- **Comentarios** (texto, autor, score, fecha)  
- **Métricas de apoyo**: proporción de comentarios positivos/negativos (VADER), conteo de “acuerdo”/“desacuerdo” por palabras clave, y un **SupportIndex** (0–1).

> **Fuentes**: PRAW (wrapper de la API de Reddit) y OAuth docs; VADER para sentimiento; Hugging Face *pipeline* opcional.  
> Usa con responsabilidad y respeta límites/ToS. Consulta cabeceras `X-Ratelimit-*` para no excederte.


## 📦 Instalación rápida

In [1]:
# Ejecutá esta celda una sola vez en tu entorno local
%pip -q install praw pandas tqdm vaderSentiment nltk transformers --upgrade
# (Opcional) descarga de recursos NLTK si quisieras usar otras herramientas
import nltk
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


## ⚙️ Configuración

In [2]:
from dataclasses import dataclass, asdict
from typing import List, Optional, Tuple
from pathlib import Path
import time, re, json
import pandas as pd
from tqdm import tqdm

# === Editá estos valores ===
SUBREDDIT         = "movies"   # <-- el subreddit
SORT              = "top"      # "hot" | "new" | "top" | "rising"
TIME_FILTER       = "week"     # si SORT="top", usar: "day"|"week"|"month"|"year"|"all"
MIN_POSTS         = 20         # mínimo a obtener
MAX_POSTS         = 200        # máximo a obtener
MAX_COMMENTS_PER_POST = 80     # tope de comentarios por post
DOWNLOAD_IMAGES   = False      # descarga local de imágenes si la URL lo es
REQUEST_SLEEP_S   = 0.7        # respeta rate limit
OUT_DIR           = Path("./reddit_api_output")

OUT_DIR.mkdir(parents=True, exist_ok=True)
(OUT_DIR / "images").mkdir(parents=True, exist_ok=True)

@dataclass
class PostRow:
    post_id: str
    title: Optional[str]
    author: Optional[str]
    score: Optional[int]
    num_comments: Optional[int]
    created: Optional[str]
    permalink: str
    is_self: bool
    image_urls: List[str]
    selftext: Optional[str]
    subreddit: Optional[str]
    # métricas de apoyo a nivel post (resumen de comentarios)
    comments_total: int = 0
    comments_pos: int = 0
    comments_neg: int = 0
    comments_neu: int = 0
    agree_hits: int = 0
    disagree_hits: int = 0
    support_index: float = 0.0

@dataclass
class CommentRow:
    post_id: str
    comment_id: str
    author: Optional[str]
    created: Optional[str]
    score: Optional[int]
    body: str
    vader_compound: Optional[float]
    sentiment_label: Optional[str]
    agrees: int
    disagrees: int


## 🔐 Autenticación y utilidades

In [3]:
import praw, requests
from datetime import datetime, timezone
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

def make_reddit():
    reddit = praw.Reddit(
        client_id="kOhzoYbAa7yXmebOh5EDRw",
        client_secret="WLrYM1NI3faRN7H48KiITdO3d-YhaQ",
        user_agent="Fickle_Finish_9750"
    )
    return reddit

def is_image_url(u: str) -> bool:
    return bool(re.search(r"\.(jpg|jpeg|png|gif)(?:\?.*)?$", (u or ""), flags=re.I))

def to_iso(ts_utc: float) -> str:
    try:
        return datetime.fromtimestamp(ts_utc, tz=timezone.utc).isoformat()
    except Exception:
        return None

def sleep():
    time.sleep(REQUEST_SLEEP_S)

# Heurística simple de acuerdo/desacuerdo (EN/ES; podés ampliar tu lexicón)
AGREE_TERMS = [
    "i agree", "agree", "supported", "support this", "i support", "valid point",
    "de acuerdo", "apoyo", "tiene razón", "cierto", "totalmente de acuerdo", "estoy de acuerdo",
]
DISAGREE_TERMS = [
    "i disagree", "disagree", "not support", "oppose", "against this", "bad take",
    "no apoyo", "en desacuerdo", "no estoy de acuerdo", "mala idea", "me opongo"
]

def count_hits(text: str, terms: List[str]) -> int:
    t = (text or "").lower()
    return sum(1 for term in terms if term in t)

analyzer = SentimentIntensityAnalyzer()

def label_from_compound(c: float) -> str:
    # Umbrales sugeridos por la doc de VADER (ver referencias)
    if c >= 0.5:
        return "pos"
    elif c <= -0.5:
        return "neg"
    else:
        return "neu"


## 🚀 Descarga de posts y comentarios (API oficial)

In [4]:
reddit = make_reddit()

# Selección de listing según SORT
sub = reddit.subreddit(SUBREDDIT)
if SORT == "hot":
    it = sub.hot(limit=MAX_POSTS)
elif SORT == "new":
    it = sub.new(limit=MAX_POSTS)
elif SORT == "rising":
    it = sub.rising(limit=MAX_POSTS)
elif SORT == "top":
    it = sub.top(time_filter=TIME_FILTER, limit=MAX_POSTS)
else:
    it = sub.hot(limit=MAX_POSTS)

posts_rows, comments_rows = [], []

for p in tqdm(it, total=MAX_POSTS, desc="Posts"):
    sleep()
    # armar registro del post
    img_urls = [p.url] if is_image_url(getattr(p, "url", "")) else []
    pr = PostRow(
        post_id=p.id,
        title=p.title,
        author=f"u/{p.author}" if p.author else None,
        score=int(p.score) if p.score is not None else None,
        num_comments=int(p.num_comments) if p.num_comments is not None else None,
        created=to_iso(getattr(p, "created_utc", None)),
        permalink=f"https://www.reddit.com{p.permalink}",
        is_self=bool(p.is_self),
        image_urls=img_urls,
        selftext=(p.selftext or None),
        subreddit=str(p.subreddit) if p.subreddit else SUBREDDIT,
    )

    # comentarios (hasta el tope)
    try:
        p.comments.replace_more(limit=0)
        com_list = p.comments.list()[:MAX_COMMENTS_PER_POST]
    except Exception as e:
        print("[WARN] Comentarios", e, "en", pr.permalink)
        com_list = []

    pos = neg = neu = 0
    agree_hits = disagree_hits = 0

    for c in com_list:
        text = getattr(c, "body", "") or ""
        comp = analyzer.polarity_scores(text).get("compound")
        lab = label_from_compound(comp) if comp is not None else None
        if lab == "pos": pos += 1
        elif lab == "neg": neg += 1
        else: neu += 1

        ah = count_hits(text, AGREE_TERMS)
        dh = count_hits(text, DISAGREE_TERMS)
        agree_hits += ah
        disagree_hits += dh

        comments_rows.append(CommentRow(
            post_id=pr.post_id,
            comment_id=getattr(c, "id", ""),
            author=f"u/{c.author}" if c.author else None,
            created=to_iso(getattr(c, "created_utc", None)),
            score=int(getattr(c, "score", 0)) if getattr(c, "score", None) is not None else None,
            body=text,
            vader_compound=comp,
            sentiment_label=lab,
            agrees=ah,
            disagrees=dh
        ))

    total = pos + neg + neu
    pr.comments_total = total
    pr.comments_pos = pos
    pr.comments_neg = neg
    pr.comments_neu = neu
    pr.agree_hits = agree_hits
    pr.disagree_hits = disagree_hits

    # índice simple de apoyo: (positivos + acuerdos) / (positivos + negativos + acuerdos + desacuerdos + epsilon)
    denom = (pos + neg + agree_hits + disagree_hits) or 1
    pr.support_index = round((pos + agree_hits) / denom, 3)

    posts_rows.append(pr)

print(f"Posts: {len(posts_rows)} | Comentarios: {len(comments_rows)}")


Posts: 100%|██████████| 200/200 [05:55<00:00,  1.78s/it]

Posts: 200 | Comentarios: 11113





## 💾 Guardado de resultados

In [5]:
posts_df = pd.DataFrame([asdict(p) for p in posts_rows])
comments_df = pd.DataFrame([asdict(c) for c in comments_rows])

posts_csv = OUT_DIR / "posts.csv"
comments_csv = OUT_DIR / "comments.csv"
posts_jsonl = OUT_DIR / "posts.jsonl"
comments_jsonl = OUT_DIR / "comments.jsonl"

posts_df.to_csv(posts_csv, index=False, encoding="utf-8-sig")
comments_df.to_csv(comments_csv, index=False, encoding="utf-8-sig")

with open(posts_jsonl, "w", encoding="utf-8") as f:
    for _, row in posts_df.iterrows():
        f.write(json.dumps(row.to_dict(), ensure_ascii=False) + "\\n")

with open(comments_jsonl, "w", encoding="utf-8") as f:
    for _, row in comments_df.iterrows():
        f.write(json.dumps(row.to_dict(), ensure_ascii=False) + "\\n")

posts_csv, comments_csv, posts_jsonl, comments_jsonl


(WindowsPath('reddit_api_output/posts.csv'),
 WindowsPath('reddit_api_output/comments.csv'),
 WindowsPath('reddit_api_output/posts.jsonl'),
 WindowsPath('reddit_api_output/comments.jsonl'))

## 👀 Vista rápida y ranking de apoyo

In [6]:
# Top 20 por índice de apoyo con al menos 10 comentarios
rank_df = (posts_df[posts_df["comments_total"] >= 10]
           .sort_values(["support_index", "comments_total"], ascending=[False, False])
           .head(20))

display(posts_df.head(50))
display(comments_df.head(100))
display(rank_df)

Unnamed: 0,post_id,title,author,score,num_comments,created,permalink,is_self,image_urls,selftext,subreddit,comments_total,comments_pos,comments_neg,comments_neu,agree_hits,disagree_hits,support_index
0,1nvfumo,"Jane Goodall, Iconic Chimpanzee Expert Who Was...",u/MarvelsGrantMan136,53320,721,2025-10-01T18:12:58+00:00,https://www.reddit.com/r/movies/comments/1nvfu...,False,[],,movies,80,34,6,40,0,0,0.85
1,1ntp6it,Official Poster for the New 'The Simpsons' Movie,u/MarvelsGrantMan136,38473,2237,2025-09-29T18:01:36+00:00,https://www.reddit.com/r/movies/comments/1ntp6...,False,[https://i.redd.it/fdrkwcoa85sf1.jpeg],,movies,80,9,10,61,0,0,0.474
2,1ns3xk0,"‘One Battle After Another,’ With Its Thriller ...",u/tylerthe-theatre,8707,987,2025-09-27T19:25:22+00:00,https://www.reddit.com/r/movies/comments/1ns3x...,False,[],,movies,80,25,15,40,3,1,0.636
3,1nunm5x,‘Highlander’: Jeremy Irons to Play Villain Opp...,u/MarvelsGrantMan136,8244,673,2025-09-30T20:03:43+00:00,https://www.reddit.com/r/movies/comments/1nunm...,False,[],,movies,80,31,13,36,0,0,0.705
4,1nx1s19,'KPop Demon Hunters' Breaks Another Netflix Re...,u/Sudden_Pop_2279,7983,718,2025-10-03T15:10:07+00:00,https://www.reddit.com/r/movies/comments/1nx1s...,False,[],,movies,80,26,8,46,0,0,0.765
5,1nudzdh,New Poster for Guillermo del Toro’s ‘Frankenst...,u/MarvelsGrantMan136,7277,324,2025-09-30T14:00:17+00:00,https://www.reddit.com/r/movies/comments/1nudz...,False,[https://i.redd.it/asr477t56bsf1.jpeg],,movies,80,19,7,54,0,0,0.731
6,1nw5xbt,‘Kill Bill: The Whole Bloody Affair’ to Receiv...,u/indiewire,7093,643,2025-10-02T14:51:23+00:00,https://www.reddit.com/r/movies/comments/1nw5x...,False,[],,movies,80,25,5,50,0,0,0.833
7,1nvm212,Quentin Tarantino’s ‘Kill Bill: The Whole Bloo...,u/MarvelsGrantMan136,6887,577,2025-10-01T22:05:23+00:00,https://www.reddit.com/r/movies/comments/1nvm2...,False,[],,movies,80,21,8,51,0,0,0.724
8,1nub8xg,AI 'Actress' Tilly Norwood Condemned by SAG-AFTRA,u/MarvelsGrantMan136,6517,972,2025-09-30T12:00:39+00:00,https://www.reddit.com/r/movies/comments/1nub8...,False,[],,movies,80,9,15,56,0,2,0.346
9,1nve5k3,'The Social Network' (2010) | Eduardo Saverin ...,u/ChiefLeef22,6401,593,2025-10-01T17:12:05+00:00,https://www.reddit.com/r/movies/comments/1nve5...,False,[],,movies,80,18,9,53,0,0,0.667


Unnamed: 0,post_id,comment_id,author,created,score,body,vader_compound,sentiment_label,agrees,disagrees
0,1nvfumo,nh892fi,u/MuptonBossman,2025-10-01T18:22:39+00:00,2836,I remember watching Jane Goodall documentaries...,0.8748,pos,0,0
1,1nvfumo,nh87tvb,u/clownus,2025-10-01T18:16:40+00:00,3594,Highly suggest people check out podcast or tal...,0.6335,pos,0,0
2,1nvfumo,nh89xsl,u/Renegadeforever2024,2025-10-01T18:26:49+00:00,432,One of one,0.0000,neu,0,0
3,1nvfumo,nh87mpu,u/Trip_on_the_street,2025-10-01T18:15:42+00:00,413,RIP. She spoke at my graduation!,0.0000,neu,0,0
4,1nvfumo,nh87ri3,u/alwaysfatigued8787,2025-10-01T18:16:20+00:00,1582,They better let the chimps attend her funeral.,0.1027,neu,0,0
...,...,...,...,...,...,...,...,...,...,...
95,1ntp6it,ngv9dur,u/AMA_requester,2025-09-29T18:06:29+00:00,177,They always manage to be crazy behind lol. Lik...,0.7750,pos,0,0
96,1ntp6it,ngv9vis,u/bwaredapenguin,2025-09-29T18:08:50+00:00,37,I had no idea this was even a possibility.,-0.2960,neu,0,0
97,1ntp6it,ngv9w02,u/burger__n__fries,2025-09-29T18:08:55+00:00,29,2027 is packed as the schedule currently stand...,0.4576,neu,0,0
98,1ntp6it,ngveef1,u/ralo229,2025-09-29T18:30:24+00:00,18,Never in a million years did I think this was ...,0.0000,neu,0,0


Unnamed: 0,post_id,title,author,score,num_comments,created,permalink,is_self,image_urls,selftext,subreddit,comments_total,comments_pos,comments_neg,comments_neu,agree_hits,disagree_hits,support_index
112,1nukr9z,"First review: Good Luck, Have Fun, Don’t Die",u/aaron___morgan,261,17,2025-09-30T18:16:11+00:00,https://www.reddit.com/r/movies/comments/1nukr...,False,[],“I didn’t realize just how much I missed Gore ...,movies,17,5,0,12,0,0,1.0
79,1nx4rny,Michael B. Jordan to Receive Outstanding Perfo...,u/MarvelsGrantMan136,551,16,2025-10-03T17:00:36+00:00,https://www.reddit.com/r/movies/comments/1nx4r...,False,[],,movies,15,4,0,11,0,0,1.0
193,1nvpo34,I tried making a map of every movie theater in...,u/mikewhoneedsabike,24,14,2025-10-02T00:43:42+00:00,https://www.reddit.com/r/movies/comments/1nvpo...,True,[],https://www.locatecinemas.com/\n\nI made a web...,movies,14,4,0,10,0,0,1.0
53,1nsuyur,"CatVideoFest 2025 tops $1,000,000 in domestic ...",u/CatVideoFest,930,50,2025-09-28T17:59:55+00:00,https://www.reddit.com/r/movies/comments/1nsuy...,False,[],,movies,51,17,1,33,2,0,0.95
169,1nt695l,Sorcerer (1977),u/M_Le_Petomane,49,30,2025-09-29T02:10:37+00:00,https://www.reddit.com/r/movies/comments/1nt69...,True,[],"I recall seeing this excellent remake of ""Wage...",movies,30,12,1,17,1,0,0.929
155,1nx5aqv,The Blues Brothers (1980) must have been wild ...,u/CeruleanBlew,75,63,2025-10-03T17:20:05+00:00,https://www.reddit.com/r/movies/comments/1nx5a...,True,[],"Finally watched this masterpiece, haha what a ...",movies,63,24,2,37,0,0,0.923
95,1ntae43,"Gore Verbinski's 'Good Luck, Have Fun, Don't D...",u/ChiefLeef22,377,67,2025-09-29T06:01:52+00:00,https://www.reddit.com/r/movies/comments/1ntae...,True,[],*A man claiming to be from the future takes th...,movies,67,23,2,42,0,0,0.92
117,1nwmw7l,Official Discussion - The Smashing Machine [SP...,u/LiteraryBoner,250,349,2025-10-03T02:07:52+00:00,https://www.reddit.com/r/movies/comments/1nwmw...,True,[],"#Poll\n\n**If you've seen the film, please rat...",movies,80,41,4,35,4,0,0.918
78,1nsdjcr,My brother made a movie and just dropped the t...,u/LastCryForHelp,553,76,2025-09-28T02:57:06+00:00,https://www.reddit.com/r/movies/comments/1nsdj...,False,[],My older brother shot and filmed a movie in ou...,movies,76,38,4,34,4,0,0.913
118,1nxbsxp,A behind-the-scenes video from the filming of ...,u/Amaruq93,252,23,2025-10-03T21:30:54+00:00,https://www.reddit.com/r/movies/comments/1nxbs...,False,[],,movies,23,9,1,13,0,0,0.9


## 🔁 (Opcional) Hugging Face `pipeline('sentiment-analysis')`

In [7]:
# Si querés probar un modelo de transformers para comparar con VADER:
# - Esto es más pesado y puede requerir GPU/caché.
from transformers import pipeline

try:
    hf_sa = pipeline("sentiment-analysis")
    sample = comments_df["body"].dropna().head(5).tolist()
    if sample:
        print(hf_sa(sample))
except Exception as e:
    print("Transformers pipeline no disponible:", e)


  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9996495246887207}, {'label': 'POSITIVE', 'score': 0.9997192025184631}, {'label': 'POSITIVE', 'score': 0.9786983132362366}, {'label': 'NEGATIVE', 'score': 0.9751015901565552}, {'label': 'NEGATIVE', 'score': 0.9962820410728455}]


## 📚 Notas y referencias

- **PRAW (API Reddit + OAuth):** autenticación y uso básico.  
- **Rate limits oficiales:** 100 QPM por client id con OAuth (revisa `X-Ratelimit-*`).  
- **VADER (Hutto & Gilbert, 2014):** reglas + umbrales (`compound ≥ 0.5` positivo; `≤ -0.5` negativo).  
- **Hugging Face Transformers:** `pipeline('sentiment-analysis')` para comparar enfoques.

> Recordatorio: este análisis de “apoyo/no apoyo” es **heurístico** (sentimiento + palabras de acuerdo). Podés extender con modelos de *stance detection* o ajustar el lexicón según el tema/subreddit.
