# Exercise 1: Reddit API Data Collection

- Objective: Collect post and comment data from political subreddits using the Reddit API (PRAW), identify the most common posts and their comments.


## 1. Credenciales de la API de Reddit – Cuenta de Desarrollador

- client_ID: V1DKOPOnBVApEfTysCq0cA

- client_secret: DXq6pofNnGcZs305LzdmPQ1HtXqQfw

## 2. Environment Setup
- Install necessary libraries

In [8]:
#!pip install praw
#!pip install -U charset-normalizer
#!pip install pandas



## 3) API Connection (PRAW)
- Task: Use your client_id, client_secret, and a user agent to connect to the Reddit API via PRAW. You’ll also need your Reddit username and password (only for script apps).


In [18]:
import praw
import pandas as pd
import os

In [5]:
# Conexión a la API de Reddit
reddit = praw.Reddit(
    client_id="V1DKOPOnBVApEfTysCq0cA",
    client_secret="DXq6pofNnGcZs305LzdmPQ1HtXqQfw",
    user_agent="Python:webscrapping_freddy:v1.0 (by /u/Ancient-Arrival8504)",
    username="Ancient-Arrival8504",       # mi nombre de usuario en Reddit
)

## 4) Collect Posts from Subreddits

In [13]:
# Difinimos parámetros
subreddits = ["politics", "PoliticalDiscussion", "worldnews"]
mode = "hot"     # "hot" o "top"
limit = 20

In [14]:
# Recolección de la información

rows = []
for sr in subreddits:
    s = reddit.subreddit(sr)
    posts_iter = getattr(s, mode)(limit=limit)  # s.hot(...) o s.top(...)
    for p in posts_iter:
        rows.append({
            "subreddit": sr,
            "title": p.title,
            "score": p.score,
            "num_comments": p.num_comments,
            "id": p.id,
            "url": p.url,
        })

In [17]:
# Almacenamiento

output_path = "/Users/freddyzutachavez/Documents/GitHub/web_scrapping/output/posts_reddit.csv"

df = pd.DataFrame(rows, columns=["subreddit","title","score","num_comments","id","url"])

Guardado 60 filas en /Users/freddyzutachavez/Documents/GitHub/web_scrapping/output/posts_reddit.csv


In [27]:
df.head(5)

Unnamed: 0,subreddit,title,score,num_comments,id,url
0,politics,Dr. Oz Splits from Trump on Tylenol After Auti...,9838,546,1nouvbe,https://www.thedailybeast.com/dr-oz-splits-fro...
1,politics,Charlie Kirk was a divisive far-right podcaste...,38345,3633,1noqbhr,https://www.theguardian.com/commentisfree/2025...
2,politics,76 percent of Americans say Trump does not des...,5631,357,1novrvc,https://www.washingtonpost.com/politics/2025/0...
3,politics,"Unlike other Dems, AOC refuses to join in whit...",15675,510,1nonj9g,https://www.peoplesworld.org/article/unlike-ot...
4,politics,Trump's big UN speech received with awkward la...,26152,1641,1nok4av,https://inews.co.uk/news/world/trumps-big-un-s...


In [None]:
#Exportar csv
df.to_csv(output_path, index=False, encoding="utf-8")

print(f"Guardado {len(df)} filas en {output_path}")

## 5. Collect Comments

In [21]:
# Definimos directorio de salida 

output_dir = "/Users/freddyzutachavez/Documents/GitHub/web_scrapping/output"
posts_csv = os.path.join(output_dir, "posts_reddit.csv")
comments_csv = os.path.join(output_dir, "comments_reddit.csv")


In [22]:
# Cargamos posts ya recolectados
df_posts = pd.read_csv(posts_csv)

In [25]:
df_posts.head(5)

Unnamed: 0,subreddit,title,score,num_comments,id,url
0,politics,Dr. Oz Splits from Trump on Tylenol After Auti...,9838,546,1nouvbe,https://www.thedailybeast.com/dr-oz-splits-fro...
1,politics,Charlie Kirk was a divisive far-right podcaste...,38345,3633,1noqbhr,https://www.theguardian.com/commentisfree/2025...
2,politics,76 percent of Americans say Trump does not des...,5631,357,1novrvc,https://www.washingtonpost.com/politics/2025/0...
3,politics,"Unlike other Dems, AOC refuses to join in whit...",15675,510,1nonj9g,https://www.peoplesworld.org/article/unlike-ot...
4,politics,Trump's big UN speech received with awkward la...,26152,1641,1nok4av,https://inews.co.uk/news/world/trumps-big-un-s...


In [28]:
# Definimos subset de posts "más relevantes"
#    Aquí usamos: top por 'score' 
#    Tomamos los 10 mejores por subreddit para tener variedad 

subset = (
    df_posts.sort_values("score", ascending=False)
            .groupby("subreddit")
            .head(10)  
            .reset_index(drop=True)
)

In [31]:
# Recogemos 5 comentarios por post (top/best)

rows_comments = []
for _, row in subset.iterrows():
    post_id = row["id"]
    submission = reddit.submission(id=post_id)
    submission.comment_sort = "best"  
    submission.comments.replace_more(limit=0)  

    # Tomar los primeros 5 comentarios
    for c in submission.comments[:5]:
        rows_comments.append({
            "post_id": post_id,     
            "body": c.body,         
            "score": c.score      
        })


In [32]:
# 4) Almacenar comentarios
df_comments = pd.DataFrame(rows_comments, columns=["post_id", "body", "score"])
df_comments.head(5)

Unnamed: 0,post_id,body,score
0,1nocisz,Why the bloody hell does anyone want an AI pope?,10090
1,1nocisz,I heard AI is going for white collar workers’ ...,1800
2,1nocisz,"He made the right call, especially considering...",1189
3,1nocisz,Understandable because who the fuck wants an A...,996
4,1nocisz,So... there's no holy ghost in the shell...,54


In [33]:
#Guardas en csv
df_comments.to_csv(comments_csv, index=False, encoding="utf-8")
print(f"Guardados {len(df_comments)} comentarios en: {comments_csv}")

Guardados 150 comentarios en: /Users/freddyzutachavez/Documents/GitHub/web_scrapping/output/comments_reddit.csv


## 6. Storage

In [35]:
# Directorio y archivos
comments_linked_csv = os.path.join(output_dir, "comments_with_posts.csv")

In [37]:
# Cargamos posts y preparar clave para merge
df_posts = pd.read_csv(posts_csv).rename(columns={"id": "post_id"})
df_posts.head(5)

Unnamed: 0,subreddit,title,score,num_comments,post_id,url
0,politics,Dr. Oz Splits from Trump on Tylenol After Auti...,9838,546,1nouvbe,https://www.thedailybeast.com/dr-oz-splits-fro...
1,politics,Charlie Kirk was a divisive far-right podcaste...,38345,3633,1noqbhr,https://www.theguardian.com/commentisfree/2025...
2,politics,76 percent of Americans say Trump does not des...,5631,357,1novrvc,https://www.washingtonpost.com/politics/2025/0...
3,politics,"Unlike other Dems, AOC refuses to join in whit...",15675,510,1nonj9g,https://www.peoplesworld.org/article/unlike-ot...
4,politics,Trump's big UN speech received with awkward la...,26152,1641,1nok4av,https://inews.co.uk/news/world/trumps-big-un-s...


In [38]:
# Seleccionamos metadatos útiles del post
post_cols = ["post_id", "subreddit", "title", "url", "score", "num_comments"]

In [40]:
# Merge: cada comentario con su post (por post_id)
df_comments_linked = df_comments.merge(df_posts[post_cols], on="post_id", how="left")

In [41]:
df_comments_linked.head(5)

Unnamed: 0,post_id,body,score_x,subreddit,title,url,score_y,num_comments
0,1nocisz,Why the bloody hell does anyone want an AI pope?,10090,worldnews,Pope Leo refuses to authorise an AI Pope and d...,https://www.pcgamer.com/software/ai/pope-leo-r...,61282,1389
1,1nocisz,I heard AI is going for white collar workers’ ...,1800,worldnews,Pope Leo refuses to authorise an AI Pope and d...,https://www.pcgamer.com/software/ai/pope-leo-r...,61282,1389
2,1nocisz,"He made the right call, especially considering...",1189,worldnews,Pope Leo refuses to authorise an AI Pope and d...,https://www.pcgamer.com/software/ai/pope-leo-r...,61282,1389
3,1nocisz,Understandable because who the fuck wants an A...,996,worldnews,Pope Leo refuses to authorise an AI Pope and d...,https://www.pcgamer.com/software/ai/pope-leo-r...,61282,1389
4,1nocisz,So... there's no holy ghost in the shell...,54,worldnews,Pope Leo refuses to authorise an AI Pope and d...,https://www.pcgamer.com/software/ai/pope-leo-r...,61282,1389


In [45]:
# Guardamos comentarios + metadatos del post
df_comments_linked.to_csv(comments_linked_csv, index=False, encoding="utf-8")


### Fin de la parte 1