## Part 2: Collect Data and Storage

In [1]:
!pip install praw
!pip install python-dotenv
import os
from dotenv import load_dotenv
import praw

Collecting praw
  Obtaining dependency information for praw from https://files.pythonhosted.org/packages/73/ca/60ec131c3b43bff58261167045778b2509b83922ce8f935ac89d871bd3ea/praw-7.8.1-py3-none-any.whl.metadata
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Obtaining dependency information for prawcore<3,>=2.4 from https://files.pythonhosted.org/packages/96/5c/8af904314e42d5401afcfaff69940dc448e974f80f7aa39b241a4fbf0cf1/prawcore-2.4.0-py3-none-any.whl.metadata
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Obtaining dependency information for update_checker>=0.18 from https://files.pythonhosted.org/packages/0c/ba/8dd7fa5f0b1c6a8ac62f8f57f7e794160c1f86f31c6d0fb00f582372a3e4/update_checker-0.18.0-py3-none-any.whl.metadata
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
   -------------------------------

In [2]:
#Rescatamos los datos secretos desde el archivo "ID..env"
load_dotenv("ID.env")

reddit = praw.Reddit(
    client_id=os.getenv("CLIENT_ID"),
    client_secret=os.getenv("CLIENT_SECRET"),
    user_agent=os.getenv("USER_AGENT")
)

In [3]:
# Probamos conexión con Python
for submission in reddit.subreddit("python").hot(limit=5):
    print(f"Título: {submission.title}")
    print(f"Score: {submission.score}")
    print(f"URL: {submission.url}\n")

Título: Sunday Daily Thread: What's everyone working on this week?
Score: 6
URL: https://www.reddit.com/r/Python/comments/1nmdhrp/sunday_daily_thread_whats_everyone_working_on/

Título: Thursday Daily Thread: Python Careers, Courses, and Furthering Education!
Score: 2
URL: https://www.reddit.com/r/Python/comments/1nps3nn/thursday_daily_thread_python_careers_courses_and/

Título: Pyrefly & Instagram - A Case Study on the Pain of Slow Code Navigation
Score: 89
URL: https://www.reddit.com/r/Python/comments/1np9d42/pyrefly_instagram_a_case_study_on_the_pain_of/

Título: Teaching my wife python!
Score: 16
URL: https://www.reddit.com/r/Python/comments/1nplhop/teaching_my_wife_python/

Título: Fast API better option than Django?
Score: 35
URL: https://www.reddit.com/r/Python/comments/1npercr/fast_api_better_option_than_django/



### PARTE 2

Target subreddits:

- r/politics
- r/PoliticalDiscussion
- r/worldnews
Task: For each of the three subreddits, collect 20 “hot” or “top” posts per subreddit. Extraction (for each post, extract):
- title
- score (upvotes)
- num_comments
- id (unique identifier)
- url Storage: Store this post data.

In [4]:
import pandas as pd
# Indicamos los subreddits
subreddits = ["politics", "PoliticalDiscussion", "worldnews"]
# Inicializamos la lista de posts
all_posts = []

# Recorremos cada subreddit
for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    print(f"Recolectando posts de: r/{sub} ...")

    for submission in subreddit.hot(limit=20):  # puedes cambiar hot → top
        all_posts.append({
            "subreddit": sub,
            "title": submission.title,
            "score": submission.score,
            "num_comments": submission.num_comments,
            "id": submission.id,
            "url": submission.url
        })

print(f" Extracción completada. Total de posts recolectados: {len(all_posts)}")

Recolectando posts de: r/politics ...
Recolectando posts de: r/PoliticalDiscussion ...
Recolectando posts de: r/worldnews ...
 Extracción completada. Total de posts recolectados: 60


In [5]:
# Convertimos a DataFrame
df = pd.DataFrame(all_posts)

# Verificación rápida
print("Total de posts por subreddit:")
print(df["subreddit"].value_counts())

# Guardamos a CSV
df.to_csv("reddit_posts_politics_world.csv", index=False)

print("Archivo CSV creado con éxito.")

Total de posts por subreddit:
subreddit
politics               20
PoliticalDiscussion    20
worldnews              20
Name: count, dtype: int64
Archivo CSV creado con éxito.


Ahora, para subconjuntos de publicaciones relevantes, recopilaremos 5 comentarios por publicación.

In [6]:
import time

In [7]:
top_posts = df.sort_values("num_comments", ascending=False).head(10)
all_comments = []

In [9]:
for _, row in top_posts.iterrows(): 
    post_id = row["id"]
    try:
        submission = reddit.submission(id=post_id)
        submission.comments.replace_more(limit=0)  # expandir comentarios

        # Ordenamos y tomamos los 5 mejores comentarios
        comments_sorted = sorted(
            submission.comments.list(),
            key=lambda c: getattr(c, "score", 0),
            reverse=True
        )[:5]

        for c in comments_sorted:
            all_comments.append({
                "post_id": post_id,             
                "comment_id": c.id,            
                "body": c.body,                 
                "score": c.score,               
                "created_utc": c.created_utc    
            })

        time.sleep(0.5)  #para evitar rate-limit

    except Exception as e:
        print(f"Error en post {post_id}: {e}")

In [10]:
# Convertimos a DataFrame
df_comments = pd.DataFrame(all_comments)

# Guardamos a CSV
df_comments.to_csv("reddit_comments_politics_world.csv", index=False)

print(f"Comentarios recolectados: {len(df_comments)}")
print("Archivo 'reddit_comments_politics_world.csv' creado con éxito.")

Comentarios recolectados: 50
Archivo 'reddit_comments_politics_world.csv' creado con éxito.
