## Exercise 1: Reddit API Data Collection

### Part 1: Reddit API Setup & Data Collection

2. Environment Setup

In [1]:
!pip install praw python-dotenv pandas



3. API Connection (PRAW)

In [3]:
import praw
import os
from dotenv import load_dotenv
import pandas as pd

# The credentials are loaded from the .env file.
load_dotenv()

reddit = praw.Reddit(
    client_id=os.getenv("REDDIT_CLIENT_ID"),
    client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
    username=os.getenv("REDDIT_USERNAME"),
    password=os.getenv("REDDIT_PASSWORD"),
    user_agent=os.getenv("REDDIT_USER_AGENT"),
)

print("Successful load with the Reddit API.")

Carga exitosa con la API de Reddit


### Part 2: Collect Data and Storage

4. Collect Posts from Subreddits 

Task: For each of the three subreddits (politics, PoliticalDiscussion, worldnews) collect 20 “hot” or “top” posts per subreddit.

In [4]:
subreddits = ["politics", "PoliticalDiscussion", "worldnews"]

posts_data = []

for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    for post in subreddit.top(limit=20):
        posts_data.append({
            "subreddit": sub,
            "id": post.id,
            "title": post.title,
            "score": post.score,
            "num_comments": post.num_comments,
            "url": post.url
        })

df_posts = pd.DataFrame(posts_data)
df_posts.to_csv("../salida/posts.csv", index=False, encoding="utf-8-sig")

df_posts.head()

Unnamed: 0,subreddit,id,title,score,num_comments,url
0,politics,jptq5n,Megathread: Joe Biden Projected to Defeat Pres...,214315,80931,https://www.reddit.com/r/politics/comments/jpt...
1,politics,krntg6,Mitch McConnell Will Lose Control Of The Senat...,156753,10099,https://www.buzzfeednews.com/article/paulmcleo...
2,politics,ecm1zg,Megathread: House Votes to Impeach President D...,147744,50628,https://www.reddit.com/r/politics/comments/ecm...
3,politics,jcm5dz,Trump Threatens to ‘Leave the Country’ if He L...,135306,16093,https://www.thedailybeast.com/trump-threatens-...
4,politics,i19sjg,Demands for Kushner to Resign Over 'Staggering...,129744,6747,https://www.commondreams.org/news/2020/07/31/d...


5. Collect Comments

Task: For the subset of the most relevant posts, collect 5 comments per post.

In [7]:
comments_data = []

for post_id in df_posts["id"]:
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=0)  # eliminar "load more comments"
    
    for comment in submission.comments[:5]:
        comments_data.append({
            "post_id": post_id,
            "body": comment.body,
            "score": comment.score
        })

df_comments = pd.DataFrame(comments_data)
df_comments.to_csv("../salida/comments.csv", index=False, encoding="utf-8-sig")

df_comments.head()

Unnamed: 0,post_id,body,score
0,jptq5n,EVERYONE WHO TURNED OUT IN 2020 NEEDS TO COME ...,2909
1,jptq5n,"Never won the popular vote, got impeached, los...",2684
2,jptq5n,As German I understand how a country can be sc...,2440
3,jptq5n,Let us not soon forget that this victory was n...,10370
4,jptq5n,Trump is living the 2020 experience - got Covi...,6067


6. Storage
   
- File posts.csv: contains the post data.
- File comments.csv: contains the associated comments.

Saved in the salida folder