<a href="https://colab.research.google.com/github/DtotheS/AI-in-the-wild/blob/main/src/reddit_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reddit Data Collection with Python PRAW

<!DOCTYPE html>
<html lang="en">
<body>
    <h2>Sign-In for this Workshop</h2>
    <img src="https://drive.google.com/uc?export=view&id=1NjmZejmnOhLlv0OT7_-tCJU1LOhLT51V"
         width="250">
</body>
</html>

## Instructor: [Dr. Sian Lee](https://libraries.olemiss.edu/team/dr-sian-lee/)

<table>
  <tr>
    <!-- Picture on the Left -->
    <td>
      <img src="https://libraries.olemiss.edu/wp-content/uploads/2024/08/leeavatar-150x150.png" alt="Sian" style="width: 150px; border-radius: 50%;">
    </td>
    <!-- Education Information on the Right -->
    <td style="padding-left: 20px;">
      <h3>Assistant Professor of Scholar Support and Data Services (SSDS)</h3>
      <h3>Education</h3>
      <ul>
        <li><strong>Ph.D. in Informatics, minor in Statistics</strong>, <br>Collge of Information Sciences and Technology (IST), <br>Penn State University</li>
        <li><strong>M.A. & B.A. in Economics</strong>
      </ul>
    </td>
  </tr>
</table>

* Duration: 2 sessions, 1 hour/session
* Tools: Python, PRAW (Python Reddit API Wrapper), Google Colab
* Objective: Learn how to authenticate with the Reddit API and extract subreddit data—such as posts, user information, comments, URLs, and engagement metrics—using PRAW.
* link for today's note: http://tiny.cc/reddit-data-1

# What You’ll Learn in This Workshop Series

## Week 1
- Overview of API & PRAW
- API key setup (client_id, client_secret, user_agent)
- Access subreddit (e.g., r/news)
- Extract post metadata (title, author, score, comments, etc.)
- Save collected posts to CSV (reddit_posts.csv)
- Search posts using keywords (e.g., search("inflation"))
- Collect recent comments from subreddit

## Week2
- Collect user metadata (karma, cake day, mod status, etc.)
- Collect each user’s posts
- Collect each user’s comments
- Collect comments under each post


# What is Application Programming Interface (API)?



- Set of *rules* and *specifications* that allows different software applications to communicate and exchange data.
- **Reddit** provides [an web API](https://www.reddit.com/dev/api/) (API over the web) so developers can programmatically access public Reddit data (e.g., posts, comments).
- It is web API, so *any programming language* (Python, R, JAVA, JavaScript etc.) that can make HTTP request (like GET, POST, etc.) can use it.
- APIs are not just for research or data collection. Developers also use them to build apps, connect services, automate workflows, and create real-time user experiences across many industries.


<!DOCTYPE html>
<html lang="en">
<body>
    <img src="https://media.geeksforgeeks.org/wp-content/uploads/20230216170349/What-is-an-API.png"
         width="800">
</body>
</html>

# What is PRAW and Why Use It?

- PRAW stands for **Python Reddit API Wrapper**.
- It is a Python library that makes it easy to interact with Reddit’s API.
- It abstracts away the complexity of making raw HTTP requests.
    - Object-Oriented Interface
    - PRAW provides Pythonic objects like Submission, Comment, Subreddit, Redditor.
    - You get structured attributes like .title, .score, .author, etc.
- With just a few lines of code, you can access posts, comments, and user info.
- **EASY TO USE**



[PRAW documentation Link](https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html#)

# Set Up Reddit API Credentials

To access Reddit API using PRAW, you need to create a Reddit App:

1. Go to [https://www.reddit.com/prefs/apps](https://www.reddit.com/prefs/apps)
2. Sign-up (if you are new to Reddit)
    - email verification code
3. Click **“are you a developer? create an app...”** or **“Create Another App”**
4. Fill in the form:
   - **Name**: custom name. e.g., `WorkshopPraw`
   - **App type**: select `script`
   - **Redirect URI**: placeholder `http://localhost:8080`
   - **Description**: *(optional)*
5. Click **Create App**
6. After creating, save the following credentials:
   - `client_id`
   - `client_secret`
   - `user_agent` *(e.g., "YOURAPPNAME by /u/yourusername", "WokrshopPraw by /u/username")*


<!DOCTYPE html>
<html lang="en">
<body>
    <h3>Create App</h3>
    <img src="https://drive.google.com/uc?export=view&id=1ocry4-TbRR4BTZZoe5Hok0ZlW5YYZING"
         width="800">
</body>
</html>

<html lang="en">
<body>
    <h3>id, secret, username</h3>
    <img src="https://drive.google.com/uc?export=view&id=12RuMFH8DdbIhNETKcz3-MvUH99ctS7WP"
         width="800">
</body>
</html>

# Install and Import PRAW

- install praw

In [1]:
!pip install praw

Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update_checker, prawcore, praw
Successfully installed praw-7.8.1 prawcore-2.4.0 update_checker-0.18.0


* Import necessary libraries

In [3]:
import praw
import pandas as pd
from datetime import datetime

# Authenticate with Reddit API

- https://www.reddit.com/prefs/apps
- Define credentials
- Initialize reddit instance

In [4]:
# Define credentials first
client_id="6OzL9t0CK8knR1OEMXP4ag"
client_secret="eg4fK5EnXKvYKzANxZUzQ-_luwhUpg"
user_agent="WokrshopPraw by /u/load_i_n_g"


# Initialize Reddit instance
reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret,
                     user_agent=user_agent)

# Access a Subreddit r/news

[Reddit Communities](https://www.reddit.com/best/communities/1/)


- get 10 submissions/posts from r/news

In [6]:
reddit.subreddit("news").new(limit=10)

<praw.models.listing.generator.ListingGenerator at 0x7b5b2bcb7b10>

- extract data from each submission

In [9]:
for submission in reddit.subreddit("news").new(limit=10):
    print("title: ", submission.title)
    print("id: ", submission.id)
    print("--"*50)

title:  US FDA suspends milk quality tests amid workforce cuts
id:  1k59llu
----------------------------------------------------------------------------------------------------
title:  25 tourists dead as Islamic outfit opens fire at Pahalgam, Jammu and Kashmir.
id:  1k58ah4
----------------------------------------------------------------------------------------------------
title:  Films made with AI can win Oscars, Academy says
id:  1k584cy
----------------------------------------------------------------------------------------------------
title:  Info Pete Hegseth shared with wife, brother came from top general's secure messages
id:  1k571mo
----------------------------------------------------------------------------------------------------
title:  Francis changed church policy on the death penalty and nuclear weapons but upheld it on abortion
id:  1k56y53
----------------------------------------------------------------------------------------------------
title:  Mexican Cartels Are 

| Sort Option | Description |
|-------------|-------------|
| **hot**     | Most "active" posts right now — based on score, time decay, and engagement |
| **new**     | Posts sorted by most recent — no upvote influence |
| **rising**  | Posts that are quickly gaining upvotes and engagement, but not yet on "hot" |
| **top**     | Posts with the highest scores over a time period (all-time, week, day, etc.) |
| **best**    | Personalized best posts (⚠️ works only with logged-in users via front page and `subreddit.best()` doesn’t exist) |

-  Hide PRAW async environment warnings

In [8]:
import logging

# Suppress PRAW async environment warnings
logger = logging.getLogger("praw")
logger.setLevel(logging.ERROR)

# Store and Save Data

- attributes: id, title, author, body text, score, upvote ratio, number of comments, date, url, post link

In [10]:
posts = []
for submission in reddit.subreddit("news").new(limit=None): #hot, new, rising, top,
    posts.append({
        "id": submission.id, # unique post id
        "title": submission.title,
        "author": str(submission.author), # user name
        "selftext": submission.selftext, # body text
        "score": submission.score, # Upvote - Downvote
        "upvote_ratio": submission.upvote_ratio,
        "num_comments": submission.num_comments,
        "created_utc": submission.created_utc, #UTC: Coordinated Universal Time
        "created_date": datetime.utcfromtimestamp(submission.created_utc).strftime("%Y-%m-%d %H:%M:%S"), #string format time
        "url": submission.url,
        "permalink": "https://reddit.com"+submission.permalink
    })

print(f"Collected {len(posts)} new posts")

Collected 292 new posts


In [11]:
posts[:10]

[{'id': '1k59llu',
  'title': 'US FDA suspends milk quality tests amid workforce cuts',
  'author': 'wei-long',
  'selftext': '',
  'score': 38,
  'upvote_ratio': 0.93,
  'num_comments': 9,
  'created_utc': 1745336787.0,
  'created_date': '2025-04-22 15:46:27',
  'url': 'https://www.reuters.com/business/healthcare-pharmaceuticals/us-fda-suspends-milk-quality-tests-amid-workforce-cuts-2025-04-21/',
  'permalink': 'https://reddit.com/r/news/comments/1k59llu/us_fda_suspends_milk_quality_tests_amid_workforce/'},
 {'id': '1k58ah4',
  'title': '25 tourists dead as Islamic outfit opens fire at Pahalgam, Jammu and Kashmir.',
  'author': 'sacredsome',
  'selftext': '',
  'score': 148,
  'upvote_ratio': 0.92,
  'num_comments': 8,
  'created_utc': 1745333638.0,
  'created_date': '2025-04-22 14:53:58',
  'url': 'https://www.thehindu.com/news/national/pahalgam-jammu-kashmir-terror-attack-tourists-dead-injured-april-22-2025/article69478557.ece',
  'permalink': 'https://reddit.com/r/news/comments/1k5

In [12]:
df = pd.DataFrame(posts)
df

Unnamed: 0,id,title,author,selftext,score,upvote_ratio,num_comments,created_utc,created_date,url,permalink
0,1k59llu,US FDA suspends milk quality tests amid workfo...,wei-long,,38,0.93,9,1.745337e+09,2025-04-22 15:46:27,https://www.reuters.com/business/healthcare-ph...,https://reddit.com/r/news/comments/1k59llu/us_...
1,1k58ah4,25 tourists dead as Islamic outfit opens fire ...,sacredsome,,148,0.92,8,1.745334e+09,2025-04-22 14:53:58,https://www.thehindu.com/news/national/pahalga...,https://reddit.com/r/news/comments/1k58ah4/25_...
2,1k584cy,"Films made with AI can win Oscars, Academy says",aiiimee,,174,0.74,75,1.745333e+09,2025-04-22 14:46:49,https://www.bbc.co.uk/news/articles/cqx4y1lrz2vo,https://reddit.com/r/news/comments/1k584cy/fil...
3,1k571mo,"Info Pete Hegseth shared with wife, brother ca...",rapidcreek409,,7249,0.98,253,1.745330e+09,2025-04-22 14:01:09,https://www.nbcnews.com/politics/national-secu...,https://reddit.com/r/news/comments/1k571mo/inf...
4,1k56y53,Francis changed church policy on the death pen...,Strict_League7833,,607,0.93,133,1.745330e+09,2025-04-22 13:57:00,https://apnews.com/article/pope-francis-career...,https://reddit.com/r/news/comments/1k56y53/fra...
...,...,...,...,...,...,...,...,...,...,...,...
287,1jtmpj0,US sidelines DOJ lawyer involved in deportatio...,swap_019,,8826,0.97,203,1.744036e+09,2025-04-07 14:33:55,https://www.reuters.com/world/us/us-sidelines-...,https://reddit.com/r/news/comments/1jtmpj0/us_...
288,1jtlin6,Elon Musk's X to clamp down on parody accounts,Aggravating_Money992,,19650,0.92,1758,1.744033e+09,2025-04-07 13:41:39,https://www.bbc.com/news/articles/c4g37elkrxdo,https://reddit.com/r/news/comments/1jtlin6/elo...
289,1jtixn7,Visa records of CMU international students ter...,apple_kicks,,2330,0.95,288,1.744025e+09,2025-04-07 11:26:39,https://www.ourmidland.com/news/article/cmu-in...,https://reddit.com/r/news/comments/1jtixn7/vis...
290,1jtioi2,M23 rebels and Congolese government hold first...,Rogue_Eccentric,,189,0.95,2,1.744024e+09,2025-04-07 11:10:55,https://www.zimsphere.co.zw/2025/04/m23-rebels...,https://reddit.com/r/news/comments/1jtioi2/m23...


- Explore df

In [None]:
df.columns
df.shape
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            288 non-null    object 
 1   title         288 non-null    object 
 2   author        288 non-null    object 
 3   selftext      288 non-null    object 
 4   score         288 non-null    int64  
 5   upvote_ratio  288 non-null    float64
 6   num_comments  288 non-null    int64  
 7   created_utc   288 non-null    float64
 8   created_date  288 non-null    object 
 9   url           288 non-null    object 
 10  permalink     288 non-null    object 
dtypes: float64(2), int64(2), object(7)
memory usage: 24.9+ KB


Unnamed: 0,id,title,author,selftext,score,upvote_ratio,num_comments,created_utc,created_date,url,permalink
0,1k4u8n4,RFK Jr.'s autism study to amass medical record...,tsagdiyev,,20,0.92,7,1745285000.0,2025-04-22 01:20:47,https://www.cbsnews.com/news/rfk-jr-autism-stu...,https://reddit.com/r/news/comments/1k4u8n4/rfk...
1,1k4t93l,Person found on ‘elevated surface’ inside Trum...,HellaHaram,,1169,0.95,108,1745282000.0,2025-04-22 00:31:44,https://www.ctvnews.ca/world/article/person-fo...,https://reddit.com/r/news/comments/1k4t93l/per...
2,1k4rjt5,China retreats from US private equity investme...,p_pio,,1794,0.98,107,1745277000.0,2025-04-21 23:11:19,https://www.reuters.com/world/china/china-retr...,https://reddit.com/r/news/comments/1k4rjt5/chi...
3,1k4p3ey,Student loans in default to be referred to deb...,stubborn_facts,,8042,0.98,892,1745271000.0,2025-04-21 21:24:03,https://apnews.com/article/student-loan-debt-d...,https://reddit.com/r/news/comments/1k4p3ey/stu...
4,1k4norc,Nadine Menendez found guilty in bribery trial,StupendousMan1995,,743,0.98,17,1745267000.0,2025-04-21 20:26:12,https://www.nbcnewyork.com/news/local/crime-an...,https://reddit.com/r/news/comments/1k4norc/nad...


-  Save to a CSV file named 'reddit_posts.csv'

In [13]:
df.to_csv('reddit_posts.csv', index=False)  # Save to a CSV file named 'reddit_posts.csv'

# Search Reddit Posts by Keyword

- `search()`: Title + Body Search
- sort: "relevance", "hot", "top", "new", or "comments". (default: "relevance").
- time_filter: "all", "day", "hour", "month", "week", or "year" (default: "all").
- returns the first ~250 search matches

| Operator          | Meaning                     |
|-------------------|-----------------------------|
| `OR`  | At least one term must appear |
| `AND`             | Both terms must appear      |
| `"`               | Exact phrase                |
| `-word`           | Exclude that word           |
| `title:"term"`    | Search only title           |
| `selftext:"term"` | Search only body            |

In [None]:
#queries: 'inflation', 'inflation OR sport', 'inflation AND us', '"inflation fall"', 'inflation -US', 'inflation -US -cut', 'selftext:"inflation"',
i = 0
for submission in reddit.subreddit("news").search(query='inflation', sort="relevance", time_filter="all", limit=10):
    print(submission.id)
    print(submission.title)
    print(submission.selftext)
    print('reddit.com'+submission.permalink)
    print("-" * 80)
    i += 1

print('len:',i)

1jpz95n
Trump announces sweeping new tariffs to promote US manufacturing, risking inflation and trade wars

reddit.com/r/news/comments/1jpz95n/trump_announces_sweeping_new_tariffs_to_promote/
--------------------------------------------------------------------------------
1insgf2
US inflation heats up to 3% for first time since June

reddit.com/r/news/comments/1insgf2/us_inflation_heats_up_to_3_for_first_time_since/
--------------------------------------------------------------------------------
1cwo8go
Target to lower prices on about 5,000 basic goods as inflation cuts into budgets

reddit.com/r/news/comments/1cwo8go/target_to_lower_prices_on_about_5000_basic_goods/
--------------------------------------------------------------------------------
1fjzgkq
Federal Reserve cuts key rate by sizable half-point, signaling end to its inflation fight

reddit.com/r/news/comments/1fjzgkq/federal_reserve_cuts_key_rate_by_sizable/
-------------------------------------------------------------------

# Search Most Recent Comments

- `comments()`
- For comment analysis
- attributes: id, author, body, score, link, post title, post id, etc.

In [None]:
i = 0
for comment in reddit.subreddit("news").comments(limit=10):
    print("Comment ID:", comment.id)
    print("Comment Author:", comment.author)
    print("Comment Body:", comment.body)
    print("Comment Score:", comment.score)
    print("Comment Permalink:", f"https://www.reddit.com{comment.permalink}")
    print("Post title: ", comment.submission.title)
    print("Post id: ", comment.submission.id)
    print("-" * 80)
    i+=1

print('len:',i)

Comment ID: mocsewu
Comment Author: BarelyContainedChaos
Comment Body: At least he took the high ground
Comment Score: 1
Comment Permalink: https://www.reddit.com/r/news/comments/1k4t93l/person_found_on_elevated_surface_inside_trump/mocsewu/
Post title:  Person found on ‘elevated surface’ inside Trump Tower in New York is arrested, police say
Post id:  1k4t93l
--------------------------------------------------------------------------------
Comment ID: mocsck6
Comment Author: Mobile-Bar7732
Comment Body: Just a tattoo.
Comment Score: 1
Comment Permalink: https://www.reddit.com/r/news/comments/1k4t93l/person_found_on_elevated_surface_inside_trump/mocsck6/
Post title:  Person found on ‘elevated surface’ inside Trump Tower in New York is arrested, police say
Post id:  1k4t93l
--------------------------------------------------------------------------------
Comment ID: mocsawl
Comment Author: blogoman
Comment Body: The back of a really sexy looking couch.
Comment Score: 1
Comment Permalink: 

# Read CSV file as df

In [31]:
import pandas as pd

# Load the CSV file
df_posts = pd.read_csv('reddit_posts.csv')

# Preview the first few rows
df_posts.head()
df_posts.info() # no null data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            292 non-null    object 
 1   title         292 non-null    object 
 2   author        292 non-null    object 
 3   selftext      0 non-null      float64
 4   score         292 non-null    int64  
 5   upvote_ratio  292 non-null    float64
 6   num_comments  292 non-null    int64  
 7   created_utc   292 non-null    float64
 8   created_date  292 non-null    object 
 9   url           292 non-null    object 
 10  permalink     292 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 25.2+ KB


# Collect User Metadata

- Unique users

In [32]:
authors = df_posts['author'].unique().tolist()
authors

['wei-long',
 'sacredsome',
 'aiiimee',
 'rapidcreek409',
 'Strict_League7833',
 'Aggravating_Money992',
 'p_pio',
 'Mein_Bergkamp',
 'steffxoxoxoo',
 'tsagdiyev',
 'HellaHaram',
 'stubborn_facts',
 'StupendousMan1995',
 'NeilPoonHandler',
 'KarateKid917',
 'AudibleNod',
 'Warcraft_Fan',
 'MinnieMaas',
 'lastdarknight',
 'randy88moss',
 'johnboy43214321',
 'thongs_are_footwear',
 'A-CommonMan',
 'ComfortableAcadia252',
 'yourdonefor_wt',
 'KanYeWestGreatest',
 'JunkReallyMatters',
 'SnowFallIcy',
 'justalazygamer',
 'Giagiaaaa',
 'Smile_you_got_owned',
 'postonrddt',
 'Seek_Adventure',
 'CorleoneBaloney',
 'brokebacknomountain',
 'Infamous-Sky-1874',
 'Grip_Socks',
 'PM_ME_YOUR_AIRCRAFT',
 'jaded-navy-nuke',
 'CupidStunt13',
 'berrekah',
 'observertruman',
 'Reiketsu_Nariseba',
 'thejerseyguy',
 'Surly_Cynic',
 'Clownineat',
 'BimBapBoom00',
 'fd6270',
 'Superbuddhapunk',
 'Durian881',
 'LadyMadonna_x6',
 'Throwaway921845',
 'reallylatetotheparty',
 'TheBlessedWant',
 'Pretend_Ad4847',

- collect user metadata
- `getattr(...)`: A Python built-in function that:
    - Tries to get the attribute (here, public_description) from the object (here, redditor.subreddit)
    - If it doesn’t exist or causes an error, it returns the fallback value (None here).

In [29]:
user_data = []

for username in authors:
    try:
        redditor = reddit.redditor(username)
        user_data.append({
            "username": username,
            "comment_karma": redditor.comment_karma,
            "post_karma": redditor.link_karma,
            "cake_day": datetime.utcfromtimestamp(redditor.created_utc).strftime("%Y-%m-%d"),
            "is_gold": redditor.is_gold, # reddit premium member (formely Reddit Gold)
            "is_mod": redditor.is_mod, # A moderator on Reddit is a user who helps manage and oversee a subreddit.
            "has_verified_email": redditor.has_verified_email,
            "description": getattr(redditor.subreddit, 'public_description', None)
        })
    except Exception as e:
        print(f"Failed to fetch data for {username}: {e}")

df_users = pd.DataFrame(user_data)
df_users.to_csv("reddit_users.csv", index=False)

Failed to fetch data for Impossible_Piano_29: 'Redditor' object has no attribute 'comment_karma'


- Exception may occur if the user account is suspended, deleted, or does not exist.
    - reddit.com/user/*username*

# Posts by Each User

In [33]:
user_posts = []

for username in authors[:10]:
    try:
        redditor = reddit.redditor(username)
        for post in redditor.submissions.new(limit=10):  # Adjust limit as needed
            user_posts.append({
                "username": username,
                "title": post.title,
                "subreddit": str(post.subreddit),
                "score": post.score,
                "num_comments": post.num_comments,
                "created_utc": post.created_utc,
                "created_date": datetime.utcfromtimestamp(post.created_utc).strftime("%Y-%m-%d %H:%M:%S"),
                "url": post.url,
                "permalink": f"https://reddit.com{post.permalink}"
            })
    except Exception as e:
        print(f"Failed to fetch posts for {username}: {e}")

df_user_posts = pd.DataFrame(user_posts)
df_user_posts.to_csv("user_posts.csv", index=False)

# Comments by Each User

In [34]:
user_comments = []

for username in authors[:10]:
    try:
        redditor = reddit.redditor(username)
        for comment in redditor.comments.new(limit=10):
            user_comments.append({
                "username": username,
                "body": comment.body,
                "score": comment.score,
                "subreddit": str(comment.subreddit),
                "created_utc": comment.created_utc,
                "created_date": datetime.utcfromtimestamp(comment.created_utc).strftime("%Y-%m-%d %H:%M:%S"),
                "link": f"https://reddit.com{comment.permalink}"
            })
    except Exception as e:
        print(f"Failed to fetch comments for {username}: {e}")

df_user_comments = pd.DataFrame(user_comments)
df_user_comments.to_csv("user_comments.csv", index=False)

# Comments under Each Post

In [38]:
post_ids = df_posts["id"].unique().tolist()
len(post_ids)

292

In [64]:
submission = reddit.submission(id=post_ids[3])
submission.comment_sort = "top"
submission.comments.replace_more(limit=None)  # In-place
all_comments = submission.comments.list()     # Flattened, with MoreComments gone
all_comments[:20]

[Comment(id='mofnvxy'),
 Comment(id='mofnb6b'),
 Comment(id='mofojfv'),
 Comment(id='mofp29y'),
 Comment(id='mofs7dl'),
 Comment(id='mofpunh'),
 Comment(id='mofqa0c'),
 Comment(id='mofs8es'),
 Comment(id='mofrwxq'),
 Comment(id='moftjet'),
 Comment(id='mofn9xe'),
 Comment(id='mofwee5'),
 Comment(id='mofsgag'),
 Comment(id='mofvvt5'),
 Comment(id='mofstw7'),
 Comment(id='mofnikz'),
 Comment(id='mofu5zt'),
 Comment(id='mofwels'),
 Comment(id='mofv2pw'),
 Comment(id='mofwna3')]

In [65]:
post_comments = []

for post_id in post_ids[:10]:
    try:
        submission = reddit.submission(id=post_id)
        submission.comment_sort = "top"  # Change filter: 'confidence', 'top', 'new', 'controversial', 'old', 'random'
        submission.comments.replace_more(limit=0) # Replace all "MoreComments" placeholders with the actual comments
        all_comments = submission.comments.list() # Flattened, with MoreComments gone
        for comment in all_comments[:20]:
            post_comments.append({
                "post_id": post_id,
                "author": str(comment.author),
                "body": comment.body,
                "score": comment.score,
                "created_utc": comment.created_utc,
                "created_date": datetime.utcfromtimestamp(comment.created_utc).strftime("%Y-%m-%d %H:%M:%S"),
                "link": f"https://reddit.com{comment.permalink}"
            })
    except Exception as e:
        print(f"Failed to fetch comments for post {post_id}: {e}")

df_post_comments = pd.DataFrame(post_comments)
df_post_comments.to_csv("post_comments.csv", index=False)

NameError: name 'df_post' is not defined

# Limitations and Possible Solutions

### Is there any Rate Limit? Yes

- Reddit officially enforces rate limits on data extraction (from mid-2023), although the exact thresholds are not publicly documented. However, the following limitations are generally observed:
  - A **maximum of 1,000 items per listing query** (e.g., `.top()`)
  - A **hard cap of approximately 250 results** for `.search()` queries
  - These limits may vary depending on the endpoint (e.g., `.top()` vs. `.new()` ) and are subject to change over time

### Can I access Full Historical Data? No

- Full historical access is **not available via Reddit's official API**
- In the past, services like **Pushshift** allowed this, but it's currently deprecated
- I’ll host another workshop focused on collecting historical Reddit data `next semester`.

### Can I build historical data from now on? Yes

- You can build/run a **daily or hourly crawler** using `.new()` to:
  - Capture recent posts before they disappear from listings
  - Store post IDs to avoid duplication
  - Accumulate a local dataset over time

You can check [official documentation](https://support.reddithelp.com/hc/en-us/articles/14945211791892-Developer-Platform-Accessing-Reddit-Data) for more details.


