In this tutorial, we will scrape reddit submissions (=posts), comments and user data using PRAW (The Python Reddit API Wrapper), which eases communicating with Reddit API in Python.

- PRAW documentation: https://praw.readthedocs.io/en/stable/index.html#)
- Reddit API documentation: https://www.reddit.com/dev/api/

Prior to running this code, you should get credentials for Reddit API. If you haven't, sign up for Reddit (https://www.reddit.com/). After that, follow the instruction in the link below.

- Getting Reddit API credentials: https://josephlai241.github.io/URS/credentials.html

SAVE your API ID and Password in a secure place.

## Install Dependencies and Import Modules

In [None]:
!pip install praw

Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update_checker, prawcore, praw
Successfully installed praw-7.8.1 prawcore-2.4.0 update_checker-0.18.0


In [None]:
# optional: mount google drive if you want to save files there

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Importing PRAW, and getting the access to Reddit API.

In [None]:
import praw

client_id = "YOUR API ID"
client_secret = "YOUR SECRET KEY"

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent="Scraper",
)

Ignore unimportant warnings

In [None]:
import logging

# Suppress warnings from PRAW
logging.getLogger('praw').setLevel(logging.ERROR)

# Explore objects

# Explore what we can get

## 1. We can scrape submissions within a subreddit.
- Specifying a category is required. (Hot, New, Controversial, Top, Rising, Search)

In [None]:
subreddit = reddit.subreddit("datascience")

In [None]:
?subreddit

In [None]:
subreddit.top(limit=5)

<praw.models.listing.generator.ListingGenerator at 0x7ad48ece0110>

In [None]:
submission_example = next(subreddit.top(limit=5))
submission_example

Submission(id='k8nyf8')

In [None]:
?submission_example

In [None]:
# Get the 5 top posts
for submission in subreddit.top(limit=5):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: k8nyf8
Title: data siens
Author: None
Text: 
URL: https://dslntlv9vhjr4.cloudfront.net/posts_images/EcY6g2neQEaIi.png
----------
ID: oeg6nl
Title: The pain and excitement
Author: Kent-Clark-
Text: 
URL: https://i.redd.it/yqnunwryjg971.jpg
----------
ID: hohvgq
Title: Shout Out to All the Mediocre Data Scientists Out There
Author: MrBurritoQuest
Text: I've been lurking on this sub for a while now and all too often I see posts from people claiming they feel inadequate and then they go on to describe their stupid impressive background and experience. That's great and all but I'd like to move the spotlight to the rest of us for just a minute. Cheers to my fellow mediocre data scientists who don't work at FAANG companies, aren't pursing a PhD, don't publish papers, haven't won Kaggle competitions, and don't spend every waking hour improving their portfolio.  Even though we're nothing special, we still deserve some appreciation every once in a while.

/rant I'll hand it back over to the 

In [None]:
?subreddit.search

In [None]:
# other categories except 'search' works in a similar way.
# to see how 'search' method works, let's search with a keyword 'python' within a day of r/datascience submissions.

for submission in subreddit.search("python", time_filter="day"):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

## 2. You can scrape a specific user's data as well.

In [None]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [None]:
?user

In [None]:
print(f"Username: {user.name}")
print(f"Karma: {user.link_karma + user.comment_karma}")
print(f"The account is created at: {user.created_utc}") # What is this strange number?

Username: MrBurritoQuest
Karma: 11304
The account is created at: 1508214093.0


In [None]:
from datetime import datetime
print(datetime.utcfromtimestamp(user.created_utc).strftime('%Y-%m-%d %H:%M:%S'))

2017-10-17 04:21:33


## 3. You can scrape every comments below a specific submission.

In [None]:
# Use the submission ID or URL
submission_id = '15y7j15'
submission = reddit.submission(id=submission_id)

In [None]:
# This does exactly the same
submission_url = 'https://www.reddit.com/r/datascience/comments/15y7j15/microsoft_is_bringing_python_to_excel/'
submission = reddit.submission(url=submission_url)

In [None]:
submission.comments.replace_more(limit=None)
len(submission.comments.list())

114

In [None]:
comment = submission.comments.list()[0]

In [None]:
print(f"Title: {comment.author}")
print(f"Body: {comment.body}")

Title: Exact-Bird-4203
Body: Feel like this has been hyped forever. Excited to actually use it


## 4. We can also scrape a specific user's Reddit activities (submissions and comments) in general.

In [None]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [None]:
?user

In [None]:
for submission in user.submissions.new(limit=5):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: 17b4lgf
Title: Bots replying to Bots replying to Bots… (swipe)
Author: MrBurritoQuest
Text: 
URL: https://www.reddit.com/gallery/17b4lgf
----------
ID: 17b4g36
Title: Bots replying to bots replying to bots…(swipe)
Author: MrBurritoQuest
Text: 
URL: https://www.reddit.com/gallery/17b4g36
----------
ID: 11r7fdi
Title: Bought a T-Bill and feel scammed
Author: MrBurritoQuest
Text: Perhaps “scammed” is too harsh, maybe stupid is more appropriate as I clearly don’t understand how these work. My wife and I needed to stash some cash for 6 months and decided to go with a T Bill instead instead of a HYSA (first time buying one). 

We purchased $15k of the March 16th 6 Month T Bill through Fidelity. On our statement it shows a Yield to Maturity of ~4.89% so I was expecting our purchase amount to be ~$14,266 but instead our purchase amount came out to be $14,643. If my math is correct, this is a measly 2.38% which is lower than our current HYSA. 

Where did I go wrong? What are the tax implica

In [None]:
for comment in user.comments.new(limit=5):
    print(f"ID: {comment.id}")
    print(f"Title: {comment.body}")
    print('----------')

ID: ljm86sq
Title: But that’s not really training perfect pitch, that’s more so basic training on the instrument of choice. Yes, of course people with perfect pitch would have to learn where the corresponding 12 notes (and their octaves) are on the piano, but that would take no more than an hour or two.  Not saying it isn’t cool, I’m just saying it’s more in the realm of ‘neat party tricks’ instead of ‘pure talent resulting from thousands of hours of dedication and practice’.
----------
ID: ljm1jw7
Title: Perfect pitch doesn’t take much practice from what I understand, it’s just something those people can do without much thought or effort. I think it’d be analogous to me telling you what color your shirt is.

Edit: idk why y’all are so butthurt about this. You can find videos of [toddlers doing this](https://youtube.com/shorts/1y8arkuosZA?si=0Tljz4JUd_K4DuCt). I’m not saying it isn’t cool, I’m just saying for people with perfect pitch, this kind of thing is closer to a neat party trick

# Define functions
- submission_to_dict(subject) returns almost every attributes of submission object available in a dictionary format.
- user_to_dict(user) returns almost every attributes of user object available in a dictionary format.
- comment_to_dict(comment) returns almost every attributes of comment object available in a dictionary format.

In [None]:
def submission_to_dict(submission): # submission = reddit.subreddit(subredditname)
    data = {
        'author': str(submission.author),
        'author_flair_text': submission.author_flair_text,
        'clicked': submission.clicked,
        'comments': len(submission.comments),
        'created_utc': submission.created_utc,
        'distinguished': submission.distinguished,
        'edited': submission.edited,
        'id': submission.id,
        'is_original_content': submission.is_original_content,
        'is_self': submission.is_self,
        'locked': submission.locked,
        'name': submission.name,
        'num_comments': submission.num_comments,
        'over_18': submission.over_18,
        'permalink': submission.permalink,
        'saved': submission.saved,
        'score': submission.score,
        'selftext': submission.selftext,
        'spoiler': submission.spoiler,
        'stickied': submission.stickied,
        'subreddit': str(submission.subreddit),
        'title': submission.title,
        'upvote_ratio': submission.upvote_ratio,
        'url': submission.url
    }

    return data

def user_to_dict(user): # user = reddit.redditor(username)
    data = {
        'comment_karma': user.comment_karma,
        'created_utc': user.created_utc,
        'has_verified_email': user.has_verified_email,
        'icon_img': user.icon_img,
        'id': user.id,
        'is_employee': user.is_employee,
        'is_friend': user.is_friend,
        'is_mod': user.is_mod,
        'is_gold': user.is_gold,
        'link_karma': user.link_karma,
        'name': user.name,
        'subreddit_banner_img': user.subreddit.banner_img if user.subreddit else 'NA',
        'subreddit_name': user.subreddit.name if user.subreddit else 'NA',
        'subreddit_over_18': user.subreddit.over_18 if user.subreddit else 'NA',
        'subreddit_public_description': user.subreddit.public_description if user.subreddit else 'NA',
        'subreddit_subscribers': user.subreddit.subscribers if user.subreddit else 'NA',
        'subreddit_title': user.subreddit.title if user.subreddit else 'NA',
    }

    return data

def comment_to_dict(comment): # comment = submission.comments.list()[index]
    data = {
        'author': str(comment.author),
        'body': comment.body,
        'body_html': comment.body_html,
        'created_utc': comment.created_utc,
        'distinguished': comment.distinguished,
        'edited': comment.edited,
        'id': comment.id,
        'is_submitter': comment.is_submitter,
        'link_id': comment.link_id,
        'parent_id': comment.parent_id,
        'permalink': comment.permalink,
        'replies': len(comment.replies),
        'saved': comment.saved,
        'score': comment.score,
        'stickied': comment.stickied,
        'submission': str(comment.submission),
        'subreddit': str(comment.subreddit),
        'subreddit_id': comment.subreddit_id
    }

    return data

# Scrape and return dataframes
- As we defined functions above to return dictionary, we can loop through different objects, return multiple objects, and save them into a pandas dataframe object.

## Submissions

In [None]:
import pandas as pd

subreddit = reddit.subreddit("datascience")
submissions_data = []

# Get the top 5 posts
for submission in subreddit.top(limit=5):
    submissions_data.append(submission_to_dict(submission))

# Convert the list of dictionaries to a DataFrame
submissions_df = pd.DataFrame(submissions_data)

submissions_df

Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,,,False,26,1607371000.0,,False,k8nyf8,False,False,...,/r/datascience/comments/k8nyf8/data_siens/,False,4105,,False,False,datascience,data siens,0.97,https://dslntlv9vhjr4.cloudfront.net/posts_ima...
1,Kent-Clark-,,False,41,1625519000.0,,False,oeg6nl,False,False,...,/r/datascience/comments/oeg6nl/the_pain_and_ex...,False,3920,,False,False,datascience,The pain and excitement,0.97,https://i.redd.it/yqnunwryjg971.jpg
2,MrBurritoQuest,,False,97,1594353000.0,,False,hohvgq,False,True,...,/r/datascience/comments/hohvgq/shout_out_to_al...,False,3633,I've been lurking on this sub for a while now ...,False,False,datascience,Shout Out to All the Mediocre Data Scientists ...,0.99,https://www.reddit.com/r/datascience/comments/...
3,CompetitivePlastic67,,False,28,1663139000.0,,False,xdv6nz,False,False,...,/r/datascience/comments/xdv6nz/lets_keep_this_on/,False,3595,,False,False,datascience,Let's keep this on...,0.97,https://i.redd.it/k102dyo0yrn91.jpg
4,,,False,172,1647837000.0,,False,tj3kek,False,False,...,/r/datascience/comments/tj3kek/guys_weve_been_...,False,3469,,False,False,datascience,"Guys, we’ve been doing it wrong this whole time",0.96,https://i.imgur.com/TAex5zG.jpg


## User information

In [None]:
username_list = ['MrBurritoQuest'] # Add usernames as needed
user_data = []

for username in username_list:
    user = reddit.redditor(username)
    user_data.append(user_to_dict(user))

# Convert the list of dictionaries to a DataFrame
user_df = pd.DataFrame(user_data)
user_df

Unnamed: 0,comment_karma,created_utc,has_verified_email,icon_img,id,is_employee,is_friend,is_mod,is_gold,link_karma,name,subreddit_banner_img,subreddit_name,subreddit_over_18,subreddit_public_description,subreddit_subscribers,subreddit_title
0,5270,1508214000.0,True,https://www.redditstatic.com/avatars/defaults/...,hl45shh,False,False,False,False,6034,MrBurritoQuest,,t5_5wl1i,False,,0,


## Comments under a submission

In [None]:
# Use the submission ID or URL
submission_id = '15y7j15'
submission = reddit.submission(id=submission_id)

# Replace "More Comments" objects with the comments they represent
submission.comments.replace_more(limit=None)

submission_comments_data = []

# Iterate through the comments
for comment in submission.comments.list():
    submission_comments_data.append(comment_to_dict(comment))

# Convert the list of dictionaries to a DataFrame
submission_comments_df = pd.DataFrame(submission_comments_data)
submission_comments_df

Unnamed: 0,author,body,body_html,created_utc,distinguished,edited,id,is_submitter,link_id,parent_id,permalink,replies,saved,score,stickied,submission,subreddit,subreddit_id
0,Exact-Bird-4203,Feel like this has been hyped forever. Excited...,"<div class=""md""><p>Feel like this has been hyp...",1.692716e+09,,False,jxa2h1w,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,2,False,255,False,15y7j15,datascience,t5_2sptq
1,,[deleted],"<div class=""md""><p>[deleted]</p>\n</div>",1.692725e+09,,False,jxapx3y,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,15,False,440,False,15y7j15,datascience,t5_2sptq
2,TrollandDie,A million IT Security engineers suddenly and c...,"<div class=""md""><p>A million IT Security engin...",1.692722e+09,,False,jxahp83,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,4,False,277,False,15y7j15,datascience,t5_2sptq
3,FishFar4370,It looks cool as hell in how it works. \n\nI ...,"<div class=""md""><p>It looks cool as hell in ho...",1.692727e+09,,1692727530.0,jxaus3y,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,5,False,42,False,15y7j15,datascience,t5_2sptq
4,SearchAtlantis,Except you're still screwed because of Excel's...,"<div class=""md""><p>Except you&#39;re still scr...",1.692725e+09,,False,jxapdyf,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,3,False,49,False,15y7j15,datascience,t5_2sptq
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,dbitterlich,Very true if you already have an existing exce...,"<div class=""md""><p>Very true if you already ha...",1.692808e+09,,False,jxfiomo,False,t3_15y7j15,t1_jxeknhw,/r/datascience/comments/15y7j15/microsoft_is_b...,0,False,1,False,15y7j15,datascience,t5_2sptq
110,tothepointe,You'll be fine as long as you know how to fix ...,"<div class=""md""><p>You&#39;ll be fine as long ...",1.693190e+09,,False,jy1h6h9,False,t3_15y7j15,t1_jy0nil0,/r/datascience/comments/15y7j15/microsoft_is_b...,0,False,1,False,15y7j15,datascience,t5_2sptq
111,quintios,I answered your specific question:\n\n> I mean...,"<div class=""md""><p>I answered your specific qu...",1.692905e+09,,False,jxlfkqt,False,t3_15y7j15,t1_jxjxm81,/r/datascience/comments/15y7j15/microsoft_is_b...,1,False,1,False,15y7j15,datascience,t5_2sptq
112,SemolinaPilchard1,??? I'm just replying the way you did.\n\nYou'...,"<div class=""md""><p>??? I&#39;m just replying t...",1.692921e+09,,False,jxmjr68,False,t3_15y7j15,t1_jxlfkqt,/r/datascience/comments/15y7j15/microsoft_is_b...,1,False,2,False,15y7j15,datascience,t5_2sptq


## User activity (submissions, comments)

In [None]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [None]:
user_submissions_data = []

# Get user submissions
for submission in user.submissions.new(limit=5):
    user_submissions_data.append(submission_to_dict(submission))

# Convert the list of dictionaries to a DataFrame
user_submissions_df = pd.DataFrame(user_submissions_data)

user_submissions_df


Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,MrBurritoQuest,,False,5,1697672000.0,,False,17b4lgf,False,False,...,/r/ABoringDystopia/comments/17b4lgf/bots_reply...,False,82,,False,False,ABoringDystopia,Bots replying to Bots replying to Bots… (swipe),0.91,https://www.reddit.com/gallery/17b4lgf
1,MrBurritoQuest,,False,12,1697671000.0,,False,17b4g36,False,False,...,/r/mildlyinfuriating/comments/17b4g36/bots_rep...,False,45,,False,False,mildlyinfuriating,Bots replying to bots replying to bots…(swipe),0.96,https://www.reddit.com/gallery/17b4g36
2,MrBurritoQuest,​,False,7,1678804000.0,,1678805012.0,11r7fdi,False,True,...,/r/personalfinance/comments/11r7fdi/bought_a_t...,False,0,"Perhaps “scammed” is too harsh, maybe stupid i...",False,False,personalfinance,Bought a T-Bill and feel scammed,0.37,https://www.reddit.com/r/personalfinance/comme...
3,MrBurritoQuest,,False,5,1678577000.0,,False,11oyc5u,False,True,...,/r/tipofmytongue/comments/11oyc5u/tomtsongaltr...,False,1,Alt-rock song (with a bit of funk influence). ...,False,False,tipofmytongue,[TOMT][SONG][ALT-ROCK][FUNK][LYRICS],1.0,https://www.reddit.com/r/tipofmytongue/comment...
4,MrBurritoQuest,,False,63,1677446000.0,,False,11cszj2,False,False,...,/r/steak/comments/11cszj2/what_is_this_hole_in...,False,192,,False,False,steak,What is this hole in my steak? Is it safe to e...,0.9,https://i.redd.it/hgylt3ww4nka1.jpg


In [None]:
user_comments_data = []

# Get user comments
for comment in user.comments.new(limit=5):
    user_comments_data.append(comment_to_dict(comment))

# Convert the list of dictionaries to a DataFrame
user_comments_df = pd.DataFrame(user_comments_data)
user_comments_df

Unnamed: 0,author,body,body_html,created_utc,distinguished,edited,id,is_submitter,link_id,parent_id,permalink,replies,saved,score,stickied,submission,subreddit,subreddit_id
0,MrBurritoQuest,"But that’s not really training perfect pitch, ...","<div class=""md""><p>But that’s not really train...",1724451000.0,,False,ljm86sq,False,t3_1eziqkg,t1_ljm44xy,/r/nextfuckinglevel/comments/1eziqkg/his_perfe...,0,False,3,False,1eziqkg,nextfuckinglevel,t5_m0bnr
1,MrBurritoQuest,Perfect pitch doesn’t take much practice from ...,"<div class=""md""><p>Perfect pitch doesn’t take ...",1724449000.0,,1724455274.0,ljm1jw7,False,t3_1eziqkg,t1_ljlulvh,/r/nextfuckinglevel/comments/1eziqkg/his_perfe...,0,False,-16,False,1eziqkg,nextfuckinglevel,t5_m0bnr
2,MrBurritoQuest,It absolutely would be unethical *if* there wa...,"<div class=""md""><p>It absolutely would be unet...",1723765000.0,,False,libmjx1,False,t3_1epatmr,t1_lhmnedz,/r/overemployed/comments/1epatmr/shes_a_legend...,0,False,1,False,1epatmr,overemployed,t5_4forqa
3,MrBurritoQuest,This guy is definitely an absolute moron and p...,"<div class=""md""><p>This guy is definitely an a...",1723136000.0,,False,lh4ut7y,False,t3_1en5lky,t1_lh3rzm3,/r/facepalm/comments/1en5lky/just_straightup_r...,0,False,5,False,1en5lky,facepalm,t5_2r5rp
4,MrBurritoQuest,"Apologies I’m dumb, is this a metaphor for som...","<div class=""md""><p>Apologies I’m dumb, is this...",1721419000.0,,False,ldza9tj,False,t3_1e782op,t1_ldyccpg,/r/RedHotChiliPeppers/comments/1e782op/whats_t...,0,False,7,False,1e782op,RedHotChiliPeppers,t5_2s504


# Automate scraping a subreddit activity

### Create new classes, SubredditDataFetcher and UserDataFetcher.
We will use these classes in an asynchronous setting (https://www.mend.io/blog/asynchronous-programming-in-python-understanding-the-essentials/) to save time.

In [None]:
!pip install asyncpraw

Collecting asyncpraw
  Downloading asyncpraw-7.8.1-py3-none-any.whl.metadata (9.0 kB)
Collecting aiosqlite<=0.17.0 (from asyncpraw)
  Downloading aiosqlite-0.17.0-py3-none-any.whl.metadata (4.1 kB)
Collecting asyncprawcore<3,>=2.4 (from asyncpraw)
  Downloading asyncprawcore-2.4.0-py3-none-any.whl.metadata (5.5 kB)
Downloading asyncpraw-7.8.1-py3-none-any.whl (196 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.4/196.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading aiosqlite-0.17.0-py3-none-any.whl (15 kB)
Downloading asyncprawcore-2.4.0-py3-none-any.whl (19 kB)
Installing collected packages: aiosqlite, asyncprawcore, asyncpraw
Successfully installed aiosqlite-0.17.0 asyncpraw-7.8.1 asyncprawcore-2.4.0


In [None]:
import os

os.environ["REDDIT_CLIENT_ID"] = "YOUR API ID"
os.environ["REDDIT_CLIENT_SECRET"] = "YOUR SECRET KEY"
os.environ["OUT_DIR"] = "/content/drive/MyDrive/Tutorial_2" # or any directory

subreddit_name = "datascience"
limit = 10
user_info_limit = 5

### Let's retrieve:
(1) New submission and commenting activities from r/datascience  
(2) For the authors of submissions and comments in r/datascience: user data, user's recent submissions and comments.

In [None]:
import asyncio, time, csv, math, logging, random
import nest_asyncio, asyncio

from typing import AsyncIterator, Iterable
import pandas as pd
import asyncpraw
from asyncprawcore.exceptions import TooManyRequests, ResponseException

logging.getLogger('praw').setLevel(logging.ERROR)

RETRY_BASE = 2.0
RETRY_MAX = 64.0

def _backoff_sleep(attempt: int, retry_after: float | None = None):
    if retry_after is not None and retry_after > 0:
        time.sleep(retry_after + random.random())
        return
    delay = min(RETRY_MAX, (RETRY_BASE ** attempt)) + random.random()
    time.sleep(delay)

async def safe_iter(async_iter, where: str):
    attempt = 0
    while True:
        try:
            async for item in async_iter:
                yield item
            return
        except TooManyRequests as e:
            attempt += 1
            ra = getattr(e, "retry_after", None)
            print(f"[429] {where}: attempt {attempt}, retry_after={ra}")
            _backoff_sleep(attempt, ra)
        except ResponseException as e:
            attempt += 1
            print(f"[ResponseException] {where}: attempt {attempt} ({e})")
            _backoff_sleep(attempt)
        except Exception as e:
            attempt += 1
            print(f"[Error] {where}: attempt {attempt} ({e})")
            if attempt >= 6:
                raise
            _backoff_sleep(attempt)

def append_csv(path: str, rows: list[dict], cols: list[str], header_written: set):
    if not rows:
        return
    os.makedirs(os.path.dirname(path), exist_ok=True)
    norm = [{k: r.get(k, pd.NA) for k in cols} for r in rows]
    df = pd.DataFrame(norm, columns=cols)
    write_header = path not in header_written
    df.to_csv(path, mode="a", header=write_header, index=False)
    header_written.add(path)


class SubredditDataFetcher:
    def __init__(self, reddit, name: str, limit: int = 10):
        self.reddit = reddit
        self.name = name
        self.limit = limit

    async def submissions(self) -> AsyncIterator[dict]:
        sub = await self.reddit.subreddit(self.name)
        async for s in safe_iter(sub.new(limit=self.limit), f"{self.name}.new"):
            if not s.author:
                continue
            yield {
                "id": s.id,
                "author": str(s.author),
                "created_utc": s.created_utc,
                "num_comments": s.num_comments,
                "permalink": s.permalink,
                "subreddit": str(s.subreddit),
                "title": s.title,
                "selftext": s.selftext,
                "score": s.score,
                "upvote_ratio": s.upvote_ratio,
                "url": s.url,
                "over_18": s.over_18,
                "is_self": s.is_self,
                "is_original_content": s.is_original_content,
            }

    async def comments_for_submission(self, sid: str) -> AsyncIterator[dict]:
        s = await self.reddit.submission(id=sid)
        await s.comments.replace_more(limit=None)
        for c in s.comments.list():
            if not c.author:
                continue
            yield {
                "id": c.id,
                'parent_id': c.parent_id,
                "submission_id": sid,
                "author": str(c.author),
                "created_utc": c.created_utc,
                "body": c.body,
                "score": c.score,
                "permalink": c.permalink,
                "is_submitter": c.is_submitter,
                'subreddit': str(c.subreddit),
            }


class UserDataFetcher:
    def __init__(self, reddit, max_concurrency: int = 3):
        self.reddit = reddit
        self.sem = asyncio.Semaphore(max_concurrency)

    async def user_core(self, name: str) -> dict | None:
        async with self.sem:
            try:
                u = await self.reddit.redditor(name)
                await u.load()
                sub = getattr(u, "subreddit", None)
                return {
                    "name": getattr(u, "name", "NA"),
                    "comment_karma": getattr(u, "comment_karma", math.nan),
                    "link_karma": getattr(u, "link_karma", math.nan),
                    "created_utc": getattr(u, "created_utc", math.nan),
                    "is_mod": getattr(u, "is_mod", False),
                    "is_gold": getattr(u, "is_gold", False),
                    "has_verified_email": getattr(u, "has_verified_email", None),
                    "profile_subreddit": getattr(sub, "name", "NA") if sub else "NA",
                    "subscribers": getattr(sub, "subscribers", math.nan) if sub else math.nan,
                }
            except Exception as e:
                print(f"[user_core] {name}: {e}")
                return None

    async def user_submissions(self, name: str, limit: int = 5) -> AsyncIterator[dict]:
        async with self.sem:
            u = await self.reddit.redditor(name)  # no await
            await u.load()
            async for s in safe_iter(u.submissions.new(limit=limit), f"{name}.subs"):
                yield {
                    "user": name,
                    "id": s.id,
                    "subreddit": str(s.subreddit),
                    "created_utc": s.created_utc,
                    "title": s.title,
                    "score": s.score,
                    "num_comments": s.num_comments,
                    "permalink": s.permalink,
                }

    async def user_comments(self, name: str, limit: int = 5) -> AsyncIterator[dict]:
        async with self.sem:
            u = await self.reddit.redditor(name)  # no await
            await u.load()
            async for c in safe_iter(u.comments.new(limit=limit), f"{name}.comms"):
                yield {
                    "user": name,
                    "id": c.id,
                    "subreddit": str(c.subreddit),
                    "created_utc": c.created_utc,
                    "body": c.body,
                    "score": c.score,
                    "permalink": c.permalink,
                    "link_id": c.link_id,
                }

In [None]:
async def main():
    reddit = asyncpraw.Reddit(
        client_id=os.environ["REDDIT_CLIENT_ID"],
        client_secret=os.environ["REDDIT_CLIENT_SECRET"],
        user_agent="Scraper"
    )

    outdir = os.environ["OUT_DIR"]
    header_written = set()

    sub_name = subreddit_name
    subfetch = SubredditDataFetcher(reddit, sub_name, limit=limit)

    SUBMISSIONS_COLS = [
        "id","author","created_utc","num_comments","permalink","subreddit",
        "title","selftext","score","upvote_ratio","url","over_18","is_self","is_original_content"
    ]

    SUBMISSION_COMMENTS_COLS = [
        "id","parent_id", "submission_id","author","created_utc","body","score","permalink","is_submitter","subreddit"
    ]

    USER_CORE_COLS = [
        "name","comment_karma","link_karma","created_utc","is_mod","is_gold",
        "has_verified_email","profile_subreddit","subscribers"
    ]

    USER_SUBMISSIONS_COLS = [
        "user","id","subreddit","created_utc","title","score","num_comments","permalink"
    ]

    USER_COMMENTS_COLS = [
        "user","id","subreddit","created_utc","body","score","permalink","link_id"
    ]

    # Stream submissions -> CSV, collect IDs and authors
    subs_ids, authors = [], set()
    buffer = []
    async for row in subfetch.submissions():
        buffer.append(row)
        subs_ids.append(row["id"])
        authors.add(row["author"])
    append_csv(f"{outdir}/{sub_name}-new-submission.csv", buffer, SUBMISSIONS_COLS, header_written)

    # Stream all comments of those submissions
    buffer = []
    for sid in subs_ids:
        async for c in subfetch.comments_for_submission(sid):
            buffer.append(c)
            authors.add(c["author"])
            if len(buffer) >= 1000:
                append_csv(f"{outdir}/{sub_name}-new-comment.csv", buffer, SUBMISSION_COMMENTS_COLS, header_written)
                buffer.clear()
    append_csv(f"{outdir}/{sub_name}-new-comment.csv", buffer, SUBMISSION_COMMENTS_COLS, header_written)

    # # Users (dedup & drop NA/None)
    # authors = {a for a in authors if a and a != "None" and a != "NA"}

    # ufetch = UserDataFetcher(reddit, max_concurrency=3)

    # # Core user data (bounded concurrency)
    # user_rows = []
    # for name in authors:
    #     r = await ufetch.user_core(name)
    #     if r:
    #         user_rows.append(r)
    #     if len(user_rows) >= 500:
    #         append_csv(f"{outdir}/{sub_name}-user-data.csv", user_rows, USER_CORE_COLS, header_written)
    #         user_rows.clear()
    # append_csv(f"{outdir}/{sub_name}-user-data.csv", user_rows, USER_CORE_COLS, header_written)

    # # Recent user submissions/comments (streamed)
    # sub_rows = []
    # for name in authors:
    #     async for r in ufetch.user_submissions(name, limit=user_info_limit):
    #         sub_rows.append(r)
    #         if len(sub_rows) >= 1000:
    #             append_csv(f"{outdir}/{sub_name}-user-submission.csv", sub_rows, USER_SUBMISSIONS_COLS, header_written)
    #             sub_rows.clear()
    # append_csv(f"{outdir}/{sub_name}-user-submission.csv", sub_rows, USER_SUBMISSIONS_COLS, header_written)

    # comm_rows = []
    # for name in authors:
    #     async for r in ufetch.user_comments(name, limit=user_info_limit):
    #         comm_rows.append(r)
    #         if len(comm_rows) >= 1000:
    #             append_csv(f"{outdir}/{sub_name}-user-comment.csv", comm_rows, USER_COMMENTS_COLS, header_written)
    #             comm_rows.clear()
    # append_csv(f"{outdir}/{sub_name}-user-comment.csv", comm_rows, USER_COMMENTS_COLS, header_written)

    await reddit.close()


nest_asyncio.apply()
await main()

ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7a97213d5b80>
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7a97227f1b80>
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7a9722802690>
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7a9722800b90>


[ResponseException] datascience.new: attempt 1 (received 401 HTTP response)


CancelledError: 

### Let's also retrive:
(3) submissions under r/datascience, r/MachineLearning, r/datasets, r/dataisbeautiful, r/learnpython

In [None]:
subreddit_name_list = ["datascience", "MachineLearning", "datasets", "dataisbeautiful", "learnpython"]
limit = 1000

In [None]:
import os
import pandas as pd
import asyncpraw
import nest_asyncio, asyncio

SUBMISSIONS_COLS = [
    "id","author","created_utc","num_comments","permalink","subreddit",
    "title","selftext","score","upvote_ratio","url","over_18","is_self","is_original_content"
]

async def main():
    reddit = asyncpraw.Reddit(
        client_id=os.environ["REDDIT_CLIENT_ID"],
        client_secret=os.environ["REDDIT_CLIENT_SECRET"],
        user_agent="Scraper"
    )
    outdir = os.environ["OUT_DIR"]
    os.makedirs(outdir, exist_ok=True)


    for subreddit_name in subreddit_name_list:
        fetcher = SubredditDataFetcher(reddit, subreddit_name, limit=limit)
        rows = [row async for row in fetcher.submissions()]
        df = pd.DataFrame(rows).reindex(columns=SUBMISSIONS_COLS)
        path = f"{outdir}/{subreddit_name}-new-submission-for-subreddit-network.csv"
        df.to_csv(path, index=False)
        print(f"Wrote {len(df)} rows -> {path}")

    await reddit.close()

nest_asyncio.apply()
await main()

Wrote 858 rows -> /content/drive/MyDrive/Tutorial_2/datascience-new-submission-for-subreddit-network.csv
Wrote 807 rows -> /content/drive/MyDrive/Tutorial_2/MachineLearning-new-submission-for-subreddit-network.csv
Wrote 896 rows -> /content/drive/MyDrive/Tutorial_2/datasets-new-submission-for-subreddit-network.csv
Wrote 900 rows -> /content/drive/MyDrive/Tutorial_2/dataisbeautiful-new-submission-for-subreddit-network.csv
Wrote 918 rows -> /content/drive/MyDrive/Tutorial_2/learnpython-new-submission-for-subreddit-network.csv


# On your own
1.1. Go to https://www.reddit.com/, and select a subreddit of your group's interest (the more active, the better-looking your network visualization will be, although the code will have a longer runtime). Get 10 new submissions, and all comments below them.  
1.2. Get the user data, user submissions (5 for each user), and user comments (5 for each user) as well.  
2.1. Select 5 subreddits that are somewhat related to one another. Get 100 new submissions.