In this tutorial, we will scrape reddit submissions (=posts), comments and user data using PRAW (The Python Reddit API Wrapper), which eases communicating with Reddit API in Python.

- PRAW documentation: https://praw.readthedocs.io/en/stable/index.html#)
- Reddit API documentation: https://www.reddit.com/dev/api/

Prior to running this code, you should get credentials for Reddit API. If you haven't, sign up for Reddit (https://www.reddit.com/). After that, follow the instruction in the link below.

- Getting Reddit API credentials: https://josephlai241.github.io/URS/credentials.html

SAVE your API ID and Password in a secure place.

## Install Dependencies and Import Modules

In [1]:
!pip install praw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl.metadata (9.8 kB)
Collecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.4.0 update-checker-0.18.0


In [2]:
# optional: mount google drive if you want to save files there

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Importing PRAW, and getting the access to Reddit API.

In [3]:
import praw

# client_id = (your api id)
# client_secret = (your api password)

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent="Scraper",
)

Ignore unimportant warnings

In [5]:
import logging

# Suppress warnings from PRAW
logging.getLogger('praw').setLevel(logging.ERROR)

# Explore objects

# Explore what we can get

## 1. We can scrape submissions within a subreddit.
- Specifying a category is required. (Hot, New, Controversial, Top, Rising, Search)

In [6]:
subreddit = reddit.subreddit("datascience")

In [7]:
?subreddit

In [8]:
subreddit.top(limit=5)

<praw.models.listing.generator.ListingGenerator at 0x7a87404738b0>

In [9]:
submission_example = next(subreddit.top(limit=5))
submission_example

Submission(id='k8nyf8')

In [10]:
?submission_example

In [11]:
# Get the 5 top posts
for submission in subreddit.top(limit=5):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: k8nyf8
Title: data siens
Author: None
Text: 
URL: https://dslntlv9vhjr4.cloudfront.net/posts_images/EcY6g2neQEaIi.png
----------
ID: oeg6nl
Title: The pain and excitement
Author: Kent-Clark-
Text: 
URL: https://i.redd.it/yqnunwryjg971.jpg
----------
ID: hohvgq
Title: Shout Out to All the Mediocre Data Scientists Out There
Author: MrBurritoQuest
Text: I've been lurking on this sub for a while now and all too often I see posts from people claiming they feel inadequate and then they go on to describe their stupid impressive background and experience. That's great and all but I'd like to move the spotlight to the rest of us for just a minute. Cheers to my fellow mediocre data scientists who don't work at FAANG companies, aren't pursing a PhD, don't publish papers, haven't won Kaggle competitions, and don't spend every waking hour improving their portfolio.  Even though we're nothing special, we still deserve some appreciation every once in a while.

/rant I'll hand it back over to the 

In [12]:
?subreddit.search

In [13]:
# other categories except 'search' works in a similar way.
# to see how 'search' method works, let's search with a keyword 'python' within a day of r/datascience submissions.

for submission in subreddit.search("python", time_filter="day"):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: 1fer01g
Title: Wrist pain using Jupiter notebook. How is your workflow ?
Author: Waste_Necessary654
Text:  Hi am data scientist. I use notebook with vscode and vim motions. 

The problems is that when I am doing some experiment. It's hard to see the output and I need to move my hand from my keyboard to my mouse. 

And I have tendinitis now. how do you do it to avoid using mouse ? Or how your workflow ?
URL: https://www.reddit.com/r/datascience/comments/1fer01g/wrist_pain_using_jupiter_notebook_how_is_your/
----------


## 2. You can scrape a specific user's data as well.

In [14]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [15]:
?user

In [16]:
print(f"Username: {user.name}")
print(f"Karma: {user.link_karma + user.comment_karma}")
print(f"The account is created at: {user.created_utc}") # What is this strange number?

Username: MrBurritoQuest
Karma: 11304
The account is created at: 1508214093.0


In [17]:
from datetime import datetime
print(datetime.utcfromtimestamp(user.created_utc).strftime('%Y-%m-%d %H:%M:%S'))

2017-10-17 04:21:33


## 3. You can scrape every comments below a specific submission.

In [18]:
# Use the submission ID or URL
submission_id = '15y7j15'
submission = reddit.submission(id=submission_id)

In [19]:
# This does exactly the same
submission_url = 'https://www.reddit.com/r/datascience/comments/15y7j15/microsoft_is_bringing_python_to_excel/'
submission = reddit.submission(url=submission_url)

In [20]:
submission.comments.replace_more(limit=None)
len(submission.comments.list())

114

In [21]:
comment = submission.comments.list()[0]

In [22]:
print(f"Title: {comment.author}")
print(f"Body: {comment.body}")

Title: Exact-Bird-4203
Body: Feel like this has been hyped forever. Excited to actually use it


## 4. We can also scrape a specific user's Reddit activities (submissions and comments) in general.

In [23]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [24]:
?user

In [25]:
for submission in user.submissions.new(limit=5):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: 17b4lgf
Title: Bots replying to Bots replying to Bots… (swipe)
Author: MrBurritoQuest
Text: 
URL: https://www.reddit.com/gallery/17b4lgf
----------
ID: 17b4g36
Title: Bots replying to bots replying to bots…(swipe)
Author: MrBurritoQuest
Text: 
URL: https://www.reddit.com/gallery/17b4g36
----------
ID: 11r7fdi
Title: Bought a T-Bill and feel scammed
Author: MrBurritoQuest
Text: Perhaps “scammed” is too harsh, maybe stupid is more appropriate as I clearly don’t understand how these work. My wife and I needed to stash some cash for 6 months and decided to go with a T Bill instead instead of a HYSA (first time buying one). 

We purchased $15k of the March 16th 6 Month T Bill through Fidelity. On our statement it shows a Yield to Maturity of ~4.89% so I was expecting our purchase amount to be ~$14,266 but instead our purchase amount came out to be $14,643. If my math is correct, this is a measly 2.38% which is lower than our current HYSA. 

Where did I go wrong? What are the tax implica

In [26]:
for comment in user.comments.new(limit=5):
    print(f"ID: {comment.id}")
    print(f"Title: {comment.body}")
    print('----------')

ID: ljm86sq
Title: But that’s not really training perfect pitch, that’s more so basic training on the instrument of choice. Yes, of course people with perfect pitch would have to learn where the corresponding 12 notes (and their octaves) are on the piano, but that would take no more than an hour or two.  Not saying it isn’t cool, I’m just saying it’s more in the realm of ‘neat party tricks’ instead of ‘pure talent resulting from thousands of hours of dedication and practice’.
----------
ID: ljm1jw7
Title: Perfect pitch doesn’t take much practice from what I understand, it’s just something those people can do without much thought or effort. I think it’d be analogous to me telling you what color your shirt is.

Edit: idk why y’all are so butthurt about this. You can find videos of [toddlers doing this](https://youtube.com/shorts/1y8arkuosZA?si=0Tljz4JUd_K4DuCt). I’m not saying it isn’t cool, I’m just saying for people with perfect pitch, this kind of thing is closer to a neat party trick

# Define functions
- submission_to_dict(subject) returns almost every attributes of submission object available in a dictionary format.
- user_to_dict(user) returns almost every attributes of user object available in a dictionary format.
- comment_to_dict(comment) returns almost every attributes of comment object available in a dictionary format.

In [27]:
def submission_to_dict(submission): # submission = reddit.subreddit(subredditname)
    data = {
        'author': str(submission.author),
        'author_flair_text': submission.author_flair_text,
        'clicked': submission.clicked,
        'comments': len(submission.comments),
        'created_utc': submission.created_utc,
        'distinguished': submission.distinguished,
        'edited': submission.edited,
        'id': submission.id,
        'is_original_content': submission.is_original_content,
        'is_self': submission.is_self,
        'locked': submission.locked,
        'name': submission.name,
        'num_comments': submission.num_comments,
        'over_18': submission.over_18,
        'permalink': submission.permalink,
        'saved': submission.saved,
        'score': submission.score,
        'selftext': submission.selftext,
        'spoiler': submission.spoiler,
        'stickied': submission.stickied,
        'subreddit': str(submission.subreddit),
        'title': submission.title,
        'upvote_ratio': submission.upvote_ratio,
        'url': submission.url
    }

    return data

def user_to_dict(user): # user = reddit.redditor(username)
    data = {
        'comment_karma': user.comment_karma,
        'created_utc': user.created_utc,
        'has_verified_email': user.has_verified_email,
        'icon_img': user.icon_img,
        'id': user.id,
        'is_employee': user.is_employee,
        'is_friend': user.is_friend,
        'is_mod': user.is_mod,
        'is_gold': user.is_gold,
        'link_karma': user.link_karma,
        'name': user.name,
        'subreddit_banner_img': user.subreddit.banner_img if user.subreddit else 'NA',
        'subreddit_name': user.subreddit.name if user.subreddit else 'NA',
        'subreddit_over_18': user.subreddit.over_18 if user.subreddit else 'NA',
        'subreddit_public_description': user.subreddit.public_description if user.subreddit else 'NA',
        'subreddit_subscribers': user.subreddit.subscribers if user.subreddit else 'NA',
        'subreddit_title': user.subreddit.title if user.subreddit else 'NA',
    }

    return data

def comment_to_dict(comment): # comment = submission.comments.list()[index]
    data = {
        'author': str(comment.author),
        'body': comment.body,
        'body_html': comment.body_html,
        'created_utc': comment.created_utc,
        'distinguished': comment.distinguished,
        'edited': comment.edited,
        'id': comment.id,
        'is_submitter': comment.is_submitter,
        'link_id': comment.link_id,
        'parent_id': comment.parent_id,
        'permalink': comment.permalink,
        'replies': len(comment.replies),
        'saved': comment.saved,
        'score': comment.score,
        'stickied': comment.stickied,
        'submission': str(comment.submission),
        'subreddit': str(comment.subreddit),
        'subreddit_id': comment.subreddit_id
    }

    return data

# Scrape and return dataframes
- As we defined functions above to return dictionary, we can loop through different objects, return multiple objects, and save them into a pandas dataframe object.

## Submissions

In [28]:
import pandas as pd

subreddit = reddit.subreddit("datascience")
submissions_data = []

# Get the top 5 posts
for submission in subreddit.top(limit=5):
    submissions_data.append(submission_to_dict(submission))

# Convert the list of dictionaries to a DataFrame
submissions_df = pd.DataFrame(submissions_data)

submissions_df

Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,,,False,26,1607371000.0,,False,k8nyf8,False,False,...,/r/datascience/comments/k8nyf8/data_siens/,False,4111,,False,False,datascience,data siens,0.97,https://dslntlv9vhjr4.cloudfront.net/posts_ima...
1,Kent-Clark-,,False,41,1625519000.0,,False,oeg6nl,False,False,...,/r/datascience/comments/oeg6nl/the_pain_and_ex...,False,3920,,False,False,datascience,The pain and excitement,0.97,https://i.redd.it/yqnunwryjg971.jpg
2,MrBurritoQuest,,False,97,1594353000.0,,False,hohvgq,False,True,...,/r/datascience/comments/hohvgq/shout_out_to_al...,False,3631,I've been lurking on this sub for a while now ...,False,False,datascience,Shout Out to All the Mediocre Data Scientists ...,0.99,https://www.reddit.com/r/datascience/comments/...
3,CompetitivePlastic67,,False,28,1663139000.0,,False,xdv6nz,False,False,...,/r/datascience/comments/xdv6nz/lets_keep_this_on/,False,3600,,False,False,datascience,Let's keep this on...,0.97,https://i.redd.it/k102dyo0yrn91.jpg
4,,,False,172,1647837000.0,,False,tj3kek,False,False,...,/r/datascience/comments/tj3kek/guys_weve_been_...,False,3464,,False,False,datascience,"Guys, we’ve been doing it wrong this whole time",0.96,https://i.imgur.com/TAex5zG.jpg


## User information

In [29]:
username_list = ['MrBurritoQuest'] # Add usernames as needed
user_data = []

for username in username_list:
    user = reddit.redditor(username)
    user_data.append(user_to_dict(user))

# Convert the list of dictionaries to a DataFrame
user_df = pd.DataFrame(user_data)
user_df

Unnamed: 0,comment_karma,created_utc,has_verified_email,icon_img,id,is_employee,is_friend,is_mod,is_gold,link_karma,name,subreddit_banner_img,subreddit_name,subreddit_over_18,subreddit_public_description,subreddit_subscribers,subreddit_title
0,5270,1508214000.0,True,https://www.redditstatic.com/avatars/defaults/...,hl45shh,False,False,False,False,6034,MrBurritoQuest,,t5_5wl1i,False,,0,


## Comments under a submission

In [30]:
# Use the submission ID or URL
submission_id = '15y7j15'
submission = reddit.submission(id=submission_id)

# Replace "More Comments" objects with the comments they represent
submission.comments.replace_more(limit=None)

submission_comments_data = []

# Iterate through the comments
for comment in submission.comments.list():
    submission_comments_data.append(comment_to_dict(comment))

# Convert the list of dictionaries to a DataFrame
submission_comments_df = pd.DataFrame(submission_comments_data)
submission_comments_df

Unnamed: 0,author,body,body_html,created_utc,distinguished,edited,id,is_submitter,link_id,parent_id,permalink,replies,saved,score,stickied,submission,subreddit,subreddit_id
0,Exact-Bird-4203,Feel like this has been hyped forever. Excited...,"<div class=""md""><p>Feel like this has been hyp...",1.692716e+09,,False,jxa2h1w,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,2,False,254,False,15y7j15,datascience,t5_2sptq
1,,[deleted],"<div class=""md""><p>[deleted]</p>\n</div>",1.692725e+09,,False,jxapx3y,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,15,False,442,False,15y7j15,datascience,t5_2sptq
2,TrollandDie,A million IT Security engineers suddenly and c...,"<div class=""md""><p>A million IT Security engin...",1.692722e+09,,False,jxahp83,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,4,False,279,False,15y7j15,datascience,t5_2sptq
3,FishFar4370,It looks cool as hell in how it works. \n\nI ...,"<div class=""md""><p>It looks cool as hell in ho...",1.692727e+09,,1692727530.0,jxaus3y,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,5,False,44,False,15y7j15,datascience,t5_2sptq
4,SearchAtlantis,Except you're still screwed because of Excel's...,"<div class=""md""><p>Except you&#39;re still scr...",1.692725e+09,,False,jxapdyf,False,t3_15y7j15,t3_15y7j15,/r/datascience/comments/15y7j15/microsoft_is_b...,3,False,46,False,15y7j15,datascience,t5_2sptq
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,dbitterlich,Very true if you already have an existing exce...,"<div class=""md""><p>Very true if you already ha...",1.692808e+09,,False,jxfiomo,False,t3_15y7j15,t1_jxeknhw,/r/datascience/comments/15y7j15/microsoft_is_b...,0,False,1,False,15y7j15,datascience,t5_2sptq
110,tothepointe,You'll be fine as long as you know how to fix ...,"<div class=""md""><p>You&#39;ll be fine as long ...",1.693190e+09,,False,jy1h6h9,False,t3_15y7j15,t1_jy0nil0,/r/datascience/comments/15y7j15/microsoft_is_b...,0,False,1,False,15y7j15,datascience,t5_2sptq
111,quintios,I answered your specific question:\n\n> I mean...,"<div class=""md""><p>I answered your specific qu...",1.692905e+09,,False,jxlfkqt,False,t3_15y7j15,t1_jxjxm81,/r/datascience/comments/15y7j15/microsoft_is_b...,1,False,1,False,15y7j15,datascience,t5_2sptq
112,SemolinaPilchard1,??? I'm just replying the way you did.\n\nYou'...,"<div class=""md""><p>??? I&#39;m just replying t...",1.692921e+09,,False,jxmjr68,False,t3_15y7j15,t1_jxlfkqt,/r/datascience/comments/15y7j15/microsoft_is_b...,1,False,2,False,15y7j15,datascience,t5_2sptq


## User activity (submissions, comments)

In [31]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [32]:
user_submissions_data = []

# Get user submissions
for submission in user.submissions.new(limit=5):
    user_submissions_data.append(submission_to_dict(submission))

# Convert the list of dictionaries to a DataFrame
user_submissions_df = pd.DataFrame(user_submissions_data)

user_submissions_df


Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,MrBurritoQuest,,False,5,1697672000.0,,False,17b4lgf,False,False,...,/r/ABoringDystopia/comments/17b4lgf/bots_reply...,False,83,,False,False,ABoringDystopia,Bots replying to Bots replying to Bots… (swipe),0.91,https://www.reddit.com/gallery/17b4lgf
1,MrBurritoQuest,,False,12,1697671000.0,,False,17b4g36,False,False,...,/r/mildlyinfuriating/comments/17b4g36/bots_rep...,False,46,,False,False,mildlyinfuriating,Bots replying to bots replying to bots…(swipe),0.95,https://www.reddit.com/gallery/17b4g36
2,MrBurritoQuest,​,False,7,1678804000.0,,1678805012.0,11r7fdi,False,True,...,/r/personalfinance/comments/11r7fdi/bought_a_t...,False,0,"Perhaps “scammed” is too harsh, maybe stupid i...",False,False,personalfinance,Bought a T-Bill and feel scammed,0.36,https://www.reddit.com/r/personalfinance/comme...
3,MrBurritoQuest,,False,5,1678577000.0,,False,11oyc5u,False,True,...,/r/tipofmytongue/comments/11oyc5u/tomtsongaltr...,False,1,Alt-rock song (with a bit of funk influence). ...,False,False,tipofmytongue,[TOMT][SONG][ALT-ROCK][FUNK][LYRICS],1.0,https://www.reddit.com/r/tipofmytongue/comment...
4,MrBurritoQuest,,False,63,1677446000.0,,False,11cszj2,False,False,...,/r/steak/comments/11cszj2/what_is_this_hole_in...,False,188,,False,False,steak,What is this hole in my steak? Is it safe to e...,0.9,https://i.redd.it/hgylt3ww4nka1.jpg


In [33]:
user_comments_data = []

# Get user comments
for comment in user.comments.new(limit=5):
    user_comments_data.append(comment_to_dict(comment))

# Convert the list of dictionaries to a DataFrame
user_comments_df = pd.DataFrame(user_comments_data)
user_comments_df

Unnamed: 0,author,body,body_html,created_utc,distinguished,edited,id,is_submitter,link_id,parent_id,permalink,replies,saved,score,stickied,submission,subreddit,subreddit_id
0,MrBurritoQuest,"But that’s not really training perfect pitch, ...","<div class=""md""><p>But that’s not really train...",1724451000.0,,False,ljm86sq,False,t3_1eziqkg,t1_ljm44xy,/r/nextfuckinglevel/comments/1eziqkg/his_perfe...,0,False,3,False,1eziqkg,nextfuckinglevel,t5_m0bnr
1,MrBurritoQuest,Perfect pitch doesn’t take much practice from ...,"<div class=""md""><p>Perfect pitch doesn’t take ...",1724449000.0,,1724455274.0,ljm1jw7,False,t3_1eziqkg,t1_ljlulvh,/r/nextfuckinglevel/comments/1eziqkg/his_perfe...,0,False,-15,False,1eziqkg,nextfuckinglevel,t5_m0bnr
2,MrBurritoQuest,It absolutely would be unethical *if* there wa...,"<div class=""md""><p>It absolutely would be unet...",1723765000.0,,False,libmjx1,False,t3_1epatmr,t1_lhmnedz,/r/overemployed/comments/1epatmr/shes_a_legend...,0,False,1,False,1epatmr,overemployed,t5_4forqa
3,MrBurritoQuest,This guy is definitely an absolute moron and p...,"<div class=""md""><p>This guy is definitely an a...",1723136000.0,,False,lh4ut7y,False,t3_1en5lky,t1_lh3rzm3,/r/facepalm/comments/1en5lky/just_straightup_r...,0,False,3,False,1en5lky,facepalm,t5_2r5rp
4,MrBurritoQuest,"Apologies I’m dumb, is this a metaphor for som...","<div class=""md""><p>Apologies I’m dumb, is this...",1721419000.0,,False,ldza9tj,False,t3_1e782op,t1_ldyccpg,/r/RedHotChiliPeppers/comments/1e782op/whats_t...,0,False,5,False,1e782op,RedHotChiliPeppers,t5_2s504


# Automate scraping a subreddit activity

### Create new classes, SubredditDataFetcher and UserDataFetcher.
We will use these classes in an asynchronous setting (https://www.mend.io/blog/asynchronous-programming-in-python-understanding-the-essentials/) to save time.

In [34]:
!pip install asyncpraw
!mkdir /content/drive/MyDrive/Tutorial_2/

from google.colab import drive
drive.mount('/content/drive')

Collecting asyncpraw
  Downloading asyncpraw-7.7.1-py3-none-any.whl.metadata (9.7 kB)
Collecting aiofiles<1 (from asyncpraw)
  Downloading aiofiles-0.8.0-py3-none-any.whl.metadata (7.0 kB)
Collecting aiosqlite<=0.17.0 (from asyncpraw)
  Downloading aiosqlite-0.17.0-py3-none-any.whl.metadata (4.1 kB)
Collecting asyncprawcore<3,>=2.1 (from asyncpraw)
  Downloading asyncprawcore-2.4.0-py3-none-any.whl.metadata (5.5 kB)
Downloading asyncpraw-7.7.1-py3-none-any.whl (196 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.7/196.7 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading aiofiles-0.8.0-py3-none-any.whl (13 kB)
Downloading aiosqlite-0.17.0-py3-none-any.whl (15 kB)
Downloading asyncprawcore-2.4.0-py3-none-any.whl (19 kB)
Installing collected packages: aiosqlite, aiofiles, asyncprawcore, asyncpraw
Successfully installed aiofiles-0.8.0 aiosqlite-0.17.0 asyncpraw-7.7.1 asyncprawcore-2.4.0
Drive already mounted at /content/drive; to attempt to forcibly remou

In [35]:
import pandas as pd
import time
from tqdm import tqdm
import asyncio
import asyncpraw
from asyncprawcore.exceptions import TooManyRequests

import logging

# Suppress warnings from PRAW
logging.getLogger('praw').setLevel(logging.ERROR)

# client_id = 'your_client_id'
# client_secret = 'your_client_secret'

reddit = asyncpraw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent="Scraper"
)

In [36]:
class SubredditDataFetcher:
    def __init__(self, reddit, subreddit_name, filter_type="new", limit=10):
        self.reddit = reddit
        self.subreddit_name = subreddit_name
        self.filter_type = filter_type
        self.limit = limit
        self.submission_usernames = set()
        self.comment_usernames = set()
        self.submission_ids = set()

    async def submission_to_dict(self, submission):
        data = {
            'author': str(submission.author) if submission.author else 'NA',
            'author_flair_text': submission.author_flair_text,
            'clicked': submission.clicked,
            'comments': len(submission.comments),
            'created_utc': submission.created_utc,
            'distinguished': submission.distinguished,
            'edited': submission.edited,
            'id': submission.id,
            'is_original_content': submission.is_original_content,
            'is_self': submission.is_self,
            'locked': submission.locked,
            'name': submission.name,
            'num_comments': submission.num_comments,
            'over_18': submission.over_18,
            'permalink': submission.permalink,
            'saved': submission.saved,
            'score': submission.score,
            'selftext': submission.selftext,
            'spoiler': submission.spoiler,
            'stickied': submission.stickied,
            'subreddit': str(submission.subreddit),
            'title': submission.title,
            'upvote_ratio': submission.upvote_ratio,
            'url': submission.url
        }
        return data

    async def fetch_submissions(self):
        subreddit = await self.reddit.subreddit(self.subreddit_name)
        submissions = subreddit.new(limit=self.limit)
        async for submission in submissions:
            self.submission_ids.add(submission.id)
            self.submission_usernames.add(str(submission.author))
            yield await self.submission_to_dict(submission)

    async def fetch_submission_comments(self, submission_id):
        submission = await self.reddit.submission(id=submission_id)
        await submission.comments.replace_more(limit=None)
        for comment in submission.comments.list():
            self.comment_usernames.add(str(comment.author))
            yield await self.comment_to_dict(comment)

    async def fetch_all_submission_comments(self):
        for submission_id in self.submission_ids:
            async for comment in self.fetch_submission_comments(submission_id):
                yield comment

    async def comment_to_dict(self, comment):
        data = {
            'author': str(comment.author) if comment.author else 'NA',
            'body': comment.body,
            'body_html': comment.body_html,
            'created_utc': comment.created_utc,
            'distinguished': comment.distinguished,
            'edited': comment.edited,
            'id': comment.id,
            'is_submitter': comment.is_submitter,
            'link_id': comment.link_id,
            'parent_id': comment.parent_id,
            'permalink': comment.permalink,
            'replies': len(comment.replies),
            'saved': comment.saved,
            'score': comment.score,
            'stickied': comment.stickied,
            'submission': str(comment.submission),
            'subreddit': str(comment.subreddit),
            'subreddit_id': comment.subreddit_id
        }
        return data


In [37]:
user = await reddit.redditor('MAPepperCanadaWet')
getattr(user, 'created_utc', 'NA')

'NA'

In [38]:
class UserDataFetcher:
    def __init__(self, reddit):
        self.reddit = reddit

    async def user_to_dict(self, user):
        data = {
            'name': getattr(user, 'name', 'NA'),
            'comment_karma': getattr(user, 'comment_karma', 'NA'),
            'created_utc': getattr(user, 'created_utc', 'NA'),
            'has_verified_email': getattr(user, 'has_verified_email', 'NA'),
            'icon_img': getattr(user, 'icon_img', 'NA'),
            'id': getattr(user, 'id', 'NA'),
            'is_employee': getattr(user, 'is_employee', 'NA'),
            'is_friend': getattr(user, 'is_friend', 'NA'),
            'is_mod': getattr(user, 'is_mod', 'NA'),
            'is_gold': getattr(user, 'is_gold', 'NA'),
            'link_karma': getattr(user, 'link_karma', 'NA'),
            'subreddit_banner_img': user.subreddit.banner_img if user.subreddit else 'NA',
            'subreddit_name': user.subreddit.name if user.subreddit else 'NA',
            'subreddit_over_18': user.subreddit.over_18 if user.subreddit else 'NA',
            'subreddit_public_description': user.subreddit.public_description if user.subreddit else 'NA',
            'subreddit_subscribers': user.subreddit.subscribers if user.subreddit else 'NA',
            'subreddit_title': user.subreddit.title if user.subreddit else 'NA',
        }
        return data

    async def fetch_user_data(self, usernames):
        for username in tqdm(usernames, desc="Fetching user data"):
            while True:
                try:
                    user = await self.reddit.redditor(username)
                    await user.load()
                    yield await self.user_to_dict(user)
                    break
                except TooManyRequests as e:
                    print(e)
                    await asyncio.sleep(60)
                except Exception as e:
                    print(e)
                    break

    async def fetch_user_submissions(self, username, filter_type="new", limit=10):
        user = await self.reddit.redditor(username)
        submissions = user.submissions.new(limit=limit)
        async for submission in submissions:
            while True:
                try:
                    yield {
                        'author': str(submission.author) if submission.author else 'NA',
                        'author_flair_text': submission.author_flair_text,
                        'clicked': submission.clicked,
                        'comments': len(submission.comments),
                        'created_utc': submission.created_utc,
                        'distinguished': submission.distinguished,
                        'edited': submission.edited,
                        'id': submission.id,
                        'is_original_content': submission.is_original_content,
                        'is_self': submission.is_self,
                        'locked': submission.locked,
                        'name': submission.name,
                        'num_comments': submission.num_comments,
                        'over_18': submission.over_18,
                        'permalink': submission.permalink,
                        'saved': submission.saved,
                        'score': submission.score,
                        'selftext': submission.selftext,
                        'spoiler': submission.spoiler,
                        'stickied': submission.stickied,
                        'subreddit': str(submission.subreddit),
                        'title': submission.title,
                        'upvote_ratio': submission.upvote_ratio,
                        'url': submission.url
                    }
                    break
                except TooManyRequests as e:
                    print(e)
                    await asyncio.sleep(60)
                except Exception as e:
                    print(e)
                    break

    async def fetch_user_comments(self, username, filter_type="new", limit=10):
        user = await self.reddit.redditor(username)
        comments = user.comments.new(limit=limit)
        async for comment in comments:
            while True:
                try:
                    yield {
                        'author': str(comment.author) if comment.author else 'NA',
                        'body': comment.body,
                        'body_html': comment.body_html,
                        'created_utc': comment.created_utc,
                        'distinguished': comment.distinguished,
                        'edited': comment.edited,
                        'id': comment.id,
                        'is_submitter': comment.is_submitter,
                        'link_id': comment.link_id,
                        'parent_id': comment.parent_id,
                        'permalink': comment.permalink,
                        'replies': len(comment.replies),
                        'saved': comment.saved,
                        'score': comment.score,
                        'stickied': comment.stickied,
                        'submission': str(comment.submission),
                        'subreddit': str(comment.subreddit),
                        'subreddit_id': comment.subreddit_id
                    }
                    break
                except TooManyRequests as e:
                    print(e)
                    await asyncio.sleep(60)
                except Exception as e:
                    print(e)
                    break

    async def fetch_all_user_submissions(self, usernames, filter_type="new", limit=10):
        for username in tqdm(usernames, desc="Fetching all user submissions"):
            while True:
                try:
                    async for submission in self.fetch_user_submissions(username, filter_type, limit):
                        yield submission
                    break
                except TooManyRequests as e:
                    print(e)
                    await asyncio.sleep(60)
                except Exception as e:
                    print(e)
                    break

    async def fetch_all_user_comments(self, usernames, filter_type="new", limit=10):
        for username in tqdm(usernames, desc="Fetching all user comments"):
            while True:
                try:
                    async for comment in self.fetch_user_comments(username, filter_type, limit):
                        yield comment
                    break
                except TooManyRequests as e:
                    print(e)
                    await asyncio.sleep(60)
                except Exception as e:
                    print(e)
                    break

### Let's first retrieve:
(1) New submission and commenting activities from r/datascience  
(2) For the authors of submissions and comments in r/datascience: user data, user's recent submissions and comments.

In [39]:
async def main():
    subreddit_name = "datascience"
    subreddit_data_fetcher = SubredditDataFetcher(reddit, subreddit_name, filter_type="new", limit=10)

    submission_data = [submission async for submission in subreddit_data_fetcher.fetch_submissions()]
    comment_data = [comment async for comment in subreddit_data_fetcher.fetch_all_submission_comments()]

    ds_submission_df = pd.DataFrame(submission_data)
    ds_submission_comments_df = pd.DataFrame(comment_data)

    print("Submissions DataFrame:")
    print(ds_submission_df)
    print("\nComments DataFrame:")
    print(ds_submission_comments_df)

    usernames = list(set(ds_submission_df['author'].astype(str))) + list(set(ds_submission_comments_df['author'].astype(str)))

    user_data_fetcher = UserDataFetcher(reddit)

    # Fetch all user data, recent submissions, and comments for each user
    user_data = [user async for user in user_data_fetcher.fetch_user_data(usernames)]
    all_user_submissions = [submission async for submission in user_data_fetcher.fetch_all_user_submissions(usernames, filter_type="new", limit=5)]
    all_user_comments = [comment async for comment in user_data_fetcher.fetch_all_user_comments(usernames, filter_type="new", limit=5)]

    # Convert the data to DataFrames for easier analysis
    ds_user_data_df = pd.DataFrame(user_data)
    ds_user_submissions_df = pd.DataFrame(all_user_submissions)
    ds_user_comments_df = pd.DataFrame(all_user_comments)

    print("User Data DataFrame:")
    print(ds_user_data_df)
    print("\nUser Submissions DataFrame:")
    print(ds_user_submissions_df)
    print("\nUser Comments DataFrame:")
    print(ds_user_comments_df)

    # Save the DataFrames to csv
    ds_submission_df.to_csv(f'/content/drive/MyDrive/Tutorial_2/{subreddit_name}-new-submission.csv')
    ds_submission_comments_df.to_csv(f'/content/drive/MyDrive/Tutorial_2/{subreddit_name}-new-comment.csv')

    ds_user_data_df.to_csv(f'/content/drive/MyDrive/Tutorial_2/{subreddit_name}-user-data.csv')
    ds_user_submissions_df.to_csv(f'/content/drive/MyDrive/Tutorial_2/{subreddit_name}-user-submission.csv')
    ds_user_comments_df.to_csv(f'/content/drive/MyDrive/Tutorial_2/{subreddit_name}-user-comment.csv')

await main()

Submissions DataFrame:
                author author_flair_text  clicked  comments   created_utc  \
0          santiviquez              None    False         0  1.726167e+09   
1             jmhimara              None    False         0  1.726166e+09   
2     nobody_undefined              None    False         0  1.726136e+09   
3             gomezalp              None    False         0  1.726113e+09   
4   Waste_Necessary654              None    False         0  1.726105e+09   
5         starktonny11              None    False         0  1.726084e+09   
6            venkarafa              None    False         0  1.726080e+09   
7     ArticleLegal5612              None    False         0  1.726068e+09   
8             gomezalp              None    False         0  1.726038e+09   
9  Gold-Artichoke-9288              None    False         0  1.726014e+09   

  distinguished        edited       id  is_original_content  is_self  ...  \
0          None         False  1ffa09p              

Fetching user data: 100%|██████████| 186/186 [00:59<00:00,  3.13it/s]
Fetching all user submissions: 100%|██████████| 186/186 [01:08<00:00,  2.73it/s]
Fetching all user comments: 100%|██████████| 186/186 [01:51<00:00,  1.67it/s]

User Data DataFrame:
                    name  comment_karma   created_utc  has_verified_email  \
0            santiviquez            232  1.587593e+09                True   
1           starktonny11             63  1.639769e+09               False   
2    Gold-Artichoke-9288            155  1.643156e+09                True   
3              venkarafa            555  1.532518e+09                True   
4       nobody_undefined             70  1.686853e+09                True   
..                   ...            ...           ...                 ...   
181       OmnipresentCPU          40433  1.555335e+09                True   
182     NotMyRealName778          38009  1.591869e+09                True   
183           fishnet222            453  1.714879e+09               False   
184  Novel_Frosting_1977          19738  1.600451e+09                True   
185          SaraSavvy24             13  1.718813e+09                True   

                                              icon_img




### Let's also retrive:
(3) submissions under r/Python, r/SQL, r/rprogramming, r/HTML, r/lua

In [40]:
async def main():
    subreddit_name_list = ["datascience", "MachineLearning", "datasets", "dataisbeautiful", "learnpython"]

    for subreddit_name in subreddit_name_list:
        subreddit_data_fetcher = SubredditDataFetcher(reddit, subreddit_name, filter_type="new", limit=100)
        submission_df = pd.DataFrame([submission async for submission in subreddit_data_fetcher.fetch_submissions()]) # Use async for within a list comprehension to get results from async generator
        submission_df.to_csv(f'/content/drive/MyDrive/Tutorial_2/{subreddit_name}-new-submission-for-subreddit-network.csv')

        print(submission_df)

await main()

                 author                           author_flair_text  clicked  \
0           santiviquez                                        None    False   
1              jmhimara                                        None    False   
2      nobody_undefined                                        None    False   
3              gomezalp                                        None    False   
4    Waste_Necessary654                                        None    False   
..                  ...                                         ...      ...   
95  Cheap_Scientist6984                                        None    False   
96         therockhound                                        None    False   
97          secret_fyre                                        None    False   
98          cruelbankai  MS Math | Data Scientist II | Supply Chain    False   
99          santiviquez                                        None    False   

    comments   created_utc distinguishe

# On your own
1.1. Go to https://www.reddit.com/, and select a subreddit of your group's interest (the more active, the better-looking your network visualization will be, although the code will have a longer runtime). Get 10 new submissions, and all comments below them.  
1.2. Get the user data, user submissions (5 for each user), and user comments (5 for each user) as well.  
2.1. Select 5 subreddits that are somewhat related to one another. Get 100 new submissions.