In this tutorial, we will scrape reddit submissions (=posts), comments and user data using PRAW (The Python Reddit API Wrapper), which eases communicating with Reddit API in Python.

- PRAW documentation: https://praw.readthedocs.io/en/stable/index.html#)
- Reddit API documentation: https://www.reddit.com/dev/api/

Prior to running this code, you should get credentials for Reddit API. If you haven't, sign up for Reddit (https://www.reddit.com/). After that, follow the instruction in the link below.

- Getting Reddit API credentials: https://josephlai241.github.io/URS/credentials.html

SAVE your API ID and Password in a secure place.

## Install Dependencies and Import Modules

In [3]:
!pip install praw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/191.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/191.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update-checker, prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.3.0 update-checker-0.18.0


In [2]:
# optional: mount google drive if you want to save files there

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Importing PRAW, and getting the access to Reddit API.

In [5]:
import praw

# client_id = (your api id)
# client_secret = (your api password)

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent="Scraper",
)

Ignore unimportant warnings

In [6]:
import logging

# Suppress warnings from PRAW
logging.getLogger('praw').setLevel(logging.ERROR)

# Explore objects

# Explore what we can get

## 1. We can scrape submissions within a subreddit.
- Specifying a category is required. (Hot, New, Controversial, Top, Rising, Search)

In [7]:
subreddit = reddit.subreddit("datascience")

In [10]:
?subreddit

In [29]:
subreddit.top(limit=5)

<praw.models.listing.generator.ListingGenerator at 0x78332a357fa0>

In [31]:
submission_example = next(subreddit.top(limit=5))
submission_example

Submission(id='k8nyf8')

In [32]:
?submission_example

In [15]:
# Get the 5 top posts
for submission in subreddit.top(limit=5):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: k8nyf8
Title: data siens
Author: None
Text: 
URL: https://dslntlv9vhjr4.cloudfront.net/posts_images/EcY6g2neQEaIi.png
----------
ID: oeg6nl
Title: The pain and excitement
Author: Kent-Clark-
Text: 
URL: https://i.redd.it/yqnunwryjg971.jpg
----------
ID: hohvgq
Title: Shout Out to All the Mediocre Data Scientists Out There
Author: MrBurritoQuest
Text: I've been lurking on this sub for a while now and all too often I see posts from people claiming they feel inadequate and then they go on to describe their stupid impressive background and experience. That's great and all but I'd like to move the spotlight to the rest of us for just a minute. Cheers to my fellow mediocre data scientists who don't work at FAANG companies, aren't pursing a PhD, don't publish papers, haven't won Kaggle competitions, and don't spend every waking hour improving their portfolio.  Even though we're nothing special, we still deserve some appreciation every once in a while.

/rant I'll hand it back over to the 

In [22]:
?subreddit.search

In [21]:
# other categories except 'search' works in a similar way.
# to see how 'search' method works, let's search with a keyword 'python' within a day of r/datascience submissions.

for submission in subreddit.search("python", time_filter="day"):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: 16bnpho
Title: Hiring - Data Scientist (Part Time). Freshers or students are welcome to apply
Author: Several-Donut4969
Text: I am looking for a remote data scientist to work on a part time basis. Freshers or students are welcome, provided they possess a solid foundation in Python, data visualization, and modeling. If you're interested, please send your resume along with your salary expectations to endlabava@gmail.com
URL: https://www.reddit.com/r/datascience/comments/16bnpho/hiring_data_scientist_part_time_freshers_or/
----------
ID: 16bti4k
Title: Translation of Code to Plain English?
Author: MikeA112
Text: Hi, 

Are there any tools out there - other than OpenAI's Codex - that converts code in any programming language (e.x. Python) into plain English? 

Thanks, 
URL: https://www.reddit.com/r/datascience/comments/16bti4k/translation_of_code_to_plain_english/
----------
ID: 16bw9ii
Title: How how how! I couldn't answer this question of my friend?
Author: midomiii
Text:  How do plat

## 2. You can scrape a specific user's data as well.

In [46]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [24]:
?user

In [47]:
print(f"Username: {user.name}")
print(f"Karma: {user.link_karma + user.comment_karma}")
print(f"The account is created at: {user.created_utc}") # What is this strange number?

Username: MrBurritoQuest
Karma: 10551
The account is created at: 1508214093.0


In [48]:
from datetime import datetime
print(datetime.utcfromtimestamp(user.created_utc).strftime('%Y-%m-%d %H:%M:%S'))

2017-10-17 04:21:33


## 3. You can scrape every comments below a specific submission.

In [33]:
# Use the submission ID or URL
submission_id = '15y7j15'
submission = reddit.submission(id=submission_id)

In [40]:
# This does exactly the same
submission_url = 'https://www.reddit.com/r/datascience/comments/15y7j15/microsoft_is_bringing_python_to_excel/'
submission = reddit.submission(url=submission_url)

In [41]:
submission.comments.replace_more(limit=None)
len(submission.comments.list())

117

In [42]:
comment = submission.comments.list()[0]

In [43]:
print(f"Title: {comment.author}")
print(f"Body: {comment.body}")

Title: Exact-Bird-4203
Body: Feel like this has been hyped forever. Excited to actually use it


## 4. We can also scrape a specific user's Reddit activities (submissions and comments) in general.

In [65]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [66]:
?user

In [67]:
for submission in user.submissions.new(limit=5):
    print(f"ID: {submission.id}")
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Text: {submission.selftext}") # For self-posts
    print(f"URL: {submission.url}") # For link-posts
    print('----------')

ID: 11r7fdi
Title: Bought a T-Bill and feel scammed
Author: MrBurritoQuest
Text: Perhaps “scammed” is too harsh, maybe stupid is more appropriate as I clearly don’t understand how these work. My wife and I needed to stash some cash for 6 months and decided to go with a T Bill instead instead of a HYSA (first time buying one). 

We purchased $15k of the March 16th 6 Month T Bill through Fidelity. On our statement it shows a Yield to Maturity of ~4.89% so I was expecting our purchase amount to be ~$14,266 but instead our purchase amount came out to be $14,643. If my math is correct, this is a measly 2.38% which is lower than our current HYSA. 

Where did I go wrong? What are the tax implications of selling this T Bill and putting the money into a CD or HYSA?

Edit: thank you everyone, I now see that 6 months = half a year thus half the APY. If you need me, I’ll be in the corner wearing my dunce cone.
URL: https://www.reddit.com/r/personalfinance/comments/11r7fdi/bought_a_tbill_and_feel_s

In [72]:
for comment in user.comments.new(limit=5):
    print(f"ID: {comment.id}")
    print(f"Title: {comment.body}")
    print('----------')

ID: jyk2t51
Title: Hey at least you acknowledged you were mistaken and were open to learning! That’s more than I can say for 90% of my interactions on Reddit lol
----------
ID: jyjtd0y
Title: Crimes rates are almost always measured per capita which by definition accounts for populations differences. The US *does have a [significantly higher murder rate](https://worldpopulationreview.com/country-rankings/murder-rate-by-country)* compared to other industrialized countries (but lower than most developing countries)
----------
ID: jyat8rt
Title: > can’t just sell stock like that

[He literally can](https://www.forbes.com/sites/kerryadolan/2021/11/04/jeff-bezos-just-sold-2-billion-worth-of-amazon-stock/amp/). And he has done so every year. You’re vastly underestimating the amount of retail investors, mutual funds, investment banks, etc who would gladly buy Amazon stock at a discount when Bezos sells, which in turn drives the price back up. This isn’t even theoretical, Bezos, and many other 

# Define functions
- submission_to_dict(subject) returns almost every attributes of submission object available in a dictionary format.
- user_to_dict(user) returns almost every attributes of user object available in a dictionary format.
- comment_to_dict(comment) returns almost every attributes of comment object available in a dictionary format.

In [73]:
def submission_to_dict(submission): # submission = reddit.subreddit(subredditname)
    data = {
        'author': str(submission.author),
        'author_flair_text': submission.author_flair_text,
        'clicked': submission.clicked,
        'comments': len(submission.comments),
        'created_utc': submission.created_utc,
        'distinguished': submission.distinguished,
        'edited': submission.edited,
        'id': submission.id,
        'is_original_content': submission.is_original_content,
        'is_self': submission.is_self,
        'locked': submission.locked,
        'name': submission.name,
        'num_comments': submission.num_comments,
        'over_18': submission.over_18,
        'permalink': submission.permalink,
        'saved': submission.saved,
        'score': submission.score,
        'selftext': submission.selftext,
        'spoiler': submission.spoiler,
        'stickied': submission.stickied,
        'subreddit': str(submission.subreddit),
        'title': submission.title,
        'upvote_ratio': submission.upvote_ratio,
        'url': submission.url
    }

    return data

def user_to_dict(user): # user = reddit.redditor(username)
    data = {
        'comment_karma': user.comment_karma,
        'created_utc': user.created_utc,
        'has_verified_email': user.has_verified_email,
        'icon_img': user.icon_img,
        'id': user.id,
        'is_employee': user.is_employee,
        'is_friend': user.is_friend,
        'is_mod': user.is_mod,
        'is_gold': user.is_gold,
        'link_karma': user.link_karma,
        'name': user.name,
        'subreddit_banner_img': user.subreddit.banner_img if user.subreddit else 'NA',
        'subreddit_name': user.subreddit.name if user.subreddit else 'NA',
        'subreddit_over_18': user.subreddit.over_18 if user.subreddit else 'NA',
        'subreddit_public_description': user.subreddit.public_description if user.subreddit else 'NA',
        'subreddit_subscribers': user.subreddit.subscribers if user.subreddit else 'NA',
        'subreddit_title': user.subreddit.title if user.subreddit else 'NA',
    }

    return data

def comment_to_dict(comment): # comment = submission.comments.list()[index]
    data = {
        'author': str(comment.author),
        'body': comment.body,
        'body_html': comment.body_html,
        'created_utc': comment.created_utc,
        'distinguished': comment.distinguished,
        'edited': comment.edited,
        'id': comment.id,
        'is_submitter': comment.is_submitter,
        'link_id': comment.link_id,
        'parent_id': comment.parent_id,
        'permalink': comment.permalink,
        'replies': len(comment.replies),
        'saved': comment.saved,
        'score': comment.score,
        'stickied': comment.stickied,
        'submission': str(comment.submission),
        'subreddit': str(comment.subreddit),
        'subreddit_id': comment.subreddit_id
    }

    return data

# Scrape and return dataframes
- As we defined functions above to return dictionary, we can loop through different objects, return multiple objects, and save them into a pandas dataframe object.

## Submissions

In [75]:
import pandas as pd

subreddit = reddit.subreddit("datascience")
submissions_data = []

# Get the top 5 posts
for submission in subreddit.top(limit=5):
    submissions_data.append(submission_to_dict(submission))

# Convert the list of dictionaries to a DataFrame
submissions_df = pd.DataFrame(submissions_data)

submissions_df

Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,,,False,26,1607371000.0,,False,k8nyf8,False,False,...,/r/datascience/comments/k8nyf8/data_siens/,False,4088,,False,False,datascience,data siens,0.97,https://dslntlv9vhjr4.cloudfront.net/posts_ima...
1,Kent-Clark-,,False,41,1625519000.0,,False,oeg6nl,False,False,...,/r/datascience/comments/oeg6nl/the_pain_and_ex...,False,3904,,False,False,datascience,The pain and excitement,0.97,https://i.redd.it/yqnunwryjg971.jpg
2,MrBurritoQuest,,False,97,1594353000.0,,False,hohvgq,False,True,...,/r/datascience/comments/hohvgq/shout_out_to_al...,False,3611,I've been lurking on this sub for a while now ...,False,False,datascience,Shout Out to All the Mediocre Data Scientists ...,0.99,https://www.reddit.com/r/datascience/comments/...
3,CompetitivePlastic67,,False,28,1663139000.0,,False,xdv6nz,False,False,...,/r/datascience/comments/xdv6nz/lets_keep_this_on/,False,3587,,False,False,datascience,Let's keep this on...,0.97,https://i.redd.it/k102dyo0yrn91.jpg
4,,,False,172,1647837000.0,,False,tj3kek,False,False,...,/r/datascience/comments/tj3kek/guys_weve_been_...,False,3461,,False,False,datascience,"Guys, we’ve been doing it wrong this whole time",0.96,https://i.imgur.com/TAex5zG.jpg


## User information

In [76]:
username_list = ['MrBurritoQuest'] # Add usernames as needed
user_data = []

for username in username_list:
    user = reddit.redditor(username)
    user_data.append(user_to_dict(user))

# Convert the list of dictionaries to a DataFrame
user_df = pd.DataFrame(user_data)
user_df

Unnamed: 0,comment_karma,created_utc,has_verified_email,icon_img,id,is_employee,is_friend,is_mod,is_gold,link_karma,name,subreddit_banner_img,subreddit_name,subreddit_over_18,subreddit_public_description,subreddit_subscribers,subreddit_title
0,4571,1508214000.0,True,https://www.redditstatic.com/avatars/defaults/...,hl45shh,False,False,False,False,5980,MrBurritoQuest,,t5_5wl1i,False,,0,


## Comments under a submission

In [None]:
# Use the submission ID or URL
submission_id = '15y7j15'
submission = reddit.submission(id=submission_id)

# Replace "More Comments" objects with the comments they represent
submission.comments.replace_more(limit=None)

submission_comments_data = []

# Iterate through the comments
for comment in submission.comments.list():
    submission_comments_data.append(comment_to_dict(comment))

# Convert the list of dictionaries to a DataFrame
submission_comments_df = pd.DataFrame(submission_comments_data)

print(submission_comments_df)

                  author                                               body  \
0        Exact-Bird-4203  Feel like this has been hyped forever. Excited...   
1              Lockonon3  Thank god, I would drink sulfuric acid before ...   
2            TrollandDie  A million IT Security engineers suddenly and c...   
3            FishFar4370  It looks cool as hell in how it works.  \n\nI ...   
4         SearchAtlantis  Except you're still screwed because of Excel's...   
5          bgighjigftuik  •\t⁠Microsoft acquires a major stake in Python...   
6            iamdeviance             Interesting. Happy debugging everyone!   
7           IbizaMykonos  Can excel handle big data thru some kinda dist...   
8               Xelonima  Finally, something that would make excel actua...   
9               rasmusdf                                  Fucking finally..   
10          wenchaodonut                                          God bless   
11             rickkkkky  So how long til I can run 

## User activity (submissions, comments)

In [77]:
username = 'MrBurritoQuest'
user = reddit.redditor(username)

In [79]:
user_submissions_data = []

# Get user submissions
for submission in user.submissions.new(limit=5):
    user_submissions_data.append(submission_to_dict(submission))

# Convert the list of dictionaries to a DataFrame
user_submissions_df = pd.DataFrame(user_submissions_data)

user_submissions_df


Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,MrBurritoQuest,​,False,7,1678804000.0,,1678805012.0,11r7fdi,False,True,...,/r/personalfinance/comments/11r7fdi/bought_a_t...,False,0,"Perhaps “scammed” is too harsh, maybe stupid i...",False,False,personalfinance,Bought a T-Bill and feel scammed,0.39,https://www.reddit.com/r/personalfinance/comme...
1,MrBurritoQuest,,False,5,1678577000.0,,False,11oyc5u,False,True,...,/r/tipofmytongue/comments/11oyc5u/tomtsongaltr...,False,1,Alt-rock song (with a bit of funk influence). ...,False,False,tipofmytongue,[TOMT][SONG][ALT-ROCK][FUNK][LYRICS],1.0,https://www.reddit.com/r/tipofmytongue/comment...
2,MrBurritoQuest,,False,63,1677446000.0,,False,11cszj2,False,False,...,/r/steak/comments/11cszj2/what_is_this_hole_in...,False,191,,False,False,steak,What is this hole in my steak? Is it safe to e...,0.9,https://i.redd.it/hgylt3ww4nka1.jpg
3,MrBurritoQuest,,False,0,1675042000.0,,False,10op0yk,False,False,...,/r/DunderMifflin/comments/10op0yk/title/,False,2,,False,False,DunderMifflin,Title,0.75,https://i.redd.it/6858o12uwzea1.jpg
4,MrBurritoQuest,,False,4,1674107000.0,,False,10ftryu,False,False,...,/r/LooneyTunesLogic/comments/10ftryu/qatar_is_...,False,17,,False,False,LooneyTunesLogic,Qatar is planning to open one of the world lar...,0.96,https://www.reddit.com/gallery/10f2tpm


In [80]:
user_comments_data = []

# Get user comments
for comment in user.comments.new(limit=5):
    user_comments_data.append(comment_to_dict(comment))

# Convert the list of dictionaries to a DataFrame
user_comments_df = pd.DataFrame(user_comments_data)
user_comments_df

Unnamed: 0,author,body,body_html,created_utc,distinguished,edited,id,is_submitter,link_id,parent_id,permalink,replies,saved,score,stickied,submission,subreddit,subreddit_id
0,MrBurritoQuest,Hey at least you acknowledged you were mistake...,"<div class=""md""><p>Hey at least you acknowledg...",1693508000.0,,False,jyk2t51,False,t3_1665g1s,t1_jyjtkbq,/r/TrueUnpopularOpinion/comments/1665g1s/canad...,0,False,19,False,1665g1s,TrueUnpopularOpinion,t5_lrajx
1,MrBurritoQuest,Crimes rates are almost always measured per ca...,"<div class=""md""><p>Crimes rates are almost alw...",1693504000.0,,False,jyjtd0y,False,t3_1665g1s,t1_jyjpzx5,/r/TrueUnpopularOpinion/comments/1665g1s/canad...,0,False,28,False,1665g1s,TrueUnpopularOpinion,t5_lrajx
2,MrBurritoQuest,> can’t just sell stock like that\n\n[He liter...,"<div class=""md""><blockquote>\n<p>can’t just se...",1693350000.0,,False,jyat8rt,False,t3_164mnqh,t1_jyamn9c,/r/meirl/comments/164mnqh/meirl/jyat8rt/,0,False,0,False,164mnqh,meirl,t5_2s5ti
3,MrBurritoQuest,> but not nearly as much as people meme about\...,"<div class=""md""><blockquote>\n<p>but not nearl...",1693347000.0,,False,jyak7xy,False,t3_164mnqh,t1_jyahfvc,/r/meirl/comments/164mnqh/meirl/jyak7xy/,0,False,2,False,164mnqh,meirl,t5_2s5ti
4,MrBurritoQuest,You know Bezos has been liquidating over $1B o...,"<div class=""md""><p>You know Bezos has been liq...",1693345000.0,,False,jyagwbe,False,t3_164mnqh,t1_jya8gfi,/r/meirl/comments/164mnqh/meirl/jyagwbe/,0,False,4,False,164mnqh,meirl,t5_2s5ti


# Automate scraping a subreddit activity

## What this code does


1. Returns N1 number of submissions, based on designated filter_types.
2. Returns N2 number of submissions each user submitted (to any subreddits), thus a total of N1*N2 submissions at maximum.
3. Returns Returns N3 number of comments each user submitted (to any subreddits), thus a total of N1*N3 comments at maximum.
4. Returns every comments below every submissions scraped at 1.
5. Returns every user data from submissions and comments scraped at 1 and 4.

In [81]:
import pandas as pd
from tqdm import tqdm
import time
from prawcore.exceptions import TooManyRequests

class SubredditDataFetcher:
    def __init__(self, reddit, subreddit_name, filter_type="new", limit=10):
        self.reddit = reddit
        self.subreddit = self.reddit.subreddit(subreddit_name)
        self.filter_type = filter_type
        self.limit = limit
        self.submissions = list(getattr(self.subreddit, filter_type)(limit=limit))
        self.submission_usernames = set()
        self.comment_usernames = set()
        self.submission_ids = set()

    def user_to_dict(self, user):
        data = {
            'comment_karma': getattr(user, 'comment_karma', 'NA'),
            'created_utc': getattr(user, 'created_utc', 'NA'),
            'has_verified_email': getattr(user, 'has_verified_email', 'NA'),
            'icon_img': getattr(user, 'icon_img', 'NA'),
            'id': getattr(user, 'id', 'NA'),
            'is_employee': getattr(user, 'is_employee', 'NA'),
            'is_friend': getattr(user, 'is_friend', 'NA'),
            'is_mod': getattr(user, 'is_mod', 'NA'),
            'is_gold': getattr(user, 'is_gold', 'NA'),
            'link_karma': getattr(user, 'link_karma', 'NA'),
            'name': getattr(user, 'name', 'NA'),
            'subreddit_banner_img': user.subreddit.banner_img if user.subreddit else 'NA',
            'subreddit_name': user.subreddit.name if user.subreddit else 'NA',
            'subreddit_over_18': user.subreddit.over_18 if user.subreddit else 'NA',
            'subreddit_public_description': user.subreddit.public_description if user.subreddit else 'NA',
            'subreddit_subscribers': user.subreddit.subscribers if user.subreddit else 'NA',
            'subreddit_title': user.subreddit.title if user.subreddit else 'NA',
        }
        return data

    def submission_to_dict(self, submission):
        data = {
            'author': getattr(submission, 'author', 'NA'),
            'author_flair_text': getattr(submission, 'author_flair_text', 'NA'),
            'clicked': getattr(submission, 'clicked', 'NA'),
            'comments': len(getattr(submission, 'comments', [])),
            'created_utc': getattr(submission, 'created_utc', 'NA'),
            'distinguished': getattr(submission, 'distinguished', 'NA'),
            'edited': getattr(submission, 'edited', 'NA'),
            'id': getattr(submission, 'id', 'NA'),
            'is_original_content': getattr(submission, 'is_original_content', 'NA'),
            'is_self': getattr(submission, 'is_self', 'NA'),
            'locked': getattr(submission, 'locked', 'NA'),
            'name': getattr(submission, 'name', 'NA'),
            'num_comments': getattr(submission, 'num_comments', 'NA'),
            'over_18': getattr(submission, 'over_18', 'NA'),
            'permalink': getattr(submission, 'permalink', 'NA'),
            'saved': getattr(submission, 'saved', 'NA'),
            'score': getattr(submission, 'score', 'NA'),
            'selftext': getattr(submission, 'selftext', 'NA'),
            'spoiler': getattr(submission, 'spoiler', 'NA'),
            'stickied': getattr(submission, 'stickied', 'NA'),
            'subreddit': getattr(submission, 'subreddit', 'NA'),
            'title': getattr(submission, 'title', 'NA'),
            'upvote_ratio': getattr(submission, 'upvote_ratio', 'NA'),
            'url': getattr(submission, 'url', 'NA')
        }
        return data

    def comment_to_dict(self, comment):
        data = {
            'author': getattr(comment, 'author', 'NA'),
            'body': getattr(comment, 'body', 'NA'),
            'body_html': getattr(comment, 'body_html', 'NA'),
            'created_utc': getattr(comment, 'created_utc', 'NA'),
            'distinguished': getattr(comment, 'distinguished', 'NA'),
            'edited': getattr(comment, 'edited', 'NA'),
            'id': getattr(comment, 'id', 'NA'),
            'is_submitter': getattr(comment, 'is_submitter', 'NA'),
            'link_id': getattr(comment, 'link_id', 'NA'),
            'parent_id': getattr(comment, 'parent_id', 'NA'),
            'permalink': getattr(comment, 'permalink', 'NA'),
            'replies': len(getattr(comment, 'replies', [])),
            'saved': getattr(comment, 'saved', 'NA'),
            'score': getattr(comment, 'score', 'NA'),
            'stickied': getattr(comment, 'stickied', 'NA'),
            'submission': getattr(comment, 'submission', 'NA'),
            'subreddit': getattr(comment, 'subreddit', 'NA'),
            'subreddit_id': getattr(comment, 'subreddit_id', 'NA')
        }
        return data

    def fetch_submissions(self):
        for submission in tqdm(self.submissions, desc="Fetching submissions"):

            while True:

                try:
                    self.submission_ids.add(submission.id)
                    self.submission_usernames.add(str(submission.author))
                    yield self.submission_to_dict(submission)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(submission.id)
                    break

    def fetch_user_data(self):
        all_usernames = self.submission_usernames.union(self.comment_usernames)
        for username in tqdm(all_usernames, desc="Fetching user data"):

            while True:

                try:
                    user = self.reddit.redditor(username)
                    yield self.user_to_dict(user)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(username)
                    break

    def fetch_user_submissions(self, username, filter_type="new", limit=10):
        user = self.reddit.redditor(username)
        for submission in getattr(user.submissions, filter_type)(limit=limit):

            while True:

                try:
                    yield self.submission_to_dict(submission)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(username)
                    break

    def fetch_user_comments(self, username, filter_type="new", limit=10):
        user = self.reddit.redditor(username)
        for comment in getattr(user.comments, filter_type)(limit=limit):

            while True:

                try:
                    yield self.comment_to_dict(comment)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(username)
                    break

    def fetch_submission_comments(self, submission_id):
        submission = self.reddit.submission(id=submission_id)
        submission.comments.replace_more(limit=None)
        for comment in submission.comments.list():

            while True:

                try:
                    self.comment_usernames.add(str(comment.author))
                    yield self.comment_to_dict(comment)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(submission_id)
                    break

    def fetch_all_user_submissions(self, filter_type="new", limit=10):
        all_usernames = self.submission_usernames.union(self.comment_usernames)
        for username in tqdm(all_usernames, desc="Fetching all user submissions"):

            while True:

                try:
                    yield from self.fetch_user_submissions(username, filter_type, limit)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(username)
                    break

    def fetch_all_user_comments(self, filter_type="new", limit=10):
        all_usernames = self.submission_usernames.union(self.comment_usernames)
        for username in tqdm(all_usernames, desc="Fetching all user comments"):

            while True:

                try:
                    yield from self.fetch_user_comments(username, filter_type, limit)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(username)
                    break

    def fetch_all_submission_comments(self):
        for submission_id in tqdm(self.submission_ids, desc="Fetching all submission comments"):

            while True:

                try:
                    yield from self.fetch_submission_comments(submission_id)
                    break

                except TooManyRequests as e:
                    print(e)
                    time.sleep(60)

                except Exception as e:
                    print(e)
                    print(submission_id)
                    break

In [83]:
fetcher = SubredditDataFetcher(reddit, "datascience", filter_type="new", limit=10)

submission_df = pd.DataFrame(fetcher.fetch_submissions())
submission_comments_df = pd.DataFrame(fetcher.fetch_all_submission_comments())
user_data_df = pd.DataFrame(fetcher.fetch_user_data())
user_submissions_df = pd.DataFrame(fetcher.fetch_all_user_submissions(filter_type="new", limit=10))
user_comments_df = pd.DataFrame(fetcher.fetch_all_user_comments(filter_type="new", limit=10))

Fetching submissions: 100%|██████████| 10/10 [00:03<00:00,  3.10it/s]
Fetching all submission comments: 100%|██████████| 10/10 [00:07<00:00,  1.39it/s]
Fetching user data: 100%|██████████| 13/13 [00:12<00:00,  1.02it/s]
Fetching all user submissions: 100%|██████████| 13/13 [01:55<00:00,  8.92s/it]
Fetching all user comments: 100%|██████████| 13/13 [00:13<00:00,  1.00s/it]


## Check and save what we've got

In [84]:
submission_df

Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,_unbanned_datum,,False,0,1694087000.0,,False,16cdnyy,False,True,...,/r/datascience/comments/16cdnyy/what_do_you_us...,False,1,I just looked at r/ApacheHive and it’s dead. I...,False,False,datascience,What do you use instead of Hive?,1.0,https://www.reddit.com/r/datascience/comments/...
1,Initial_Employee9917,,False,0,1694086000.0,,False,16cdcsm,False,True,...,/r/datascience/comments/16cdcsm/junior_data_sc...,False,1,"Hello all, \n\nI have been reading here for a ...",False,False,datascience,Junior Data Scientist - Advise for career Prog...,1.0,https://www.reddit.com/r/datascience/comments/...
2,big_booty_booth,,False,0,1694084000.0,,False,16ccq7u,False,True,...,/r/datascience/comments/16ccq7u/laptop_recs/,False,1,Anyone got any good recommendations? Have to b...,False,False,datascience,Laptop Recs,1.0,https://www.reddit.com/r/datascience/comments/...
3,tyr1699,,False,0,1694084000.0,,False,16ccp1p,False,True,...,/r/datascience/comments/16ccp1p/transitioning_...,False,1,Hey fellow data scientists!\n\nI hope this pos...,False,False,datascience,Transitioning to DS from an undergraduate degr...,1.0,https://www.reddit.com/r/datascience/comments/...
4,Lumpy_Ad_6868,,False,1,1694084000.0,,False,16cclu4,False,True,...,/r/datascience/comments/16cclu4/as_data_scient...,False,1,,False,False,datascience,as data scientist/analyst what was you major a...,1.0,https://www.reddit.com/r/datascience/comments/...
5,Sollimann,,False,1,1694083000.0,,False,16cchvy,False,False,...,/r/datascience/comments/16cchvy/chatty_llama_a...,False,0,,False,False,datascience,Chatty LLama: A fullstack Rust + react chat ap...,0.4,https://i.redd.it/ua2kfxi5btmb1.png
6,RightProfile0,,False,0,1694083000.0,,False,16ccadi,False,True,...,/r/datascience/comments/16ccadi/international_...,False,1,"Hello, I'm an international student nearing gr...",False,False,datascience,international student trying to dip toes into ...,1.0,https://www.reddit.com/r/datascience/comments/...
7,Educational_One_2337,,False,1,1694082000.0,,False,16cc4gy,False,True,...,/r/datascience/comments/16cc4gy/couldnt_write_...,False,1,Hey there I have 3 huge graphs. I plotted them...,False,False,datascience,Couldn't write my function on R,1.0,https://www.reddit.com/r/datascience/comments/...
8,External_Oven_6379,,False,1,1694081000.0,,False,16cbtjd,False,True,...,/r/datascience/comments/16cbtjd/not_sure_if_th...,False,2,I've been grappling with this particular chall...,False,False,datascience,Not sure if this problem can be solved with ML...,1.0,https://www.reddit.com/r/datascience/comments/...
9,colinberan,,False,0,1694080000.0,,False,16cbnv8,False,True,...,/r/datascience/comments/16cbnv8/how_to_make_my...,False,1,"So I'm a ""new"" DS/DA (I got my MS almost four ...",False,False,datascience,How to make my next move,0.67,https://www.reddit.com/r/datascience/comments/...


In [85]:
user_data_df

Unnamed: 0,comment_karma,created_utc,has_verified_email,icon_img,id,is_employee,is_friend,is_mod,is_gold,link_karma,name,subreddit_banner_img,subreddit_name,subreddit_over_18,subreddit_public_description,subreddit_subscribers,subreddit_title
0,80,1658436000.0,True,https://styles.redditmedia.com/t5_6qs5e9/style...,qb9an4mo,False,False,False,False,137,big_booty_booth,,t5_6qs5e9,False,,0,
1,225,1611663000.0,True,https://styles.redditmedia.com/t5_3scfib/style...,4x2r1nsu,False,False,False,False,1595,Sollimann,,t5_3scfib,False,,0,
2,45,1607911000.0,True,https://styles.redditmedia.com/t5_3k21mp/style...,6psemurb,False,False,False,False,80,Educational_One_2337,,t5_3k21mp,True,,0,nah want my name to be secret
3,751,1566734000.0,True,https://styles.redditmedia.com/t5_23pq4t/style...,4g2pxonu,False,False,False,False,917,MarcelDeSutter,https://styles.redditmedia.com/t5_23pq4t/style...,t5_23pq4t,False,,0,
4,25180,1547395000.0,False,https://styles.redditmedia.com/t5_ulxly/styles...,2zkwx57k,False,False,False,False,395,Slothvibes,,t5_ulxly,True,I talk about overemployment a lot. Stack that ...,0,RevinaWarsula
5,9339,1680063000.0,True,https://styles.redditmedia.com/t5_84rt62/style...,83g1niecs,False,False,False,True,147,_unbanned_datum,,t5_84rt62,True,"I’m probably full of shit, and that’s why I ge...",0,
6,1076,1596367000.0,True,https://www.redditstatic.com/avatars/defaults/...,7b2badgv,False,False,False,False,75,tyr1699,,t5_2xu48i,False,,0,no2ways
7,80,1395253000.0,True,https://www.redditstatic.com/avatars/defaults/...,frktt,False,False,False,False,606,colinberan,,t5_ercg9,False,,0,
8,136,1616054000.0,True,https://styles.redditmedia.com/t5_44f12z/style...,azejemi0,False,False,False,False,41,External_Oven_6379,,t5_44f12z,False,,0,
9,665,1589417000.0,True,https://www.redditstatic.com/avatars/defaults/...,6gxwhbsk,False,False,False,False,1433,RightProfile0,,t5_2nt1uu,False,,0,


In [86]:
user_submissions_df

Unnamed: 0,author,author_flair_text,clicked,comments,created_utc,distinguished,edited,id,is_original_content,is_self,...,permalink,saved,score,selftext,spoiler,stickied,subreddit,title,upvote_ratio,url
0,big_booty_booth,,False,0,1.694084e+09,,False,16ccq7u,False,True,...,/r/datascience/comments/16ccq7u/laptop_recs/,False,1,Anyone got any good recommendations? Have to b...,False,False,datascience,Laptop Recs,1.00,https://www.reddit.com/r/datascience/comments/...
1,big_booty_booth,,False,77,1.692181e+09,,1692267895.0,15sltls,False,True,...,/r/knitting/comments/15sltls/knitting_gifts/,False,82,Have you ever knitted something for someone an...,False,False,knitting,Knitting Gifts,0.87,https://www.reddit.com/r/knitting/comments/15s...
2,big_booty_booth,,False,2,1.688405e+09,,False,14ppakq,False,True,...,/r/hiking/comments/14ppakq/camino_santiago/,False,4,Has anyone done Camino Santiago? If so which r...,False,False,hiking,Camino Santiago,0.75,https://www.reddit.com/r/hiking/comments/14ppa...
3,big_booty_booth,,False,2,1.685538e+09,,False,13wlq4m,False,True,...,/r/datascience/comments/13wlq4m/get_a_doctorat...,False,1,[removed],False,False,datascience,Get a Doctorate or Nah?,1.00,https://www.reddit.com/r/datascience/comments/...
4,big_booty_booth,,False,1,1.684755e+09,,False,13oo7p6,False,True,...,/r/digitalnomad/comments/13oo7p6/dning_as_a_co...,False,2,Do you guys feel like making friends as a coup...,False,False,digitalnomad,DNing as a Couple & Making Friends,0.75,https://www.reddit.com/r/digitalnomad/comments...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,Lumpy_Ad_6868,,False,1,1.693324e+09,,False,164lw8v,False,True,...,/r/hacking/comments/164lw8v/is_there_any_possi...,False,1,[removed],False,False,hacking,is there any possibility that a hacker can hac...,1.00,https://www.reddit.com/r/hacking/comments/164l...
99,Lumpy_Ad_6868,,False,11,1.693313e+09,,False,164hfd8,False,True,...,/r/saudiarabia/comments/164hfd8/الحين_صدق_الدك...,False,1,عندي كلاس انقليزي بشكل يومي وكل مره ابي انادي ...,False,False,saudiarabia,الحين صدق الدكاتره يزعلون من كلمه تيتشر؟ او اس...,1.00,https://www.reddit.com/r/saudiarabia/comments/...
100,Lumpy_Ad_6868,,False,4,1.693307e+09,,False,164ffn2,False,False,...,/r/saudiarabia/comments/164ffn2/راجعه_من_الجامعه/,False,0,,False,False,saudiarabia,راجعه من الجامعه,0.50,https://i.redd.it/lzza5sor71lb1.jpg
101,Lumpy_Ad_6868,,False,4,1.693010e+09,,False,161g6dj,False,True,...,/r/saudiarabia/comments/161g6dj/عندي_سؤال_مالق...,False,5,ليه ببعض صور المجرمين يبين كل وجهه ويغطون عيون...,False,False,saudiarabia,عندي سؤال مالقيت له اجابه,0.86,https://www.reddit.com/r/saudiarabia/comments/...


In [87]:
user_comments_df

Unnamed: 0,author,body,body_html,created_utc,distinguished,edited,id,is_submitter,link_id,parent_id,permalink,replies,saved,score,stickied,submission,subreddit,subreddit_id
0,big_booty_booth,I love this actually,"<div class=""md""><p>I love this actually</p>\n<...",1.694014e+09,,False,jze36pt,False,t3_16b9cr4,t3_16b9cr4,/r/knitting/comments/16b9cr4/are_you_actually_...,0,False,2,False,16b9cr4,knitting,t5_2qiu0
1,big_booty_booth,"Stop in Denver, as someone who used to live in...","<div class=""md""><p>Stop in Denver, as someone ...",1.692961e+09,,False,jxodypk,False,t3_160k3tj,t3_160k3tj,/r/roadtrip/comments/160k3tj/how_bad_is_drivin...,0,False,1,False,160k3tj,roadtrip,t5_2r413
2,big_booty_booth,Dead tourists are bad for business,"<div class=""md""><p>Dead tourists are bad for b...",1.692903e+09,,False,jxl851j,False,t3_16092x7,t3_16092x7,/r/TravelHacks/comments/16092x7/should_i_cance...,0,False,3,False,16092x7,TravelHacks,t5_30zyl
3,big_booty_booth,Route 66 through NM is definitely all sorts of...,"<div class=""md""><p>Route 66 through NM is defi...",1.692381e+09,,False,jwr1mdy,False,t3_15up2tq,t1_jwqyr0i,/r/roadtrip/comments/15up2tq/need_help_with_pl...,0,False,5,False,15up2tq,roadtrip,t5_2r413
4,big_booty_booth,That “was” your girlfriend. Not anymore,"<div class=""md""><p>That “was” your girlfriend....",1.688849e+09,,False,jr733e4,False,t3_14u3vx7,t3_14u3vx7,/r/relationship_advice/comments/14u3vx7/i_20m_...,0,False,1,False,14u3vx7,relationship_advice,t5_2r0cn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,Lumpy_Ad_6868,طيب لو اخذته بكالوريوس واخذت الداتا ساينس ماجس...,"<div class=""md""><p>طيب لو اخذته بكالوريوس واخذ...",1.694016e+09,,False,jze8gp9,True,t3_16bkq13,t1_jze177g,/r/saudiarabia/comments/16bkq13/وش_رأيكم_بمستق...,0,False,1,False,16bkq13,saudiarabia,t5_2roj4
116,Lumpy_Ad_6868,والله انا مو ناقصه تسحسحس\nابي اجابه واضحه من ...,"<div class=""md""><p>والله انا مو ناقصه تسحسحس\n...",1.694016e+09,,False,jze8biw,True,t3_16bkq13,t1_jzdtnf5,/r/saudiarabia/comments/16bkq13/وش_رأيكم_بمستق...,0,False,1,False,16bkq13,saudiarabia,t5_2roj4
117,Lumpy_Ad_6868,وش يعني نثريات,"<div class=""md""><p>وش يعني نثريات</p>\n</div>",1.693987e+09,,False,jzco160,False,t3_16bbvig,t1_jzch1tc,/r/saudiarabia/comments/16bbvig/تكلفة_الحياة_ب...,0,False,2,False,16bbvig,saudiarabia,t5_2roj4
118,Lumpy_Ad_6868,I think these studies includes all the jobs on...,"<div class=""md""><p>I think these studies inclu...",1.693981e+09,,False,jzcftwz,False,t3_16ayf4l,t3_16ayf4l,/r/TooAfraidToAsk/comments/16ayf4l/if_gender_p...,0,False,1,False,16ayf4l,TooAfraidToAsk,t5_2ssp7


In [88]:
submission_comments_df

Unnamed: 0,author,body,body_html,created_utc,distinguished,edited,id,is_submitter,link_id,parent_id,permalink,replies,saved,score,stickied,submission,subreddit,subreddit_id
0,Sollimann,Link to repo: https://github.com/Sollimann/cha...,"<div class=""md""><p>Link to repo: <a href=""http...",1694083000.0,,False,jzictns,True,t3_16cchvy,t3_16cchvy,/r/datascience/comments/16cchvy/chatty_llama_a...,0,False,1,False,16cchvy,datascience,t5_2sptq
1,Rafiki-Rhymes,One thing to consider is the distribution of y...,"<div class=""md""><p>One thing to consider is th...",1694086000.0,,False,jziifd2,False,t3_16cbtjd,t3_16cbtjd,/r/datascience/comments/16cbtjd/not_sure_if_th...,0,False,1,False,16cbtjd,datascience,t5_2sptq
2,MarcelDeSutter,Undergrad in business and psychology. Always e...,"<div class=""md""><p>Undergrad in business and p...",1694084000.0,,False,jzien3p,False,t3_16cclu4,t3_16cclu4,/r/datascience/comments/16cclu4/as_data_scient...,0,False,1,False,16cclu4,datascience,t5_2sptq
3,Slothvibes,This is a perfect use case for chatgpt,"<div class=""md""><p>This is a perfect use case ...",1694083000.0,,False,jzic16p,False,t3_16cc4gy,t3_16cc4gy,/r/datascience/comments/16cc4gy/couldnt_write_...,0,False,1,False,16cc4gy,datascience,t5_2sptq


In [None]:
# if you have mounted your google drive and willing to save it in there

submission_df.to_csv("/content/drive/MyDrive/rdatascience_submission_df.csv")
user_data_df.to_csv("/content/drive/MyDrive/rdatascience_user_data_df.csv")
user_submissions_df.to_csv("/content/drive/MyDrive/rdatascience_user_submissions_df.csv")
user_comments_df.to_csv("/content/drive/MyDrive/rdatascience_user_comments_df.csv")
submission_comments_df.to_csv("/content/drive/MyDrive/rdatascience_submission_comments_df.csv")

In [None]:
# if you do not want to use google drive

submission_df.to_csv("/content/rdatascience_submission_df.csv")
user_data_df.to_csv("/content/rdatascience_user_data_df.csv")
user_submissions_df.to_csv("/content/rdatascience_user_submissions_df.csv")
user_comments_df.to_csv("/content/rdatascience_user_comments_df.csv")
submission_comments_df.to_csv("/content/rdatascience_submission_comments_df.csv")

# make sure you download these files manually before colab resets and it's all gone