# SCRAPING REDDIT

Referenced Link: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768

---
---

# SETUP STEPS:

## Import env for API Keys

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

True

## Create PRAW instance

In [2]:
import praw

reddit = praw.Reddit(client_id=os.getenv("my_client_id"), 
                     client_secret=os.getenv("my_client_secret"), 
                     user_agent=os.getenv("my_user_agent"))


---
---

# SCRAPING CAPABILITIES:

## Scrape Posts

In [3]:
# get 5 hot posts from the MachineLearning subreddit

hot_posts = reddit.subreddit('MachineLearning').hot(limit=5)

for post in hot_posts:
    
    print(post.title)

[D] Simple Questions Thread
[D] Machine Learning - WAYR (What Are You Reading) - Week 140
[D] Fan-made NeurIPS 2022 Movie Trailer
[P] What we learned by benchmarking TorchDynamo (PyTorch team), ONNX Runtime and TensorRT on transformers model (inference)
[R] LocoProp: Enhancing BackProp via Local Loss Optimization (Google Brain, 2022)


## Reddit Posts into Pandas DataFrame

In [4]:
import pandas as pd

posts = []

ml_subreddit = reddit.subreddit('MachineLearning')

for post in ml_subreddit.hot(limit=5):
    
    posts.append([post.title, 
                  post.score, 
                  post.id, 
                  post.subreddit, 
                  post.url, 
                  post.num_comments, 
                  post.selftext, 
                  post.created])
    
posts = pd.DataFrame(posts,
                     columns=['title', 
                              'score', 
                              'id', 
                              'subreddit', 
                              'url', 
                              'num_comments', 
                              'body', 
                              'created'])

posts.head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,[D] Simple Questions Thread,6,wcqp3a,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,26,Please post your questions here instead of cre...,1659280000.0
1,[D] Machine Learning - WAYR (What Are You Read...,95,vg5kjd,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,17,This is a place to share machine learning rese...,1655675000.0
2,[D] Fan-made NeurIPS 2022 Movie Trailer,114,wey49o,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,12,[https://twitter.com/postrat\_dril/status/1554...,1659505000.0
3,[P] What we learned by benchmarking TorchDynam...,45,weyup0,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,3,**TL;DR**: `TorchDynamo` (prototype from PyTor...,1659507000.0
4,[R] LocoProp: Enhancing BackProp via Local Los...,105,weoh6w,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,8,Paper: [https://arxiv.org/abs/2106.06199](http...,1659478000.0


## Checking the body of a post

In [5]:
posts.body[0]

'Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!\n\nThread will stay alive until next one so keep posting after the date in the title.\n\nThanks to everyone for answering questions in the previous thread!'

## Get the top level comments of a post

In [6]:
from praw.models import MoreComments

submission = reddit.submission(id="jh8kou")

for top_level_comment in submission.comments:
    
    if isinstance(top_level_comment, MoreComments):
        
        continue
        
    print(top_level_comment.body)

Poor kid just wanted to sleep in.
the little baby run to his momma is fucken adorable
“Omg come quick he’s ded! oh nvm he’s just a lazy shit thanks fellas”
I had no idea that I needed some passed out baby elephant in my life.
That mom trusts the keepers with her baby. That's some respect
I think every new parent can relate to looking at their baby and wanting to give them a little nudge to make sure they're still alive.
Wait, what happened to that sprinkling water on the face, thing? I thought that would be a natural instinct for elephants.

Maybe child rearing practices have changed since I was a kid.
I have narcolepsy, the way my dad used to get me up for school was let my dog into my room. She would lick my face until I woke up then lie down next to me and sloooowly stretch out her legs pushing me out of bed. It was very effective
Aww that cute little scramble for mama after waking up is sooo cute
I love that there’s a full-fledge relationship between the elephant and the keepers, e

## Turn the top comments on a post into a dataframe

In [7]:
import pandas as pd
from praw.models import MoreComments

comments = []

submission = reddit.submission(id="jh8kou")

for comment in submission.comments:
    
    if isinstance(comment, MoreComments):
        
        continue
    
    comments.append([comment.body])
    
comments = pd.DataFrame(comments,
                     columns=['body'])

comments.head()

Unnamed: 0,body
0,Poor kid just wanted to sleep in.
1,the little baby run to his momma is fucken ado...
2,“Omg come quick he’s ded! oh nvm he’s just a l...
3,I had no idea that I needed some passed out ba...
4,That mom trusts the keepers with her baby. Tha...


---
---
---


# CREATE TWO DATA FRAMES

- ## Hot Posts dataframe
- ## Top Comments from Hot Posts dataframe

### Create top posts dataframe

In [19]:
import pandas as pd
from praw.models import MoreComments


# create empty list to gather raw post data
posts = []

# create instance of PRAW for subreddit
ml_subreddit = reddit.subreddit('MachineLearning')

# loop through PRAW instance and record in post list
for post in ml_subreddit.hot(limit=10):
# for post in ml_subreddit.hot(limit=None):
    
    posts.append([post.title, 
                  post.score, 
                  post.id, 
                  post.subreddit, 
                  post.url, 
                  post.num_comments, 
                  post.selftext, 
                  post.created])

# create dataframe for posts
posts = pd.DataFrame(posts,
                     columns=['title', 
                              'score', 
                              'id', 
                              'subreddit', 
                              'url', 
                              'num_comments', 
                              'body', 
                              'created'])

print(len(posts), "posts scraped.")
posts.head()

10 posts scraped.


Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,[D] Simple Questions Thread,5,wcqp3a,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,26,Please post your questions here instead of cre...,1659280000.0
1,[D] Machine Learning - WAYR (What Are You Read...,94,vg5kjd,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,17,This is a place to share machine learning rese...,1655675000.0
2,[D] Fan-made NeurIPS 2022 Movie Trailer,115,wey49o,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,13,[https://twitter.com/postrat\_dril/status/1554...,1659505000.0
3,[P] What we learned by benchmarking TorchDynam...,47,weyup0,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,3,**TL;DR**: `TorchDynamo` (prototype from PyTor...,1659507000.0
4,[R] LocoProp: Enhancing BackProp via Local Los...,106,weoh6w,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,8,Paper: [https://arxiv.org/abs/2106.06199](http...,1659478000.0


### Create comments dataframe

In [23]:
# create dataframe for comments
comments_df = pd.DataFrame(columns=['post_id', 'body'])


# loop through ids in posts, and gather all the top comments into dataframe
for post_id in posts.id:
    
    comments = []

    submission = reddit.submission(id=post_id)

    for comment in submission.comments:

        if isinstance(comment, MoreComments):

            continue

        comments.append([post_id, comment.body])

    comments = pd.DataFrame(comments,
                         columns=['post_id', 'body'])
    
    comments_df = pd.concat([comments_df, comments], sort=False)

    
print(comments_df.shape[0], "top comments scraped.\n")

print(comments_df.info())

comments_df.head()

49 top comments scraped.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 0
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   post_id  49 non-null     object
 1   body     49 non-null     object
dtypes: object(2)
memory usage: 1.1+ KB
None


Unnamed: 0,post_id,body
0,wcqp3a,Hey all! Question. Are outstanding reviewer aw...
1,wcqp3a,We use fasstext (https://fasttext.cc/docs/en/s...
2,wcqp3a,What do you think how our models of the future...
3,wcqp3a,If I want to train a generic image classifier ...
4,wcqp3a,What is the most profitable machine or deep le...
