# Reddit Web Scraper 
### <font color='red'>Read the text below before using the code</font>
1. Use PRAW API to extract subreddit attributes that are title, upvotes, id, url, subreddit, num_comments, selftext, subscribers. 
2. Convert UTC dates to human-readable dates
3. Create the folder called Datasets in your project directory. Save the data file into ./Datasets/name.csv
4. Extract top-level comments from posts and save as txt file in datasets. (I did not extract second-level comments which are replies to top-level comments. I will extract them if needed)

## Import Packages
Packages required in this Ipynb file are praw. You can **pip install praw** in the terminal

In [6]:
import praw
import pandas as pd
from datetime import datetime
from praw.models import MoreComments

## Enter API Key
Click https://www.reddit.com/prefs/apps and create a new app. The client_id is right below personal use script. The client secret is the secret. The user agent is the name of app.

In [2]:
# Enter your client id, secret and user agent
reddit = praw.Reddit(client_id = 'k_irONQxgcqWFg', 
                     client_secret = 'A6udBWn-8PXi2p7X34K7HT5THiA', 
                     user_agent = 'Test')

## Enter the group names (the id after r/)

In [3]:
# Reddit groups that Drew wants us to look at
Reddit_groups = ['depression', 'anxiety', 'OCD', 'socialanxiety', 'panicdisorder']

## Extract subreddit post attributes
Attributes are **title, score, id, subreddit, url, number of comments, body, created date, number of subscribers.**

In [4]:
# Write a loop to put top 10 posts from each group and their key info into one dataframe
posts = []
for group in Reddit_groups:
    print(group)
    mental_subreddit = reddit.subreddit(group)
    for post in mental_subreddit.hot(limit=10): # you can change the top number of posts 
        
        posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created, mental_subreddit.subscribers])

posts = pd.DataFrame(posts, columns = ['title', 'score', 'id','subreddit','url','num_comments','body','created','num_subscribers'])
posts.head()

depression
anxiety
OCD
socialanxiety
panicdisorder


Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created,num_subscribers
0,Our most-broken and least-understood rules is ...,1593,doqwow,depression,https://www.reddit.com/r/depression/comments/d...,109,We understand that most people who reply immed...,1572390000.0,603300
1,Regular Check-In Post,151,exo6f1,depression,https://www.reddit.com/r/depression/comments/e...,862,Welcome to /r/depression's check-in post - a p...,1580678000.0,603300
2,Does anyone else just wanna start new,595,f4re4h,depression,https://www.reddit.com/r/depression/comments/f...,109,I just want to move to a town where no one kno...,1581892000.0,603300
3,I wanna get sick for a few weeks to catch a break,230,f4u8i2,depression,https://www.reddit.com/r/depression/comments/f...,30,This is prob really messed up but i kind of ju...,1581905000.0,603300
4,I don't want you to ask me if I am feeling bet...,39,f4uix8,depression,https://www.reddit.com/r/depression/comments/f...,2,"The moment you ask me that, I automatically fe...",1581906000.0,603300


## Convert UTC date to human-readable dates

In [7]:
# change date
posts['created_date'] = posts.created.apply(datetime.utcfromtimestamp)
posts = posts.drop(['created'],axis=1)

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,num_subscribers,created_date
0,Our most-broken and least-understood rules is ...,1593,doqwow,depression,https://www.reddit.com/r/depression/comments/d...,109,We understand that most people who reply immed...,603300,2019-10-29 22:52:02
1,Regular Check-In Post,151,exo6f1,depression,https://www.reddit.com/r/depression/comments/e...,862,Welcome to /r/depression's check-in post - a p...,603300,2020-02-02 21:08:26
2,Does anyone else just wanna start new,595,f4re4h,depression,https://www.reddit.com/r/depression/comments/f...,109,I just want to move to a town where no one kno...,603300,2020-02-16 22:33:48
3,I wanna get sick for a few weeks to catch a break,230,f4u8i2,depression,https://www.reddit.com/r/depression/comments/f...,30,This is prob really messed up but i kind of ju...,603300,2020-02-17 01:55:58
4,I don't want you to ask me if I am feeling bet...,39,f4uix8,depression,https://www.reddit.com/r/depression/comments/f...,2,"The moment you ask me that, I automatically fe...",603300,2020-02-17 02:13:54


In [9]:
posts['created_date'] = pd.to_datetime(posts['created_date']).dt.date # we get rid of the hour, minute and seconds
posts.head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,num_subscribers,created_date
0,Our most-broken and least-understood rules is ...,1593,doqwow,depression,https://www.reddit.com/r/depression/comments/d...,109,We understand that most people who reply immed...,603300,2019-10-29
1,Regular Check-In Post,151,exo6f1,depression,https://www.reddit.com/r/depression/comments/e...,862,Welcome to /r/depression's check-in post - a p...,603300,2020-02-02
2,Does anyone else just wanna start new,595,f4re4h,depression,https://www.reddit.com/r/depression/comments/f...,109,I just want to move to a town where no one kno...,603300,2020-02-16
3,I wanna get sick for a few weeks to catch a break,230,f4u8i2,depression,https://www.reddit.com/r/depression/comments/f...,30,This is prob really messed up but i kind of ju...,603300,2020-02-17
4,I don't want you to ask me if I am feeling bet...,39,f4uix8,depression,https://www.reddit.com/r/depression/comments/f...,2,"The moment you ask me that, I automatically fe...",603300,2020-02-17


In [10]:
# save post dataframe to csv
posts.to_csv('./Datasets/group_posts.csv')

## Extract top-level comments from each posts
There are some posts containing **read more** and we skip those because threads with read more cannot be extracted with this API.

In [None]:
# Extract comments from each group and save as text file
for group in Reddit_groups:
    print(group)
    post_ids = posts[posts['subreddit'] == group]['id'].tolist()
    with open(f"./Datasets/{group}.txt", "w") as f:
        for post_id in post_ids:
            submission = reddit.submission(id=post_id)
            submission.comments.replace_more(limit=0)
            for top_level_comment in submission.comments:
#                 if isinstance(top_level_comment, MoreComments):
#                     continue
                f.write(top_level_comment.body)