<b> This file shows you how to scrap reddit posts and comments step by step. </b>

# Preparation

## 1. Install package praw if you haven't

In [1]:
#!pip install praw

## 2. Creating a Reddit app
This is an very easy step, detailed process could be found at: https://www.geeksforgeeks.org/scraping-reddit-using-python/. 

The purpose to register a Reddit app is to get the client_id, secret, and user_agent values, which are needed to connect to Reddit using python.

## 3. Creating a PRAW instance
In order to connect to Reddit, we need to create a praw instance. There are 2 types of praw instances:  

Read-only Instance: Using read-only instances, we can only scrape publicly available information on Reddit. For example, retrieving the top 5 posts from a particular subreddit.
Authorized Instance: Using an authorized instance, you can do everything you do with your Reddit account. Actions like upvote, post, comment, etc., can be performed.

In this group project, we will be only using the read-only instance. After we have created an instance, we can use Reddit’s API to extract data. 

In [2]:
# Read-only instance
import praw
import pandas as pd

reddit_read_only = praw.Reddit(client_id="your client id",         # your client id
                               client_secret="your client secret",      # your client secret
                               user_agent="your user agent")        # your user agent

# Scraping Reddit Posts:
To extract data from Reddit posts, we need the URLs of the related post. Once we have the URL, we need to create a submission object.

## Scraping from a list of post urls manually selected from the Economics and other subreddits

In [3]:
from praw.models import MoreComments

posturls = pd.read_csv('Reddit_Posts_and_Links - General.csv')

posturls

Unnamed: 0,Subreddits,PostMonth,PostTitle,URL
0,r/antiwork,2022-1,A question about US job market,https://www.reddit.com/r/antiwork/comments/rp1...
1,r/cscareerquestions,2022-1,"Hot take: this ""hot"" software job market means...",https://www.reddit.com/r/cscareerquestions/com...
2,r/cscareerquestions,2022-1,Data science job market is shrinking,https://www.reddit.com/r/cscareerquestions/com...
3,r/Economics,2022-1,The US added more jobs in 2021 than any year i...,https://www.reddit.com/r/Economics/comments/ry...
4,r/Economics,2022-1,"AP: US employers add 199,000 jobs as unemploym...",https://www.reddit.com/r/Economics/comments/ry...
...,...,...,...,...
107,r/Economics,2022-9,Factory Jobs Are Booming Like It’s the 1970s,https://www.reddit.com/r/Economics/comments/xo...
108,r/Economics,2022-9,"50% of employers expect job cuts, survey finds...",https://www.reddit.com/r/Economics/comments/xj...
109,r/Economics,2022-9,The US Has Reversed Pandemic Job Losses. Most ...,https://www.reddit.com/r/Economics/comments/xk...
110,r/jobs,2022-9,This job market is a joke,https://www.reddit.com/r/jobs/comments/xskdgx/...


In [4]:
# URL LIST of the post
Subreddits = posturls['Subreddits']
PostMonths = pd.to_datetime(posturls['PostMonth'])
PostTitles = posturls['PostTitle']
Posturls = posturls['URL']

CommentSubreddits = []
CommentonPostTitles = []
Post_Months = []


# Get the comments from the 1st URL
submission = reddit_read_only.submission(url=Posturls[0])

post_comments = []

# Get the comments from the other URLs
for i in range(len(Posturls)): # the total number of urls in the file is 75, I take the first 7 as an sample to save time
    submission = reddit_read_only.submission(url=Posturls[i])

    for comment in submission.comments:
        if type(comment) == MoreComments:
            continue

        post_comments.append(comment.body)
        CommentSubreddits.append(Subreddits[i])
        CommentonPostTitles.append(PostTitles[i])
        Post_Months.append(PostMonths[i])
        

In [5]:
# creating a dataframe
post_comments_df = pd.DataFrame({'Subreddit': CommentSubreddits, 'PostTitle': CommentonPostTitles, 'PostMonth': Post_Months, 'Comment': post_comments})
post_comments_df = post_comments_df[post_comments_df['Comment'] != '[removed]']
post_comments_df

Unnamed: 0,Subreddit,PostTitle,PostMonth,Comment
0,r/antiwork,A question about US job market,2022-01-01,I’d say there’s a level of willful ignorance w...
1,r/antiwork,A question about US job market,2022-01-01,"> my main question is for everyone, how do you..."
2,r/antiwork,A question about US job market,2022-01-01,I don’t listen to people that reiterate faux n...
3,r/antiwork,A question about US job market,2022-01-01,I’m from / in LA. All the jobs that used to go...
4,r/antiwork,A question about US job market,2022-01-01,"About 200,000 working aged adults have died of..."
...,...,...,...,...
2153,r/jobs,This job market is a joke,2022-09-01,Jobs are like “we’re hiring but only minimum w...
2154,r/Wallstreetsilver,Biden is telling us the US job market is the s...,2022-09-01,Everything out of their mouth is lies
2155,r/Wallstreetsilver,Biden is telling us the US job market is the s...,2022-09-01,https://twitter.com/goldsilver_pros/status/157...
2156,r/Wallstreetsilver,Biden is telling us the US job market is the s...,2022-09-01,Part of this can be the baby boomers retiring ...


In [6]:
post_comments_df['index'] = [i for i in range(len(post_comments_df))]
post_comments_df.set_index(post_comments_df['index'],inplace=True)
post_comments_df = post_comments_df.drop(columns = ['index'], axis = 1)
post_comments_df

Unnamed: 0_level_0,Subreddit,PostTitle,PostMonth,Comment
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,r/antiwork,A question about US job market,2022-01-01,I’d say there’s a level of willful ignorance w...
1,r/antiwork,A question about US job market,2022-01-01,"> my main question is for everyone, how do you..."
2,r/antiwork,A question about US job market,2022-01-01,I don’t listen to people that reiterate faux n...
3,r/antiwork,A question about US job market,2022-01-01,I’m from / in LA. All the jobs that used to go...
4,r/antiwork,A question about US job market,2022-01-01,"About 200,000 working aged adults have died of..."
...,...,...,...,...
1971,r/jobs,This job market is a joke,2022-09-01,Jobs are like “we’re hiring but only minimum w...
1972,r/Wallstreetsilver,Biden is telling us the US job market is the s...,2022-09-01,Everything out of their mouth is lies
1973,r/Wallstreetsilver,Biden is telling us the US job market is the s...,2022-09-01,https://twitter.com/goldsilver_pros/status/157...
1974,r/Wallstreetsilver,Biden is telling us the US job market is the s...,2022-09-01,Part of this can be the baby boomers retiring ...


In [7]:
post_comments_df.to_csv("post_comments_df.csv", index=False)