# Scraper
Using .json API as well as PRAW, scrape some data from reddit.

## Using json

In [1]:
import requests
import pandas as pd
import numpy as np
import time
import random

Normally I would want to use PRAW or something more user friendly but for the purposes of this assignment I just left it as the .json (although it's not very hard to parse it anyway).

In any case the scraper will record 3 things:
- Name (Title) of post, for example "The Toronto Raptors Win The..."
- The unique identifer of the post, for example: https://www.reddit.com/r/movies/comments/62sjuh/the_senate_upvote_this_so_that_people_see_it_when/
 would be 62sjuh. I am keeping this because I want to avoid repeats
- The timestamp of the post as posts occur in chronological order and I don't want data leakage.

As the aim of the project is to identify the subreddit only based on the title, I didn't scrape anything else. Also the non-text parts will be removed before running the model. They are just to make sure the data was collected properly.

(It also avoids any posts that are NSFW because this is an assignment, if I wanted to actually develop a model I would leave them in since more data is better, although there aren't that many NSFW posts anyway.).


I didn't record the subtext because a lot of the posts don't have any subtext, and I wanted to keep things consistent.

In [2]:
## Define a function that scrapes reddit using .json API

def reddit_scraper(subreddit, save_loc, num_posts = 50):
    
    address = "https://reddit.com/r/" + subreddit + ".json"
    
    # the guidelines for reddit api say to use this format for user agent
    user_agent = "Python:RedditScraping:1.0 (Test Scraper)"
    
    # see how many pages based on # of posts
    num_pages = num_posts//25
    print('Ok, looping through {} pages'.format(num_pages))
    
    # set some things for the loop below
    after = None
    all_scraped_data = []
    
    # loop through for range 
    for i in range(num_pages):
        
        # get the correct address
        if after is None:
            full_address = address
        else:
            full_address = address + '?after=' + after
            
        # if i%10 == 0:
        print("Page {} / {}".format(i, num_pages))
            
        
        # make time for sleeping
        # I used poisson because it seems a bit more realistic
        sleep_duration = np.random.poisson(30)
        
        # make sure it isn't TOO long or short
        while sleep_duration > 60 or sleep_duration < 5:
            sleep_duration = np.random.poisson(30)
            
        time.sleep(sleep_duration)

        # send the req
        req = requests.get(full_address, headers={"User-agent" : user_agent})
        
        
        # see if the request is ok
        if req.status_code != 200:
            print("WARNING: STATUS IS NOT 200 ON {} ITERATION".format(i))
            print("Current 'after' is {}".format(after))
            continue
            
        # if request is ok, extract the json data:
        current_json = req.json()
        

        # extract the things we want from the data
        for s in current_json["data"]["children"]:
            data = s.get("data")
            
            #check if it is NSFW, don't put it in if it is
            is_nsfw =  data.get("thumbnail", 0)            
            if is_nsfw == "nsfw":
                continue
            
            data_tuple = (data.get("title"), data.get("name"), data.get("created"))
            
            # put it into the list
            all_scraped_data.append(data_tuple)
        
        # get 'after' for the next loop
        after = current_json["data"]["after"]
        
        # Save data up to current loop as a .csv so the program is ok even if it crashes
        pd.DataFrame(all_scraped_data).to_csv(save_loc)
        
        
    
    print("yay it's done")

### Uncomment the code lines below to run the scraper since it's very slow.

In [4]:
# Scrape Subreddit 1
subreddit = "shittysuperpowers"
save_loc = "../data/scraped_subreddit_1/hot_from_json.csv"
# reddit_scraper(subreddit, save_loc, num_posts = 1000)

print("-" * 20)

# Scrape Subreddit 2
subreddit = "godtiersuperpowers"
save_loc = "../data/scraped_subreddit_2/hot_from_json.csv"
# reddit_scraper(subreddit, save_loc, num_posts = 1000)

print("Done with both")

Ok, looping through 40 pages
Page 0 / 40
Page 1 / 40
Page 2 / 40
Page 3 / 40
Page 4 / 40
Page 5 / 40
Page 6 / 40
Page 7 / 40
Page 8 / 40
Page 9 / 40
Page 10 / 40
Page 11 / 40
Page 12 / 40
Page 13 / 40
Page 14 / 40
Page 15 / 40
Page 16 / 40
Page 17 / 40
Page 18 / 40
Page 19 / 40
Page 20 / 40
Page 21 / 40
Page 22 / 40
Page 23 / 40
Page 24 / 40
Page 25 / 40
Page 26 / 40
Page 27 / 40
Page 28 / 40
Page 29 / 40
Page 30 / 40
Page 31 / 40
Page 32 / 40
Page 33 / 40
Page 34 / 40
Page 35 / 40
Page 36 / 40
Page 37 / 40
Page 38 / 40
Page 39 / 40
yay it's done
--------------------
Ok, looping through 40 pages
Page 0 / 40
Page 1 / 40
Page 2 / 40
Page 3 / 40
Page 4 / 40
Page 5 / 40
Page 6 / 40
Page 7 / 40
Page 8 / 40
Page 9 / 40
Page 10 / 40
Page 11 / 40
Page 12 / 40
Page 13 / 40
Page 14 / 40
Page 15 / 40
Page 16 / 40
Page 17 / 40
Page 18 / 40
Page 19 / 40
Page 20 / 40
Page 21 / 40
Page 22 / 40
Page 23 / 40
Page 24 / 40
Page 25 / 40
Page 26 / 40
Page 27 / 40
Page 28 / 40
Page 29 / 40
Page 30 / 40
Page

## Using PRAW

The above is very slow (although it works), so using PRAW (Python Reddit API Wrapper) is a lot faster. This also comes with other advantages:

- Easier to test ideas
- Seems to work better
- Easier to use once set up correctly.
- It's more 'correct' in that this is the method that Reddit intends for people to scrape data

However it is still limited to 1000 posts. I decided to use the all-time top posts for this one since:
- Bigger spread of posts over time (from the previous scraper, I found that 1000 posts filtered by hot is about 10 days)
- Avoids scraping posts that may not match subreddit rules (for example, a post that doesn't fit the rules but hasn't been banned yet)

- One drawback is that data from this may not be as good for predicting 'new' posts.
- I don't want to scrape the same data again, obviously.

I decided to use this (all-time top posts) as the data for the modelling process. For the above json data I'll just use it as the final test data, to see how well the model works.

### IMPORTANT NOTE:
Before running this make sure you have "reddit_secrets.env" in the main folder (i.e. Projects/project3/reddit_secrets.env"). I should have sent this to you but if not please let me know. It contains the reddit username/secret/id and will be loaded in by the code below.

In [24]:
# import
import praw
from dotenv import load_dotenv
import os

In [33]:
# set the secret client id from environment variable
load_dotenv("../reddit_secrets.env")

# get the client secret and password and username
reddit_username = os.environ.get("REDDIT_USERNAME")
reddit_userid = os.environ.get("REDDIT_CLIENT_ID")
reddit_usersecret = os.environ.get("REDDIT_CLIENT_SECRET")

if (reddit_username is None) or (reddit_userid is None) or (reddit_usersecret is None):
    print("You probably forgot to get the .env file")

# make user agent nicely formatted
user_agent = "jupyter:RedditScraper:v1.1 (by " + reddit_username + ")"

reddit = praw.Reddit(
    client_id = reddit_userid,
    client_secret = reddit_usersecret,
    user_agent = user_agent)

None


Next, actually scrape the data.

In [49]:
# function that will scrape using PRAW
def praw_scraper(subreddit, save_loc, num_posts = 50):
    all_data = []
    
    #for submission in reddit.subreddit(subreddit).hot(limit=num_posts):
    for submission in reddit.subreddit(subreddit).top("all", limit=num_posts):
        
        # ignore if NSFW
        if submission.over_18:
            #pass
            continue
        
        # get the data we want
        one_post = (submission.title, submission.name, submission.created_utc)
        
        all_data.append(one_post)
        
    
    # save it to the file
    pd.DataFrame(all_data).to_csv(save_loc)
    print("done")

Once again, uncomment the lines to scrape.

In [51]:
subreddit = "shittysuperpowers"
save_loc = "../data/scraped_subreddit_1/top_from_praw_1.csv"
# praw_scraper(subreddit, save_loc, num_posts = 1000)

print("-" * 20)

subreddit = "godtiersuperpowers"
save_loc = "../data/scraped_subreddit_2/top_from_praw_2.csv"
# praw_scraper(subreddit, save_loc, num_posts = 1000)

done
--------------------
done


## At this point please head back to '0_Main' notebook