# Reddit NLP Data Scraping

## Overview

This notebook contains the data scrapping portion of the Reddit NLP Project to create a model distinguishing posts between 2 subreddits. More detailed information can be found in the Data Cleaning and EDA notebook.

The following two subreddits were scrapped to obtain up to 1500 useable posts:
- r/personalfinance
- r/wallstreetbets

To this end, the 2 subreddits were scraped for a total of:
r/personalfinance - 31 times
r/wallstreetbets - 61 times

Due to the nature of the r/wallstreetbets subreddit, many posts would be memes or pictures, or removed due to the rules of the subreddit. As a result, a little under twice as many posts were scrapped from this subreddit with the knowledge that many of these would be too short to be used to train the model.

## Content
- [Imports](#Imports)
- [Scraping](#Scraping)
- [r/personalfinance Scrape](#r/personalfinance-Scrape)
- [r/wallstreetbets Scrape](#r/wallstreetbets-Scrape)
- [Saving the Data](#Saving-the-Data)

## Imports

In [1]:
import numpy as np
import pandas as pd
import time

import requests

## Scraping

Due to a quirk of the scrapper, every 9th scrape only drew 99 posts instead of the requested 100, as a result, scrapes were either limited to 99 posts a time, or the requested time was dependent on the 99th post. 

Only the following parameters were retained from the scrape, as there was quite a lot of information and columns that the scraper would create:

|Column Retained|Reasoning|
|--|--|
|subreddit|Required to identify the subreddit, and will form our dependent variable|
|title|Title of the Post - Contains a summary of the post, and will be used to help train the model| 
|selftext|Generally contains a the body of the post, with the most content, which will be used to help train our model|

In [2]:
# Pushshift API to scrape Reddit
url = 'https://api.pushshift.io/reddit/search/submission'

## r/personalfinance Scrape

In [3]:
params_p_f = {
    'subreddit' : 'personalfinance',
    'size' : 100,
    'before': 1632293155
}
p_f = requests.get(url, params_p_f)

In [4]:
p_f.status_code

200

In [5]:
data = p_f.json()

In [6]:
p_f = pd.DataFrame(data['data'])

In [7]:
p_f.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 68 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  100 non-null    object 
 1   allow_live_comments            100 non-null    bool   
 2   author                         100 non-null    object 
 3   author_flair_background_color  83 non-null     object 
 4   author_flair_css_class         1 non-null      object 
 5   author_flair_richtext          100 non-null    object 
 6   author_flair_text              83 non-null     object 
 7   author_flair_text_color        83 non-null     object 
 8   author_flair_type              100 non-null    object 
 9   author_fullname                100 non-null    object 
 10  author_is_blocked              100 non-null    bool   
 11  author_patreon_flair           100 non-null    bool   
 12  author_premium                 100 non-null    bool

In [8]:
p_f['retrieved_on'][0]

1632293155

In [9]:
p_f['retrieved_on'][97]

1632270871

In [10]:
p_f_df = p_f[['subreddit', 'title', 'selftext', 'retrieved_on']]
p_f_df

Unnamed: 0,subreddit,title,selftext,retrieved_on
0,personalfinance,Raise or PTO opinions,"Inspired to post after lurking for some time, ...",1632293155
1,personalfinance,Advice On Car Loan,"Hey everyone,\n\nIn 2019 I was in a pretty bad...",1632293141
2,personalfinance,Debit card I never use was skimmed and Wells F...,"Hi all,\n\nI have a debit card that I use just...",1632293092
3,personalfinance,Personal Loans for 400 Credit score?,are there any personal/emergency loans for 400...,1632292929
4,personalfinance,Free Airdrops For you. Claim this now.,[removed],1632292235
...,...,...,...,...
95,personalfinance,Wells Fargo dispute UPDATE,So last time I was advised due to having GPS a...,1632271260
96,personalfinance,How should I use my new credit card?,So im 29 and just got my first credit card so ...,1632270939
97,personalfinance,Investing in crypto,[removed],1632270871
98,personalfinance,I’m making 6k net a month living with my paren...,[removed],1632270669


In [12]:
new_date = p_f['retrieved_on'][99]

for n in range(1, 31):
    params_p_f = {
    'subreddit' : 'personalfinance',
    'size' : 100,
    'before': new_date
    }
    p_f_current = requests.get(url, params_p_f).json()
    p_f_1 = pd.DataFrame(p_f_current['data'])
    p_f_1_cleaned = p_f_1[['subreddit', 'retrieved_on', 'title', 'selftext']]
    new_date = p_f_1['retrieved_on'][98]
    p_f_df = pd.concat(objs = [
        p_f_df, p_f_1_cleaned
    ])
    
    time.sleep(8)
    
    print(f'Scrape no: {n} \nScrapped {len(p_f_1_cleaned)} posts. \n--------')
    

Scrape no: 1 
Scrapped 100 posts. 
--------
Scrape no: 2 
Scrapped 100 posts. 
--------
Scrape no: 3 
Scrapped 100 posts. 
--------
Scrape no: 4 
Scrapped 100 posts. 
--------
Scrape no: 5 
Scrapped 100 posts. 
--------
Scrape no: 6 
Scrapped 100 posts. 
--------
Scrape no: 7 
Scrapped 100 posts. 
--------
Scrape no: 8 
Scrapped 100 posts. 
--------
Scrape no: 9 
Scrapped 100 posts. 
--------
Scrape no: 10 
Scrapped 99 posts. 
--------
Scrape no: 11 
Scrapped 100 posts. 
--------
Scrape no: 12 
Scrapped 100 posts. 
--------
Scrape no: 13 
Scrapped 100 posts. 
--------
Scrape no: 14 
Scrapped 100 posts. 
--------
Scrape no: 15 
Scrapped 100 posts. 
--------
Scrape no: 16 
Scrapped 100 posts. 
--------
Scrape no: 17 
Scrapped 100 posts. 
--------
Scrape no: 18 
Scrapped 100 posts. 
--------
Scrape no: 19 
Scrapped 100 posts. 
--------
Scrape no: 20 
Scrapped 99 posts. 
--------
Scrape no: 21 
Scrapped 100 posts. 
--------
Scrape no: 22 
Scrapped 100 posts. 
--------
Scrape no: 23 
Scrapp

In [13]:
p_f_df_clean = p_f_df.reset_index(drop = True)

In [14]:
p_f_df_clean.duplicated(subset = 'title').sum()

1479

In [15]:
p_f_df_clean.drop_duplicates(subset = 'title', keep='first', inplace=True)

## r/wallstreetbets Scrape

In [16]:
params_wsb = {
    'subreddit' : 'wallstreetbets',
    'size' : 100,
    'before': 1632293401
}
wsb = requests.get(url, params_wsb)

In [17]:
wsb = pd.DataFrame(wsb.json()['data'])

In [18]:
wsb['retrieved_on'][0]

1632293401

In [19]:
wsb['retrieved_on'][99]

1632279727

In [20]:
wsb_df = wsb[['subreddit', 'retrieved_on', 'title', 'selftext']]

In [21]:
wsb_df

Unnamed: 0,subreddit,retrieved_on,title,selftext
0,wallstreetbets,1632293401,The market won’t crash,Market prices are based on demand and offer. \...
1,wallstreetbets,1632293287,China Injects $18.6 Billion Into Banking Syste...,
2,wallstreetbets,1632293253,$ATRA,[removed]
3,wallstreetbets,1632293192,$YANG Gang Update - We’re Fuk (Maybe) - Part 3,Yeahh.... sooo Xinnie the Pooh just [took over...
4,wallstreetbets,1632292854,StockJesus Interesting Trades,Here are the trades that you need to be watchi...
...,...,...,...,...
95,wallstreetbets,1632279893,$wish take all my money or give me 2 million I...,
96,wallstreetbets,1632279864,Advice for someone who knows nothing about the...,"Im in my mid 30s, decent job with 401k, but st..."
97,wallstreetbets,1632279824,How I use Data Science to Trade Options Around...,Recently I started a trading strategy around e...
98,wallstreetbets,1632279778,Waiting on Next Meme Stock YOLO,


In [22]:
new_date = wsb['retrieved_on'][99]

for n in range(1, 61):
    params_wsb = {
    'subreddit' : 'wallstreetbets',
    'size' : 99,
    'before': new_date
    }
    wsb_current = requests.get(url, params_wsb).json()
    wsb_1 = pd.DataFrame(wsb_current['data'])
    wsb_1_cleaned = wsb_1[['subreddit', 'retrieved_on', 'title', 'selftext']]
    new_date = wsb_1['retrieved_on'][98]
    wsb_df = pd.concat(objs = [
        wsb_df, wsb_1_cleaned
    ])
    
    time.sleep(8)
    
    print(f'Scrape no: {n} \nScrapped {len(wsb_1_cleaned)} posts. \n--------')

Scrape no: 1 
Scrapped 99 posts. 
--------
Scrape no: 2 
Scrapped 99 posts. 
--------
Scrape no: 3 
Scrapped 99 posts. 
--------
Scrape no: 4 
Scrapped 99 posts. 
--------
Scrape no: 5 
Scrapped 99 posts. 
--------
Scrape no: 6 
Scrapped 99 posts. 
--------
Scrape no: 7 
Scrapped 99 posts. 
--------
Scrape no: 8 
Scrapped 99 posts. 
--------
Scrape no: 9 
Scrapped 99 posts. 
--------
Scrape no: 10 
Scrapped 99 posts. 
--------
Scrape no: 11 
Scrapped 99 posts. 
--------
Scrape no: 12 
Scrapped 99 posts. 
--------
Scrape no: 13 
Scrapped 99 posts. 
--------
Scrape no: 14 
Scrapped 99 posts. 
--------
Scrape no: 15 
Scrapped 99 posts. 
--------
Scrape no: 16 
Scrapped 99 posts. 
--------
Scrape no: 17 
Scrapped 99 posts. 
--------
Scrape no: 18 
Scrapped 99 posts. 
--------
Scrape no: 19 
Scrapped 99 posts. 
--------
Scrape no: 20 
Scrapped 99 posts. 
--------
Scrape no: 21 
Scrapped 99 posts. 
--------
Scrape no: 22 
Scrapped 99 posts. 
--------
Scrape no: 23 
Scrapped 99 posts. 
------

In [23]:
wsb_df.reset_index(drop = True, inplace = True)

In [24]:
wsb_df.duplicated(subset = 'title').sum()

477

In [25]:
wsb_df.drop_duplicates(subset = 'title', keep='first', inplace=True)

In [26]:
wsb_df

Unnamed: 0,subreddit,retrieved_on,title,selftext
0,wallstreetbets,1632293401,The market won’t crash,Market prices are based on demand and offer. \...
1,wallstreetbets,1632293287,China Injects $18.6 Billion Into Banking Syste...,
2,wallstreetbets,1632293253,$ATRA,[removed]
3,wallstreetbets,1632293192,$YANG Gang Update - We’re Fuk (Maybe) - Part 3,Yeahh.... sooo Xinnie the Pooh just [took over...
4,wallstreetbets,1632292854,StockJesus Interesting Trades,Here are the trades that you need to be watchi...
...,...,...,...,...
6035,wallstreetbets,1631739325,An old meme for CCJ I made back when it was &l...,
6036,wallstreetbets,1631739315,Is SDC new WSB rocket？,
6037,wallstreetbets,1631739313,Who wants to join my time machine startup.,
6038,wallstreetbets,1631739311,Who is selling TLRY and the other Canadian can...,[removed]


In [27]:
reddit = pd.concat(objs = [
        p_f_df_clean, wsb_df
]).reset_index(drop = True)

## Saving the Data

In [28]:
reddit.to_csv('datasets/reddit.csv')