# Imports

___

In [13]:
import pandas as pd
import numpy as np 
import requests
from nltk.stem import PorterStemmer, WordNetLemmatizer

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer

In [14]:
import time

We are going to train a model to determine whether a post is from the one of the two following subreddits. In order to do this we will need a large number fo samples from both. The API we are using is an opensource project, and in order to avoid overtaxing the server we have built a sleep time into our API request function. Those trying to replicate this prokject should take care to include some measure of courtesy when using the API. The host has been known to block abussive IP adresses. 

Facts - https://www.reddit.com/r/facts/
FakeFacts - https://www.reddit.com/r/FakeFacts/

___

# Url and pull paramters

We are going to be using the Pushshift APi - https://github.com/pushshift/api

In [15]:
# api setting
url = 'https://api.pushshift.io/reddit/search/submission'



In [16]:
# create json file from request results
# access the df through the data key 
# return df
def res_to_df(res):
    data = res.json()
    posts = data['data']
    return pd.DataFrame(posts)


Below is a sleep controlled function for pulling subreddit information from the API that take a number of iterations, a subreddit name, and a UTC time to set the the most recent post in the return dataframe. The function is built to pull 50 posts at a time, and to sleep for a period of time between each pull. Each request result is concatonated into a master_df and returned.

In [17]:
def get_posts(strokes, subreddit, target_time):
    df = pd.DataFrame()
    master_df = pd.DataFrame()
    
   
    for i in range(strokes):
        try:
            params = {
                'subreddit': subreddit,
                'size': 50,
                'before': target_time}

            res = requests.get(url,params)
            data = res.json()
            posts = data['data']
            df = pd.DataFrame(posts)

            frames = [df,master_df]
            master_df = pd.concat(frames, axis= 0 , ignore_index = True)

            target_time = df['created_utc'].min()

            time.sleep(10)
        except KeyError:
            continue

    
# for when created_utc does not exist     

       
    return master_df



Now we run our API request function for both our Subreddit pages and save the resulting dataframes to a single master_df . We neeed to give it a UTC time for the most recent post that we want returned.

___

# Calling API request function

This function needs run several times. Furst run this section, then run the merger section so that out results df updates with new API results. The UTC time code will need updated before each new request.

In [32]:
fact_df = get_posts(20, 'facts', 1594746614)

In [33]:
fiction_df = get_posts(20, 'FakeFacts', 1540795916)

___

# Merge Results

results_df is our full output from both subreddit api requests. We will continue to concatonate new api request to to this df in untill we have a large enough sample size.

On the first run you will need to uncomment the line of code imediately below for the first and only the first request df merger.

In [36]:
# uncomment this for the first time running
# results_df = pd.DataFrame()

# update fiction 
results_df = pd.concat([results_df,fiction_df], axis = 0, ignore_index = True)

# update tact
results_df = pd.concat([results_df,fact_df], axis = 0, ignore_index = True)

___

## Inspect Results

___

Since we want to maintian an even sample size from both sources we will continue to pull from both untill one returns fewer samples than the other.

In [34]:
fact_df.shape

(1000, 74)

In [35]:
fiction_df.shape

(524, 81)

Use the outputed UTC below codes to pull next batch. 

In [37]:
last_fact = fact_df['created_utc'].min()
last_fic = fiction_df['created_utc'].min()

print(f'Fact_utc: {last_fact}')
print(f'Fiction_utc: {last_fic}')

Fact_utc: 1587720304
Fiction_utc: 1332353296


In [38]:
results_df.shape

(5524, 93)

Our Fake Facts Subreddit ran out of posts to scrape after  2,524 observations. We will have slightly uneveen classes but this is ok for now. We can asses and change with in EDA if we find we need to.

___

# Clip & Clean Dataframe

This dataframe will narrow down our features to just a text and word character length

In [39]:
clip_df = results_df[['subreddit','selftext','title',]].copy()

## Subtext

1) Removed '[removed]' from selftext column

2) Created Has_selftext feature

3) Combine selftext and title fields

4) Create feature for Characeter lenght 

Most of our sibureddits selftexts are hyperlinks and. so thier pressence or abssence wil likely be a predicitve feature when trying to determine if our post cam from the facts, or FakeFacts reddit page.

In [40]:
clip_df.selftext = clip_df.selftext.replace(['[removed]'],' ')

In [41]:
# add title and self text
clip_df['selftext'] = clip_df['title'] + ' ' + clip_df['selftext']

# fill hold over columns that had NAns with title
clip_df['selftext'] = clip_df['selftext'].fillna(clip_df['title'])

# drop title 
clip_df.drop(columns = 'title', inplace = True)

# make column for length of selftext
clip_df['selftext_length'] = [len(post) for post in clip_df.selftext]

## Results

We should have a df with columns for our subreddit, the text, and the lenght of the text in characters.

In [42]:
display(clip_df.groupby(['subreddit']).mean(),clip_df.head())

Unnamed: 0_level_0,selftext_length
subreddit,Unnamed: 1_level_1
FakeFacts,176.398177
facts,187.203667


Unnamed: 0,subreddit,selftext,selftext_length
0,FakeFacts,"The origin of the word ""yeet"" is in old englis...",50
1,FakeFacts,Did you know LEGO was named after the phrase “...,117
2,FakeFacts,Taking the crust off pizza is considered disre...,65
3,FakeFacts,The Catholic church has a secret bible that al...,99
4,FakeFacts,The earth is round,20


In [43]:
clip_df.shape

(5524, 3)

___

# Export 

Now we export as a csv and pickle our datframe

In [44]:
results_df.to_csv('./data/raw_reddit.csv',  index = False)

In [45]:
import pickle 

In [46]:
with open('raw.pkl','wb') as raw_pickle:
    pickle.dump(clip_df, raw_pickle)