# 1. Scraping Data using Pushshift api

##### <u> Contents </u>
- [Pulling data from subreddits](#Pulling-data-from-both-subreddits)
- [Light Dataset Checks](#Light-Dataset-Checks)
- [Concatenating and Saving Dataset for Language Preprocessing](Concatenating-and-Saving-Dataset-for-Language-Preprocessing)

In [1]:
import pandas as pd
import datetime as dt
import time
import requests

The function below was given to us by Tom Ludlow in a demo for using the Pushshift api to scrape posts from specified subreddits. 

In [2]:
def query_pushshift(subreddit, kind='submission', skip=30, times=5, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 
                                'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):
    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    mylist = []
    for x in range(1, times + 1):
        URL = "{}&after={}d".format(stem, skip * x)
        print(URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        mylist.append(df)
        time.sleep(2) #sleep throttle to slow down api pulls
    full = pd.concat(mylist, sort=False)
    if kind == "submission":
        full = full[subfield]
        full = full.drop_duplicates(subset = 'title')
        full = full.loc[full['is_self'] == True]
    def get_date(created):
        return dt.date.fromtimestamp(created)
    _timestamp = full["created_utc"].apply(get_date)
    full['timestamp'] = _timestamp
    print(full.shape)
    return full

### Pulling data from both subreddits

In [3]:
starwars = query_pushshift('StarWars')

https://api.pushshift.io/reddit/search/submission/?subreddit=StarWars&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=StarWars&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=StarWars&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=StarWars&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=StarWars&size=500&after=150d
(1098, 9)


In [4]:
startrek = query_pushshift('startrek')

https://api.pushshift.io/reddit/search/submission/?subreddit=startrek&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=startrek&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=startrek&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=startrek&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=startrek&size=500&after=150d
(1816, 9)


### Light Dataset Checks

In [5]:
starwars.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
1,Did anyone else notice this pretty cool practi...,"I've only seen the movie once, so I can't exac...",StarWars,1577751046,DoubleOhSeven68,10,1,True,2019-12-30
2,Anyone else fans of the Legacy comic series?,Read them a long time ago (in a library ten mi...,StarWars,1577751339,PrettyDumbHonestly,4,1,True,2019-12-30
4,A non Star Wars fan's thoughts on the series' ...,My wife and I went and saw RoS the other day a...,StarWars,1577751468,Mojo884ever,3,1,True,2019-12-30
6,"Has anyone seen Red Tails? It’s Lucasfilm, 2012",A world war 2 movie for those who don’t know. ...,StarWars,1577751800,DarthCasanova,4,1,True,2019-12-30
8,Kathleen Kennedy taking credit for the return ...,https://www.ign.com/articles/2019/12/30/star-w...,StarWars,1577752344,killiandw,10,1,True,2019-12-30


In [6]:
startrek.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,What's the music genre called that Nic Fontain...,This Frank Sinatra kind of music.,startrek,1577751121,AdligerAdler,15,1,True,2019-12-30
3,Which episode has the worst/dumbest moral? (Be...,My money's on Cogenitor from ENT. Trip discove...,startrek,1577751484,Z80-A,526,1,True,2019-12-30
4,Ds9 mirror mic Fontaine and Picard?,Is there an in universe explanation of why mir...,startrek,1577753150,richterman2369,4,1,True,2019-12-30
6,[Spoilers-Picard] Are these guys suppose to be...,https://imgur.com/CzznXX9,startrek,1577756248,InadequateUsername,4,1,True,2019-12-30
7,The uniforms in Picard are annoying me.,They went backward. To put it in perspective f...,startrek,1577757269,YoviQ,26,1,True,2019-12-30


The data that will be used to populate the corpus in the modeling will be taken from the `title` column in both datasets so I will run some rudimentary checks for null values, as well as tags like `[removed]`, and `[deleted]` in the `title` column which imply missing data and posts that have been taken down by the moderator of the subreddits for various reasons.

In [7]:
startrek['title'].isnull().sum()

0

In [8]:
starwars['title'].isnull().sum()

0

In [9]:
startrek['title'][(startrek['title']=='[removed]') & (startrek['title']=='[deleted]')].sum()

0

In [10]:
starwars['title'][(starwars['title']=='[removed]') & (starwars['title']=='[deleted]')].sum()

0

There are no evident nulls or values that should be deemed nulls such as `[removed]` or `[deleted]` so the datasets can be concatenated and saved for future analysis. 

### Concatenating and Saving Dataset for Language Preprocessing

In [11]:
scifi = pd.concat([starwars,startrek], axis = 0)

In [12]:
scifi.subreddit.value_counts() #shows that both datasets have been concatenated.

startrek    1816
StarWars    1098
Name: subreddit, dtype: int64

In [13]:
# Saved concatenated dataset to csv file
scifi.to_csv('./scifi.csv',index = False)

In [14]:
scifi = pd.read_csv('./scifi.csv')

In [15]:
scifi.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Did anyone else notice this pretty cool practi...,"I've only seen the movie once, so I can't exac...",StarWars,1577751046,DoubleOhSeven68,10,1,True,2019-12-30
1,Anyone else fans of the Legacy comic series?,Read them a long time ago (in a library ten mi...,StarWars,1577751339,PrettyDumbHonestly,4,1,True,2019-12-30
2,A non Star Wars fan's thoughts on the series' ...,My wife and I went and saw RoS the other day a...,StarWars,1577751468,Mojo884ever,3,1,True,2019-12-30
3,"Has anyone seen Red Tails? It’s Lucasfilm, 2012",A world war 2 movie for those who don’t know. ...,StarWars,1577751800,DarthCasanova,4,1,True,2019-12-30
4,Kathleen Kennedy taking credit for the return ...,https://www.ign.com/articles/2019/12/30/star-w...,StarWars,1577752344,killiandw,10,1,True,2019-12-30


In [18]:
scifi['subreddit'].value_counts(normalize = True)

startrek    0.623198
StarWars    0.376802
Name: subreddit, dtype: float64

# Please continue to the Language Preprocessing notebook.