# Data Collection

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Collection" data-toc-modified-id="Data-Collection-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Collection</a></span><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Set-paramters" data-toc-modified-id="Set-paramters-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Set paramters</a></span></li><li><span><a href="#Define-function-to-automate-the-data-collecting-process" data-toc-modified-id="Define-function-to-automate-the-data-collecting-process-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Define function to automate the data collecting process</a></span></li><li><span><a href="#Collect-posts-from-subreddit:-Star-Trek" data-toc-modified-id="Collect-posts-from-subreddit:-Star-Trek-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Collect posts from subreddit: Star Trek</a></span></li><li><span><a href="#Collect-posts-from-subreddit:-Star-Wars" data-toc-modified-id="Collect-posts-from-subreddit:-Star-Wars-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Collect posts from subreddit: Star Wars</a></span></li><li><span><a href="#Combine-the-two-dataframes-and-shuffle-dataframe-rows" data-toc-modified-id="Combine-the-two-dataframes-and-shuffle-dataframe-rows-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Combine the two dataframes and shuffle dataframe rows</a></span></li></ul></li></ul></div>

## Import libraries

In [1]:
# Important libraries
import requests
import pandas as pd
import datetime as dt
import time
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', None)  

## Set paramters

In [2]:
# Set parameters for data collecting

# Number of posts to collect for each subreddit
NUMBER_OF_POSTS_TO_COLLECT = 5000

# Number of results to return
MAX_NUM_OF_RESULTS = 500

## Define function to automate the data collecting process

In [3]:
# Adapted from Mahdi's local lesson
def query_pushshift(subreddit_name, day_window, max_requests=10):
    
    subfields = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self'] 
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    
    # Define a function that will return a dictionary of parameters.
    def params(d_window):
        return { 'subreddit' : subreddit_name,
                 'size' : MAX_NUM_OF_RESULTS,
                 'after' : f'{d_window}d' }
    
    # Create empty dataframe
    df = pd.DataFrame()
    
    # Create a for loop to collect 500 posts each loop
    for i in range(1, max_requests + 1):
        response = requests.get(url, params(day_window*i))
        print(f'queried from {response.url}')
        assert response.status_code == 200
        posts = response.json()['data']
        
        # Configure and filter the dataframes
        df_new = pd.DataFrame.from_dict(posts)
        df_new = df_new.loc[df_new['is_self'] == True]
        df_new = df_new[subfields]
        df = df.append(df_new, sort=False)
        df.drop_duplicates(inplace = True)
        
        # Break the loop if desired number of posts has been achived
        if df.shape[0] >= NUMBER_OF_POSTS_TO_COLLECT: 
            break
        time.sleep(2)

    # Create `timestamp` column
    df['timestamp'] = df['created_utc'].map(dt.date.fromtimestamp)
    df.reset_index(drop=True, inplace=True)
    
    # Print Query Completion Time and Date
    now = dt.datetime.now()
    print ( 'Query Completion Time and Date: ' + now.strftime('%Y-%m-%d %H:%M:%S'))
    return df

## Collect posts from subreddit: Star Trek

In [4]:
df_star_trek = query_pushshift('startrek', day_window=30, max_requests=15)

queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=30d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=60d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=90d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=120d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=150d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=180d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=210d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=240d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&size=500&after=270d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=startrek&siz

In [5]:
# Check the shape of the dataframe
df_star_trek.shape

(5074, 9)

In [6]:
# Check the head of the dataframe
df_star_trek.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,One thing I would change in the Star Trek univ...,Transporter and Replicator technology. \n\nIt...,startrek,1577410001,aaraujo1973,35,1,True,2019-12-26
1,"[ENT] What, IRL, was the idea with ""Chef"" neve...",surely the writers couldn't have had that unfo...,startrek,1577410949,strangemotives,43,1,True,2019-12-26
2,I enjoyed Nemesis,I finally watched it. I know it gets a lot of ...,startrek,1577411556,GreatScott0389,9,1,True,2019-12-26
3,Data’s day,I am a first time viewer of any version of Sta...,startrek,1577412676,shadowdra126,5,1,True,2019-12-26
4,"Picard Universe, the “Marvel” Time and Univers...","So, I noticed Picard takes place only 4 years ...",startrek,1577416233,b-zod,17,1,True,2019-12-26


In [7]:
# Only keep 5000 rows for the sake of a balanced class
df_star_trek.drop(df_star_trek.index[range(5000, df_star_trek.shape[0])], inplace=True)

In [8]:
df_star_trek.shape

(5000, 9)

In [9]:
# Export dataframe to csv
# Query Completion Time and Date: 2020-01-25 20:15:13
df_star_trek.to_csv('../data/star_trek.csv')

## Collect posts from subreddit: Star Wars

In [10]:
df_star_wars = query_pushshift('StarWars', day_window=30, max_requests=30)

queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=30d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=60d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=90d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=120d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=150d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=180d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=210d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=240d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&size=500&after=270d
queried from https://api.pushshift.io/reddit/search/submission?subreddit=StarWars&siz

In [11]:
df_star_wars.shape

(5062, 9)

In [12]:
# Check the head of the dataframe
df_star_wars.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Qui-Gon jinn movie please?,"Hear me out, baby qui-gon jinn!!!! and seeing ...",StarWars,1577409358,devn_bokr,11,1,True,2019-12-26
1,October 2019 - Canon: Emperor Palpatine is an ...,[removed],StarWars,1577409600,i-THiNK-iM-BiG-MEECH,0,1,True,2019-12-26
2,About Cal Kestis,For real what is his age I have searched onlin...,StarWars,1577409715,XxWhatIsLifexX,6,1,True,2019-12-26
3,Revan TV Show [Spoilers],I think it the KOTOR should be a show broken i...,StarWars,1577409997,ltite,3,1,True,2019-12-26
4,The lack of film on Disney's cutting-room floo...,Good art is often the product of a skilled and...,StarWars,1577410354,msilcommand,7,1,True,2019-12-26


In [13]:
# Only keep 5000 rows for the sake of a balanced class
df_star_wars.drop(df_star_wars.index[range(5000, df_star_wars.shape[0])], inplace=True)

In [14]:
# Export dataframe to csv
# Query Completion Time and Date: 2020-01-25 20:16:17
df_star_wars.to_csv('../data/star_wars.csv')

## Combine the two dataframes and shuffle dataframe rows

In [15]:
# Combine the two dataframes
df_combined = df_star_trek.append(df_star_wars, ignore_index=True)

In [16]:
df_combined.shape

(10000, 9)

In [17]:
df_combined.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,One thing I would change in the Star Trek univ...,Transporter and Replicator technology. \n\nIt...,startrek,1577410001,aaraujo1973,35,1,True,2019-12-26
1,"[ENT] What, IRL, was the idea with ""Chef"" neve...",surely the writers couldn't have had that unfo...,startrek,1577410949,strangemotives,43,1,True,2019-12-26
2,I enjoyed Nemesis,I finally watched it. I know it gets a lot of ...,startrek,1577411556,GreatScott0389,9,1,True,2019-12-26
3,Data’s day,I am a first time viewer of any version of Sta...,startrek,1577412676,shadowdra126,5,1,True,2019-12-26
4,"Picard Universe, the “Marvel” Time and Univers...","So, I noticed Picard takes place only 4 years ...",startrek,1577416233,b-zod,17,1,True,2019-12-26


In [18]:
# Shuffle Dataframe Rows
df_shuffled = df_combined.sample(frac=1).reset_index(drop=True)

In [19]:
df_shuffled.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,"The more I think on it, Rian’s treatment of Sn...",Obviously The Last Jedi has its fans and it’s ...,StarWars,1523027283,NickMoore30,155,168,True,2018-04-06
1,"Tea, Earl Grey, Hot.",Did anyone ever think why Captain Picard alway...,startrek,1567337333,OX_Bigly,26,1,True,2019-09-01
2,The most improbable thing in Star trek imo,Just watched an episode of DS9 with Rom and Le...,startrek,1575721353,castiel65,6,1,True,2019-12-07
3,Why did the Voyager crew ever go in space again?,Call me a gutless mah'tog if you want. But if ...,startrek,1544332908,zombiecmh,6,1,True,2018-12-09
4,REVOLUTION,"Brother, sisters!! We have laser weapons and m...",startrek,1551593655,svetambara,5,0,True,2019-03-03


In [20]:
df_shuffled.tail()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
9995,I recently started Discovery.,And I hated the first season. I literally fast...,startrek,1556995926,MyNicknameIsDice,24,0,True,2019-05-04
9996,Han Solo,Does any one have or can find a 9 panel pictur...,StarWars,1569852133,Jesse-Jones,0,0,True,2019-09-30
9997,Triple Force Friday Australia,Does anyone know if any stores in Australia ar...,StarWars,1569732434,bdave3385,2,3,True,2019-09-29
9998,Anyone else had to laugh at “what the frick”,Context: https://youtu.be/2pJlpCfueV0,startrek,1551523022,tresslessone,5,0,True,2019-03-02
9999,"Rewatching ENT, what a worthless crew!","Okay, so I'm currently rewatching ENT. The ser...",startrek,1564941087,rattlesnakejake90,15,0,True,2019-08-04


In [21]:
# Export the final dataframe
df_shuffled.to_csv('../data/final.csv')