In [1]:
import requests
import pandas as pd

## Getting the data

We want 1,000 unique posts from each subreddit so we don't have imbalanced classes. My process was to take the maximum of each api pushshift (500) and sort by ascending and descending. Then, I dropped the duplicates, waited a day and gathered the rest of the data until I had 1000 observations from each subreddit. There was probably a better way of doing this by specifying what timeframe I'm getting the data from so I wouldn't have to wait a day to get new data. Note, at the end I've outputted my final data frame of both subreddits and will just use this in my exploratory analysis. If we run this again, the data will change.

To get our dataset, we only need the last cell of this notebook. When we explore the data we will just read in the dataset that we made in this notebook.

### Shower Thoughts

In [None]:
#good to have equal size of both datasets? yes, we want 1000 from each subreddit
show_url1 = 'https://api.pushshift.io/reddit/search/submission?subreddit=showerthoughts&size=500'
show_url2 = 'https://api.pushshift.io/reddit/search/submission?subreddit=showerthoughts&size=500&sort=asc'


In [None]:
show_req1 = requests.get(show_url1)
show_req2 = requests.get(show_url2)


In [None]:
print(show_req1.status_code)

print(show_req2.status_code)


In [None]:
shower1 = show_req1.json()
shower1 = shower1['data']
shower1 = pd.DataFrame(shower1)

In [None]:
shower2 = show_req2.json()
shower2 = shower2['data']
shower2 = pd.DataFrame(shower2)

In [None]:
shower1 = shower1[['created_utc', 'title', 'selftext', 'subreddit', 'permalink', 'author']]
shower2 = shower2[['created_utc', 'title', 'selftext', 'subreddit', 'permalink', 'author']]

In [None]:
temp = pd.concat([shower1, shower2], axis=0)

In [None]:
len(temp['title'].unique()) 

In [None]:
temp.drop_duplicates(subset = 'title', inplace = True)

In [None]:
len(temp['title']) 

This gave us 995 unique observations. We need 5 more but have to wait some time to run the api again because otherwise, it'll just take the same observations. Our final shower dataframe will be a combination of this temporary one and 5 new observations.

In [None]:
temp.to_csv('temp.csv', index = False)

## Stoner philosophy

In [None]:
stone_url1 = 'https://api.pushshift.io/reddit/search/submission?subreddit=stonerphilosophy&size=500'
stone_url2 = 'https://api.pushshift.io/reddit/search/submission?subreddit=stonerphilosophy&size=500&sort=asc'

stone_req1 = requests.get(stone_url1)
stone_req2 = requests.get(stone_url2)

In [None]:
print(stone_req1.status_code)
print(stone_req2.status_code)

In [None]:
stoner1 = stone_req1.json()
stoner1 = stoner1['data']
stoner1 = pd.DataFrame(stoner1)

stoner2 = stone_req2.json()
stoner2 = stoner2['data']
stoner2 = pd.DataFrame(stoner2)

In [None]:
stoner1 = stoner1[['created_utc', 'title', 'selftext', 'subreddit', 'permalink', 'author']]
stoner2 = stoner2[['created_utc', 'title', 'selftext', 'subreddit', 'permalink', 'author']]

In [None]:
temp2 = pd.concat([stoner1, stoner2], axis=0)

In [None]:
len(temp2['title'].unique())

In [None]:
temp2.drop_duplicates(subset = 'title', inplace = True)

In [None]:
len(temp2['title'])

Interestingly enough this also gave us 995 unique observations. We need 5 more but have to wait some time to run the api again because otherwise, it'll just take the same observations. Our final stoner dataframe will be a combination of this temporary one and 5 new observations.

In [None]:
temp2.to_csv('temp2.csv', index = False)

## Waiting for 5 more unique observations

### Getting 'extra' showerthoughts observations

In [None]:
temp = pd.read_csv('../Data/temp.csv')
url1 = 'https://api.pushshift.io/reddit/search/submission?subreddit=showerthoughts&size=10'
req1 = requests.get(url1)
temp1 = req1.json()
temp1 = temp1['data']
temp1 = pd.DataFrame(temp1)
temp1 = temp1[['created_utc', 'title', 'selftext', 'subreddit', 'permalink', 'author']]

In [None]:
df = pd.concat([temp, temp1], axis = 0)

In [None]:
len(df['title'].unique())

Our first dataframe had 995 values and it seems like we've added 10 whole new observations. So we will save this dataframe to a new csv, after deleting 5 rows.

In [None]:
#delete the last 5 so we have 1000 observations
df = df[:-5]

In [None]:
len(df)

In [None]:
df.to_csv('shower_final.csv', index = False)

### Getting 'extra' stonerphilosophy observations

There wasn't enough new posts so we have to get new data a different way. We will use the 'before' parameter in the api pushshift link. I located the 'created_utc' of the first observation and said I want posts after a random number on that magnitude.

In [29]:
temp = pd.read_csv('../Data/temp2.csv')
url1 = 'https://api.pushshift.io/reddit/search/submission?subreddit=stonerphilosophy&size=20&after=1480139035'
req1 = requests.get(url1)
temp1 = req1.json()
temp1 = temp1['data']
temp1 = pd.DataFrame(temp1)
temp1 = temp1[['created_utc', 'title', 'selftext', 'subreddit', 'permalink', 'author']]

In [30]:
df2 = pd.concat([temp, temp1], axis = 0)

In [31]:
len(df2['title'].unique())

1015

In [32]:
df2 = df2[:-15]

In [33]:
len(df2)

1000

In [34]:
df2.to_csv('stoner_final.csv', index = False)

## Combined DataFrame

We can now combine both final shower and stoner dataframes to create our final data frame that we will work with for modeling.

In [36]:
df = pd.read_csv('../Data/shower_final.csv')
df2 = pd.read_csv('../Data/stoner_final.csv')

final_df = pd.concat([df, df2], axis=0)
final_df.to_csv('../Data/final.csv', index = False)