# PROJECT 3 - PART B

## COLLECTION OF KETO AND WINE THROUGH PUSHSHIFT IO

**OVERVIEW** 

After requests method yielded a high number of duplicates from the subreddit, followed instructor recommendation to utilize the following code to capture archive of records off of pushshift io.

In [1]:
import pandas as pd
import time
import requests
import datetime as dt


def get_date(created):
    # get the date of post
    return dt.date.fromtimestamp(created)



def query_pushshift(subreddit, kind='submission', skip=5, times=50, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments',
                                'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):
    
    
    # get the base url that contains information I want to scrape where 'kind' are all submitted posts
    # and 'subreddit' is the specified subreddit. Get 500 posts.
    stem = f"https://api.pushshift.io/reddit/search/{kind}/?subreddit={subreddit}&size=500"
    
    # instantiate list to contain 
    mylist = []
   
    # scrape posts from the subreddit 'times' times
    for x in range(1, times + 1):
        # Get posts 'skip' * 'x' days ago
        URL = f"{stem}&after={skip * x}d"
        print(URL)
       
        # Scrape URL
        response = requests.get(URL)
       
        # Give me an AssertionError if status code not 200
        assert response.status_code == 200
       
        # Of the HTML scraped, take the values of 'data'
        the_json=response.json()
        no_blanks=[c for c in the_json['data'] if ('selftext' in c.keys()) and len(c['selftext'])>10]
        
        # turn the data into a dataframe
        df = pd.DataFrame.from_dict(no_blanks)
        
        # append the dataframe to mylist
        mylist.append(df)
        
        # wait to not overrun Reddit's resources
        time.sleep(3)
   
    # concatenate the dataframes together as one large dataframe, full
    full = pd.concat(mylist, sort=False)
    if kind == "submission":
       
        # take all speficied data
        full = full[subfield]
        
        # drop duplicate rows
        full = full.drop_duplicates()
        full = full.loc[full['is_self'] == True]
   
    # date the the post was... posted
    _timestamp = full["created_utc"].apply(get_date)
    full['timestamp'] = _timestamp
    print(full.shape)
    return full



#### ACCESS SUBREDDIT KETO FOR DATA

In [10]:
df=query_pushshift(subreddit = 'keto', skip = 1, times = 100)

https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=1d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=2d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=3d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=4d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=5d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=6d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=7d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=8d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=9d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=10d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=11d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500

https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=98d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=99d
https://api.pushshift.io/reddit/search/submission/?subreddit=keto&size=500&after=100d
(7900, 9)


In [12]:
df_push_keto = df

In [13]:
df_push_keto.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,It's been a hot minute,I keep getting off track when it comes to keto...,keto,1580160840,bbwsoontobebw,5,1,True,2020-01-27
1,Type 1 Diabetes on Keto,"Hello, I’m beginning my second month of keto. ...",keto,1580160905,iWantNotToWant,10,1,True,2020-01-27
2,"Week 3,5 and intense cravings for bread and rice","Week 3,5 and I am having intense cravings for ...",keto,1580161118,littleboo2theboo,10,1,True,2020-01-27
3,12-Week Challenge at the gym and looking for t...,"Hi guys,\n\nI am looking for some advice. I di...",keto,1580161503,chow_shepard,4,1,True,2020-01-27
4,"Does never going ""Full Keto"" hurt my health?",Hello!\n\nI'm a active highschooler who avoids...,keto,1580161639,OrganizingChaosBrb,3,1,True,2020-01-27


In [14]:
df_push_keto.to_csv('keto_push_output.csv')

#### ACCESS SUBREDDIT WINE FOR DATA

In [17]:
df_push_wine=query_pushshift(subreddit = 'wine', skip = 1, times = 300)

https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=1d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=2d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=3d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=4d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=5d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=6d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=7d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=8d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=9d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=10d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=11d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500

https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=98d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=99d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=100d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=101d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=102d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=103d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=104d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=105d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=106d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=107d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=108d
https://api.pushshift.io/reddit/search/submission/?subre

https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=194d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=195d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=196d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=197d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=198d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=199d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=200d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=201d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=202d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=203d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=204d
https://api.pushshift.io/reddit/search/submission/?sub

https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=290d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=291d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=292d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=293d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=294d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=295d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=296d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=297d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=298d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=299d
https://api.pushshift.io/reddit/search/submission/?subreddit=wine&size=500&after=300d
(2917, 9)


In [18]:
df_push_wine.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Looking for a summer job,"Hi,\n\nI'm a 23 year-old man from the Netherla...",wine,1580165102,Loirettoux,3,1,True,2020-01-27
1,California whites,"Whenever I try a Californian wine, it's always...",wine,1580170191,RaphGiroux,8,1,True,2020-01-27
2,Wine suggestions while in France (Provence),Will be traveling to France and spending most ...,wine,1580177480,irishmuse,4,1,True,2020-01-27
3,Show me the Munny Hunny | Wine Industry Report...,"If you are in the wine industry, congratulatio...",wine,1580177690,cudaeducation,0,1,True,2020-01-27
4,Going to Burgundy in May,"Hi, I'll be going to Burgundy during May and w...",wine,1580182920,GaanZi,5,1,True,2020-01-27


In [19]:
df_push_wine.to_csv('wine_push_output.csv')