# Data Collection

### In this notebook, we will collect data from the top 1000 posts of Reddit's r/wine subreddit using Reddit's API.

#### First, we will load in our login info for authorization.

In [1]:
# Load in login info
import pickle

def load_object(filename):
    try:
        with open(filename, "rb") as f:
            return pickle.load(f)
    except Exception as ex:
        print("Error during unpickling object:", ex)

login_info = load_object("login_info.pickle")

#### Next, we request authorization to make further requests

In [89]:
# Request OAuth token from Reddit

import requests

# Script Authentication
auth = requests.auth.HTTPBasicAuth(login_info['client_id'], login_info['secret'])

# pass in login method
data = {'grant_type' : 'password',
        'username' : login_info['username'],
        'password' : login_info['password']}

# setup header info
headers = {'User-Agent' : 'desktop:myDataAnalysisPortfolioProject:v0.0.1 (by /u/bwZel)'}

# send request for OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization' : f"bearer {TOKEN}"}}

# add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

<Response [200]>

#### Now that we have the authorization, we can begin to request data from reddit's API. In this project, I will gather data from the subreddit, r/wine, and hopefully find insights from this data.

#### This function will be used for extracting the relevant information from our response and creating a Pandas DataFrame from that information.

In [2]:
from datetime import datetime

# Use this function to convert responses to dataframes
def df_from_response(res):
    # initialize dataframe for each batch of data
    df = pd.DataFrame()
    
    # loop through each post pulled from res and append necessary info to df
    for post in res.json()['data']['children']:
        df = df.append({
            'subreddit' : post['data']['subreddit_name_prefixed'],
            'title' : post['data']['title'],
            'selftext' : post['data']['selftext'],
            'upvote_ratio' : post['data']['upvote_ratio'],
            'ups' : post['data']['ups'],
            'downs' : post['data']['downs'],
            'score' : post['data']['score'],
            'created_utc' : datetime.fromtimestamp(post['data']['created_utc']),
            'id' : post['data']['id'],
            'kind' : post['kind']
        }, ignore_index=True)
    
    return df

#### Here, we will make the 10 requests to get our top 1000 posts from Reddit's r/wine subreddit

In [4]:
# Gather the data
import pandas as pd

subreddit_data = pd.DataFrame()
params = {'limit' : 100}

for i in range(10):
    # Make request
    res = requests.get("http://oauth.reddit.com/r/wine/top/?t=all",
                       headers=headers,
                       params=params)
    
    # Get dataframe from response
    new_df = df_from_response(res)
    
    # Get final row (oldest entry)
    row = new_df.iloc[len(new_df)-1]
    
    # Create fullname
    fullname = row['kind'] + '_' + row['id']
    
    # Add/update after params
    params['after'] = fullname
    
    # Append new_df to dataFrame
    subreddit_data = subreddit_data.append(new_df, ignore_index=True)
    


#### Lastly, we save our data as a csv file for analysis.

In [49]:
# Now we save this data as a .csv file
subreddit_data.to_csv("RedditWineData.csv", index=False)