## Table of Contents:
* [Import Packages](#first-bullet)
* [Function to Pull Data from Reddit API](#second-bullet)
* [Pull Canoe Subreddit Data](#third-bullet)
* [Pull Table Tennis Data](#fourth-bullet)

## Import Packages <a class="anchor" id="first-bullet"></a>

In [6]:
# Import libaries
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

## Function to Pull Data from Reddit API

The below function will be used to pull data from the Reddit API.  The function pulls in 25 posts per loop, taking a one second 
break in between pulls.  

In [7]:
# Function to pull subreddit posts
def get_subreddit(url, n_pulls, headers):    

    # Create empty templates
    posts = []
    after = None

    # Create a loop that does max 25 requests per pull
    for pull_num in range(n_pulls):
        print("Pulling data attempted", pull_num+1,"time(s)")

        if after == None:
            new_url = url                 # base case
        else:
            new_url = url+"?after="+after # subsequent iterations

        res = requests.get(new_url, headers=headers)

        if res.status_code == 200:
            subreddit_json = res.json()                      # Pull JSON
            posts.extend(subreddit_json['data']['children']) # Get subreddit posts
            after = subreddit_json['data']['after']          # 'after' = ID of the last post in this iteration
        else:
            print("We've run into an error. The status code is:", res.status_code)
            break

        time.sleep(1)
        
    return(posts)

## Import Canoe Subreddit <a class="anchor" id="third-bullet"></a>

Using the function defined above, we will now pull in the canoe Reddit data.  Running the API call 43 times will ensure we have
close to 1,000 posts.

In [8]:
# Define URL and username
url_name_canoe = "https://www.reddit.com/r/canoeing.json"
headers = {"User-agent": 'phillipdibert'}      # header to prevent 429 error

In [9]:
# Run function (from above) to pull data from canoe subreddit
canoeing_pull = get_subreddit(url_name_canoe, n_pulls = 43, headers = headers)

Pulling data attempted 1 time(s)
Pulling data attempted 2 time(s)
Pulling data attempted 3 time(s)
Pulling data attempted 4 time(s)
Pulling data attempted 5 time(s)
Pulling data attempted 6 time(s)
Pulling data attempted 7 time(s)
Pulling data attempted 8 time(s)
Pulling data attempted 9 time(s)
Pulling data attempted 10 time(s)
Pulling data attempted 11 time(s)
Pulling data attempted 12 time(s)
Pulling data attempted 13 time(s)
Pulling data attempted 14 time(s)
Pulling data attempted 15 time(s)
Pulling data attempted 16 time(s)
Pulling data attempted 17 time(s)
Pulling data attempted 18 time(s)
Pulling data attempted 19 time(s)
Pulling data attempted 20 time(s)
Pulling data attempted 21 time(s)
Pulling data attempted 22 time(s)
Pulling data attempted 23 time(s)
Pulling data attempted 24 time(s)
Pulling data attempted 25 time(s)
Pulling data attempted 26 time(s)
Pulling data attempted 27 time(s)
Pulling data attempted 28 time(s)
Pulling data attempted 29 time(s)
Pulling data attempted 

The output of the function we ran above to pull the Reddit data is a list, containing one dictionary per post. Using the below
for loop, we will pull out the post title and the name of the sub-reddit.  This data will be saved into a dataframe.

In [25]:
canoeing_title = []
for i in canoeing_pull:
    can = {}
    can['post_title'] = i['data']['title']
    can['post_sub'] = i['data']['subreddit']
    canoeing_title.append(can) 
    
canoe_df = pd.DataFrame(canoeing_title)

In [26]:
# inspect canoe dataframe
canoe_df.head()

Unnamed: 0,post_sub,post_title
0,canoeing,BWCA has been opened to Mining - Please Speak Out
1,canoeing,Surfing in Alberta #canoeing #surfing #esquif
2,canoeing,Ottertail paddle
3,canoeing,My very favorite possession.
4,canoeing,I thought you guys might like this. Solo open ...


In [27]:
len(canoe_df)

1070

Now that we have a dataframe with the data we will need for our modeling, we will save it to a csv to be used in our EDA and
modeling notebooks.

In [28]:
canoe_df.to_csv('./canoe_data.csv')

## Import Table Tennis Subreddit <a class="anchor" id="fourth-bullet"></a>

Using the same process that we used for the canoe sub-reddit, we will pull in the table tennis data from the Reddit API,
extract the post title and sub-reddit name, and export the resulting DataFrame to a csv.

In [13]:
# Define URL and username
url_name_tennis = "https://www.reddit.com/r/tabletennis.json"
headers = {"User-agent": 'phillipdibert'} 

In [20]:
# pull tennis data using function
tennis_pull = get_subreddit(url_name_tennis, n_pulls = 43, headers = headers)

Pulling data attempted 1 time(s)
Pulling data attempted 2 time(s)
Pulling data attempted 3 time(s)
Pulling data attempted 4 time(s)
Pulling data attempted 5 time(s)
Pulling data attempted 6 time(s)
Pulling data attempted 7 time(s)
Pulling data attempted 8 time(s)
Pulling data attempted 9 time(s)
Pulling data attempted 10 time(s)
Pulling data attempted 11 time(s)
Pulling data attempted 12 time(s)
Pulling data attempted 13 time(s)
Pulling data attempted 14 time(s)
Pulling data attempted 15 time(s)
Pulling data attempted 16 time(s)
Pulling data attempted 17 time(s)
Pulling data attempted 18 time(s)
Pulling data attempted 19 time(s)
Pulling data attempted 20 time(s)
Pulling data attempted 21 time(s)
Pulling data attempted 22 time(s)
Pulling data attempted 23 time(s)
Pulling data attempted 24 time(s)
Pulling data attempted 25 time(s)
Pulling data attempted 26 time(s)
Pulling data attempted 27 time(s)
Pulling data attempted 28 time(s)
Pulling data attempted 29 time(s)
Pulling data attempted 

In [21]:
tennis_title = []
for i in tennis_pull:
    ten = {}
    ten['post_title'] = i['data']['title']
    ten['post_sub'] = i['data']['subreddit']
    tennis_title.append(ten) 
    
table_tennis_df = pd.DataFrame(tennis_title)

In [22]:
# inspect dataframe
table_tennis_df.head()

Unnamed: 0,post_sub,post_title
0,tabletennis,Need a new paddle? Check here first!
1,tabletennis,"Weekly Table Tennis Advice - March 24, 2019"
2,tabletennis,Need opinions on current setup and maybe recom...
3,tabletennis,How to stop getting tendonitis and tennis elbo...
4,tabletennis,Should I master the pendulum serve first or ca...


In [23]:
# check length of df
len(table_tennis_df)

1063

In [29]:
# export to csv, to use in later work
table_tennis_df.to_csv('./table_tennis_data.csv')