# General Assembly - Project 3:  Webs API & Classification
## Notebook 2: Web Scraping
### DSI19 / Jordan David Nalpon

### Notebook 2 Index
---
* [Import Libraries](#lib)
* [Setting through Reddit's API](#api)
---
<a name="index"></a>

###  Notebook 2: Web Scraping

This notebook will web scrape the posts from r/tifu and r/confessions and export the data out as a csv for the next notebook. Please take note the csv export code contains a '_lock' to prevent overwriting of the original data that was scraped.

<a name="lib"></a>
### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import time
import requests
import ast

<a name="api"></a>
### Setting through Reddit's API

In [4]:
headers_dict = {'User-agent':'AgentJordan' }

In [5]:
# code to loop through subreddits
url = 'https://reddit.com/'
sub01_url = url + 'r/tifu/new' # set sub01 to 'tifu'
sub02_url = url + 'r/confessions'        # set sub02 to 'confessions'

limit_num = 100     # API 'limit' parameter

sub01_after = None  # instantiate empty counters for API 'after' parameter
sub02_after = None

sub01_pages = []    # instantiate empty lists to save API results
sub02_pages = []

for i in range(10): # pull from API 20 times
    
    # add 'after' parameters if an id has been saved - starts as None
    if sub01_after and sub02_after:
        # create full API url for sub01
        sub01_after_url = sub01_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub01_after
        print(sub01_after_url)
        
        # create full API url for sub02
        sub02_after_url = sub02_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub02_after
        print(sub02_after_url)
    
    # if one after is logged and the other is not
    elif bool(sub01_after) != bool(sub02_after):
        print('After reference out of sync.')
        break
    
    else:
        # create first run url
        sub01_after_url = sub01_url + '.json?limit=' + str(limit_num)
        sub02_after_url = sub02_url + '.json?limit=' + str(limit_num)
    
    # pull json from sub01
    sub01_res = requests.get(sub01_after_url, headers=headers_dict)
    print(i, sub01_res.status_code)
    
    # if sub01 connection is established
    if sub01_res.status_code == 200:
        # add page to list
        sub01_pages.append(sub01_res.json()['data'])
        print('sub01_pages length: ', len(sub01_pages))
        
        # set 'after' parameter for next run
        sub01_after = sub01_res.json()['data']['after']
        print('sub01_after: ', sub01_after)
        
    else:        
        print('Connection failed.\n')
    
    # sleep one second
    time.sleep(1)
    
    # pull json from sub02
    sub02_res = requests.get(sub02_after_url, headers=headers_dict)
    print(i, sub02_res.status_code)
    
    # if sub02 connection is established
    if sub02_res.status_code == 200:
        # add page to list
        sub02_pages.append(sub02_res.json()['data'])
        print('sub02_pages length: ', len(sub02_pages))
        
        # set 'after' parameter for next run
        sub02_after = sub02_res.json()['data']['after']
        print('sub02_after: ', sub02_after)
    else:
        print('Connection failed.\n')
        
    # sleep one second    
    time.sleep(1)


0 200
sub01_pages length:  1
sub01_after:  t3_kvkjnc
0 200
sub02_pages length:  1
sub02_after:  t3_kvmstp
https://reddit.com/r/tifu/new.json?limit=100&after=t3_kvkjnc
https://reddit.com/r/confessions.json?limit=100&after=t3_kvmstp
1 200
sub01_pages length:  2
sub01_after:  t3_kuefo8
1 200
sub02_pages length:  2
sub02_after:  t3_kv9avn
https://reddit.com/r/tifu/new.json?limit=100&after=t3_kuefo8
https://reddit.com/r/confessions.json?limit=100&after=t3_kv9avn
2 200
sub01_pages length:  3
sub01_after:  t3_ktdzt7
2 200
sub02_pages length:  3
sub02_after:  t3_kucuo5
https://reddit.com/r/tifu/new.json?limit=100&after=t3_ktdzt7
https://reddit.com/r/confessions.json?limit=100&after=t3_kucuo5
3 200
sub01_pages length:  4
sub01_after:  t3_ks5kcx
3 200
sub02_pages length:  4
sub02_after:  t3_ktgszs
https://reddit.com/r/tifu/new.json?limit=100&after=t3_ks5kcx
https://reddit.com/r/confessions.json?limit=100&after=t3_ktgszs
4 200
sub01_pages length:  5
sub01_after:  t3_krb9ut
4 200
sub02_pages lengt

In [6]:
# create DataFrames from posting lists
df_tifu = pd.DataFrame(sub01_pages)
df_confessions = pd.DataFrame(sub02_pages)

In [7]:
df_tifu.head()

Unnamed: 0,modhash,dist,children,after,before
0,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_kvkjnc,
1,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_kuefo8,
2,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_ktdzt7,
3,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_ks5kcx,
4,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_krb9ut,


In [8]:
df_confessions.head()

Unnamed: 0,modhash,dist,children,after,before
0,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_kvmstp,
1,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_kv9avn,
2,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_kucuo5,
3,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_ktgszs,
4,,100,"[{'kind': 't3', 'data': {'approved_at_utc': No...",t3_ksboej,


In [9]:
# exporting the dataframes as csv files
# the _lock is to prevent any overwriting of the dataset used in the other notebooks
df_tifu.to_csv(r'../01_data/notebook2_df_tifu_lock.csv', index=False)
df_confessions.to_csv(r'../01_data/notebook2_df_confessions_lock.csv', index=False)