# Project 3: Web APIs & Classification

## Web Scrapping notebook
The two subreddits chosen for web scrapping and classification are:
- [Ask Historians](https://www.reddit.com/r/AskHistorians/)
- [Ask Science](https://www.reddit.com/r/askscience/)


This notebook will detail the webscrapping process of turning the posts into csv files for the classification.

In [1]:
import requests
import pandas as pd
import time
import random

In [2]:
url_askhistorians = 'https://www.reddit.com/r/AskHistorians/.json'
url_askscience = 'https://www.reddit.com/r/askscience/.json'

In [3]:
res_askhistorians = requests.get(url_askhistorians, headers={'User-agent': 'Pony Inc 1.0'})
res_askscience = requests.get(url_askscience, headers={'User-agent': 'Pony Inc 1.0'})

In [4]:
print(res_askhistorians.status_code)
print(res_askscience.status_code)

200
200


In [5]:
reddit_dict_askhistorians = res_askhistorians.json()
reddit_dict_askscience = res_askscience.json()

In [6]:
reddit_dict_askhistorians

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'AskHistorians',
     'selftext': '[Previous](/r/AskHistorians/search?q=title%3A"Sunday+Digest"+OR+title%3A"Day+of+Reflection"&amp;restrict_sr=on&amp;sort=new&amp;t=all)\n\nToday: \n\nWelcome to this week\'s instalment of /r/AskHistorians\' Sunday Digest (formerly the Day of Reflection). Nobody can read all the questions and answers that are posted here, so in this thread we invite you to share anything you\'d like to highlight from the last week - an interesting discussion, an informative answer, an insightful question that was overlooked, or anything else.',
     'author_fullname': 't2_6l4z3',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Sunday Digest | Interesting &amp; Overlooked Posts | May 04, 2020–May 10, 2020',
     'link_flair_richtext': [{'e': 'text', 't': 'Digest'}],
   

In [7]:
reddit_dict_askhistorians.keys()

dict_keys(['kind', 'data'])

In [8]:
reddit_dict_askhistorians['kind']

'Listing'

In [9]:
reddit_dict_askhistorians['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [10]:
reddit_dict_askscience

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'askscience',
     'selftext': "Hello everyone! We thought it was time for a meta post to connect with our community. We have two topics we'd like to cover today. Please grab a mug of tea and pull up a comfy chair so we can have a chat.\n\n---\n\n**COVID-19**\n\nFirst, we wanted to talk about COVID-19. The mod team and all of our expert panelists have been working overtime to address as many of your questions as we possibly can. People are understandably scared, and we are grateful that you view us as a trusted source of information right now. We are doing everything we can to offer information that is timely and accurate. \n\nWith that said, there are some limits to what we can do. There are a lot of unknowns surrounding this virus and the disease it causes. Our policy has always been to rely on peer-reviewed science wherever possible, and an

In [11]:
#Using the loop function from DSI-14 week 4 api class

posts_askhistorians = []
after = None

#range of 20 per 50 posts to get close to 1000
for a in range(20):
    if after == None:
        current_url = url_askhistorians
    else:
        current_url = url_askhistorians + '?after=' + after +'&limit=50' #changing the limit to 50 posts
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    print('No. of posts ' +str(len(current_posts)))
    
    # Extending posts_askhistorians list
    posts_askhistorians.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,30)
    print(sleep_duration)
    time.sleep(sleep_duration) #program sleeps for the sleep_duration

https://www.reddit.com/r/AskHistorians/.json
No. of posts 27
13
https://www.reddit.com/r/AskHistorians/.json?after=t3_gh01vp&limit=50
No. of posts 50
7
https://www.reddit.com/r/AskHistorians/.json?after=t3_ggwz45&limit=50
No. of posts 50
18
https://www.reddit.com/r/AskHistorians/.json?after=t3_gglcng&limit=50
No. of posts 50
14
https://www.reddit.com/r/AskHistorians/.json?after=t3_ggdxst&limit=50
No. of posts 50
28
https://www.reddit.com/r/AskHistorians/.json?after=t3_ggjkzq&limit=50
No. of posts 50
18
https://www.reddit.com/r/AskHistorians/.json?after=t3_gg4s4u&limit=50
No. of posts 50
28
https://www.reddit.com/r/AskHistorians/.json?after=t3_gg0ovg&limit=50
No. of posts 50
7
https://www.reddit.com/r/AskHistorians/.json?after=t3_gfqy4n&limit=50
No. of posts 50
26
https://www.reddit.com/r/AskHistorians/.json?after=t3_gfvr3e&limit=50
No. of posts 50
18
https://www.reddit.com/r/AskHistorians/.json?after=t3_gfd7q9&limit=50
No. of posts 50
16
https://www.reddit.com/r/AskHistorians/.json?aft

In [12]:
#Checking the length of the posts retrieved
len(posts_askhistorians)

977

In [15]:
#Saving the posts to a csv file
pd.DataFrame(posts_askhistorians).to_csv('../datasets/askhistorians.csv')

In [16]:
#Using the loop function from DSI-14 week 4 api class

posts_askscience = []
after = None

#range of 20 per 50 posts to get close to 1000
for a in range(20):
    if after == None:
        current_url = url_askscience
    else:
        current_url = url_askscience + '?after=' + after +'&limit=50' #changing the limit to 50 posts
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    print('No. of posts ' +str(len(current_posts)))
    
    # Extending posts_askscience list
    posts_askscience.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,30)
    print(sleep_duration)
    time.sleep(sleep_duration) #program sleeps for the sleep_duration

https://www.reddit.com/r/askscience/.json
No. of posts 27
15
https://www.reddit.com/r/askscience/.json?after=t3_gfaczm&limit=50
No. of posts 50
27
https://www.reddit.com/r/askscience/.json?after=t3_gffe0u&limit=50
No. of posts 50
15
https://www.reddit.com/r/askscience/.json?after=t3_geffy3&limit=50
No. of posts 50
20
https://www.reddit.com/r/askscience/.json?after=t3_gcrlhn&limit=50
No. of posts 50
10
https://www.reddit.com/r/askscience/.json?after=t3_gbdjhn&limit=50
No. of posts 50
29
https://www.reddit.com/r/askscience/.json?after=t3_gabinr&limit=50
No. of posts 50
22
https://www.reddit.com/r/askscience/.json?after=t3_g9b95q&limit=50
No. of posts 50
8
https://www.reddit.com/r/askscience/.json?after=t3_g85y1q&limit=50
No. of posts 50
24
https://www.reddit.com/r/askscience/.json?after=t3_g6kmsy&limit=50
No. of posts 50
5
https://www.reddit.com/r/askscience/.json?after=t3_g5yl9j&limit=50
No. of posts 50
28
https://www.reddit.com/r/askscience/.json?after=t3_g45x1t&limit=50
No. of posts 5

In [17]:
#Checking the length of the posts retrieved
len(posts_askscience)

977

In [19]:
#Saving the posts to a csv file
pd.DataFrame(posts_askscience).to_csv('../datasets/askscience.csv')

## Creating the holdout dataset as the final test for the best model

125 posts from each subreddit will be scrapped as the final

In [3]:
import requests
import pandas as pd
import time
import random

In [17]:
#Using the loop function from DSI-14 week 4 api class

holdout = []
after = None

#Askscience
for a in range(3):
    if after == None:
        current_url = 'https://www.reddit.com/r/askscience/new/.json'
    else:
        current_url = 'https://www.reddit.com/r/askscience/new/.json' + '?after=' + after +'&limit=50' #changing the limit to 50 posts
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    print('No. of posts ' +str(len(current_posts)))
    
    # Extending posts_askhistorians list
    holdout.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,30)
    print(sleep_duration)
    time.sleep(sleep_duration) #program sleeps for the sleep_duration
    


https://www.reddit.com/r/askscience/new/.json
No. of posts 25
15
https://www.reddit.com/r/askscience/new/.json?after=t3_gk1fr5&limit=50
No. of posts 50
8
https://www.reddit.com/r/askscience/new/.json?after=t3_gjfq57&limit=50
No. of posts 50
18


In [18]:
#Using the loop function from DSI-14 week 4 api class
after = None

#Askhistorians
for a in range(3):
    if after == None:
        current_url = 'https://www.reddit.com/r/askhistorians/new/.json'
    else:
        current_url = 'https://www.reddit.com/r/askhistorians/new/.json' + '?after=' + after +'&limit=50' #changing the limit to 50 posts
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    print('No. of posts ' +str(len(current_posts)))
    
    # Extending posts_askscience list
    holdout.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,30)
    print(sleep_duration)
    time.sleep(sleep_duration) #program sleeps for the sleep_duration

https://www.reddit.com/r/askhistorians/new/.json
No. of posts 25
16
https://www.reddit.com/r/askhistorians/new/.json?after=t3_gksjwm&limit=50
No. of posts 50
24
https://www.reddit.com/r/askhistorians/new/.json?after=t3_gkkwg8&limit=50
No. of posts 50
14


In [19]:
len(holdout)

250

In [20]:
#Saving the holdout to a csv file
pd.DataFrame(holdout).to_csv('../datasets/holdout.csv')