# Reddit Web-scraping

## Table of contents:

<ul>
<li><a href="#intro">i) Introduction</a></li>
<li><a href="#import">ii) Import Dependencies</a></li>    
<li><a href="#scraping">iii) Web Scraping (Data Collection)</a></li>
<ul>



<a id='intro'></a>
## i) Introduction

Selected subb-Reddits I have chosen are centered on personal interests in how people try understand and answer life's toughest questions. These can be found in the links below:

* [r/AskScience](https://www.reddit.com/r/askscience/)  
* [r/AskPhilosophy](https://www.reddit.com/r/askphilosophy/)

I have tokenized the text and then modeled the threads accordingly. The aim of this project would be to see which words are being used often in the 'Science' or 'Philosophy' context and also find out the names that might have impacted the respective fields.

<a id='import'></a>
## ii) Import Dependencies

In [43]:
import requests
import json
import time
import random
import pandas as pd
from bs4 import BeautifulSoup

<a id='scraping'></a>
## iii) Web Scraping

In [33]:
# Select urls that will be scraped
url_1 = "https://www.reddit.com/r/askscience/.json"
url_2 = "https://www.reddit.com/r/askphilosophy/.json"

In [58]:
# Headers to mimic a browser visit
headers = {'User-Agent': 'Mozilla/5.0'}

# Request data from server
res_1 = requests.get(url_1, headers=headers)
res_2 = requests.get(url_2, headers=headers)

In [59]:
# Check if server returns error or that everything is working fine
res_1.status_code, res_2.status_code

(200, 200)

In [60]:
# Return the json-encoded content of a response respectively
sci_dict = res_1.json()
philo_dict = res_2.json()

##### Loop through r/askscience

In [76]:
# Loop through 25 posts at a time for r/askscience
posts = []
after = None

for a in range(45):
    if after == None:
        current_url = url_1
    else:
        current_url = url_1 + '?after=' + after
    print(current_url)
    res_1 = requests.get(current_url, headers=headers)
    
    if res_1.status_code != 200:
        print('Status error', res_1.status_code)
        break
    
    current_dict = res_1.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    pd.DataFrame(posts).to_csv('askscience.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,5)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/askscience/.json
4
https://www.reddit.com/r/askscience/.json?after=t3_dij9wn
5
https://www.reddit.com/r/askscience/.json?after=t3_dhqkoa
3
https://www.reddit.com/r/askscience/.json?after=t3_dhbjik
5
https://www.reddit.com/r/askscience/.json?after=t3_dgx8h9
2
https://www.reddit.com/r/askscience/.json?after=t3_dfwv71
2
https://www.reddit.com/r/askscience/.json?after=t3_dga34c
4
https://www.reddit.com/r/askscience/.json?after=t3_dfrnxp
5
https://www.reddit.com/r/askscience/.json?after=t3_dfbqz9
2
https://www.reddit.com/r/askscience/.json?after=t3_deuka9
5
https://www.reddit.com/r/askscience/.json?after=t3_de30xh
5
https://www.reddit.com/r/askscience/.json?after=t3_ddr4vf
2
https://www.reddit.com/r/askscience/.json?after=t3_ddeghx
5
https://www.reddit.com/r/askscience/.json?after=t3_dcjej4
3
https://www.reddit.com/r/askscience/.json?after=t3_dcfb8m
5
https://www.reddit.com/r/askscience/.json?after=t3_dbwfww
2
https://www.reddit.com/r/askscience/.json?after=t3_db3fj

##### Loop through r/askphilosophy

In [77]:
# Loop through 25 posts at a time for r/askphilosophy
posts = []
after = None

for a in range(45):
    if after == None:
        current_url = url_2
    else:
        current_url = url_2 + '?after=' + after
    print(current_url)
    res_2 = requests.get(current_url, headers=headers)
    
    if res_2.status_code != 200:
        print('Status error', res_2.status_code)
        break
    
    current_dict = res_2.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    pd.DataFrame(posts).to_csv('askphilosophy.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/askphilosophy/.json
3
https://www.reddit.com/r/askphilosophy/.json?after=t3_dixdly
4
https://www.reddit.com/r/askphilosophy/.json?after=t3_dio9le
2
https://www.reddit.com/r/askphilosophy/.json?after=t3_dif23g
6
https://www.reddit.com/r/askphilosophy/.json?after=t3_di53wa
2
https://www.reddit.com/r/askphilosophy/.json?after=t3_dhu39b
5
https://www.reddit.com/r/askphilosophy/.json?after=t3_dhhgsk
4
https://www.reddit.com/r/askphilosophy/.json?after=t3_dgt3c1
6
https://www.reddit.com/r/askphilosophy/.json?after=t3_dggn75
6
https://www.reddit.com/r/askphilosophy/.json?after=t3_dg8hov
5
https://www.reddit.com/r/askphilosophy/.json?after=t3_dg0jgf
5
https://www.reddit.com/r/askphilosophy/.json?after=t3_dfic19
6
https://www.reddit.com/r/askphilosophy/.json?after=t3_df28ow
5
https://www.reddit.com/r/askphilosophy/.json?after=t3_devhzw
5
https://www.reddit.com/r/askphilosophy/.json?after=t3_deeloj
5
https://www.reddit.com/r/askphilosophy/.json?after=t3_dea0qk
2
https://

In [80]:
philo_df = pd.read_csv('askphilosophy.csv')
philo_df.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,awarders,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,steward_reports,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,True,BernardJOrtcutt,,,,[],,,,text,t2_nmqzc,False,[],,,False,False,,False,,False,1541457000.0,1541429000.0,,moderator,self.askphilosophy,0,False,0,{},False,False,9udzvt,False,False,False,False,True,True,False,,,mod,[],61188432-c80a-11e7-bc74-0ef1e0d09910,Modpost,dark,text,False,,{},False,,,,[],t3_9udzvt,False,95,0,,False,all_ads,/r/askphilosophy/comments/9udzvt/announcement_...,False,6,False,,,False,95,,{},Today we are going live with a new set of rule...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,[],True,askphilosophy,t5_2sc5r,r/askphilosophy,114809,public,,,"Announcement: New Rules, Guidelines and Flair ...",0,95,https://www.reddit.com/r/askphilosophy/comment...,[],,False,all_ads,6
1,[],False,,,False,AutoModerator,,,,[],,,,text,t2_6l4z3,False,[],,,False,False,,False,,False,1571108000.0,1571079000.0,,moderator,self.askphilosophy,0,False,0,{},False,False,dhv21d,False,False,False,False,True,True,False,,,mod,[],,Open Thread,dark,text,False,,{},False,,,,[],t3_dhv21d,True,79,0,,False,all_ads,/r/askphilosophy/comments/dhv21d/raskphilosoph...,False,6,False,,,False,5,,{},Welcome to this week's Open Discussion Thread....,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",False,False,[],True,askphilosophy,t5_2sc5r,r/askphilosophy,114809,public,new,,/r/askphilosophy Open Discussion Thread | Octo...,0,5,https://www.reddit.com/r/askphilosophy/comment...,[],,False,all_ads,6
2,[],False,,,False,dhaddu_Dhadaura1212,,,,[],,,,text,t2_4d9fjvhe,False,[],,,False,False,,False,,False,1571318000.0,1571289000.0,,,self.askphilosophy,0,1571299353.0,0,{},False,False,dj1nfo,False,False,False,False,True,True,False,,,,[],,,dark,text,False,,{},False,,,,[],t3_dj1nfo,False,16,0,,False,all_ads,/r/askphilosophy/comments/dj1nfo/can_a_post_sc...,False,6,False,,,False,19,,{},For example if everyone has all their needs an...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,[],False,askphilosophy,t5_2sc5r,r/askphilosophy,114809,public,,,Can a post scarcity world be unsatisfying for ...,0,19,https://www.reddit.com/r/askphilosophy/comment...,[],,False,all_ads,6
3,[],False,,,False,NYCWallCrawlr,,,,[],,,,text,t2_2kqxjqdm,False,[],,,False,False,,False,,False,1571271000.0,1571242000.0,,,self.askphilosophy,0,1571242875.0,0,{},False,False,dirjll,False,False,False,False,True,True,False,,,,[],,,dark,text,True,,{},False,,,,[],t3_dirjll,False,87,1,,False,all_ads,/r/askphilosophy/comments/dirjll/what_are_the_...,False,6,False,,,False,136,,{},From comments removed by mods simply for menti...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,[],False,askphilosophy,t5_2sc5r,r/askphilosophy,114809,public,,,What are the philosophical mistakes that Jorda...,0,136,https://www.reddit.com/r/askphilosophy/comment...,[],,False,all_ads,6
4,[],False,,,False,BioFeld,,,,[],,,,text,t2_8kc2c,False,[],,,False,False,,False,,False,1571318000.0,1571289000.0,,,self.askphilosophy,0,False,0,{},False,False,dj1ope,False,False,False,False,True,True,False,,,,[],,,dark,text,False,,{},False,,,,[],t3_dj1ope,True,1,0,,False,all_ads,/r/askphilosophy/comments/dj1ope/what_translat...,False,6,False,,,False,12,,{},"I got a copy of Albert Camus' book ""A Happy De...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,[],False,askphilosophy,t5_2sc5r,r/askphilosophy,114809,public,,,"What translation of Albert Camus' ""A Happy Dea...",0,12,https://www.reddit.com/r/askphilosophy/comment...,[],,False,all_ads,6


##### Checks

In [81]:
sci_df = pd.read_csv('askscience.csv')
sci_df.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,awarders,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,steward_reports,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,"[{'count': 1, 'is_enabled': True, 'subreddit_i...",True,,,False,AskScienceModerator,,,,[],,Mod Bot,dark,text,t2_ec1ey,False,[],,,False,False,,False,,False,1563658000.0,1563629000.0,,moderator,self.askscience,0,False,1,{'gid_2': 1},False,False,cflsy3,False,False,False,False,True,True,False,,,,[],,,dark,text,False,,{},False,,,,[],t3_cflsy3,False,50,0,,False,all_ads,/r/askscience/comments/cflsy3/askscience_panel...,False,6,False,,,False,262,,{},**Please read this entire post carefully and f...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,[],True,askscience,t5_2qm4e,r/askscience,18179394,public,new,,AskScience Panel of Scientists XXI,1,262,https://www.reddit.com/r/askscience/comments/c...,[],,False,all_ads,6
1,[],False,,,False,AskScienceModerator,,,,[],,Mod Bot,dark,text,t2_ec1ey,False,[],,,False,False,,False,,False,1571252000.0,1571224000.0,,,self.askscience,0,False,0,{},False,False,dinnqo,False,False,False,False,True,True,False,,,med,[],,Medicine,dark,text,False,,{},False,,,,[],t3_dinnqo,False,37,0,,False,all_ads,/r/askscience/comments/dinnqo/askscience_ama_s...,False,6,False,,,False,42,,{},Today is International Restart-a-Heart Day (ht...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",False,False,[],True,askscience,t5_2qm4e,r/askscience,18179394,public,,,AskScience AMA Series: We're a team of researc...,0,42,https://www.reddit.com/r/askscience/comments/d...,[],,False,all_ads,6
2,[],True,,,False,AskScienceModerator,,,,[],,Mod Bot,dark,text,t2_ec1ey,False,[],,,False,False,,False,,False,1571252000.0,1571224000.0,,,self.askscience,0,False,0,{},False,False,dinnrs,False,False,False,False,True,True,False,,,med,[],,Medicine,dark,text,False,,{},False,,,,[],t3_dinnrs,False,486,1,,False,all_ads,/r/askscience/comments/dinnrs/askscience_ama_s...,False,6,False,,,False,6977,,{},Measles is one of the most contagious diseases...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",False,False,[],False,askscience,t5_2qm4e,r/askscience,18179394,public,,,AskScience AMA Series: Experts are warning tha...,0,6977,https://www.reddit.com/r/askscience/comments/d...,[],,False,all_ads,6
3,[],True,,,False,_Dantallica_,,,,[],,,,text,t2_135j50,False,[],,,False,False,,False,,False,1571210000.0,1571181000.0,,,self.askscience,0,False,0,{},False,False,dig96b,False,False,False,False,True,True,False,,#66ddff,chem,[],3ebe1a9c-8971-11e1-9c40-12313d2c1af1,Chemistry,dark,text,False,,{},False,,,,[],t3_dig96b,False,70,1,,False,all_ads,/r/askscience/comments/dig96b/why_is_the_heat_...,False,6,False,,,False,1611,,{},"I am taking AP Chemistry this year, and we're ...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,[],False,askscience,t5_2qm4e,r/askscience,18179394,public,,,Why is the heat capacity of liquid water so mu...,0,1611,https://www.reddit.com/r/askscience/comments/d...,[],,False,all_ads,6
4,[],False,,,False,Khenghis_Ghan,,,,[],,,,text,t2_932kv,False,[],,,False,False,,False,,False,1571317000.0,1571288000.0,,,self.askscience,0,False,0,{},False,False,dj1i9k,False,False,False,False,True,True,False,,#ff99cc,computing,[],612c7612-dfa7-11e3-bb11-12313d18e5cd,Computing,dark,text,False,,{},False,,,,[],t3_dj1i9k,False,10,0,,False,all_ads,/r/askscience/comments/dj1i9k/how_can_software...,False,6,False,,,False,4,,{},I was watching a lecture about assemblers/comp...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,[],False,askscience,t5_2qm4e,r/askscience,18179394,public,,,How can software perform tasks hardware cant’t?,0,4,https://www.reddit.com/r/askscience/comments/d...,[],,False,all_ads,6


##### continued to `02_reddit-nlp-and-text-modeling`...