# Webscraping using Subreddit APIs

In this notebook, I webscrape the following 2 subreddits:
1. `r/Conservative`
2. `r/democrats`

I also view the data on first scrape and perform some preliminary analysis on the data to determine which fields are useful for analysis later on. 

Contents:
- [Imports](#Import-libraries)
- [Webscraping Function](#Webscraping-function)
- [Scraped data from `r/Conservative`](#Scrape-data-from-`r/Conservative`)
- [Data exploration on `r/Conservative` scraped data](#-Data-exploration-on-`r/Conservative`-scraped-data)
- [Scraped data from `r/democrats`](#Scrape-data-from-`r/democrats`)
- [Data exploration on `r/democrats` scraped data](#Data-exploration-on-`r/democrats`-scraped-data)
-[Scraped comments from posts](#Scrape-comments-from-posts)
- [Save datasets to csv file](#Save-datasets-to-csv-file)

### Import libraries

In [36]:
# imports
import pandas as pd
import numpy as np

import requests
# !pip install praw
import praw

import time
import random

from pandas_profiling import ProfileReport
import warnings
warnings.filterwarnings('ignore')

# !pip install prettytable
from prettytable import PrettyTable

### Webscraping function

**JSON API**

Using json api, I create a function to first scrape the top posts and comments. There is a limit of up to 1,000 reddit posts so we can expect this many posts for each subreddit or lesser if there are duplicate posts in the subreddit.

In [5]:
def scrape_subreddit(subreddit, num_posts):
    # creates a dataframe of subreddit posts
    url = f'https://www.reddit.com/r/{subreddit}.json'
    headers = {'User-agent': 'fun-sized 1.0'}
    posts_full = []
    posts_unique = []
    after = None

    num_scrape = int(np.ceil(num_posts/25))
    for i in range(num_scrape):
        
        #print tracker
        if (i + 1) % 10 == 0:
            print(f'Scraped {i + 1} pages...')
        
        #update url for next reddit page after the first
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after

        #use request lib to get html, use useragent under headers to avoid 429 status error.
        res = requests.get(current_url, headers = headers) 

        #print error if encountered
        if res.status_code != 200:
            print('Status error: ', res.status_code) 
            break

        current_dict = res.json() #get the dictionary of data from current 25 posts for subreddit
        current_posts = [p['data'] for p in current_dict['data']['children']] #extract just the posts for the subreddit
        posts_full.extend(current_posts) #append current posts to list of posts previously extracted
        after = current_dict['data']['after'] #set after to create new url for the next 25 posts 
        
        #generate random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        time.sleep(sleep_duration)
    
    #remove duplicates
    titles = []   
    for p in posts_full:
        if p['title'] not in titles: #use title to check for duplicates
            titles.append(p['title'])
            posts_unique.append(p)
    
    return pd.DataFrame(posts_unique)

### Scrape data from `r/Conservative`

In [6]:
#scrape 1,500 posts from r/Conservative
conservative_df = scrape_subreddit('Conservative', 1500)
conservative_df.head()

Scraped 10 pages...
Scraped 20 pages...
Scraped 30 pages...
Scraped 40 pages...
Scraped 50 pages...
Scraped 60 pages...


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,is_gallery,media_metadata,gallery_data
0,,Conservative,,t2_4hgotpz8,False,,0,False,Parler finds refuge with right-leaning webhost...,"[{'e': 'text', 't': 'Flaired Users Only'}]",...,https://www.washingtonexaminer.com/news/parler...,632488,1610414000.0,0,,False,,,,
1,,Conservative,,t2_4hgotpz8,False,,0,False,Elon Musk: A lot of people are going to be sup...,"[{'e': 'text', 't': 'Flaired Users Only'}]",...,https://twitchy.com/samj-3930/2021/01/11/hey-g...,632488,1610397000.0,1,,False,,,,
2,,Conservative,,t2_4u7bd,False,,0,False,Sorry Cleveland,"[{'e': 'text', 't': 'Flaired Users Only'}]",...,https://i.redd.it/ukmkh9mo8ta61.jpg,632488,1610418000.0,0,,False,3cbf370a-5fe4-11e6-9198-0e0b7c2c3ef3,,,
3,,Conservative,,t2_d0rgw,False,,0,False,Democrat Law Professor: Trump Never Actually C...,"[{'e': 'text', 't': 'Flaired Users Only'}]",...,https://townhall.com/tipsheet/katiepavlich/202...,632488,1610410000.0,2,,False,,,,
4,,Conservative,,t2_4hgotpz8,False,,0,False,Elon Musk Advises People to Ditch Facebook and...,"[{'e': 'text', 't': 'Flaired Users Only'}]",...,https://www.digitaltrends.com/news/elon-musk-f...,632488,1610390000.0,0,,False,,,,


In [7]:
conservative_df.shape

(491, 111)

From here, we see 491 unique titles out of about 1,000 posts on the subreddit, which is slightly less than half. More data would be better to work with and this will be considered in further analysis below.

### Data exploration on `r/Conservative` scraped data
I do a quick review of the dataset using pandas profiling report, to identify which fields would be useful for analysis.

In [8]:
# generate html summary report about data - r/democrats
report = ProfileReport(conservative_df)
report.to_file(output_file = 'report_conservative_df.html')
print("Report is ready!")

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=125.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))


Report is ready!


From the conservative dataframe report, I note the following:
1. 'selftext' column which contains the content of the reddit posts are empty in this subreddit. Looking through the subreddit `r/Conservative`, each post mostly usually contains title and a news link and rarely a text. While this would be most useful for modelling, there is no data. An alternative would be using comments data, which I will scrape later on. 


2. 'title' column has no missing value. Apart from the content of the posts, the title is probably the closest identifier of the post to a particular subreddit. This would be my first option for modelling.


3. 'link_flair_text' column has 9 missing data (2.2% of dataset). Flair in reddit is used as a form of subcategorisation within the subreddit (Mathew, 2019). There are three types of flairs for posts : Flaired Users Only, Satire - Flaired Users Only and Misleading Title. These are not very useful for modelling later so we will not consider this. 


4. 'domain' shows the source of the link eg. foxnews, dailywire, washingtonexaminer etc. There are no missing values under this column so this could be considered for modelling later on. This is especially since for instance, foxnews is well-known to be controlled by Republicans so we are likely to see more links in this subreddit from this news channel. 


5. 'subreddit' column will be the label for our classification model later. 

### Scrape data from `r/democrats`

In [9]:
#scrape 1,500 posts from r/democrats
democrats_df = scrape_subreddit('democrats', 1500)
democrats_df.head()

Scraped 10 pages...
Scraped 20 pages...
Scraped 30 pages...
Scraped 40 pages...
Scraped 50 pages...
Scraped 60 pages...


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,is_gallery,media_metadata,gallery_data
0,,democrats,,t2_nkk56,False,,0,False,House Democrats launch second impeachment of T...,"[{'e': 'text', 't': '🔴 Megathread'}]",...,/r/JoeBiden/comments/kv4jen/house_democrats_la...,175113,1610378000.0,0,,False,,,,
1,,democrats,,t2_4f8e5viq,False,,0,False,Do I have to?,[],...,https://i.redd.it/sq8b597wpsa61.jpg,175113,1610411000.0,1,,False,,,,
2,,democrats,,t2_17c1os,False,,0,False,"""Camp Auschwitz"" guy identified!",[],...,https://i.redd.it/ewmxi3oh0qa61.jpg,175113,1610378000.0,2,,False,,,,
3,,democrats,,t2_yfff4,False,,0,False,No Crawling Back!!!,"[{'e': 'text', 't': '📄Effortpost'}]",...,https://i.redd.it/863qmtjcbsa61.jpg,175113,1610406000.0,0,,False,,,,
4,,democrats,,t2_2tm5mq8b,False,,0,False,Use the 14th Amendment to ban Trump,"[{'e': 'text', 't': '🗳️ Beat Trump'}]",...,https://i.redd.it/nil93j0cmsa61.jpg,175113,1610410000.0,0,,False,,,,


In [10]:
democrats_df.shape

(990, 114)

### Data exploration on `r/democrats` scraped data

In [11]:
# generate html summary report about data - r/democrats
report = ProfileReport(democrats_df)
report.to_file(output_file = 'report_democrats_df.html')
print("Report is ready!")

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=128.0), HTML(value='')))

(using `df.profile_report(correlations={"cramers": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')





HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))


Report is ready!


From the democrats dataframe report, there are some similar characteristics with the conservatives dataframe profile report. Below the details:
1. 94% of the 'selftext' column is empty. Similar to the subreddit `r/Conservative`, each post mostly usually contains title and a news link and rarely a text. While this would be most useful for modelling, there is no data. An alternative would be using comments data, which I will scrape later on. 


2. 'title' column has no missing value. Similar to `r/Conservative`, this would be the first option to use for modelling since it closely reflects the content of the post.


3. 'link_flair_text' column has 45.4% missing data which is pretty substantial. I would not consider this for modelling.


4. 'domain' shows the source of the link eg. self.democrats, twitter, youtube etc. There are no missing values under this column so this could be considered for modelling as well.


5. 'subreddit' column will be the label for our classification model later. 

### Scrape comments from posts
From analysis above, I decided to scrape comments as an alternative to post content, since post content is lacking in the subreddits selected.

In [13]:
def scrape_top_comments(subreddit):
    #function scrapes the top 10 comments from the top 10 posts in the subreddit
    url = f'https://www.reddit.com/r/{subreddit}/top.json'
    headers = {'User-agent': 'fun-sized 2.0'}
    
    res = requests.get(url, headers = headers)
    
    if res.status_code == 200:
        raw_dict = res.json() # get dict of data from top 25 posts for subreddit
        comments = []

        for i in range(24):
            permalink = raw_dict['data']['children'][i]['data']['permalink'] 
            comment_url = f'https://www.reddit.com{permalink}.json?sort=confidence' #get the url for comments of individual post
            res_comment = requests.get(comment_url, headers = headers)
            if res_comment.status_code == 200:
                dict_best_comments = res_comment.json()
                indiv_comments = dict_best_comments[1]['data']['children']
                for j in range(100):
                    try:
                        comments.append(indiv_comments[j]['data']['body'])
                    except:
                        break

        comments_df = pd.DataFrame(comments, columns = ['comments'])
        comments_df['subreddit'] = [f'{subreddit}' for k in range(len(comments_df))]
    
    return comments_df

In [14]:
%%time
#scrape comments from r/Conservative
conservative_top_comments_df = scrape_top_comments('Conservative')
conservative_top_comments_df.head()

Wall time: 32.6 s


Unnamed: 0,comments,subreddit
0,Looking for debate? Head to the public section...,Conservative
1,Except their lawyers have dropped Parler as a ...,Conservative
2,Republicans had six years to do something abou...,Conservative
3,/r/news has been celebrating all day. They lov...,Conservative
4,I wish him luck but what can he do at this poi...,Conservative


In [15]:
#view number of comments
conservative_top_comments_df.shape

(729, 2)

From here we see that there are 729 comments scraped from `r/Conservative`, which is a decent amount of data.

In [16]:
%%time
democrats_top_comments_df = scrape_top_comments('democrats')
democrats_top_comments_df.head()

Wall time: 15.1 s


Unnamed: 0,comments,subreddit
0,My hero!! He made these idiots to go the other...,democrats
1,Isn’t this the Blue Lives Matter/Law &amp; Ord...,democrats
2,Ability to make critical decisions on the spot.,democrats
3,Goodman is good man!\n\nSorry I had too,democrats
4,He is the only real American in that damn room...,democrats


In [17]:
democrats_top_comments_df.shape

(262, 2)

From here we see that there are 262 comments scraped from `r/democrats`, which is quite lacking. 

**Python Reddit API Wrapper(PRAW)**


Extracting titles and comments can also be done with Python Reddit API Wrapper(PRAW) (Boe, 2020), which is faster than json.

In [19]:
reddit = praw.Reddit(client_id='bxZEn1eDscHbUg', client_secret='TK1X8gY_fS-TjHTz61H156z6uKOkcQ', user_agent='reddit-webscrape')

In [20]:
# function that scrape titles of subreddit
def praw_scrape_titles(subreddit, num_post):
    post_title = []
    top_posts_pols = reddit.subreddit(subreddit).hot(limit=num_post)
    for post in top_posts_pols:
        post_title.append(post.title)
    
    praw_titles_df = pd.DataFrame(post_title, columns = ['titles'])
    praw_titles_df['subreddit'] = [f'{subreddit}' for k in range(len(praw_titles_df))]   
    
    return praw_titles_df

In [21]:
#scrape titles off r/Conservative using praw
cons_praw_titles_df = praw_scrape_titles('Conservative', 1000)
cons_praw_titles_df.head()

Unnamed: 0,titles,subreddit
0,Twitter’s ban on Trump strips US of ‘moral hig...,Conservative
1,Parler finds refuge with right-leaning webhost...,Conservative
2,Sorry Cleveland,Conservative
3,Elon Musk: A lot of people are going to be sup...,Conservative
4,Democrat Law Professor: Trump Never Actually C...,Conservative


In [22]:
len(cons_praw_titles_df)

504

Successfully scraped 504 titles off `r/Conservative` using praw.

In [23]:
#scrape titles off r/democrats using praw
dems_praw_titles_df = praw_scrape_titles('democrats', 1000)
dems_praw_titles_df.head()

Unnamed: 0,titles,subreddit
0,House Democrats launch second impeachment of T...,democrats
1,Do I have to?,democrats
2,"""Camp Auschwitz"" guy identified!",democrats
3,No Crawling Back!!!,democrats
4,Use the 14th Amendment to ban Trump,democrats


In [24]:
len(dems_praw_titles_df)

997

Successfully scraped 997 titles off `r/democrats` using praw.

In [29]:
#function that scrapes top comments from each subreddit
def praw_scrape_comments(subreddit):
    url = f'https://www.reddit.com/r/{subreddit}/top.json'
    headers = {'User-agent': 'fun-sized 3.0'}
    res = requests.get(url, headers = headers)
    if res.status_code == 200:
        raw_dict = res.json()
        comments = []
        for i in range(24):
            permalink = raw_dict['data']['children'][i]['data']['permalink']
            comment_url = f'https://www.reddit.com{permalink}.json?sort=confidence'
            submission = reddit.submission(url = comment_url)
            submission.comments.replace_more(limit = None)
            for comment in submission.comments.list():
                text =str(comment.body)
                comments.append(text)                         
    
        comments_df = pd.DataFrame(comments, columns = ['comments'])
        comments_df['subreddit'] = [f'{subreddit}' for k in range(len(comments_df))]   
    
    return comments_df       

In [30]:
%%time
#scrape comments off r/Conservative using praw
cons_praw_comments_df = praw_scrape_comments('Conservative')
cons_praw_comments_df.head()

Wall time: 2min 42s


Unnamed: 0,comments,subreddit
0,Looking for debate? Head to the public section...,Conservative
1,Except their lawyers have dropped Parler as a ...,Conservative
2,Republicans had six years to do something abou...,Conservative
3,/r/news has been celebrating all day. They lov...,Conservative
4,I wish him luck but what can he do at this poi...,Conservative


In [31]:
len(cons_praw_comments_df)

4248

In [None]:
Successfully scraped 4248 comments off `r/Conservative` using praw.

In [32]:
%%time
#scrape comments off r/democrats using praw
dems_praw_comments_df = praw_scrape_comments('democrats')
dems_praw_comments_df.head()

Wall time: 16.1 s


Unnamed: 0,comments,subreddit
0,My hero!! He made these idiots to go the other...,democrats
1,Isn’t this the Blue Lives Matter/Law & Order c...,democrats
2,Ability to make critical decisions on the spot.,democrats
3,Goodman is good man!\n\nSorry I had too,democrats
4,He is the only real American in that damn room...,democrats


In [33]:
len(dems_praw_comments_df)

573

Successfully scraped 573 titles off `r/democrats` using praw.

In [37]:
# view number of data collected in table 
x = PrettyTable()
x.field_names = ["Mode and Subreddit", "Number of Titles Scraped", "Number of Comments Scraped"]

x.add_row(["JSON | r/Conservative", len(conservative_df), len(conservative_top_comments_df)])
x.add_row(["PRAW | r/Conservative", len(cons_praw_titles_df), len(cons_praw_comments_df)])
x.add_row(["JSON | r/democrats", len(democrats_df), len(democrats_top_comments_df)])
x.add_row(["PRAW | r/democrats", len(dems_praw_titles_df), len(dems_praw_comments_df)])

print(x)

+-----------------------+--------------------------+----------------------------+
|   Mode and Subreddit  | Number of Titles Scraped | Number of Comments Scraped |
+-----------------------+--------------------------+----------------------------+
| JSON | r/Conservative |           491            |            729             |
| PRAW | r/Conservative |           504            |            4248            |
|   JSON | r/democrats  |           990            |            262             |
|   PRAW | r/democrats  |           997            |            573             |
+-----------------------+--------------------------+----------------------------+


Generally `r/democrats` had more titles (possibly because there are duplicate titles under `r/Conservative` where several or the same user(s) share the same title and newslink to get more redditors' attention on the newslink). Conversely, there are much more comments on `r/Conservative` than for `r/democrats`. Also, praw was able to scrape all sub-comments, resulting in much more data available for comments. 

We will move forward with praw-scraped datasets as overall it has more data. 

### Save datasets to csv file

In [40]:
#concatenate title dataframes and save to csv
titles_df = pd.concat([cons_praw_titles_df, dems_praw_titles_df])
titles_df.to_csv('../data/titles.csv', index = False)

In [41]:
#concatenate comments dataframes and save to csv
comments_df = pd.concat([cons_praw_comments_df, dems_praw_comments_df])
comments_df.to_csv('../data/comments.csv', index = False)

I will be performing data cleaning and exploratory data analysis on these datasets in [this notebook](02-data-cleaning-and-preprocessing.ipynb)

### References

"PRAW: The Python Reddit API Wrapper"(Boe, 2020)
https://praw.readthedocs.io/en/latest/