# DSI Project 3 - Natural Language Processing via Reddit Scraping

---

# Problem Statement



A psychology professor is interested in gauging maliciousness via language. Being a frequent surfer of reddit, he has chanced upon two subreddits that would provide him a good metric for comparison: r/pettyrevenge and r/prorevenge. 

However, this psychology professor has no background in language processing via code, and has thus hired me, a freelance data analyst, to use scraped data from both these subreddits to build a model that would determine if a post was less (r/pettyrevenge) or more (r/prorevenge) malicious, as well as do a simple analysis of any significant words found.

The professor has stated that for the purpose of his research, the model needs to achieve an accuracy score of over 70%.

---

# Executive Summary

The subreddits r/pettyrevenge and r/prorevenge were scraped for about 800 posts each via the PushshiftAPI. The "selftext" and "title" data categories were then extracted and combined into a "text" category. The text was run through a function removing stopwords, punctuation and short words (len < 3), and then lemmatized using the WordNet Lemmatizer. 

The text was transformed using a Count Vectorizer, and the top scoring words for both reddits were removed before running the data through three classifiers: Multinomial Naive Bayes, K-Nearest-Neighbors and Random Forest, each with minor optimisation via GridSearchCV. Random Forest scored the best in accuracy overall, and was then further optimised, resulting in a model with 72.7% accuracy. 

After examination of the highest weighted features, it was theorized that there were several themes that significantly factored into causing someone to increase their maliciousness: Level of hostile intent from the offending party, level of desperation of the victim and forward planning for the revenge.

Further research was suggested using several other "revenge" subreddits, such as r/nuclearrevenge, so that one's maliciousness can be determined on a scale rather than as a binary function.

---

In [2]:
#imports

import requests
import pandas as pd
import time
import random
import json
from datetime import datetime

---

# Part 1 - Scraping Reddit

---

### Contents

Part 1 - Reddit Scraping

- [Subreddit Overviews](#Subreddit-Overviews)
    - [r/pettyrevenge](#r/pettyrevenge)
    - [r/prorevenge](#r/prorevenge)


- [The Reddit API](#The-Reddit-API)


- [Alternative to the Reddit API: The Pushshift API](#Alternative-to-the-Reddit-API:-The-Pushshift-API)


- [Scraping Reddit using the PushshiftAPI](#Scrapping-Reddit-using-the-PushshiftAPI)
	- [Scraping Data](#Scraping-Data)
    
<a href = "part-2_eda_and_data_cleaning.ipynb">Part 2 - EDA and Data Cleaning</a><br>

<a href = "part-3_model_building.ipynb">Part 3 - Model Building</a><br>

---

## Subreddit Overviews
[top](#Contents)

To provide some context to the problem statement, a quick overview of both subreddits have been appended below. Both subreddits share several characteristics, only seeming to  differ in the amount of malice demonstracted by the submissions.

### r/pettyrevenge
[top](#Contents)

<img src="images/petty.png" width ="600" height = "400" style="border:1px solid black">

<a href="https://www.reddit.com/r/pettyrevenge">/r/pettyrevenge</a> is subreddit where submissions consist of stories of how people have done really petty things to get back at other people. The sub currently has 975k members, and has been in existence since 2012.  A common characteristic of the posts are the tl;dr (too long, didn't read section, which consists of a summary of the entire posts. 

### r/prorevenge 
[top](#Contents)

<img src="images/pro.png" width ="600" height = "400" style="border:1px solid black">

<a href="https://www.reddit.com/r/ProRevenge">/r/prorevenge</a> is subreddit where submissions consist of stories of how people have gone through extreme lengths to get back at other people. The sub currently has 1.1 million members. Similar to r/pettyrevenge, this subreddit has been existence since 2012, and also share the characteristic of posts with tl;dr sections.

---

## The Reddit API
[top](#Contents)

<img src="images/redditapi2.png" width ="600" height = "400" style="border:1px solid black">

Reddit provides an easy to access API via the addition of ".json" to the end of any of their subreddits. This generates a json-formatted file, which is a language-independent data exchange format. In a python specific context, this mean a list of dictionaries,  which one would then be able to navigate through to extract the required information. 

The major disadvange of the reddit API is readability. As seen in the image above, the reddit API's json output is a large chunk of text, without any line spacing and/or formatting. While it would be possible to manually scan every line of code to find the exact combination of keys required to extract infomration, this would be time consuming and labor intensive.

There are better alternatives.

---

## Alternative to the Reddit API: The Pushshift API
[top](#Contents)

<img src="images/pushshiftapi.png" width ="600" height = "400" style="border:1px solid black">

Several alternatives to the reddit API exist. One such popular API is the <a href="https://pushshift.io/api-parameters/">PushshiftAPI</a>. The PushshiftAPI converts the Reddit json output into a more readable format (as shown in the image above), allowing for an easier time in scanning through the document. 

There is one disadvantage to using the PushshiftAPI. Unlike the Reddit API, the PushshiftAPI is not "real-time", as it functions by copying and re-formatting the RedditAPI. This means that recent posts and edits up to a month before "now" might not be reflected in the PushshiftAPI json file.

For the current use case, however, there is no significant effect on model building, and the PushshiftAPI will be utilized for our data extraction process.

---

## Scraping Reddit using the PushshiftAPI
[top](#Contents)

The code below makes use of the PushshiftAPI to scrape posts by date, utilizing the "created_utc" information of the last post to determine the start date of the next set of posts to scrape. Arbitarily, a minimum of 800 posts were scraped from each subreddit. 

An initial stage of data cleaning was also integrated into the scrapping process by examining each post to check if it was either "removed" or containd no text. This resulted in a dataset of 880 posts from r/prorevenge and 813 posts from r/pettyrevenge, which was then saved to a csv file.

### Scraping Data
[top](#Contents)

In [3]:
# determining the start and stop dates and converting them into timestamps

start_date = datetime(2019, 1,1)
stop_date = datetime(2020, 1,1)

start_timestamp = int(datetime.timestamp(start_date))
stop_timestamp = int(datetime.timestamp(stop_date))

In [12]:
# Creating a function to scrape reddit for a mminimum number of posts

def pushscraper(start_timestamp, stop_timestamp, subreddit, count_limit):
    
    # Creating an empty aggregated posts list to append posts to later 
    
    aggregate = []
    
    subCount = 0
    
    # Internal function to read and extract the data from the json input
    
    def getPSData(start_timestamp, stop_timestamp, subreddit):
        
        # The url is based on the multiple endpoints of the PushshiftAPI
        
        url = 'https://api.pushshift.io/reddit/search/submission/?size=1000&after='+str(start_timestamp)+'&before='+str(stop_timestamp)+'&subreddit='+str(subreddit)
        print(url)
        
        # Using the requests and json modules to convert the json input into a pandas dataframe
        
        r = requests.get(url)
        data = json.loads(r.text)
        return data['data']    
    
    # Calling getPSData to get the initial set of posts
    
    data = getPSData(start_timestamp, stop_timestamp, subreddit)
    
    # While loop to run the function continouusly until the required number of posts is met (count_limit)

    while len(data) > 0 and subCount < count_limit:
        
        # A Try/Except and If/Else is required to exclude posts that were either removed or contain no text
        
        for submission in data:
            try:
                if submission["selftext"] != "[removed]":
                    subCount+=1
                    aggregate.append(submission)
                else:
                    pass
            except:
                pass

        # Printing number of posts scraped and datetime of last post
        
        print(f'total posts scraped: {subCount}')
        print(str(datetime.fromtimestamp(data[-1]['created_utc'])))               
        
        # Calling getPSData with the created date of the last submission
        
        start_timestamp = data[-1]['created_utc']        
        data = getPSData(start_timestamp, stop_timestamp, subreddit)
    
    result = pd.DataFrame(aggregate)
    
    return result

In [5]:
# Scraping r/prorevenge for 800 posts

pro_df = pushscraper(start_timestamp, stop_timestamp, "prorevenge", 800)

https://api.pushshift.io/reddit/search/submission/?size=1000&after=1546272000&before=1577808000&subreddit=prorevenge
total posts scraped: 95
2019-01-15 14:06:51
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1547532411&before=1577808000&subreddit=prorevenge
total posts scraped: 193
2019-01-30 04:54:12
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1548795252&before=1577808000&subreddit=prorevenge
total posts scraped: 292
2019-02-13 23:00:59
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1550070059&before=1577808000&subreddit=prorevenge
total posts scraped: 389
2019-02-17 16:54:21
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1550393661&before=1577808000&subreddit=prorevenge
total posts scraped: 487
2019-02-19 03:02:46
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1550516566&before=1577808000&subreddit=prorevenge
total posts scraped: 586
2019-02-20 06:53:42
https://api.pushshift.io/redd

In [6]:
# Scraping r/pettyrevenge for 800 posts

petty_df = pushscraper(start_timestamp, stop_timestamp, "pettyrevenge", 800)

https://api.pushshift.io/reddit/search/submission/?size=1000&after=1546272000&before=1577808000&subreddit=pettyrevenge
total posts scraped: 29
2019-01-05 11:01:43
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1546657303&before=1577808000&subreddit=pettyrevenge
total posts scraped: 58
2019-01-11 04:09:10
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1547150950&before=1577808000&subreddit=pettyrevenge
total posts scraped: 89
2019-01-15 20:36:15
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1547555775&before=1577808000&subreddit=pettyrevenge
total posts scraped: 121
2019-01-19 05:48:21
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1547848101&before=1577808000&subreddit=pettyrevenge
total posts scraped: 145
2019-01-22 12:26:25
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1548131185&before=1577808000&subreddit=pettyrevenge
total posts scraped: 179
2019-01-25 04:10:17
https://api.pushshi

In [7]:
# Combining posts and saving as a csv file

final_posts_ps = pd.concat([petty_df, pro_df])

pd.DataFrame(final_posts_ps).to_csv(f'data/raw_scrape_ps.csv', index = False)

#### Crosscheck

In [27]:
test = pd.read_csv("data/raw_scrape_ps.csv")

In [30]:
test.drop_duplicates(inplace = True)
print (test.shape)

(1693, 56)


In [31]:
test.head()

Unnamed: 0,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,can_mod_post,contest_mode,created_utc,...,title,url,whitelist_status,wls,author_cakeday,updated_utc,gilded,post_hint,preview,edited
0,Mr_kitloin,,[],,text,t2_wupes,False,False,False,1546273636,...,No mike and ikes for me? How about hot tamales...,https://www.reddit.com/r/pettyrevenge/comments...,all_ads,6,,,,,,
1,CyberneticFennec,,[],,text,t2_15fz43,False,False,False,1546280455,...,I'm not allowed to use the car? Fine.,https://www.reddit.com/r/pettyrevenge/comments...,all_ads,6,,,,,,
2,bblackwalker,,[],,text,t2_24uomksc,False,False,False,1546291051,...,Tinder date was an extreme feminist and wanted...,https://www.reddit.com/r/pettyrevenge/comments...,all_ads,6,,,,,,
3,ilovemysubaru,,[],,text,t2_11xnqyar,False,False,False,1546305143,...,Pretty Revenge Request,https://www.reddit.com/r/pettyrevenge/comments...,all_ads,6,,,,,,
4,shinn_ann,,[],,text,t2_p5svqtv,False,False,False,1546309401,...,refuse to quiet down? let your kids learn rap ...,https://www.reddit.com/r/pettyrevenge/comments...,all_ads,6,,,,,,
