### 1. Introduction and problem statement

#### a. Introduction

Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members. Within each subreddit, there are text, image post, upvote or downvote posted by users to express their comments to the content of the post.

#### b. Problem statement

For this project, we select the subreddit that similiar on surface `r/nottheonion` and `r/theonion`. Subreddit `r/nottheonion` focus on true stories that are so mind-blowingly ridiculous that you could have sworn they were from The Onion while subreddit `r/theonion` foucs American satirical articles on international, national, and local news.

Due to the name of these 2 subreddits are highly appear to be misleading and confused. There will be problem for the moderator to maintain the subreddits. When performing maintenance, the moderator may accidentally deleted multiple posts from r/nottheonion and r/theonion. But, the moderator was not able to recover the titles of the lost posts.


Our goals of the project:

Using the natural language processing to build a classification model for

● training on posts submitted before 01 Jan 2022 to classify the 
recovered posts back to their respective subreddits, r/nottheonion 
and r/theonion, based solely on the post titles.

● as a proof of concept for the development of an automated 
moderator which would automatically delete posts that do not 
belong to the subreddit that they are posted to.

● Having automated moderators police the subreddit for spam posts 
would free up time for human moderators, who are volunteers, to do 
things that they want to do


### 2. Data Scrapping

We are using PushShift API (PSAW) to extract upto maximum of 10,000 posts from each of these subreddits for the year 2021. For Reddit, each subreddit has it's own individual JSON that we can access. The most direct way to access the subreddit is through the Application Programming Interface (API) of the a site and extract a JavaScript Object notation (JSON). Then this JSON can be read into Python as a dictionary.

The information to be scrapped are followed:
* Author of post
* Domain (Permalink)
* Title of post
* Number of comments
* Score
* Date of post
* Subscribers

In [14]:
# API scrape 
from psaw import PushshiftAPI

# Import libraries
import numpy as np
import pandas as pd

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')


In [15]:
## Define the scrap function

def scrap_data(subreddit):
    
    # Instantiate 
    api = PushshiftAPI()

    # Create list of scraped data
    scrape_list = list(api.search_submissions(subreddit=subreddit,
                                after = 1609459200 ,before = 1640995200, # Specifying the cut off date as of 31 December 2021
                                filter=['title', 'subreddit', 'num_comments', 'author', 'subreddit_subscribers', 'score', 'domain', 'created_utc'],
                                limit=10000))

    #Filter the list
    clean_scrape_lst = []
    for i in range(len(scrape_list)):
        scrape_dict = {}
        scrape_dict['subreddit'] = scrape_list[i][5]
        scrape_dict['author'] = scrape_list[i][0]
        scrape_dict['domain'] = scrape_list[i][2]
        scrape_dict['title'] = scrape_list[i][7]
        scrape_dict['num_comments'] = scrape_list[i][3]
        scrape_dict['score'] = scrape_list[i][4]
        scrape_dict['timestamp'] = scrape_list[i][1]
        clean_scrape_lst.append(scrape_dict)

    # Show number of subscribers
    print(subreddit, 'subscribers:',scrape_list[1][6])
    
    # Return list of scraped data
    return clean_scrape_lst

In [16]:
# Create DataFrame for nottheonion using the function

df_not_onion = pd.DataFrame(scrap_data('nottheonion'))


nottheonion subscribers: 20438921


In [17]:
# Convert the data to csv

df_not_onion.to_csv('nottheonion.csv', index=False)

In [18]:
# Check the shape of DataFrame

df_not_onion.shape

(9997, 7)

In [19]:
# Show head

df_not_onion.head()

Unnamed: 0,subreddit,author,domain,title,num_comments,score,timestamp
0,nottheonion,Taco_duck68,wral.com,"Man attempts to pay for car with rap, steals p...",0,1,1640995192
1,nottheonion,BlackNingaa,bloodyelbow.com,Former UFC fighter reveals past as sex worker ...,1,1,1640994707
2,nottheonion,Lopsided_File_1642,facebook.com,Log into Facebook,1,1,1640991506
3,nottheonion,SkinnyWhiteGirl19,theartnewspaper.com,McDonald’s blocked from building drive-through...,0,1,1640990429
4,nottheonion,kids-cake-and-crazy,kjrh.com,Legendary actress Betty White dies at 99 on Ne...,0,1,1640989181


In [20]:
# Create DataFrame for theonion using the function

df_onion = pd.DataFrame(scrap_data('theonion'))

theonion subscribers: 165298


In [21]:
# Convert the data to csv

df_onion.to_csv('theonion.csv', index=False)

In [22]:
# Check the shape of DataFrame

df_onion.shape

(1343, 7)

In [23]:
# Show head
df_onion.head()

Unnamed: 0,subreddit,author,domain,title,num_comments,score,timestamp
0,TheOnion,mothershipq,theonion.com,Surgeon Kind Of Pissed Patient Seeing Her Defo...,0,1,1640973300
1,TheOnion,-ImYourHuckleberry-,theartnewspaper.com,McDonald’s blocked from building drive-through...,1,1,1640971771
2,TheOnion,dwaxe,theonion.com,Gwyneth Paltrow Touts New Diamond-Encrusted Tr...,0,1,1640955671
3,TheOnion,dwaxe,theonion.com,Artist Crafting Music Box Hopes It Delights At...,0,1,1640955669
4,TheOnion,dwaxe,theonion.com,Homeowner Trying To Smoke Out Snakes Accidenta...,0,1,1640955668
