# Project 3: Subreddit Classification
## Project Overview
**Data science process demonstrated:**
- [Problem Statement](#Problem-Statement)
- [Executive Summary](#Executive-Summary)
- [Webscraping and Data Acquisition](#Webscraping-and-Data-Acquisition)
- [Data Inspect and Data Cleaning](#Data-Inspect-and-Data-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Preprocessing and Feature Engineering](#Preprocessing-and-Feature-engineering)
- [Modelling](#Modelling)
- [Outside Research](#Outside-Research)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Problem Statement

## Executive Summary

In [1]:
# Import libraries

# Maths
import pandas as pd
import numpy as np

# Web
import requests
import json
import time, warnings
import datetime as dt

# nlp
import regex as re
from sklearn.feature_extraction.text import CountVectorizer,HashingVectorizer,TfidfVectorizer
from nltk.corpus import stopwords
from wordcloud import WordCloud

# For visuals
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib_venn as venn2
%matplotlib inline
plt.style.use('seaborn')
%config InlineBackend.figure_format = 'retina'


# display data
pd.options.display.max_rows = 4000
pd.options.display.max_columns = 500

# Webscraping and Data Acquisition

In [103]:
# Modified from web-scraper given to the class

def webscrape_subreddit(subreddit):
    
    posts = []
    after = None

    for a in range(50): 
        url = f'https://www.reddit.com/r/{subreddit}/new.json' # download posts sorted by new
        
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after

        # send request to url
        res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})

        # check for errors
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
    
        # get posts and add to [posts]
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        
        # Output JSON file for each request. Use timestamp to separate each request file 
        timestamp = dt.datetime.now()
        timestamp = timestamp.strftime(" %Y_%m_%d %H_%M_%S")
        with open(f'../json_files/{subreddit}/posts_{subreddit}_{timestamp}.json', 'w') as outfile:
            json.dump(current_dict, outfile)

        # get tag of last post on the page
        after = current_dict['data']['after']

        # generate a random sleep duration to look more 'natural'
        sleep_duration = np.random.randint(2,6)
        time.sleep(sleep_duration)
    

    # create dataframe to extract useful columns
    df_post = pd.DataFrame(posts)
    df_post = df_post[['title','author','created', 'selftext', 'score', 'upvote_ratio', 'num_comments']]
    df_post['date_created'] = pd.to_datetime(df_post['created'], unit = 's') #create new date column
    # remove column that is used to create date
    df_post = df_post[['title','author','selftext', 'score', 'upvote_ratio', 'num_comments', 'date_created']]
    
    # check whether all posts are added to df_post
    df_post.shape[0] == len(posts)

    # Investigate duplicates
    print("There are", df_post.duplicated(subset='title').sum(), "number of duplicate posts")
    df_post = df_post.drop_duplicates(subset='title') #drop the duplicates
    
    # print number of posts saved
    print(f'After removing the duplicates, a total of {len(df_post)} posts were downloaded.')

    return df_post

In [89]:
%%time
df_pixar = webscrape_subreddit('pixar')
df_pixar.head()

There are 0 number of duplicate posts
After removing the duplicates, a total of 50 posts were downloaded.
Wall time: 9.99 s


Unnamed: 0,title,author,selftext,score,upvote_ratio,num_comments,date_created
0,POLL: Favorite Inside Out Character?,TheReySkywalker,,2,1.0,0,2020-09-24 14:14:07
1,Do BnL Starliners have any means of defending ...,Newman1651,The danger in question not present in the film...,1,1.0,0,2020-09-24 12:41:12
2,Here's a sneak peak for my 25 Years of Pixar A...,davidthefnaffan,,85,0.94,7,2020-09-24 08:48:32
3,Carlos Maserati,BasedJerrySmith,,6,0.81,0,2020-09-23 23:45:02
4,bought this necklace with multiple pixar charm...,penguinschlong,,99,0.98,2,2020-09-23 16:26:23


In [87]:
# saving pixar and disney dataframe posts to csv
df_pixar.to_csv('../data/pixar_posts.csv', index = True)

In [102]:
# Opening raw JSON file (before clean)
f = open('../json_files/pixar/posts_pixar_ 2020_09_25 00_10_29.json',) 
  
# returns JSON object as a dictionary 
json_pixar = json.load(f)
print("Sanity check shows a request containing", len(json_pixar['data']['children']), "posts")

Sanity check shows a request containing 25 posts


In [10]:
%%time
df_disney = webscrape_subreddit('disney')
df_disney.head()

There are 250 number of duplicate posts
After removing the duplicates, a total of 1231 posts were downloaded.
Wall time: 3min 36s


Unnamed: 0,title,author,selftext,score,upvote_ratio,num_comments,date_created
0,I made Baby Yoda (The Child) earrings for my f...,artisticallymusical5,,13,0.88,2,2020-09-24 12:40:19
1,Where can I find this pillow?,Epehec,,2,1.0,1,2020-09-24 13:04:25
2,Princess Tiger Lily (Peter Pan),lazypeach0,Why isn’t she ever categorized with the other ...,1,1.0,0,2020-09-24 11:02:42
3,Spirit Halloween TNBC Merch!,waldesnachtbrahms,,2,0.75,0,2020-09-24 10:48:49
4,Sitka (Brother Bear) Thinking About Elsa (Froz...,Rosie-Love98,,0,0.25,0,2020-09-24 10:25:22


# Data Inspect and Data Cleaning

In [104]:
df_pixar.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   title         50 non-null     object        
 1   author        50 non-null     object        
 2   selftext      50 non-null     object        
 3   score         50 non-null     int64         
 4   upvote_ratio  50 non-null     float64       
 5   num_comments  50 non-null     int64         
 6   date_created  50 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 3.1+ KB


In [131]:
#df_pixar.replace(r'^\s*$', np.nan, regex=True).head()

df_pixar.replace(r'^\s*$', 'test', regex=True).head()

Unnamed: 0,title,author,selftext,score,upvote_ratio,num_comments,date_created
0,POLL: Favorite Inside Out Character?,TheReySkywalker,test,2,1.0,0,2020-09-24 14:14:07
1,Do BnL Starliners have any means of defending ...,Newman1651,The danger in question not present in the film...,1,1.0,0,2020-09-24 12:41:12
2,Here's a sneak peak for my 25 Years of Pixar A...,davidthefnaffan,test,85,0.94,7,2020-09-24 08:48:32
3,Carlos Maserati,BasedJerrySmith,test,6,0.81,0,2020-09-23 23:45:02
4,bought this necklace with multiple pixar charm...,penguinschlong,test,99,0.98,2,2020-09-23 16:26:23


In [132]:

df_pixar.selftext.str.count('test').sum()

0