# Project 3 - Reddit NLP

<img src="./images/reddit.png" alt="reddit" width="600"/>



## Problem Statement

Reddit, the user-generated internet news/content aggregator has at it's heart a system of creating specialized "subreddits" where users may gather and discuss pertinent interests together. Our goal is to design a machine-learning algorithm based on supervised Natural Language Processing (NLP) and see if it has the ability to distinguish whether a post came from one of two selected subreddits.

## Executive Summary

Reddit is a community-based forum dubbed often as "The frontage of the internet." Reddit has a central front-page, where it displays the top posts from many subreddits. These subreddits, are smaller, subdivided into sub-forums specific to topics, which are chosen by the members of reddit. Subreddits can be created by individual members and are typically self-moderated unless they violate the Reddit terms of service. Users can post their own content as well as repy to other's posts with comments. By utilizing the posts and comments system, we believe we can use supervised Natural Language Processing (NLP) to predict whether a given post came from one subreddit or another of our choosing. By training on these two subreddits of our choosing, we can possible reveal some intrcacies and trends that that may underly why some people post to one reddit or another. If our NLP can achieve this result with greater precision than a baseline model, we can walk away with some great insights.


   ### Contents:

   - [Imports](#Imports)
   - [Read-in Data Files](#Read-in-Data-Files)
   - [Preprocessing](#Preprocessing)
   - [Functions](#Functions)
   - [Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA))
   - [Modeling](#Modeling)
   - [Conclusions and Recommendations](#Conclusions-and-Recommendations)
   - [Sources](#Sources)

## Imports

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import time, requests, json
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import stop_words
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#Test Imports - May not be used.
import eli5
from eli5.sklearn import PermutationImportance

#538 Style graphs
plt.style.use('fivethirtyeight')

#Optimized graphs for retina displays
%config InlineBackend.figure_format = 'retina'

#Random State
np.random.seed(41)

## Reddit Pushshift API Query

In [2]:
#Pushshift Query provided by Brian Collins
def query_pushshift(subreddit, kind='comment', skip=30, times=6, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 
                                'num_comments', 'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):

    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    mylist = []
    
    for x in range(1, times):
        
        URL = "{}&after={}d".format(stem, skip * x)
        print(URL)
        response = requests.get(URL)
        assert response.status_code == 200, "Link Dead"
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        mylist.append(df)
        time.sleep(2)
        
    full = pd.concat(mylist, sort=False)
    
    if kind == "submission":
        
        full = full[subfield]
        
        full = full.drop_duplicates()
        
        full = full.loc[full['is_self'] == True]
        
    def get_date(created):
        return dt.date.fromtimestamp(created)
    
    _timestamp = full["created_utc"].apply(get_date)
    
    full['timestamp'] = _timestamp

    print(full.shape)
    
    return full

## Initial Scrape Only

In [3]:
# #Chosen Subreddits
# s1 = 'legaladvice'
# s2 = 'relationship_advice'

# # #API Calls for data
# s1_comment = query_pushshift(s1, kind='comment', times=24)
# s1_submission = query_pushshift(s1, kind='submission', times=24)
# s2_comment = query_pushshift(s2, kind='comment', times=24)
# s2_submission = query_pushshift(s2, kind='submission', times=24)

# #Save API data to files for quick access later
# s1_comment.to_csv('./data/s1_comment.csv', index=False)
# s1_submission.to_csv('./data/s1_submission.csv', index=False)
# s2_comment.to_csv('./data/s2_comment.csv', index=False)
# s2_submission.to_csv('./data/s2_submission.csv', index=False)

#Uncomment out if rescraping.

For this study, we used the r/legaladvice and r/relationship_advice subreddits. r/legaladvice typically deals with members discussing legal discussion whether it's theoretical discussion or getting actual advice on what to do pertaining to court cases they may be involved with. Lawyers who are also redditors also typically chime in with advice. r/relationship_advice pertains with people coming with relationoship issues and soliciting other members to weigh in on their topic. We believed these two forums would initially be hard to distinguish because a vast majority of times on r/relationship_advice, the questions pertain to divorce, getting a lawyer, and legal options given a bad breakup. Going forth, r/legaladvice will be "s1" (subreddit 1) and r/relationship_advice will be "s2" (subreddit 2) as they pertain to variables.

In [4]:
#Read in saved data from API calls
s1_comment = pd.read_csv('./data/s1_comment.csv')
s1_submission = pd.read_csv('./data/s1_submission.csv')
s2_comment = pd.read_csv('./data/s2_comment.csv')
s2_submission = pd.read_csv('./data/s2_submission.csv')

In [5]:
#Check dimensionality of queried data. Our initial goal was to use as to 10,000 as possible for each subreddit.
s1_comment.shape, s2_comment.shape, s1_submission.shape, s2_submission.shape

((11500, 35), (11500, 33), (11498, 9), (11454, 9))

We have around 11500 results from each query, if the EDA does not require dropping many columns, it is very unlikely we will need to deal with unbalanced classes. Part of the Pushshift query filtered out titles without selftext as well, which is often a common occurrence with reddit as many posts are simple pictures with no text content to actually analyze. This script omits those and only grabs pertinent posts with selftexts.

## Exploratory Data Analysis (EDA)

### Submission Columns Cleaning

In [6]:
#Discover columns contained for each submission group
s1_submission.columns

Index(['title', 'selftext', 'subreddit', 'created_utc', 'author',
       'num_comments', 'score', 'is_self', 'timestamp'],
      dtype='object')

In [7]:
#Remove data columns we will no use
s1_submission.drop(columns=['created_utc', 'author', 'num_comments', 
                            'subreddit', 'is_self', 'timestamp'], inplace=True)
s2_submission.drop(columns=['created_utc', 'author', 'num_comments', 
                            'subreddit', 'is_self', 'timestamp'], inplace=True)

#Subreddit #1 is a 0 and subreddit #2 is a 1
s1_submission['subreddit'] = 0
s2_submission['subreddit'] = 1

In [8]:
s1_submission.shape, s2_submission.shape #Shape check

((11498, 4), (11454, 4))

Assigning classification weight to subreddits with s1 being 0 and s2 being a 1.

In [9]:
#Remove [deleted] and [removed] text self texts, and only took scores greater than 0 (1 or above).
#We believe that posts that score higher will typically be more aligned with the belief and
#culture of each subreddit
s1_submission = s1_submission[(s1_submission['selftext'] != '[deleted]') & 
                              (s1_submission['selftext'] != '[removed]') & 
                              (s1_submission['score'] > 0)]

s2_submission = s2_submission[(s2_submission['selftext'] != '[deleted]') & 
                              (s2_submission['selftext'] != '[removed]') & 
                              (s2_submission['score'] > 0)]

In [10]:
s1_submission.shape, s2_submission.shape #Shape check

((7613, 4), (8685, 4))

We only submissions with score > 0. Submissions which score higher (more highly-rated) by the community are defintely an indicator of a subreddit's tone or style. Moderators will also remove ones that are not favorable or score poorly as well. We only took a score of 0 as our baseline, as scores can be negative, indicating negative sentiment from those who view the subreddit. We also take the assumption that any random post at it's extremes with no matching words from a subreddit should match just as poorly in any other subreddit and will be at best a random choice anyway. We do not want our model to match on these "random" submissions, which if anything, only add noise. We also remove posts that have "removed" or "deleted" in their text, as the submittor decided to remove the content after posting.

In [None]:
#We no longer need scores after we have filtered, so let's get rid of them.
s1_submission.drop(columns=['score'], inplace=True)
s2_submission.drop(columns=['score'], inplace=True)

In [11]:
tot_sub = pd.concat([s1_submission, s2_submission]) #Assemble final dataframe for analysis

In [12]:
tot_sub.shape #Confirm shape of final analysis dataframe

(16298, 4)

In [13]:
tot_sub.head()

Unnamed: 0,title,selftext,score,subreddit
0,Workmans comp question,"Hi,\nI really do not know a lot about this. La...",6,0
1,Domestic violence and how to approach it? (Texas),My father has always been physically and verba...,6,0
4,"Normally at my work, we are not allowed any br...",I'm a minor from the state of New York. I work...,498,0
5,"FL- Spouse passed with no will, what to do abo...",My husband recently passed away 2 months ago u...,50,0
7,evicting roommate,"living in Georgia, i have a monthly rental agr...",2,0


Final dataframe for submissions (posts) gives us both title and selftext columns to work with should we choose either or both to work with.

### Comment Column Cleaning

In [14]:
s1_comment.shape, s2_comment.shape #Seems s1 has some extra data, we drop all the columns except body and score

((11500, 35), (11500, 33))

In [15]:
s1_comment.columns, s2_comment.columns

(Index(['author', 'author_flair_background_color', 'author_flair_css_class',
        'author_flair_richtext', 'author_flair_template_id',
        'author_flair_text', 'author_flair_text_color', 'author_flair_type',
        'author_fullname', 'author_patreon_flair', 'body', 'created_utc',
        'distinguished', 'gildings', 'id', 'link_id', 'no_follow', 'parent_id',
        'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied',
        'subreddit', 'subreddit_id', 'author_cakeday', 'rte_mode', 'can_gild',
        'collapsed', 'collapsed_reason', 'controversiality', 'edited', 'gilded',
        'is_submitter', 'timestamp'],
       dtype='object'),
 Index(['author', 'author_cakeday', 'author_flair_background_color',
        'author_flair_css_class', 'author_flair_richtext',
        'author_flair_template_id', 'author_flair_text',
        'author_flair_text_color', 'author_flair_type', 'author_fullname',
        'author_patreon_flair', 'body', 'created_utc', 'gildings', 'id',
  

Interestingly enough, even though the scrape proceeded in the same way, the s2 has extra columns, these might be extra columns from that particular subreddit only. Because it only appears in one, we will remove it since we have no basis of comparison

In [16]:
#Drop columns we will not use.
s1_comment.drop(columns=['author', 'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'author_fullname', 'author_patreon_flair', 'created_utc',
       'distinguished', 'gildings', 'id', 'link_id', 'no_follow', 'parent_id',
       'permalink', 'retrieved_on', 'send_replies', 'stickied',
       'subreddit', 'subreddit_id', 'author_cakeday', 'rte_mode', 'can_gild',
       'collapsed', 'collapsed_reason', 'controversiality', 'edited', 'gilded',
       'is_submitter', 'timestamp'], inplace=True)

s2_comment.drop(columns=['author', 'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'created_utc', 'gildings', 'id',
       'link_id', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 
        'send_replies', 'stickied', 'subreddit', 'subreddit_id',
       'rte_mode', 'can_gild', 'collapsed', 'collapsed_reason',
       'controversiality', 'edited', 'is_submitter', 'timestamp'], inplace=True)

In [17]:
s1_comment.shape, s2_comment.shape

((11500, 2), (11500, 2))

In [18]:

#Remove [deleted] and [removed] text self texts, and only took scores greater than 0 (1 or above).
#We believe that posts that score higher will typically be more aligned with the belief and
#culture of each subreddit
s1_comment = s1_comment[(s1_comment['body'] != '[deleted]') & 
                              (s1_comment['body'] != '[removed]') & 
                              (s1_comment['score'] > 0)]

s2_comment = s2_comment[(s2_comment['body'] != '[deleted]') & 
                              (s2_comment['body'] != '[removed]') & 
                              (s2_comment['score'] > 0)]

In [19]:
s1_comment.shape, s2_comment.shape

((10388, 2), (10550, 2))

We do the same thing we did for submissions, dropping rows with a score of 0, which people who browse the subreddit chose to vote less favorably, such that it does not align with their views.

In [20]:
#We no longer need scores after we have filtered, so let's get rid of them.
s1_comment.drop(columns=['score'], inplace=True)
s2_comment.drop(columns=['score'], inplace=True)

#Add column to score which subreddit it came from 0 for s1 and 1 for s2
s1_comment['subreddit'] = 0
s2_comment['subreddit'] = 1

Assign the same weighting to the classifier. 0 for s1 and 1 for s2.

In [21]:
tot_com = pd.concat([s1_comment, s2_comment]) #Assemble final dataframe for analysis

In [22]:
tot_com.shape

(20938, 2)

In [23]:
tot_com.head()

Unnamed: 0,body,subreddit
0,Thank you. I suppose if anything were to happe...,0
1,The building previously allowed smoking inside...,0
2,That's true. People get slipped discs all the ...,0
3,Does pulling your product violate that contrac...,0
4,Who is she subleasing from? Are you and your o...,0


In [24]:
tot_com.isnull().sum()

body         0
subreddit    0
dtype: int64

In [25]:
tot_sub.isnull().sum()

title          0
selftext     208
score          0
subreddit      0
dtype: int64

In [26]:
tot_sub.dropna(inplace=True) #Dropped some nulls in submissions. Comments did not have nulls.

In [27]:
tot_sub.subreddit.value_counts(normalize=True) #Proportions are still relatively close, no need to balance classes

1    0.534431
0    0.465569
Name: subreddit, dtype: float64

## Train-Test-Split

In [28]:
#Train-Test-Split
X = tot_com['body']
y = tot_com['subreddit']
X_train_com, X_test_com, y_train_com, y_test_com = train_test_split(X, y, stratify=y)

In [29]:
X_train_com.shape, X_test_com.shape, y_train_com.shape, y_test_com.shape

((15703,), (5235,), (15703,), (5235,))

In [30]:
#Train-Test-Split Submission Data
X = tot_sub['selftext']
y = tot_sub['subreddit']
X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(X, y, stratify=y)

In [31]:
X_train_sub.shape, X_test_sub.shape, y_train_sub.shape, y_test_sub.shape

((12067,), (4023,), (12067,), (4023,))

# Modeling

## Logistic Regression Pipelines

Tokeniers needed to be objects to passed into any Vectorizer. Define custom classes for this to happen. Classes taken from StackOverflow - https://stackoverflow.com/questions/47423854/sklearn-adding-lemmatizer-to-countvectorizer

In [32]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

In [33]:
class StemTokenizer(object):
    def __init__(self):
        self.wnl = PorterStemmer()
    def __call__(self, articles):
        return [self.wnl.stem(t) for t in word_tokenize(articles)]

In [34]:
pipe_lr = Pipeline([
    ('cv', CountVectorizer(tokenizer=StemTokenizer())), #interchange tokenizer with LemmaTokenizer
    ('lr', LogisticRegression(n_jobs=-1))
])

pipe_params = {
    'cv__stop_words' : ['english'],
    'cv__max_features': [3000],
    'cv__min_df': [2],
    'cv__max_df': [.9],
    'cv__ngram_range': [(1,2)] #Tried (1,3) - (1,2) performed better
#     'lr__solver' : ['newton-cg']
#     'lr__penalty' : ['l2'],
#     'lr__C': [.1, .5]
}

gs_lr = GridSearchCV(pipe_lr, param_grid=pipe_params, cv=3)

In [35]:
gs_lr.fit(X_train_sub, y_train_sub)

  'stop_words.' % sorted(inconsistent))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_a...penalty='l2', random_state=None, solver='warn', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cv__stop_words': ['english'], 'cv__max_features': [3000], 'cv__min_df': [2], 'cv__max_df': [0.9], 'cv__ngram_range': [(1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [36]:
gs_lr.best_params_

#Submission Scores
print(gs_lr.score(X_train_sub, y_train_sub))
print(gs_lr.score(X_test_sub, y_test_sub))

#Comment Scores
print(gs_lr.score(X_train_com, y_train_com))
print(gs_lr.score(X_test_com, y_test_com))

0.9985911991381453
0.964951528709918
0.800611348150035
0.7910219675262655


## Naive Bayes Pipeline

In [37]:
pipe_nb = Pipeline([
    ('cv', CountVectorizer(tokenizer=StemTokenizer())),
    ('nb', MultinomialNB())
])

pipe_params = {
    'cv__stop_words' : ['english'],
    'cv__max_features': [3000],
    'cv__min_df': [2],
    'cv__max_df': [.9],
    'cv__ngram_range': [(1,2)] #Tried (1,3) - (1,2) performed better
#     'lr__solver' : ['newton-cg']
#     'lr__penalty' : ['l2'],
#     'lr__C': [.1, .5]
}

gs_nb = GridSearchCV(pipe_nb, param_grid=pipe_params, cv=3)

In [38]:
gs_nb.fit(X_train_sub, y_train_sub)

  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_a...2b0>,
        vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cv__stop_words': ['english'], 'cv__max_features': [3000], 'cv__min_df': [2], 'cv__max_df': [0.9], 'cv__ngram_range': [(1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [39]:
gs_nb.best_params_

{'cv__max_df': 0.9,
 'cv__max_features': 3000,
 'cv__min_df': 2,
 'cv__ngram_range': (1, 2),
 'cv__stop_words': 'english'}

In [40]:
#Submission Scores
print(gs_nb.score(X_train_com, y_train_com))
print(gs_nb.score(X_test_com, y_test_com))

#Comment Scores
print(gs_nb.score(X_train_sub, y_train_sub))
print(gs_nb.score(X_test_sub, y_test_sub))

0.842768897662867
0.8324737344794652
0.9636197895085771
0.960974397216008


In [41]:
gs_lr.best_estimator_

Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=3000, min_df=2,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        st...penalty='l2', random_state=None, solver='warn', tol=0.0001,
          verbose=0, warm_start=False))])

In [42]:
x = dict(zip(gs_lr.best_estimator_.named_steps['cv'].get_feature_names(), gs_lr.best_estimator_.named_steps['lr'].coef_[0] ))

sorted_d = sorted(x.items(), key=lambda x: x[1])



In [43]:
#Assemble Dictionary

In [44]:
sorted_d

[('legal', -2.5410729884401118),
 ('court', -1.663480042030627),
 ('troubl', -1.5423024078322232),
 ('landlord', -1.5314239702482597),
 ('lawyer', -1.4851860340607708),
 ('compani', -1.3453473636150088),
 ('charg', -1.321451231613138),
 ('illeg', -1.301273946388759),
 ('record', -1.2987724480409573),
 ('polic', -1.279128795893262),
 ('report', -1.2631030234430785),
 ('jail', -1.2413613387952447),
 ('california', -1.2251674434908113),
 ('case', -1.2218014891083153),
 ('process', -1.2083630960541671),
 ('insur', -1.1698043976548123),
 ('canada', -1.161317649956929),
 ('file', -1.1460168642305741),
 ('threaten', -1.145375194788915),
 ('texa', -1.144043530662481),
 ('. anyth', -1.1351777477266918),
 ('receiv', -1.0822341475860129),
 ('owner', -1.0792133366175631),
 ('cop', -1.046975706409726),
 ('state', -1.025691248737241),
 ('florida', -1.010546535680994),
 ('violat', -1.003321784431227),
 ('assum', -0.9899349848998493),
 (', thing', -0.9491264526488266),
 ('appli', -0.9318898756216875),

In [46]:
gs_lr.score(X_train_com, y_train_com)

gs_lr.score(X_test_com, y_test_com)

0.7910219675262655

In [48]:
pipe_lr3 = Pipeline([
    ('cv', CountVectorizer(tokenizer=LemmaTokenizer())),
    ('lr', LogisticRegression(n_jobs=-1))
])

pipe_params = {
    'cv__stop_words' : ['english'],
    'cv__max_features': [3000],
    'cv__min_df': [2],
    'cv__max_df': [.9],
    'cv__ngram_range': [(1,2)] #Tried (1,3) - (1,2) performed better
#     'lr__solver' : ['newton-cg']
#     'lr__penalty' : ['l2'],
#     'lr__C': [.1, .5]
}

gs_lr3 = GridSearchCV(pipe_lr3, param_grid=pipe_params, cv=3)

In [49]:
gs_lr3.fit(X_train_com, y_train_com)

  'stop_words.' % sorted(inconsistent))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_a...penalty='l2', random_state=None, solver='warn', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cv__stop_words': ['english'], 'cv__max_features': [3000], 'cv__min_df': [2], 'cv__max_df': [0.9], 'cv__ngram_range': [(1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [50]:
gs_lr3.best_estimator_.named_steps['cv'].get_feature_names()

['!',
 '! !',
 "! ''",
 '! ’',
 '#',
 '# wiki_general_rules',
 '# x200b',
 '$',
 '$ $',
 '%',
 '% 2flegaladvice',
 '% 2fr',
 '&',
 '& amp',
 '& gt',
 "'",
 "' .",
 "''",
 "'' 's",
 "'' ,",
 "'' .",
 "'' ?",
 "'' ``",
 "'d",
 "'d like",
 "'d say",
 "'ll",
 "'m",
 "'m going",
 "'m just",
 "'m really",
 "'m saying",
 "'m sorry",
 "'m sure",
 "'m trying",
 "'re",
 "'re doing",
 "'re going",
 "'re just",
 "'re looking",
 "'re right",
 "'s",
 "'s ,",
 "'s .",
 "'s ``",
 "'s best",
 "'s doing",
 "'s going",
 "'s good",
 "'s hard",
 "'s just",
 "'s legal",
 "'s like",
 "'s likely",
 "'s okay",
 "'s possible",
 "'s probably",
 "'s problem",
 "'s really",
 "'s right",
 "'s thing",
 "'s time",
 "'s way",
 "'s worth",
 "'s wrong",
 "'ve",
 "'ve got",
 "'ve seen",
 '(',
 '( )',
 '( ,',
 '( /message/compose/',
 '( coming',
 '( http',
 '( like',
 "( n't",
 '( s',
 '( wa',
 '( ’',
 ')',
 ') &',
 ") 's",
 ') )',
 ') *',
 ') ,',
 ') .',
 ') :',
 ') ^|',
 ') question',
 ') section',
 ') |',
 '*',
 '* *do

In [51]:
print(gs_lr3.score(X_train_com, y_train_com))

print(gs_lr3.score(X_test_com, y_test_com))

0.9172132713494237
0.8464183381088826


In [52]:
## HashingVectorizer performance test

pipe_hv = Pipeline([
    ('hv', HashingVectorizer()),
    ('lr', LogisticRegression())
])

pipe_params_hv = {
    'hv__stop_words' : [None, 'english'],
    'hv__ngram_range' : [(1,2)],
#    'lr__solver' : ['newton-cg'],
    'lr__penalty' : ['l2'],
    'lr__C': [.5, 1]
}

gs_hv = GridSearchCV(pipe_hv, param_grid=pipe_params_hv, cv=3)

In [53]:
##TfidfVectorizer performance test
pipe_tfdf = Pipeline([
    ('tf', TfidfVectorizer(tokenizer=StemTokenizer())),
    ('lr', LogisticRegression())
])

pipe_params_tf = {
    'tf__stop_words' : ['english'],
    'tf__max_features': [3000],
    'tf__min_df': [2],
    'tf__max_df': [.9],
}

gs_tf = GridSearchCV(pipe_tfdf, param_grid=pipe_params_tf, cv=3)

In [54]:
gs_tf.fit(X_train_sub, y_train_sub)

  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tf__stop_words': ['english'], 'tf__max_features': [3000], 'tf__min_df': [2], 'tf__max_df': [0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [55]:
gs_tf.best_estimator_

Pipeline(memory=None,
     steps=[('tf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=3000, min_df=2,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

In [56]:
#Print Comment Results
print(gs_tf.score(X_train_com, y_train_com))
print(gs_tf.score(X_test_com, y_test_com))

#Print Submission Results
print(gs_tf.score(X_train_sub, y_train_sub))
print(gs_tf.score(X_test_sub, y_test_sub))

0.7919505826912056
0.7799426934097421
0.9807740117676307
0.9704200845140443


##

## Random Forest Classifier

In [57]:
pipe_rf = Pipeline([
    ('cv', CountVectorizer()),
    ('rf', RandomForestClassifier(n_jobs=-1))
])
    
rf_params = {
    'cv__ngram_range' : [(1,2)],
    'rf__n_estimators': [10, 20, 30],
    'rf__max_depth': [None, 1, 2, 3, 4, 5],
    'rf__min_samples_split': [2,3,4]
    
}

gs_rf = GridSearchCV(pipe_rf, param_grid=rf_params, cv=3)


In [58]:
gs_rf.fit(X_train_sub, y_train_sub)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_a..._jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cv__ngram_range': [(1, 2)], 'rf__n_estimators': [10, 20, 30], 'rf__max_depth': [None, 1, 2, 3, 4, 5], 'rf__min_samples_split': [2, 3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [59]:
gs_rf.best_params_

{'cv__ngram_range': (1, 2),
 'rf__max_depth': None,
 'rf__min_samples_split': 2,
 'rf__n_estimators': 30}

In [60]:
#Print Comment Results
print(gs_rf.score(X_train_com, y_train_com))
print(gs_rf.score(X_test_com, y_test_com))

#Print Submission results
print(gs_rf.score(X_train_sub, y_train_sub))
print(gs_rf.score(X_test_sub, y_test_sub))

0.6931159651022097
0.6792741165234002
0.9999171293610674
0.9415858811831966


## XGB Pipeline

Even with already high scores in the other models, we just wanted to test out the capabilities and the tuning of the XGB classifier as it pertains to our classification problem.

In [61]:
pipe_boost = Pipeline([
    ('cv', CountVectorizer()),
    ('xg', XGBClassifier(objective='binary:logistic',seed=42))
])
    
boost_params = {
    'xg__n_estimators=': [1000],
    'xg__learning_rate': [.1],
    'xg__max_depth':[11],
    'xg__min_child_weight': [3],
    'xg__gamma': [0],
    'xg__subsample':[.8],
    'xg__colsample_bytree':[.8],
    'xg__pos_weight':[1],
    'xg__reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gs_boost = GridSearchCV(pipe_boost, param_grid=boost_params,cv=3)

In [62]:
gs_boost.fit(X_train_sub, y_train_sub)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_a...0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
       subsample=1))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'xg__n_estimators=': [1000], 'xg__learning_rate': [0.1], 'xg__max_depth': [11], 'xg__min_child_weight': [3], 'xg__gamma': [0], 'xg__subsample': [0.8], 'xg__colsample_bytree': [0.8], 'xg__pos_weight': [1], 'xg__reg_alpha': [1e-05, 0.01, 0.1, 1, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [63]:
gs_boost.best_params_

{'xg__colsample_bytree': 0.8,
 'xg__gamma': 0,
 'xg__learning_rate': 0.1,
 'xg__max_depth': 11,
 'xg__min_child_weight': 3,
 'xg__n_estimators=': 1000,
 'xg__pos_weight': 1,
 'xg__reg_alpha': 1e-05,
 'xg__subsample': 0.8}

In [64]:
print(gs_boost.score(X_train_com,y_train_com))

print(gs_boost.score(X_test_com,y_test_com))

0.7712539005285615
0.7619866284622732


In [65]:
print(gs_boost.score(X_train_sub,y_train_sub))

print(gs_boost.score(X_test_sub,y_test_sub))

0.9938675727189856
0.9684315187670892


## Model Tester

In [1]:
test_post = ['I am a lawyer'] #A random string simulating a post
gs_boost.predict(test_post) #Using any of above gridsearchedpipelines to test 
#efficiacy, where a 0 will be s1 and 1 will be s2

NameError: name 'gs_boost' is not defined

## Conclusions and Recommendations

Gathering roughly 8,000 posts and 10,000 posts from each subreddit, the cleaning process led to very little loss in of data and as such, each subreddit was well-represented in the final classification model. No balancing of classes was required.

From there, analysis and optimization of grid searches over hyperparameters led to the following insights. From all NLP models -  Logistic Regression, Multinomial Naive Bayes, Random Forest Classification, and XGBoost, it seemed like in general the Logistic Regression performed the best with the highest interpretability and least tuning.

A simple grid search on the Logistic Regression parameters while it took several iterations, was much simpler than the fine tuning that a XGBoost required. It was simply much easier to get a good result with Logistic Regression. Results of training off the submission selftext typically scored nearly 0.99 and roughly 0.95/0.96 across most of the model. Understandably so, the use of XGBoost as the last model was extraneous, as there was not much room for improvement, and our results show that as such. XGBoost's fitting did not have room to perform much better. Given it's notable efficacy within the Machine Learning community, it is not hard to imagine that with even more tuning of the hyperparameters, XGBoost could beat all the other previously mentioned models. However, in this case, our subreddits were farther apart in terms of language than we initially estimated and NLP should be able to distinguish them close to 100% of the time. Most of the language we found that scored highly was very indicative of r/legaladvice.

One extra note of insight we found was that training models on comments alone and lead to good post accuracy scores (~.99). However, training a model on selftext could only lead to scores in the neighborhood of (0.80) for identifying the subreddit given a comment. This suggests that selftexts are MORE indicative of a subreddit than the comments. And intuitively, we think this is true because there are many more comments than there are posts (most users comment, but few post) and these comments may come with much more variance than the possibly moderate posts. We may also theorize that posts are typically submitted by the most dedicated members of the subreddit and as such, are more representative of the community.


## Sources

    - www.reddit.com/r/legaladvice
    - www.reddit.com/r/relationship_advice
