# This-or-That Consultants
## Executive Summary

### Problem Statement:
	
	Reddit has suffered an internal attack by a disgruntled ex-employee.  As a parting gift,
	the ex-employee left the company by replacing all subreddit fields with " ¯\_(ツ)_/¯ ".
	As dedicated Data Analysts, we will build a classification model that can sort certain subreddits
	into two hats.  Without the subreddit fields being assigned, the links do not work.  
		With a classification model, Reddit will be able to get started on re-assigning the
	fields of each post, at which point the subreddit links will populate and 
	be functional again.
		How successful our model is, of course, will determine whether or not it saves time.
	To measure the effectiveness of the different models we build, we will measure each model's
	accuracy (correct classifications made divided by all classifications made).
		Before building any models, though, we must: first, retrieve the data; second, clean
	and format the data; and third, vectorize the cleaned text columns (to allow them to be
	included as features in the models).  The subreddits we have selected are the Mario Party
	and the Super Smash Ultimate subreddits, two relatively young, active, and present (the
	two games were very recently released) subreddits.  We will be using two different methods
	of vectorization: CountVectorize and TF-IDF.
		After retrieving, cleaning, and vectorizing the data, we will remove certain columns,
	including the 'target' column, of course, from a 'features' list.  This list, as implied,
	will contain a list of the columns to be included in the model as a feature and
	therefore constitute our X's.  Our 'target' column, which contains a binary value of 0 
	or 1 (0 representing that the post came from the Super Smash Ultimate subreddit and 1 
	representing that the post came from the Mario Party subreddit), will be used as our y.
		Next, we will split the dataset into two: a training set, and a testing set.  By using
	sklearn's train_test_split, we can check the results of our model when run on 'unseen' data,
	giving ourselves a more realistic reading on the model's effectiveness.
		Finally, we will run our data through the modeling process.  We will use four different
	classification models: 1. Logistic Regression; 2. K Nearest Neighbors; 3. Random Forests;
	and 4. Extra Trees.  After running each model, we will also run the model through a gridsearch,
	testing different paramaters for each model to further evaluate how well a model could perform.
		Once all models have ran, we will evaluate their accuracy scorings, find the strongest
	performing model and declare it the model most worth pursuing further.
		While we continue to improve our model, though, our hope is that reddit would
	partner with us and use our model to help sort their subreddits.  After all, the reddit team
	knows their product best and would be able to help us most when trying to improve our model.
	
	
### Conclusions and Recommendations:

	What we found was that our best performing model was Logistic Regression.  We Gridsearched
	it and some of the parameters that resulted were: C = 1.0; penalty = Ridge; tolerance = 0.0001.
		We believe the subreddits may not have had as much overlap as we initially expected; i.e.,
	the subreddits were more different from each other and therefore were easier to classify.
	With this being the case, we decided to handicap the model, in a sense, by removing certain
	words from being vectorized (the words found in the subreddits' titles, like Mario, Party, Super,
	Smash, etc.).  We also removed the post authors from the features; we did not want our model to
	be even partially dependent on this post trait because these two subreddits are relatively young
	and therefore may have concentrated groups of authors, making them a giveaway in this case alone.
	Ideally, our model can work on subreddits that older/have more diverse pools of authors,
	so when building our model, we removed the column entirely from the features.
		After handicapping the model, though, the results were still strong, with our best model
	having a Train Set Accuracy Score of .999 and a Test Set Accuracy Score of .950.  We also had a
	strong performing Extra Trees model, and its number of trees turned out to be 10, implying that
	not many features were needed to build a strong model (at least with this particular dataset).
		Of course, we would love to see how our models work on different pairs of subreddits and
	compare results.  Does our model work well on the MARIOPARTY/SuperSmashUltimate subreddits,
	or does it also work well on other subreddit pairs?  We'd like to find out.
	
	We recommend that reddit use our model on the MARIOPARTY/SuperSmashUltimate subreddits!
	The model seems to work well, at least on this data set, and may work well on others.
		Although our model seems to be very strong, it cannot guarantee 100% accuracy, therefore
	we have a second suggestion: add a "Misplaced Post" button so that users may suggest a post
	as being in the wrong subreddit.  This way, even if a post is misclassified, it can be rectified.

---

# Import Libraries

In [202]:
import requests
import time
import nltk
import pandas as pd
import regex as re
import numpy as np

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

---

### Create function that scrapes reddit's API:

In [120]:
#The function 'reddit_to_csv' will take three arguments: 1. the subreddit being scraped; 2. the filename, or the name
# the csv file will be given; and 3. the number of requests the user would like to make of reddit's API. 

def reddit_to_csv(subreddit, filename, n_requests=1):
    
    #Create an empty list to be used later in function:
    posts = []
    
    #Create User-Agent to avoid 429 res.status_code:
    headers = {'User-Agent': 'Knock Knock 914'}
    
    #Establish that 'after' (a variable used later) is None type:
    after = None
    
    #for loop n_requests iterations (n_requests is established by user):
    for i in range(n_requests):
        #Print i to inform user how far along function is:
        print(i)
        
        #At first, 'after' will be None, as established above, making params, initially, an empty dictionary
        #without any parameters set.  After the first iteration, 'after' will be given a value containing an id
        #tag of the last post pulled in that iteration's request, allowing the function to continue looping
        #through the next set of posts instead of continuously pulling the same 25 posts, for example.
        
        if after == None:
            params = {}
        else:
            params = {'after': after}
            
        #Assign 'url' to reddit's base url, plus whatever subreddit the user provides, plus .json for clean results:
        url = 'https://www.reddit.com/' + str(subreddit) + '/.json'
        
        #Set my res variable equal to the results from requests.get, and the parameters set above like 'url' or 'params':
        res = requests.get(url, params = params, headers = headers)
        
        #Conditional statement to ensure access to the API is approved:
        if res.status_code == 200:
            
            the_json = res.json()
            
            for x in range(len(the_json['data']['children'])):
                
                #Create temporary dictionary to add results of each post to:
                temp_dict = {}
                
                #After looking through the json results, I've selected the below information about the posts
                #as those that can potentially add value to my model's results.
                temp_dict['subreddit'] = the_json['data']['children'][x]['data']['subreddit']
                temp_dict['title'] = the_json['data']['children'][x]['data']['title']
                temp_dict['post_paragraph'] = the_json['data']['children'][x]['data']['selftext']
                temp_dict['clicked'] = the_json['data']['children'][x]['data']['clicked']
                temp_dict['ups'] = the_json['data']['children'][x]['data']['ups']
                temp_dict['downs'] = the_json['data']['children'][x]['data']['downs']
                temp_dict['likes'] = the_json['data']['children'][x]['data']['likes']
                temp_dict['category'] = the_json['data']['children'][x]['data']['category']
                temp_dict['number_of_comments'] = the_json['data']['children'][x]['data']['num_comments']
                temp_dict['score'] = the_json['data']['children'][x]['data']['score']
                temp_dict['author_flair_css_class'] = the_json['data']['children'][x]['data']['author_flair_css_class']
                temp_dict['subreddit_type'] = the_json['data']['children'][x]['data']['subreddit_type']
                
                #Add the temporary dictionary to 'posts',the list of each post's dictionary of information:
                posts.append(temp_dict)
                
            after = the_json['data']['after']
            
        else:
            print(res.status_code)
            break
            
        #Enter a delay of one second in the requests to reddit's API for good internet citizenship:    
        time.sleep(1)
    
    #Turn the list of post dictionaries into a pandas DataFrame:
    posts_df = pd.DataFrame(posts)
    
    #Drop any duplicate rows that may have been pulled:
    posts_df.drop_duplicates(inplace = True)
    
    #Rearrange the columns into a more logical order:
    posts_df = posts_df[['subreddit', 'title', 'clicked', 'ups', 'downs', 'post_paragraph', 'likes', 'number_of_comments', 'category', 'score', 'author_flair_css_class', 'subreddit_type']]
    
    #Save the DataFrame as a .csv file:
    posts_df.to_csv(str(filename), index = False, sep = ",")


In [None]:
#reddit_to_csv(subreddit = 'r/MARIOPARTY',
              #n_requests = 150,
              #filename = 'mario_party_reddit_posts.csv')

In [None]:
#reddit_to_csv(subreddit = 'r/SmashBrosUltimate',
              #n_requests = 150,
              #filename = 'super_smash_reddit_posts.csv')

---

### Merge two dataframes

In [253]:
mario_party_df = pd.read_csv('./mario_party_reddit_posts.csv')

In [254]:
mario_party_df.head(3)

Unnamed: 0,subreddit,title,clicked,ups,downs,post_paragraph,likes,number_of_comments,category,score,author_flair_css_class,subreddit_type
0,MARIOPARTY,Super Mario Party: OUT NOW! Click here to fres...,False,63,0,,,4,,63,wario,public
1,MARIOPARTY,Looking for a group to play Super Mario Party'...,False,40,0,,,4,,40,wario,public
2,MARIOPARTY,Little did I know the weight the last minigame...,False,21,0,,,6,,21,noflair,public


In [255]:
mario_party_df.shape

(2052, 12)

In [256]:
super_smash_df = pd.read_csv('./super_smash_reddit_posts.csv')

In [257]:
super_smash_df.head(3)

Unnamed: 0,subreddit,title,clicked,ups,downs,post_paragraph,likes,number_of_comments,category,score,author_flair_css_class,subreddit_type
0,SmashBrosUltimate,Super Smash Bros Ultimate Release Day Thread -...,False,295,0,Welcome everyone! It's the day we've all been ...,,366,,295,21marth,public
1,SmashBrosUltimate,Mod Apps are now open! Apply if you're interes...,False,19,0,,,11,,19,36diddykong,public
2,SmashBrosUltimate,Saw the guy in front of me wearing this,False,802,0,,,10,,802,,public


In [258]:
super_smash_df.shape

(1698, 12)

In [259]:
df = mario_party_df.append(super_smash_df, ignore_index=True)

---

# CSV CHECKPOINT

---

## Confirm dataframes were stacked correctly:

In [260]:
df.head(3)

Unnamed: 0,subreddit,title,clicked,ups,downs,post_paragraph,likes,number_of_comments,category,score,author_flair_css_class,subreddit_type
0,MARIOPARTY,Super Mario Party: OUT NOW! Click here to fres...,False,63,0,,,4,,63,wario,public
1,MARIOPARTY,Looking for a group to play Super Mario Party'...,False,40,0,,,4,,40,wario,public
2,MARIOPARTY,Little did I know the weight the last minigame...,False,21,0,,,6,,21,noflair,public


In [261]:
df.shape

(3750, 12)

In [None]:
2052+1698

---

## Create a 'target' column (will equal 1 if the post's subreddit is Mario Party, and 0 if the post's subreddit is Smash Bros. Ultimate):

In [262]:
df['target'] = np.where(df['subreddit'] == 'MARIOPARTY', 1, 0)

---

## Look for columns that don't have any values and can be dropped:

In [None]:
#df['likes'].isnull().sum()

In [None]:
#df['category'].isnull().sum()

### I noticed that column 'clicked' is not empty, but the column values are purely False, therefore I will drop 'clicked' as well.  The same for columns 'downs' and 'subreddit_type' which are purely 0's and 'public', respectively.

In [None]:
#df['clicked'].value_counts()

In [None]:
#df['downs'].value_counts()

In [None]:
#df['subreddit_type'].value_counts()

In [263]:
df_drop_list = ['likes', 'category', 'clicked', 'downs', 'subreddit_type']

In [264]:
df.drop(df_drop_list, axis=1, inplace=True)

In [None]:
#df.shape

In [265]:
df.to_csv('master_df.csv', index=False, sep=",")

---

## Set my tokenizer:

In [266]:
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')

# Lemmatize

In [267]:
lemmatizer = WordNetLemmatizer()

In [268]:
#df = pd.read_csv('./master_df.csv')

In [269]:
df.columns

Index(['subreddit', 'title', 'ups', 'post_paragraph', 'number_of_comments',
       'score', 'author_flair_css_class', 'target'],
      dtype='object')

## Create function that takes a column containing text and returns the lemmatized version in a new 'cleaned' column:

In [270]:
def column_cleaner(column, df=df):
    #For some reason, I was running into errors trying to run this code until I added the code
    #below (df[column+'_clean'] = ""), establishing from the beginning that the new column to be created
    #exists in the dataframe and contains nothing but empty strings.
    
    df[column+'_clean'] = ""
    
    #for loop through each row in the column:
    for i in range(len(df[column])):
        
        #Tokenize, or separate, each word in column's string into its own string (prep for lemmatization):
        col_tok = []
        col_tok.extend(tokenizer.tokenize(df[column][i].lower()))
        col_token = []
        [col_token.append(s) for s in col_tok if s not in col_token]
        
        #Lemmatize the words (cut the word to its base/root, for improved model results):
        col_lem = []
        for x in range(len(col_token)):
            col_lem.append(lemmatizer.lemmatize(col_token[x]))
        
        #Remove characters and numbers (for improved model results, hopefully):
        letters_only_col = []
        for c in range(len(col_lem)):
            letters_only_col.append(re.sub("[^a-zA-Z]", "", col_lem[c]))
        
        #Remove stopwords (for improved model results):
        col_words = [w for w in letters_only_col if not w in stopwords.words('english')]
        
        #Remove 'cheat' words (words that are in the subreddit's name and also in the column)
        cheat_words = ['mario', 'party', 'marioparty', 'smash', 'bros', 'ultimate', 'smashbrosultimate', 'super']
        col_words = [w for w in letters_only_col if not w in cheat_words]
        
        #Ensure that there are no 'None' objects in title_words:
        col_words = list(filter(None, col_words))

        #Join the lemmatized words - stopwords back to one long string (prep for
        #vectorization, done outside/after this function):
        col_words = " ".join(col_words)

        #Fill new column with 'cleaned' string from column:
        df[column+'_clean'][i] = col_words

In [271]:
column_cleaner(column='title', df=df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [272]:
df['post_paragraph'].head()

0                                                  NaN
1                                                  NaN
2                                                  NaN
3                                                  NaN
4    First, the objective is the same on all game b...
Name: post_paragraph, dtype: object

In [273]:
df['post_paragraph'] = df['post_paragraph'].replace(np.nan, "")

In [274]:
column_cleaner(column='post_paragraph', df=df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


---

# Save version of DataFrame:

In [275]:
df.to_csv('master_df_cleaned.csv', index=False, sep=",")

---

In [131]:
df.head()

Unnamed: 0,subreddit,title,ups,post_paragraph,number_of_comments,score,author_flair_css_class,target,title_clean,post_paragraph_clean
0,MARIOPARTY,Super Mario Party: OUT NOW! Click here to fres...,63,,4,63,wario,1,out now click here to freshen up on what this ...,
1,MARIOPARTY,Looking for a group to play Super Mario Party'...,40,,4,40,wario,1,looking for a group to play s online mode with...,
2,MARIOPARTY,Little did I know the weight the last minigame...,21,,6,21,noflair,1,little did i know the weight last minigame car...,
3,MARIOPARTY,Is...is Rosalina ok?,1,,0,1,noflair,1,is is rosalina ok,
4,MARIOPARTY,Super Mario Party: Initial Impressions,1,"First, the objective is the same on all game b...",1,1,noflair,1,initial impression,first the objective is same on all game board ...


---

# Create CountVectorize Function:

---

In [138]:
def count_vec_column(column, func_df=df):
    #Instantiate CountVectorizer:
    vect = CountVectorizer()
    
    #Create temporary variable X_text that takes on the fit/transformed results of the column:
    X_text = vect.fit_transform(func_df[column])
    
    #Turn X_text into an array (prep to easily make a DataFrame):
    X_text = X_text.toarray()
    
    #Create a temporary DataFrame with each word/word-pair/word-group as the columns:
    temp_df = pd.DataFrame(X_text,
                           columns=vect.get_feature_names())
    
    #Add the original column name to the beginning of the new columns' names to differentiate from which column
    # the vectorized words came from (this may impact the strength of the model):
    for i in range(len(temp_df.columns)):
        #print(i)
        temp_df.rename(columns={temp_df.columns[i]: column + '_' + temp_df.columns[i]}, inplace=True)
    
    #Combine the two DataFrames:
    func_df = pd.concat([func_df, temp_df], axis=1, join_axes=[func_df.index])
    return func_df

---

# Create TF-IDF Function:

---

In [53]:
def tfidf_column(column, func_df=df):
    #Instantiate TfidfVectorizer:
    tfidf_vect = TfidfVectorizer()
    
    #Create temporary variable X_text that takes on the fit/transformed results of the column:
    X_text = tfidf_vect.fit_transform(func_df[column])
    
    #Turn X_text into an array (prep to easily make a DataFrame):
    X_text = X_text.toarray()
    
    #Create a temporary DataFrame with each word/word-pair/word-group as the columns:
    temp_df = pd.DataFrame(X_text,
                           columns=tfidf_vect.get_feature_names())
    
    #Add the original column name to the beginning of the new columns' names to differentiate from which column
    # the tf-idf vectorized words came from (this may impact the strength of the model):
    for i in range(len(temp_df.columns)):
        #print(i)
        temp_df.rename(columns={temp_df.columns[i]: column + '_' + temp_df.columns[i]}, inplace=True)
    
    #Combine the two DataFrames:
    func_df = pd.concat([func_df, temp_df], axis=1, join_axes=[func_df.index])
    return func_df

---

# CountVectorize

---

In [4]:
#df = count_vec_column(func_df=df, column='title_clean')

In [67]:
#df['post_paragraph'] = df['post_paragraph'].replace(np.nan, "")

In [3]:
#df = count_vec_column(func_df=df, column='post_paragraph_clean')

---

# CHECKPOINT

In [143]:
len(df.columns)

9304

In [145]:
df.to_csv('master_df_cleaned_vected.csv', index=False, sep=",")

---

## Feature Engineering:

---

### Drop the 'author_flair_css_class' from the features list (I will 'get_dummies' on this column later on to see if it affects the strength of my models):

In [146]:
df.columns

Index(['subreddit', 'title', 'ups', 'post_paragraph', 'number_of_comments',
       'score', 'author_flair_css_class', 'target', 'title_clean',
       'post_paragraph_clean',
       ...
       'post_paragraph_clean_zelda', 'post_paragraph_clean_zero',
       'post_paragraph_clean_zio', 'post_paragraph_clean_ziodyne',
       'post_paragraph_clean_zionga', 'post_paragraph_clean_zombie',
       'post_paragraph_clean_zone', 'post_paragraph_clean_zoom',
       'post_paragraph_clean_zoomed', 'post_paragraph_clean_zr'],
      dtype='object', length=9304)

In [147]:
features = list(df.columns)

In [148]:
features[:10]

['subreddit',
 'title',
 'ups',
 'post_paragraph',
 'number_of_comments',
 'score',
 'author_flair_css_class',
 'target',
 'title_clean',
 'post_paragraph_clean']

In [149]:
del_list = ['subreddit', 'target', 'title', 'title_clean', 'post_paragraph', 'post_paragraph_clean', 'author_flair_css_class']

In [150]:
features = [i for i in features if i not in del_list]

---

# Set X and y:

In [152]:
X = df[features]
y = df['target']

---

# Train-test-split:

In [153]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

---

# Modeling

---

## Logistic Regression

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "CountVectorized"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.



---

In [155]:
logreg = LogisticRegression()

In [156]:
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [157]:
cross_val_score(logreg, X_train, y_train).mean()



0.8993661817456399

In [158]:
logreg.score(X_train, y_train)

0.9960881934566145

In [159]:
logreg.score(X_test, y_test)

0.94136460554371

In [None]:
logreg.coef_

---

## Gridsearch on Logistic Regression

---

In [178]:
logreg = LogisticRegression()

In [179]:
my_params = {
    'penalty': ['l1', 'l2'],
    'C': [0.5, 1.0, 25],
    
}

In [180]:
grid = GridSearchCV(logreg, param_grid=my_params, cv=5)

In [181]:
grid.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.5, 1.0, 25]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [182]:
grid.score(X_train, y_train)

0.9985775248933144

In [183]:
grid.score(X_test, y_test)

0.9498933901918977

---

# K Nearest Neighbors

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "CountVectorized"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.


---

In [162]:
knn = KNeighborsClassifier()

In [163]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [164]:
cross_val_score(knn, X_train, y_train).mean()



0.6642978126595259

In [165]:
knn.score(X_train, y_train)

0.8424608819345661

In [166]:
knn.score(X_test, y_test)

0.685501066098081

---

## Gridsearch on K Nearest Neighbors

---

In [184]:
knn = KNeighborsClassifier()

In [185]:
my_params = {
    'n_neighbors': [5, 10, 25],
    'weights': ['uniform', 'distance'],
    
}

In [186]:
grid = GridSearchCV(knn, param_grid=my_params, cv=5)

In [187]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': [5, 10, 25], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [189]:
grid.score(X_test, y_test)

0.7899786780383795

---

# Random Forests:

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "CountVectorized"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.


---

In [167]:
rf = RandomForestClassifier()

In [168]:
rf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [169]:
cross_val_score(rf, X_train, y_train).mean()



0.8862017098529309

In [170]:
rf.score(X_train, y_train)

0.9975106685633002

In [171]:
rf.score(X_test, y_test)

0.9381663113006397

---

## Gridsearch on Random Forests

---

In [190]:
rf = RandomForestClassifier()

In [191]:
my_params = {
    'criterion': ['gini', 'entropy'],
    'n_estimators': [18, 20, 25],
    'max_depth': [4, 10, 20],
    'max_features': ['auto', 1.0, 2, 3]
    
}

In [192]:
grid = GridSearchCV(rf, param_grid = my_params, cv = 5)

In [193]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['gini', 'entropy'], 'n_estimators': [18, 20, 25], 'max_depth': [4, 10, 20], 'max_features': ['auto', 1.0, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [194]:
grid.score(X_train, y_train)

0.9096728307254623

In [195]:
grid.score(X_test, y_test)

0.826226012793177


---

# Extra Trees

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "CountVectorized"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.


---

In [172]:
et = ExtraTreesClassifier()

In [173]:
et.fit(X_train, y_train)



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [174]:
cross_val_score(et, X_train, y_train).mean()



0.8897595419760475

In [176]:
et.score(X_test, y_test)

0.9477611940298507

---

## Gridsearch on Extra Trees

---

In [196]:
et = ExtraTreesClassifier()

In [197]:
my_params = {
    'criterion': ['gini', 'entropy'],
    'n_estimators': [18, 20, 25],
    'max_depth': [4, 10, 20],
    'max_features': ['auto', 1.0, 2, 3]
    
}

In [198]:
grid = GridSearchCV(et, param_grid = my_params, cv = 5)

In [199]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['gini', 'entropy'], 'n_estimators': [18, 20, 25], 'max_depth': [4, 10, 20], 'max_features': ['auto', 1.0, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [200]:
grid.score(X_train, y_train)

0.903271692745377

In [201]:
grid.score(X_test, y_test)

0.835820895522388

---

# CHECKPOINT

---

In [226]:
df = df[['subreddit', 'title', 'ups', 'post_paragraph', 'number_of_comments',
       'score', 'author_flair_css_class', 'target', 'title_clean',
       'post_paragraph_clean']]

---

# TF-IDF

---

In [1]:
#df = tfidf_column(func_df=df, column='title_clean')

In [278]:
#df['post_paragraph'] = df['post_paragraph'].replace(np.nan, "")

In [2]:
#df = tfidf_column(func_df=df, column='post_paragraph_clean')

---

# CHECKPOINT

In [280]:
len(df.columns)

9304

In [281]:
df.to_csv('master_df_cleaned_tfidf.csv', index=False, sep=",")

---

## Feature Engineering:

---

### Drop the 'author_flair_css_class' from the features list (I will 'get_dummies' on this column later on to see if it affects the strength of my models):

In [304]:
df.columns

Index(['subreddit', 'title', 'ups', 'post_paragraph', 'number_of_comments',
       'score', 'author_flair_css_class', 'target', 'title_clean',
       'post_paragraph_clean',
       ...
       'zelda.1', 'zero.1', 'zio', 'ziodyne', 'zionga', 'zombie', 'zone',
       'zoom', 'zoomed', 'zr'],
      dtype='object', length=9304)

In [306]:
features = list(df.columns)

In [307]:
features[:10]

['subreddit',
 'title',
 'ups',
 'post_paragraph',
 'number_of_comments',
 'score',
 'author_flair_css_class',
 'target',
 'title_clean',
 'post_paragraph_clean']

In [308]:
del_list = ['subreddit', 'target', 'title', 'title_clean', 'post_paragraph', 'post_paragraph_clean', 'author_flair_css_class']

In [309]:
features = [i for i in features if i not in del_list]

---

# Set X and y:

In [313]:
X = df[features]
y = df['target']

---

# Train-test-split:

In [316]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

---

# Modeling

---

## Logistic Regression

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "TF-IDF'ed"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.



---

In [317]:
logreg = LogisticRegression()

In [318]:
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [319]:
cross_val_score(logreg, X_train, y_train).mean()



0.8950923079373676

In [320]:
logreg.score(X_train, y_train)

0.9822190611664295

In [321]:
logreg.score(X_test, y_test)

0.9189765458422174

In [329]:
logreg.coef_

array([[-3.04268735e-03,  6.06753110e-05, -3.04268735e-03, ...,
        -1.96434998e-01,  0.00000000e+00,  9.75458709e-02]])

---

## Gridsearch on Logistic Regression

---

In [332]:
logreg = LogisticRegression()

In [333]:
my_params = {
    'penalty': ['l1', 'l2'],
    'C': [0.5, 1.0, 25],
    
}

In [334]:
grid = GridSearchCV(logreg, param_grid=my_params, cv=5)

In [335]:
grid.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.5, 1.0, 25]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [336]:
grid.score(X_train, y_train)

0.9985775248933144

In [343]:
grid.score(X_test, y_test)

0.9445628997867804

---

# K Nearest Neighbors

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "TF-IDF'ed"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.


---

In [344]:
knn = KNeighborsClassifier()

In [345]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [346]:
cross_val_score(knn, X_train, y_train).mean()



0.6582505220505189

In [347]:
knn.score(X_train, y_train)

0.8040540540540541

In [348]:
knn.score(X_test, y_test)

0.6673773987206824

---

## Gridsearch on K Nearest Neighbors

---

In [349]:
knn = KNeighborsClassifier()

In [350]:
my_params = {
    'n_neighbors': [5, 10, 25],
    'weights': ['uniform', 'distance'],
    
}

In [351]:
grid = GridSearchCV(knn, param_grid=my_params, cv=5)

In [352]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': [5, 10, 25], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [353]:
grid.score(X_test, y_test)

0.6823027718550106

---

# Random Forests:

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "TF-IDF'ed"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.


---

In [354]:
rf = RandomForestClassifier()

In [355]:
rf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [356]:
cross_val_score(rf, X_train, y_train).mean()



0.8758869928448928

In [357]:
rf.score(X_train, y_train)

0.9971550497866287

In [358]:
rf.score(X_test, y_test)

0.9275053304904051

---

## Gridsearch on Random Forests

---

In [359]:
rf = RandomForestClassifier()

In [360]:
my_params = {
    'criterion': ['gini', 'entropy'],
    'n_estimators': [18, 20, 25],
    'max_depth': [4, 10, 20],
    'max_features': ['auto', 1.0, 2, 3]
    
}

In [361]:
grid = GridSearchCV(rf, param_grid = my_params, cv = 5)

In [362]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['gini', 'entropy'], 'n_estimators': [18, 20, 25], 'max_depth': [4, 10, 20], 'max_features': ['auto', 1.0, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [363]:
grid.score(X_train, y_train)

0.9267425320056899

In [364]:
grid.score(X_test, y_test)

0.8272921108742004


---

# Extra Trees

#### To note:

* Authors of the posts are not included in the Features/Inputs
* The 'title' and 'post_paragraph' columns were cleaned and "TF-IDF'ed"
* Words contained in the subreddit title names (like Mario, Party, Super, etc.) have been removed to ensure that 
the model is not depending on these for accuracy; we do not want to assume that future titles will always be so descriptive.


---

In [365]:
et = ExtraTreesClassifier()

In [366]:
et.fit(X_train, y_train)



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [367]:
cross_val_score(et, X_train, y_train).mean()



0.8947384589478283

In [368]:
et.score(X_test, y_test)

0.9307036247334755

---

## Gridsearch on Extra Trees

---

In [369]:
et = ExtraTreesClassifier()

In [370]:
my_params = {
    'criterion': ['gini', 'entropy'],
    'n_estimators': [18, 20, 25],
    'max_depth': [4, 10, 20],
    'max_features': ['auto', 1.0, 2, 3]
    
}

In [371]:
grid = GridSearchCV(et, param_grid = my_params, cv = 5)

In [372]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['gini', 'entropy'], 'n_estimators': [18, 20, 25], 'max_depth': [4, 10, 20], 'max_features': ['auto', 1.0, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [373]:
grid.score(X_train, y_train)

0.9274537695590327

In [374]:
grid.score(X_test, y_test)

0.835820895522388

---