# Other Reddits

In this notebook, I am going to take a look at what reddit posts characteristics most accurately predict which subreddits when the subreddits are about the same topic. 

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import time

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, r2_score, accuracy_score, roc_auc_score, roc_curve, auc
from sklearn.pipeline import Pipeline

import regex as re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from bs4 import BeautifulSoup 

%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')

## Load Data

In [3]:
df1 = pd.read_csv('./investing.csv', index_col=0)
df2 = pd.read_csv('./stocks.csv', index_col=0)

In [4]:
#check 
df1.head()

Unnamed: 0,title,post,author,subreddit
0,"It's moronic Monday, the Wednesday edition, yo...",We encourage all our visitors to ask those inv...,AutoModerator,investing
1,"Musk Doubles Down On Cave Diver Attack, Calls ...",article: https://www.cnbc.com/2018/09/05/tesla...,LIFO_CAN_FIFO_ITSELF,investing
2,Mercedes Unveils First Tesla Rival in $12 Bill...,Full article: [https://www.bloomberg.com/news/...,Throwawayacct449393,investing
3,Norway's $1 trillion sovereign wealth fund is ...,https://www.cnbc.com/2018/09/05/norways-1-tril...,NineteenEighty9,investing
4,Fidelity's New Zero Fee Funds Experience $1 Bi...,News article: https://www.cnbc.com/2018/09/04/...,notafeg,investing


In [5]:
df2.tail()

Unnamed: 0,title,post,author,subreddit
791,LGORF - what do you guys think?,"Was listening to a Bloomberg show, it was an h...",HariOfTrantor,stocks
792,the stars group,r/https://thestockboys.com/2018/08/13/time-to-...,matttttt123,stocks
793,Opinions on FLWS,1800 flowers was an early adopter for e-commer...,DeliciouslyUnaware,stocks
794,SUNRUN to RUN UP?,Since their earnings report where SUNRUN disap...,Scarecroll,stocks
795,Is SPOT overvalued?,I feel like it’s going to be very difficult fo...,84935,stocks


In [6]:
#print to see how many rows
print(df1.shape)
print(df2.shape)
print(968+796)

(968, 4)
(796, 4)
1764


In [7]:
#concat the two df
all_df = [df1, df2]
df = pd.concat(all_df).reset_index(drop=True)

In [8]:
#confirm final shape
df.shape

(1764, 4)

## Preprocessing

In [9]:
#change target column to binary so talesfromtech will be 1 and jokes will be 0
df['subreddit'] = [1 if i == 'investing' else 0 for i in df['subreddit']]

In [10]:
#check to make sure it works
# df.head()
# df.tail()

In [11]:
#Check what our titles look like
# df['title'][5]

In [12]:
#Check what our posts look like
# df['Post'][0]

In [13]:
# Function to convert a raw post to a string of words
def raw_to_words(raw):
    
    #remove URL
    link = re.sub(r'http\S+', '', raw)
    
    # Remove HTML
    text = BeautifulSoup(link).get_text()
    
    # Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    
    # Convert to lower case, split into individual words
    words = letters_only.lower().split()
    
    # convert the stop words to a set
    stops = set(stopwords.words('english'))
    
    # Remove stop words
    meaningful_words = [w for w in words if not w in stops]
    
    # Join the words back into one string separated by space and return the result.
    return (" ".join(meaningful_words))

In [14]:
#check first post to see if it worked
raw_to_words(df['post'][0])

'encourage visitors ask investing related questions always afraid ask members r investing answer educate note question anything similar single answer question also need lot information give sort answer old employed making income much objectives money buy house retirement savings risk tolerance mind risking blackjack need know safe current holdings already exposure specific funds sectors assets house paid cars expensive girlfriend really asset time horizon need money next month next yrs big debts relevant financial information useful give proper answer aware answers opinions redditors used starting point research strongly consider seeing registered financial rep making financial decisions'

In [15]:
#drop null values
df.dropna(inplace=True)
df.shape

(1714, 4)

In [16]:
#apply data cleaning function to columns
df['title'] = df['title'].map(raw_to_words)
df['post'] = df['post'].map(raw_to_words)
df['author'] = df['author'].map(raw_to_words)

In [17]:
# look at baseline accuracy, majority looks like it is talesfromtechsupport with 52.3%
# we would want something at least greater than this.
df['subreddit'].value_counts(normalize=True)

1    0.536173
0    0.463827
Name: subreddit, dtype: float64

## Build Models

### Title Column

I am going to start with the title column and see how just the title does in predicting subreddit talesfromtechsupport

#### Count Vectorizer

In [18]:
#set X and y to run train test split for title
X_title = df.title
y = df.subreddit

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X_title , y , test_size = .33, random_state = 42)

In [20]:
#n-gram = 1
cvec = CountVectorizer(stop_words='english', max_features = 5000, ngram_range=(1,1)).fit(X_train)
df_train = pd.DataFrame(cvec.transform(X_train).todense(), columns=cvec.get_feature_names())
df_test = pd.DataFrame(cvec.transform(X_test).todense(), columns=cvec.get_feature_names())

In [21]:
#n-gram = 2
cvec = CountVectorizer(stop_words='english', max_features = 5000, ngram_range=(1,2)).fit(X_train)
df_train_2 = pd.DataFrame(cvec.transform(X_train).todense(), columns=cvec.get_feature_names())
df_test_2 = pd.DataFrame(cvec.transform(X_test).todense(), columns=cvec.get_feature_names())

#### TfidVectorizer

In [22]:
#n-gram = 1
tvec = TfidfVectorizer(stop_words='english', ngram_range=(1,1)).fit(X_train)
train_title = pd.DataFrame(tvec.transform(X_train).todense(), columns = tvec.get_feature_names())
test_title = pd.DataFrame(tvec.transform(X_test).todense(), columns = tvec.get_feature_names())

In [23]:
#n-gram = 2
tvec = TfidfVectorizer(stop_words='english', ngram_range=(1,2)).fit(X_train)
train_title_2 = pd.DataFrame(tvec.transform(X_train).todense(), columns = tvec.get_feature_names())
test_title_2 = pd.DataFrame(tvec.transform(X_test).todense(), columns = tvec.get_feature_names())

#### Create a dataframe to store my accuracy scores along the way

In [24]:
#set columns and index for the df
columns = ['cvec_score_1', 'cvec_score_2', 'tvec_score_1', 'tvec_score_2']
index = ['lr_title', 'knn_title', 'randomforest_title', 'multinomial_title',
         'lr_post', 'knn_post', 'randomforest_post', 'multinomial_post',
         'lr_title_post', 'knn_title_post', 'randomforest_title_post', 'multinomial_title_post']
score_df = pd.DataFrame(index=index,columns=columns)

In [25]:
# score_df

#### Logistic Regression


In [97]:
#n-gram = 1, CVEC
lr = LogisticRegression()
lr.fit(df_train, y_train)

y_pred = lr.predict(df_test)
score_df['cvec_score_1']['lr_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(df_train, y_train))

accuracy score 0.6254416961130742
train score 0.9259581881533101


In [27]:
#n-gram = 2, CVEC
lr = LogisticRegression()
lr.fit(df_train_2, y_train)

y_pred = lr.predict(df_test_2)
score_df['cvec_score_2']['lr_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(df_train_2, y_train))

accuracy score 0.6183745583038869
train score 0.9729965156794426


In [28]:
#n-gram = 1, TVEC
lr = LogisticRegression()
lr.fit(train_title, y_train)

y_pred = lr.predict(test_title)
score_df['tvec_score_1']['lr_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(train_title, y_train))

accuracy score 0.6254416961130742
train score 0.877177700348432


In [29]:
#n-gram = 2, TVEC
lr = LogisticRegression()
lr.fit(train_title_2, y_train)

y_pred = lr.predict(test_title_2)
score_df['tvec_score_2']['lr_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(train_title_2, y_train))

accuracy score 0.627208480565371
train score 0.9468641114982579


#### KNN


In [30]:
#n-gram = 1, CVEC
params = {
    'n_neighbors': (1, 3, 5),
    'metric': ['minkowski', 'euclidean'],
    'weights': ['uniform', 'distance']    
}
gs = GridSearchCV(KNeighborsClassifier(), params)
gs.fit(df_train, y_train)

y_pred = gs.predict(df_test)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', gs.score(df_train, y_train))

accuracy score 0.5848056537102474
train score 0.9912891986062717


In [31]:
score_df['cvec_score_1']['knn_title'] = 0.5848056537102474

In [32]:
#n-gram = 2, CVEC
params = {
    'n_neighbors': (1, 3, 5),
    'metric': ['minkowski', 'euclidean'],
    'weights': ['uniform', 'distance']    
}
gs = GridSearchCV(KNeighborsClassifier(), params)
gs.fit(df_train_2, y_train)

y_pred = gs.predict(df_test_2)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', gs.score(df_train_2, y_train))

accuracy score 0.5742049469964664
train score 0.9886759581881533


In [33]:
score_df['cvec_score_2']['knn_title'] = 0.5742049469964664

In [34]:
#n-gram = 1, TVEC
params = {
    'n_neighbors': (1, 3, 5),
    'metric': ['minkowski', 'euclidean'],
    'weights': ['uniform', 'distance']    
}
gs = GridSearchCV(KNeighborsClassifier(), params)
gs.fit(train_title, y_train)

y_pred = gs.predict(test_title)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', gs.score(train_title, y_train))

accuracy score 0.598939929328622
train score 0.9912891986062717


In [35]:
score_df['tvec_score_1']['knn_title'] = 0.598939929328622

In [36]:
#n-gram = 2, TVEC
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_title_2, y_train)

y_pred = knn.predict(test_title_2)
score_df['tvec_score_2']['knn_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', knn.score(train_title_2, y_train))

accuracy score 0.588339222614841
train score 0.7560975609756098


#### Random Forest


In [113]:
#n-gram = 1, CVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(df_train, y_train)

score_df['cvec_score_1']['randomforest_title'] = rf.score(df_test, y_test)
print(rf.score(df_train, y_train))
print(rf.score(df_test, y_test))
rf.best_params_

0.7229965156794426
0.6095406360424028


{'class_weight': 'balanced', 'max_depth': 6, 'n_estimators': 50}

In [112]:
#n-gram = 2, CVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(df_train_2, y_train)

score_df['cvec_score_2']['randomforest_title'] = rf.score(df_test_2, y_test)
print(rf.score(df_train_2, y_train))
print(rf.score(df_test_2, y_test))
rf.best_params_

0.7151567944250871
0.6007067137809188


{'class_weight': 'balanced', 'max_depth': 6, 'n_estimators': 55}

In [111]:
#n-gram = 1, TVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(train_title, y_train)

score_df['tvec_score_1']['randomforest_title'] = rf.score(test_title, y_test)
print(rf.score(train_title, y_train))
print(rf.score(test_title, y_test))

0.7264808362369338
0.5848056537102474


In [114]:
#n-gram = 2, TVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(train_title_2, y_train)

score_df['tvec_score_2']['randomforest_title'] = rf.score(test_title_2, y_test)
print(rf.score(train_title_2, y_train))
print(rf.score(test_title_2, y_test))

0.730836236933798
0.6166077738515902


#### MultinomialNB


In [41]:
#n-gram = 1, CVEC
mnb = MultinomialNB()
mnb.fit(df_train, y_train)

y_pred = mnb.predict(df_test)
score_df['cvec_score_1']['multinomial_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(df_train, y_train))

accuracy score 0.6148409893992933
train score 0.89198606271777


In [42]:
#n-gram = 2, CVEC
mnb = MultinomialNB()
mnb.fit(df_train_2, y_train)

y_pred = mnb.predict(df_test_2)
score_df['cvec_score_2']['multinomial_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(df_train_2, y_train))

accuracy score 0.6130742049469965
train score 0.9425087108013938


In [43]:
#n-gram = 1, TVEC
mnb = MultinomialNB()
mnb.fit(train_title, y_train)

y_pred = mnb.predict(test_title)
score_df['tvec_score_1']['multinomial_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(train_title, y_train))

accuracy score 0.6219081272084805
train score 0.9102787456445993


In [44]:
#n-gram = 2, TVEC
mnb = MultinomialNB()
mnb.fit(train_title_2, y_train)

y_pred = mnb.predict(test_title_2)
score_df['tvec_score_2']['multinomial_title'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(train_title_2, y_train))

accuracy score 0.6219081272084805
train score 0.9695121951219512


### Post Column
Next I am going to take a look at just the post texts

#### Count Vectorizer

In [45]:
#set x and y for train test split
X_post = df.post
y = df.subreddit

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X_post, y, test_size = 0.33, random_state=42)

In [47]:
#n-gram = 1
cvec = CountVectorizer(stop_words='english', max_features = 5000, ngram_range=(1,1)).fit(X_train)
cvec_train_post = pd.DataFrame(cvec.transform(X_train).todense(), columns=cvec.get_feature_names())
cvec_test_post = pd.DataFrame(cvec.transform(X_test).todense(), columns=cvec.get_feature_names())

In [48]:
#n-gram = 2
cvec_2 = CountVectorizer(stop_words='english', max_features = 5000, ngram_range=(1,2)).fit(X_train)
cvec_train_post_2 = pd.DataFrame(cvec_2.transform(X_train).todense(), columns=cvec_2.get_feature_names())
cvec_test_post_2 = pd.DataFrame(cvec_2.transform(X_test).todense(), columns=cvec_2.get_feature_names())

#### TfidVectorizer

In [49]:
#n-gram = 1
tvec = TfidfVectorizer(stop_words='english', ngram_range=(1,1)).fit(X_train)
tvec_train_post = pd.DataFrame(tvec.transform(X_train).todense(), columns = tvec.get_feature_names())
tvec_test_post = pd.DataFrame(tvec.transform(X_test).todense(), columns = tvec.get_feature_names())

In [50]:
#n-gram = 2
tvec_2 = TfidfVectorizer(stop_words='english', ngram_range=(1,2)).fit(X_train)
tvec_train_post_2 = pd.DataFrame(tvec_2.transform(X_train).todense(), columns = tvec_2.get_feature_names())
tvec_test_post_2 = pd.DataFrame(tvec_2.transform(X_test).todense(), columns = tvec_2.get_feature_names())

#### Logistic Regression


In [51]:
#n-gram = 1, CVEC
lr = LogisticRegression()
lr.fit(cvec_train_post, y_train)

y_pred = lr.predict(cvec_test_post)
score_df['cvec_score_1']['lr_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(cvec_train_post, y_train))

accuracy score 0.6254416961130742
train score 0.9799651567944251


In [52]:
#n-gram = 2, CVEC
lr = LogisticRegression()
lr.fit(cvec_train_post_2, y_train)

y_pred = lr.predict(cvec_test_post_2)
score_df['cvec_score_2']['lr_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(cvec_train_post_2, y_train))

accuracy score 0.6201413427561837
train score 0.9860627177700348


In [53]:
#n-gram = 1, TVEC
lr = LogisticRegression()
lr.fit(tvec_train_post, y_train)

y_pred = lr.predict(tvec_test_post)
score_df['tvec_score_1']['lr_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(tvec_train_post, y_train))

accuracy score 0.6342756183745583
train score 0.8998257839721254


In [54]:
#n-gram = 2, TVEC
lr = LogisticRegression()
lr.fit(tvec_train_post_2, y_train)

y_pred = lr.predict(tvec_test_post_2)
score_df['tvec_score_2']['lr_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(tvec_train_post_2, y_train))

accuracy score 0.6484098939929329
train score 0.9817073170731707


#### KNN


In [55]:
#n-gram = 1, CVEC
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(cvec_train_post, y_train)

y_pred = knn.predict(cvec_test_post)
score_df['cvec_score_1']['knn_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', knn.score(cvec_train_post, y_train))

accuracy score 0.5335689045936396
train score 0.6332752613240418


In [56]:
#n-gram = 2, CVEC
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(cvec_train_post_2, y_train)

y_pred = knn.predict(cvec_test_post_2)
score_df['cvec_score_2']['knn_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', knn.score(cvec_train_post_2, y_train))

accuracy score 0.5300353356890459
train score 0.6332752613240418


In [57]:
#n-gram = 1, TVEC
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(tvec_train_post, y_train)

y_pred = knn.predict(tvec_test_post)
score_df['tvec_score_1']['knn_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', knn.score(tvec_train_post, y_train))

accuracy score 0.5547703180212014
train score 0.5574912891986062


#### Random Forest


In [116]:
#n-gram = 1, CVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(cvec_train_post, y_train)

score_df['cvec_score_1']['randomforest_post'] = rf.score(cvec_test_post, y_test)
print(rf.score(cvec_train_post, y_train))
print(rf.score(cvec_test_post, y_test))

0.7682926829268293
0.6201413427561837


In [117]:
#n-gram = 2, CVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(cvec_train_post_2, y_train)

score_df['cvec_score_2']['randomforest_post'] = rf.score(cvec_test_post_2, y_test)
print(rf.score(cvec_train_post_2, y_train))
print(rf.score(cvec_test_post_2, y_test))

0.789198606271777
0.6325088339222615


In [118]:
#n-gram = 1, TVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(tvec_train_post, y_train)

score_df['tvec_score_1']['randomforest_post'] = rf.score(tvec_test_post, y_test)
print(rf.score(tvec_train_post, y_train))
print(rf.score(tvec_test_post, y_test))

0.8275261324041812
0.6431095406360424


In [119]:
#n-gram = 2, TVEC
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(tvec_train_post_2, y_train)

score_df['tvec_score_2']['randomforest_post'] = rf.score(tvec_test_post_2, y_test)
print(rf.score(tvec_train_post_2, y_train))
print(rf.score(tvec_test_post_2, y_test))

0.7761324041811847
0.6113074204946997


#### Multinomial NB


In [62]:
#n-gram = 1, CVEC
mnb = MultinomialNB()
mnb.fit(cvec_train_post, y_train)

y_pred = mnb.predict(cvec_test_post)
score_df['cvec_score_1']['multinomial_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(cvec_train_post, y_train))

accuracy score 0.6607773851590106
train score 0.8623693379790941


In [63]:
#n-gram = 2, CVEC
mnb = MultinomialNB()
mnb.fit(cvec_train_post_2, y_train)

y_pred = mnb.predict(cvec_test_post_2)
score_df['cvec_score_2']['multinomial_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(cvec_train_post_2, y_train))

accuracy score 0.6431095406360424
train score 0.8527874564459931


In [64]:
#n-gram = 1, TVEC
mnb = MultinomialNB()
mnb.fit(tvec_train_post, y_train)

y_pred = mnb.predict(tvec_test_post)
score_df['tvec_score_1']['multinomial_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(tvec_train_post, y_train))

accuracy score 0.6607773851590106
train score 0.9250871080139372


In [65]:
#n-gram = 2, TVEC
mnb = MultinomialNB()
mnb.fit(tvec_train_post_2, y_train)

y_pred = mnb.predict(tvec_test_post_2)
score_df['tvec_score_2']['multinomial_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(tvec_train_post_2, y_train))

accuracy score 0.6554770318021201
train score 0.990418118466899


### Title and Post
Lastly, lets take a look at title and post texts

In [66]:
#concat cvec of title and post for train and test
cvec_title_post_train = pd.concat([cvec_train_post, df_train], axis = 1)
cvec_title_post_test = pd.concat([cvec_test_post, df_test], axis = 1)

#concat tvec of title and post for train and test
tvec_title_post_train = pd.concat([tvec_train_post, train_title], axis = 1)
tvec_title_post_test = pd.concat([tvec_test_post, test_title], axis = 1)

In [67]:
# I didn't bother running n-grams = 2 for title and post
# because it seemed like from title and post n-gram=2 didn't improve the score

#### Logistic Regression


In [68]:
#cvec
lr = LogisticRegression()
lr.fit(cvec_title_post_train, y_train)

y_pred = lr.predict(cvec_title_post_test)
score_df['cvec_score_1']['lr_title_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(cvec_title_post_train, y_train))

accuracy score 0.627208480565371
train score 0.9947735191637631


In [69]:
#tvec
lr = LogisticRegression()
lr.fit(tvec_title_post_train, y_train)

y_pred = lr.predict(tvec_title_post_test)
score_df['tvec_score_1']['lr_title_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', lr.score(tvec_title_post_train, y_train))

accuracy score 0.6554770318021201
train score 0.955574912891986


#### KNN


In [70]:
#cvec
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(cvec_title_post_train, y_train)

y_pred = knn.predict(cvec_title_post_test)
score_df['cvec_score_1']['knn_title_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', knn.score(cvec_title_post_train, y_train))

accuracy score 0.5282685512367491
train score 0.740418118466899


In [71]:
#tvec
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(tvec_title_post_train, y_train)

y_pred = knn.predict(tvec_title_post_test)
score_df['tvec_score_1']['knn_title_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', knn.score(tvec_title_post_train, y_train))

accuracy score 0.5865724381625441
train score 0.6045296167247387


#### Random Forest


In [121]:
#cvec
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(cvec_title_post_train, y_train)

score_df['cvec_score_1']['randomforest_title_post'] = rf.score(cvec_title_post_test, y_test)
print(rf.score(cvec_title_post_train, y_train))
print(rf.score(cvec_title_post_test, y_test))

0.7987804878048781
0.6325088339222615


In [120]:
#tvec
rf_param={
    'class_weight': ['balanced'],
    'n_estimators': (45, 50, 55),
    'max_depth': [5, 6]
}
rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param)
rf.fit(tvec_title_post_train, y_train)

score_df['tvec_score_1']['randomforest_title_post'] = rf.score(tvec_title_post_test, y_test)
print(rf.score(tvec_title_post_train, y_train))
print(rf.score(tvec_title_post_test, y_test))

0.8170731707317073
0.6749116607773852


#### Multinomial NB


In [74]:
#cvec
mnb = MultinomialNB()
mnb.fit(cvec_title_post_train, y_train)

y_pred = mnb.predict(cvec_title_post_test)
score_df['cvec_score_1']['multinomial_title_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(cvec_title_post_train, y_train))

accuracy score 0.6713780918727915
train score 0.9146341463414634


In [75]:
#tvec
mnb = MultinomialNB()
mnb.fit(tvec_title_post_train, y_train)

y_pred = mnb.predict(tvec_title_post_test)
score_df['tvec_score_1']['multinomial_title_post'] = accuracy_score(y_test, y_pred)
print('accuracy score', accuracy_score(y_test, y_pred))
print('train score', mnb.score(tvec_title_post_train, y_train))

accuracy score 0.6819787985865724
train score 0.9451219512195121


## Analyze

### Score Table
Lets look at them all together

In [122]:
score_df

Unnamed: 0,cvec_score_1,cvec_score_2,tvec_score_1,tvec_score_2
lr_title,0.625442,0.618375,0.625442,0.627208
knn_title,0.584806,0.574205,0.59894,0.588339
randomforest_title,0.609541,0.600707,0.584806,0.616608
multinomial_title,0.614841,0.613074,0.621908,0.621908
lr_post,0.625442,0.620141,0.634276,0.64841
knn_post,0.533569,0.530035,0.55477,
randomforest_post,0.620141,0.632509,0.64311,0.611307
multinomial_post,0.660777,0.64311,0.660777,0.655477
lr_title_post,0.627208,,0.655477,
knn_title_post,0.528269,,0.586572,


In [123]:
#going to take a look only at n-gram= 1
drop = ['cvec_score_2', 'tvec_score_2']
score_df.drop(score_df[drop], axis=1)

Unnamed: 0,cvec_score_1,tvec_score_1
lr_title,0.625442,0.625442
knn_title,0.584806,0.59894
randomforest_title,0.609541,0.584806
multinomial_title,0.614841,0.621908
lr_post,0.625442,0.634276
knn_post,0.533569,0.55477
randomforest_post,0.620141,0.64311
multinomial_post,0.660777,0.660777
lr_title_post,0.627208,0.655477
knn_title_post,0.528269,0.586572


### Coefficients for best Logistic Regression

In [78]:
#picked my top model
lr = LogisticRegression()
lr.fit(tvec_title_post_train, y_train)
y_pred = lr.predict(tvec_title_post_test)
lr.coef_

array([[-0.04794514, -0.01684372, -0.2113701 , ...,  0.08351664,
        -0.40051824, -0.30824188]])

In [79]:
#establish an array for my features so that i can match it with my coef_
features = np.array(list(tvec_title_post_train)).reshape(-1,1)

#to round my coef numbers and shape it to be same as my features
coeffs = np.reshape(np.round(lr.coef_,5),(-1,1))
# concat the two arrays together 
coeffs = np.concatenate((features,coeffs),axis=1)

#put it back into a data frame to sort the values
coef_df = pd.DataFrame(coeffs,columns=['Features','Coeff'])
coef_df = coef_df.sort_values('Coeff', ascending=False)

In [80]:
#check type
coef_df.dtypes

Features    object
Coeff       object
dtype: object

In [81]:
#convert coeff to float
coef_df['Coeff'] = coef_df['Coeff'].astype('float')

In [82]:
coef_df['log_odds'] = np.exp(coef_df['Coeff'])

In [89]:
#check our coef df
coef_df.head(15)

Unnamed: 0,Features,Coeff,log_odds
3254,joined,3e-05,1.00003
7871,investing,1.55938,4.755872
2463,fund,1.17216,3.22896
7206,china,0.99917,2.716027
8955,value,0.9449,2.572556
7825,index,0.93907,2.557602
348,article,0.88087,2.412998
3882,money,0.87708,2.40387
8805,tech,0.81911,2.26848
7409,dis,0.81584,2.261074


### Classification Metrics

In [84]:
#build a confusion matrix
cm = confusion_matrix(y_test, y_pred)

In [85]:
#set it into a dataframe for asthetics 
cm_df = pd.DataFrame(cm, columns=['predicted stocks', 'predicted investing'], 
                     index=['actual stocks', 'actual investing'])
cm_df

Unnamed: 0,predicted stocks,predicted investing
actual stocks,142,123
actual investing,72,229


In [86]:
# set classifications for each cell
tn, fp, fn, tp = cm.ravel()

In [87]:
accuracy = (tp + tn) / (tn + fp + fn + tp)
misclassification = 1 - accuracy
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print('Accuracy:', accuracy)
print('Misclassification:', misclassification)
print('Sensitivity:', sensitivity)
print('Specificity:', specificity)

Accuracy: 0.6554770318021201
Misclassification: 0.34452296819787986
Sensitivity: 0.760797342192691
Specificity: 0.5358490566037736
