# This notebook has the different models I tried

In [8]:
import nltk
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk import word_tokenize          
from sklearn.feature_extraction import stop_words
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rosedennis/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
#Import the data
df = pd.read_csv('../Data/final.csv')

## Exploratory Analysis

Make sure we have 2,000 rows, 1000 for each subreddit

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df['subreddit'].value_counts()

As desired, we have balanced classes of our target variable.

In [None]:
df.isnull().sum()

This is telling us that there are 825 (a good amount) of null values for selftext. However, we won't worry about that now because 'selftext' refers to comments and that's not what we are focused on.

In [None]:
df['author'].unique

This is surprising. This tells us that there is no overlap between authors but also that we have not pulled data that some author may have posted twice on the same reddit. I would've thought that there'd be at least one author posting multiple times in the same subreddit.

In [None]:
df.info()

### I want to create a column that counts how many words there are in the title. 

My initial belief is that shower thought posts will be shorter than stoner philosophy posts. The following code will see if that's the case.

In [None]:
temp = df.iloc[4,1]
temp

In [None]:
len(temp.split())

In [None]:
df['word_count'] = [len(x.split()) for x in df['title']]

In [None]:
df.head()

In [None]:
df.groupby('subreddit')['word_count'].mean()

This is super interesting because it's the opposite of my assumption. Seems that on average shower thoughts are longer than stoner thoughts.

### I want to find the most common word for each subreddit

In [None]:
counter = Counter(temp.split())

In [None]:
counter.most_common()

In [None]:
#create a function that will convert every title into a list of words so we can call 'most common'
def pop_word(column):
    words = []
    for i in column:
        i = i.split()
        words.extend(i)
    return words

In [None]:
#looks like a list of lists
word = pop_word(df['title'])

In [None]:
counter1 = Counter(word)

In [None]:
#These are the most common words in all of the titles, including both subreddits.
counter1.most_common()

My opinion is that the 'interesting' top words are: I, we, your, people. I think this says that we, as people, are mainly interested in things that relate to us versus different topics. This is to be expected. To complete my exploratory analysis, I'm going to examine the most popular words of each subreddit.

In [None]:
temp1 = pd.read_csv('../Data/shower_final.csv')

In [None]:
temp1_a = pop_word(temp1['title'])

In [None]:
counter2 = Counter(temp1_a)

In [None]:
counter2.most_common()

Common words in 'shower thoughts': you, I, What (capital W indicates that's how they're starting the post probably...like a question), we, If (note capital I), your

In [None]:
temp2 = pd.read_csv('../Data/stoner_final.csv')

In [None]:
temp2_a = pop_word(temp2['title'])

In [None]:
counter3 = Counter(temp2_a)

In [None]:
counter3.most_common()

Common words in 'stoner thoughts': you, I, we, What, your, The. I might be trying to find something out of nothing but I would guess that shower thoughts have more posts formed as questions than stoner thoughts. 

## Modeling

In [3]:
#make correct y variable
df['subreddit'] = df['subreddit'].map({'Showerthoughts': 0, 'StonerPhilosophy': 1})
y = df['subreddit']
y

0       0
1       0
2       0
3       0
4       0
       ..
1995    1
1996    1
1997    1
1998    1
1999    1
Name: subreddit, Length: 2000, dtype: int64

In [None]:
y.isnull().sum()

In [4]:
#make X (needs to be a series, not DataFrame)
X = df['title']

In [None]:
#Train/test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [None]:
#baseline accuracy
y.value_counts(normalize = True)

### Logistic Regression

In [None]:
cvec = CountVectorizer()
cvec.fit(X_train)
X_train = cvec.transform(X_train)

In [None]:
# Convert X_train into a DataFrame.

X_train_df = pd.DataFrame(X_train.toarray(),
                          columns=cvec.get_feature_names())
X_train_df

In [None]:
# Transform test
X_test = cvec.transform(X_test)
X_test_df = pd.DataFrame(X_test.toarray(),
                         columns=cvec.get_feature_names())

X_test_df.head()

In [None]:
l1 = LogisticRegression()
l1.fit(X_train_df, y_train)


In [None]:
l1.score(X_train_df, y_train)

In [None]:
l1.score(X_test_df, y_test)

In [None]:
l1.coef_

#### With stopwords

In [None]:
X2_train, X2_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [None]:
cvec2 = CountVectorizer(stop_words = 'english')
cvec2.fit(X2_train)
X2_train = cvec2.transform(X2_train)

In [None]:
X2_train_df = pd.DataFrame(X2_train.toarray(),
                          columns=cvec2.get_feature_names())

In [None]:
# Transform test
X2_test = cvec2.transform(X2_test)
X2_test_df = pd.DataFrame(X2_test.toarray(),
                         columns=cvec2.get_feature_names())


In [None]:
l2 = LogisticRegression()
l2.fit(X2_train_df, y_train)

In [None]:
print(l2.score(X2_train_df, y_train))
l2.score(X2_test_df, y_test)

This is worse than our previous model that includes stopwords.

#### With lemmatizing

In [None]:
X3_train, X3_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [5]:
class LemmaTokenizer(object): #https://stackoverflow.com/questions/47423854/sklearn-adding-lemmatizer-to-countvectorizer
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

In [None]:
cvec3 = CountVectorizer(tokenizer=LemmaTokenizer(),
                      stop_words = 'english',
                      lowercase = False)
cvec3.fit(X3_train)
X3_train = cvec3.transform(X3_train)

In [None]:
X3_train_df = pd.DataFrame(X3_train.toarray(),
                          columns=cvec3.get_feature_names())
X3_test = cvec3.transform(X3_test)
X3_test_df = pd.DataFrame(X3_test.toarray(),
                         columns=cvec3.get_feature_names())


In [None]:
l3 = LogisticRegression()
l3.fit(X3_train_df, y_train)

In [None]:
print(l3.score(X3_train_df, y_train))
l3.score(X3_test_df, y_test)

This is the best performing model so far. A logistic regression model with lemmatizing, english stopwords, and including uppercased words.

#### With 'word_count' feature

In [None]:
#X4 = pd.DataFrame(df, columns = ['title', 'word_count'])

In [None]:
X4_train, X4_test, y_train, y_test = train_test_split(X,
                                                      y,
                                                     test_size=0.33,
                                                     stratify=y,
                                                     random_state=42)

In [None]:
cvec4 = CountVectorizer(tokenizer=LemmaTokenizer(),
                      stop_words = 'english',
                      lowercase = False)

# cvec.fit(X2_train)
# X2_train = cvec2.transform(X2_train)

In [None]:
# X2_train_df = pd.DataFrame(X2_train.toarray(),
#                           columns=cvec2.get_feature_names())
# X2_train_df

### CountVectorizer w/ Grid Search

In [6]:
X5_train, X5_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [6]:
#taken from the english stopword list and a few of my own
stop_words = ['stoner', 'weed', 'marijuana', 'high', 'baked', 'me', 'my', 'myself',
              'ourselves', "you're", "you've", "you'll", "you'd", 'your', 'yours', 
              'yourself', 'yourselves', 'he', 'him', 'himself', 'she', "she's",
              'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 
              'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 
              "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 
              'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a',
              'an', 'the', 'and', 'but', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
              'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 
              'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
              'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 
              'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
              's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
              'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
              'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't",
              'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn',
              "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't",
             'smoke', 'stone', 'shower']

In [7]:
# adapted from lecture
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

In [8]:
pipe_params = {
    'cvec__max_features': [2000, 3000, 4000, 5000], #max number of words to go into the model
    'cvec__min_df': [2, 3],      #minimum number that a word appears in all the documents to be considered 
    'cvec__max_df': [.9, .95],   #max number of documents needed to include token
    'cvec__ngram_range': [(1,1), (1,2)], #checking single words, and also two-word phrases
    'cvec__stop_words': [stop_words, None], #don't have features that are stop words
    'cvec__tokenizer': [LemmaTokenizer(), None], #lemmatize and not lemmatize, this argument caused problems
    'cvec__lowercase': [False, True]
}

In [9]:
gs = GridSearchCV(pipe, 
                  param_grid=pipe_params, 
                  cv=5) 

In [12]:
gs.fit(X5_train, y_train) #this takes awhile and throws a lot of warnings

  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [13]:
#gives us the best cross val score
gs.best_score_

0.6880597014925374

In [14]:
gs_model = gs.best_estimator_

In [15]:
gs_model.score(X5_train, y_train)

0.9708955223880597

In [16]:
gs_model.score(X5_test, y_test)

0.6848484848484848

In [17]:
gs.best_params_

{'cvec__lowercase': False,
 'cvec__max_df': 0.9,
 'cvec__max_features': 3000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': None,
 'cvec__tokenizer': <__main__.LemmaTokenizer at 0x1a17fb3590>}

## Pickling

In [13]:
model = make_pipeline(
        CountVectorizer(lowercase = False, max_df = 0.9, max_features = 3000, min_df = 2, ngram_range = (1,2), 
                        stop_words=None, tokenizer= LemmaTokenizer()),
        MultinomialNB(alpha =1.0))
model.fit(X5_train, y_train);


In [14]:
print(f"train score {model.score(X5_train, y_train)}")

print(f"test score {model.score(X5_test, y_test)}")

train score 0.8835820895522388
test score 0.6878787878787879


In [15]:
import pickle

In [16]:
!ls -s

total 1872
  24 Getting_the_Data.ipynb       1248 Project_3_Report.ipynb
 584 Models.ipynb                   16 Tokenizing&Lemmatizing.ipynb


In [19]:
model_fn = 'rosedennis.pickle'

with open(model_fn, 'wb') as f:
    pickle.dump(model, f)

In [20]:
!ls -s

total 4000
  24 Getting_the_Data.ipynb         16 Tokenizing&Lemmatizing.ipynb
 584 Models.ipynb                 1064 rosedennis
1248 Project_3_Report.ipynb       1064 rosedennis.pickle


We see that our training score is a lot better than our test, therefore overfitting. We also see the differnt parameters that produced the best model. The testing score was the highest so far but it took a very long time for my model to run. 

We will now explore this best model a bit.

### Logistic Regression model with Count Vectorizer best model

In [None]:
X6_train, X6_test, y_train, y_test = train_test_split(X,
                                                      y,
                                                     test_size=0.33,
                                                     stratify=y,
                                                     random_state=42)

In [None]:
cvec6 = CountVectorizer(tokenizer=LemmaTokenizer(),
                      stop_words = None,
                      lowercase = False,
                        max_df = 0.9,
                        max_features = 3000,
                        min_df = 2,
                        ngram_range = (1,2)
                       )
cvec6.fit(X6_train)
X6_train = cvec6.transform(X6_train)

In [None]:
X6_train_df = pd.DataFrame(X6_train.toarray(),
                          columns=cvec6.get_feature_names())
X6_test = cvec6.transform(X6_test)
X6_test_df = pd.DataFrame(X6_test.toarray(),
                         columns=cvec6.get_feature_names())


In [None]:
l6 = LogisticRegression()
l6.fit(X6_train_df, y_train)

In [None]:
print(l6.score(X6_train_df, y_train))
l6.score(X6_test_df, y_test)

### Multinomial Naive Bayes model with Count Vectorizer

In [18]:
X7_train, X7_test, y_train, y_test = train_test_split(X,
                                                      y,
                                                     test_size=0.33,
                                                     stratify=y,
                                                     random_state=42)

In [19]:
cvec7 = CountVectorizer(tokenizer=LemmaTokenizer(),
                      stop_words = None,
                      lowercase = False,
                        max_df = 0.9,
                        max_features = 3000,
                        min_df = 2,
                        ngram_range = (1,2)
                       )
cvec7.fit(X7_train)
X7_train = cvec7.transform(X7_train)
X7_test = cvec7.transform(X7_test)


In [20]:
X7_train_df = pd.DataFrame(X7_train.toarray(),
                          columns=cvec7.get_feature_names())
X7_train_df

Unnamed: 0,!,! !,#,%,% of,&,& amp,& gt,',' doe,...,‘,’,’ ll,’ m,’ re,’ s,’ t,’ ve,“,”
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1335,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1336,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1337,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1338,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
mnb = MultinomialNB()

In [22]:
mnb.fit(X7_train, y_train)
mnb.score(X7_train, y_train)

0.8835820895522388

In [23]:
mnb.score(X7_test, y_test)

0.6878787878787879

This performs slightly better than our grid search model with logistic regression.

In [24]:
mnb.coef_

array([[-6.40417606, -8.80207134, -8.39660623, ..., -7.8857806 ,
        -7.19263342, -7.29799394]])

In [25]:
coef_df = pd.DataFrame(list(mnb.coef_), columns =cvec7.get_feature_names()).T

In [26]:
coef_df[0].sort_values(ascending=False)

.                      -3.867597
the                    -3.911722
a                      -4.092541
,                      -4.201914
is                     -4.253472
                          ...   
pencil                 -9.495219
will become            -9.495219
people in              -9.495219
actually watertrucks   -9.495219
heated debate          -9.495219
Name: 0, Length: 3000, dtype: float64

In [27]:
mnb.predict_proba(X7_test)

array([[0.73213412, 0.26786588],
       [0.00633525, 0.99366475],
       [0.99434116, 0.00565884],
       ...,
       [0.87768034, 0.12231966],
       [0.84819184, 0.15180816],
       [0.08147163, 0.91852837]])

In [28]:
y_pred7 = mnb.predict(X7_test)

In [29]:
results7 = pd.DataFrame(y_pred7, columns=['predicted'])
results7['actual'] = y_test.to_list()
results7

Unnamed: 0,predicted,actual
0,0,0
1,1,1
2,0,0
3,0,1
4,0,1
...,...,...
655,1,1
656,0,1
657,0,0
658,0,0


In [30]:
y_test.index[3]

1984

In [31]:
df.iloc[y_test.index[3],:]

created_utc                                           1329651853
title                                         Fucked up ironies.
selftext       Does no one else think it was weird that Owen ...
subreddit                                                      1
permalink      /r/StonerPhilosophy/comments/pwd1n/fucked_up_i...
author                                          IFUCKINGLOVEMETH
Name: 1984, dtype: object

### Gaussian Model  with TFIDF Vectorizer

I'm going to use the same parameters as our CountVectorizer model so we can compare more easily. I'll instantiate the Tfidf vectorizer and then fit a Gaussian model to it.

In [None]:
X8_train, X8_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [None]:
tfidf8 = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                      stop_words = None,
                      lowercase = False,
                        max_df = 0.9,
                        max_features = 3000,
                        min_df = 2,
                        ngram_range = (1,2)
                       )
tfidf8.fit(X8_train)
X8_train = tfidf8.transform(X8_train)
X8_test = tfidf8.transform(X8_test)

In [None]:
gaus = GaussianNB()

In [None]:
gaus.fit(X8_train.toarray(), y_train)

In [None]:
gaus.score(X8_train.toarray(), y_train)

In [None]:
gaus.score(X8_test.toarray(), y_test)

The TFIDF Vectorizer with a Gaussian model does more poorly than our Count Vectorizer with multinomial model.

### Support Vector Machine models

In [32]:
X9_train, X9_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

In [33]:
tfidf9 = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                      stop_words = None,
                      lowercase = False,
                        max_df = 0.9,
                        max_features = 3000,
                        min_df = 2,
                        ngram_range = (1,2)
                       )
tfidf9.fit(X9_train)
X9_train = tfidf9.transform(X9_train)
X9_test = tfidf9.transform(X9_test)

In [39]:
X9_train.shape

(1340, 3000)

In [40]:
X9_test.shape

(660, 3000)

In [34]:
svc = SVC(gamma="scale")

In [35]:
svc.fit(X9_train, y_train)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [36]:
y_pred9 = svc.predict(X9_test)

accuracy_score(y_test, y_pred9)


0.6984848484848485

In [37]:
results = pd.DataFrame(y_pred9, columns=['predicted'])
results['actual'] = y_test.to_list()
results

Unnamed: 0,predicted,actual
0,0,1
1,1,0
2,0,1
3,0,0
4,1,1
...,...,...
655,0,0
656,1,1
657,1,1
658,0,0
