<h2>Challenge: Iterate and evaluate your classifier</h2>

It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

Do any of your classifiers seem to overfit?
Which seem to perform the best? Why?
Which features seemed to be most impactful to performance?

Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.

In [932]:
#importing modules and potential modules
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import random
import nltk
from sklearn.naive_bayes import BernoulliNB
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFdr, chi2
from sklearn.feature_selection import VarianceThreshold
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [933]:
#importing the Amazon review dataset
df = pd.read_csv('https://raw.githubusercontent.com/GenTaylor/unit2/master/amazon_cells_labelled.txt', delimiter= '\t', header = None)


In [934]:
#giving the columns names
df.columns = ['review', 'number']

In [935]:
#checking the data
df.head()

Unnamed: 0,review,number
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [936]:
df.tail()

Unnamed: 0,review,number
995,The screen does get smudged easily because it ...,0
996,What a piece of junk.. I lose more calls on th...,0
997,Item Does Not Match Picture.,0
998,The only thing that disappoint me is the infra...,0
999,"You can not answer calls with the unit, never ...",0


In [937]:
df.shape

(1000, 2)

In [938]:
# examine the class distribution
#0=negative 1=positive
df.number.value_counts()

1    500
0    500
Name: number, dtype: int64

In [939]:
# how to define X and y for use with COUNTVECTORIZER

X = df.review
y = df.number
print (X.shape)
print (y.shape)

(1000,)
(1000,)


In [940]:
# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(750,)
(250,)
(750,)
(250,)


In [941]:
# instantiate the vectorizer
vect = CountVectorizer()

In [942]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [943]:
# examine the document-term matrix
X_train_dtm

<750x1541 sparse matrix of type '<class 'numpy.int64'>'
	with 6874 stored elements in Compressed Sparse Row format>

In [944]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<250x1541 sparse matrix of type '<class 'numpy.int64'>'
	with 1914 stored elements in Compressed Sparse Row format>

<h2>First we used MultinomialNB</h2>

In [945]:
#instantiate a Multinomial Naive Bayes model
mnb = MultinomialNB()

In [946]:
# train the model using X_train_dtm 
mnb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [947]:
# make class predictions for X_test_dtm
y_pred_class = mnb.predict(X_test_dtm)

In [948]:
# calculate accuracy of class predictions
metrics.accuracy_score(y_test, y_pred_class)

0.792

In [949]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[ 96,  36],
       [ 16, 102]], dtype=int64)

In [950]:
# print message text for the negative incorrectly classified as positive
X_test[y_test < y_pred_class]

142                      I was not happy with this item.
108    The camera, although rated at an impressive 1....
90     For a product that costs as much as this one d...
453    I even fully charged it before I went to bed a...
639    Disappointing accessory from a good manufacturer.
885    When it opens, the battery connection is broke...
307    As many people complained, I found this headse...
734    We have tried 2 units and they both failed wit...
528                                 No real improvement.
992                     Lasted one day and then blew up.
559                 None of it works, just don't buy it.
154    I've bought $5 wired headphones that sound bet...
868                   The item received was Counterfeit.
794    The internet access was fine, it the rare inst...
311                   The instruction manual is lacking.
799    I tried talking real loud but shouting on the ...
625                      Very Dissapointing Performance.
248                     Interne

In [951]:
# print message text for the positives incorrectly classified as negative
X_test[y_test > y_pred_class]

600    Their Research and Development division obviou...
721    My phone doesn't slide around my car now and t...
537    Small, sleek, impressive looking, practical se...
34     Car charger as well as AC charger are included...
956    Just reading on the specs alone makes you say ...
911    So I bought about 10 of these and saved alot o...
512    The sound is clear and the people I talk to on...
159                                W810i is just SUPERB.
797    A good quality bargain.. I bought this after I...
443           Restored my phone to like new performance.
578    It does everything the description said it would.
699    Comfortable fit - you need your headset to be ...
429    My Sanyo has survived dozens of drops on black...
945    It is easy to turn on and off when you are in ...
301              Now I know that I made a wise decision.
474                            The delivery was on time.
Name: review, dtype: object

In [952]:
# example false negative
X_test[474]

'The delivery was on time.'

In [953]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = mnb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([1.01249421e-01, 1.59691980e-01, 9.99590149e-01, 9.38768598e-01,
       1.10011416e-02, 2.38536973e-01, 9.99332483e-01, 1.46480394e-01,
       1.21092819e-02, 2.43815666e-01, 8.56946008e-02, 7.01233381e-01,
       1.36199605e-03, 9.88800911e-01, 9.74630257e-01, 7.78014357e-01,
       3.74553508e-01, 3.73808381e-02, 9.69520797e-02, 9.83575217e-01,
       6.07787946e-01, 1.18083722e-03, 9.82459893e-01, 1.13440926e-01,
       9.84675604e-01, 8.11570485e-01, 7.84398051e-01, 9.74985926e-01,
       9.92767240e-03, 8.68460079e-02, 9.51402836e-01, 9.66162789e-01,
       7.01838147e-03, 8.80009625e-02, 3.61780421e-03, 2.90891608e-01,
       9.99555431e-01, 9.30543429e-02, 1.03872153e-03, 3.77459020e-02,
       7.05493240e-01, 8.28459721e-01, 9.87477996e-01, 1.22254776e-03,
       2.34752697e-03, 2.35537072e-01, 4.45437474e-01, 1.80334884e-01,
       5.45106235e-01, 4.79577998e-02, 8.68086072e-01, 1.33103427e-01,
       6.78055325e-01, 7.91211096e-03, 2.90573208e-01, 1.89659532e-01,
      

In [954]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.8744863893168977

In [955]:
# examine class distribution
print(y_test.value_counts())

0    132
1    118
Name: number, dtype: int64


In [956]:
#calculate null accuracy (for multi-class classification problems)
# .head(1) assesses the value 1208
null_accuracy = y_test.value_counts().head(1) / len(y_test)
print('Null accuracy:', null_accuracy)

Null accuracy: 0    0.528
Name: number, dtype: float64


In [957]:
# Manual calculation of null accuracy by always predicting the majority class
print('Manual null accuracy:',(132 / (132 + 118)))

Manual null accuracy: 0.528


<h2>Feature selection SelectPercentile</h2>

In [958]:
#feature selection SelectPercentile @ 50%

select = SelectPercentile(percentile=50)
select.fit(X_train_dtm, y_train)
X_train_selected = select.transform(X_train_dtm)

print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_selected is : {}'.format(X_train_selected.shape))

X_train_dtm.shape is : (750, 1541)
X_train_selected is : (750, 770)


In [959]:
X_test_selected = select.transform(X_test_dtm)
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logreg.score(X_test_dtm, y_test)))

logreg.fit(X_train_selected, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logreg.score(X_test_selected, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.796


In [960]:
#feature selection SelectPercentile @ 20%

select = SelectPercentile(percentile=20)
select.fit(X_train_dtm, y_train)
X_train_selected = select.transform(X_train_dtm)

print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_selected is : {}'.format(X_train_selected.shape))

X_train_dtm.shape is : (750, 1541)
X_train_selected is : (750, 308)


In [961]:
X_test_selected = select.transform(X_test_dtm)
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logreg.score(X_test_dtm, y_test)))

logreg.fit(X_train_selected, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logreg.score(X_test_selected, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.800


In [962]:
#feature selection SelectPercentile @ 90%

select = SelectPercentile(percentile=90)
select.fit(X_train_dtm, y_train)
X_train_selected = select.transform(X_train_dtm)

print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_selected is : {}'.format(X_train_selected.shape))

X_train_dtm.shape is : (750, 1541)
X_train_selected is : (750, 1386)


In [963]:
X_test_selected = select.transform(X_test_dtm)
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logreg.score(X_test_dtm, y_test)))

logreg.fit(X_train_selected, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logreg.score(X_test_selected, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.816


<h2>Feature selection SelectKBest</h2>

In [964]:
#feature selection selectkbest k=20
selectk = SelectKBest(chi2, k=20)
selectk.fit(X_train_dtm, y_train)
X_train_selectedk = selectk.transform(X_train_dtm)

print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_selectedk is : {}'.format(X_train_selectedk.shape))

X_train_dtm.shape is : (750, 1541)
X_train_selectedk is : (750, 20)


In [965]:
X_test_selectedk = selectk.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_selectedk, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_selectedk, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.696


In [982]:
#feature selection selectkbest k=10
selectk = SelectKBest(chi2, k=10)
selectk.fit(X_train_dtm, y_train)
X_train_selectedk = selectk.transform(X_train_dtm)

print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_selectedk is : {}'.format(X_train_selectedk.shape))

X_train_dtm.shape is : (750, 1541)
X_train_selectedk is : (750, 10)


In [983]:
X_test_selectedk = selectk.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_selectedk, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_selectedk, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.672


In [968]:
#feature selection selectkbest k=40
selectk = SelectKBest(chi2, k=40)
selectk.fit(X_train_dtm, y_train)
X_train_selectedk = selectk.transform(X_train_dtm)

print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_selectedk is : {}'.format(X_train_selectedk.shape))

X_train_dtm.shape is : (750, 1541)
X_train_selectedk is : (750, 40)


In [969]:
X_test_selectedk = selectk.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_selectedk, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_selectedk, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.712


<h2>Feature selection SelectFdr</h2>


In [970]:
#Feature SelectFdr alpha @ 0.01
sfdr = SelectFdr(chi2, alpha=0.01)
sfdr.fit(X_train_dtm, y_train)
X_train_sfdr = sfdr.transform(X_train_dtm)
print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_sfdr is : {}'.format(X_train_sfdr.shape))

X_train_dtm.shape is : (750, 1541)
X_train_sfdr is : (750, 6)


In [971]:
X_test_sfdr = sfdr.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_sfdr, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_sfdr, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.628


In [972]:
#Feature SelectFdr alpha @ 0.5
sfdr = SelectFdr(chi2, alpha=0.5)
sfdr.fit(X_train_dtm, y_train)
X_train_sfdr = sfdr.transform(X_train_dtm)
print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_sfdr is : {}'.format(X_train_sfdr.shape))

X_train_dtm.shape is : (750, 1541)
X_train_sfdr is : (750, 1293)


In [973]:
X_test_sfdr = sfdr.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_sfdr, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_sfdr, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.816


In [974]:
#Feature SelectFdr alpha @ 0.2
sfdr = SelectFdr(chi2, alpha=0.2)
sfdr.fit(X_train_dtm, y_train)
X_train_sfdr = sfdr.transform(X_train_dtm)
print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_sfdr is : {}'.format(X_train_sfdr.shape))

X_train_dtm.shape is : (750, 1541)
X_train_sfdr is : (750, 21)


In [975]:
X_test_sfdr = sfdr.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_sfdr, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_sfdr, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.692


<h2>Feature selection Variance Threshold</h2>

In [976]:
#Variance Threshold w 0 variance
vt = VarianceThreshold(threshold=0.0)
vt.fit(X_train_dtm, y_train)
X_train_vt = vt.transform(X_train_dtm)
print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_vt is : {}'.format(X_train_vt.shape))

X_train_dtm.shape is : (750, 1541)
X_train_vt is : (750, 1541)


In [977]:
#Variance Threshold w 0 variance
X_test_vt = vt.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_vt, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_vt, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.796


In [978]:
#Variance Threshold w 0.1 variance
vt = VarianceThreshold(threshold=0.1)
vt.fit(X_train_dtm, y_train)
X_train_vt = vt.transform(X_train_dtm)
print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_vt is : {}'.format(X_train_vt.shape))

X_train_dtm.shape is : (750, 1541)
X_train_vt is : (750, 13)


In [979]:
#Variance Threshold w 0.1 variance
X_test_vt = vt.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_vt, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_vt, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.628


In [980]:
#Variance Threshold w 0.5 variance
vt = VarianceThreshold(threshold=0.5)
vt.fit(X_train_dtm, y_train)
X_train_vt = vt.transform(X_train_dtm)
print('X_train_dtm.shape is : {}'.format(X_train_dtm.shape))
print('X_train_vt is : {}'.format(X_train_vt.shape))

X_train_dtm.shape is : (750, 1541)
X_train_vt is : (750, 1)


In [981]:
#Variance Threshold w 0.5 variance
X_test_vt = vt.transform(X_test_dtm)
logregk = LogisticRegression()
logregk.fit(X_train_dtm, y_train)

print('The score of Logistic Regression on all features: {:.3f}'.format(logregk.score(X_test_dtm, y_test)))

logregk.fit(X_train_vt, y_train)
print('The score of Logistic Regression on the selected features: {:.3f}'.format(logregk.score(X_test_vt, y_test)))

The score of Logistic Regression on all features: 0.796
The score of Logistic Regression on the selected features: 0.536


<h2>Summary</h2>

<b>Below you will find the accuracy scores of the features and methods used:</b>

<b>MultinomialNB:</b> 
-  metrics.accuracy_score = 0.792
-  metrics.roc_auc_score = 0.8744
-  null_accuracy = 0.528


<b>With ALL selected the scores of the following under logistic regression was 0.796. The scores below are with certain parameters set.</b>

<b>SelectPercentile:</b>
-  Score @ 50% Selected: 0.796
-  Score @ 20% Selected: 0.800
-  Score @ 90% Selected: 0.816


<b>SelectKBest:</b>
-  Score w/ k=20: 0.696
-  Score w/ k=10: 0.672
-  Score w/ k=40: 0.712


<b>SelectFDR:</b>
-  Score w alpha @ 0.01= 0.628
-  Score w alpha @ 0.5: 0.816
-  Score w alpha @ 0.2: 0.692


<b>VarianceThreshold:</b>
-  Score w. threshold=0: 0.796
-  Score w. threshold = 0.1: 0.628
-  Score w. threshold = 0.5: 0.536


Looking at these different features I noticed patterns for all of them except SelectPercentile. For each of them, with a change in the numbers used, the score followed a trend of increasing or decreasing. With SelectPercentile, it varied by percentage leaving me to believe that it would cause an overfit. 

SelectKBest increased as k increased.
SelectFDR increased as the alpha number increased.
VarianceThreshold decreased as the threshold increased. 

SelectPercentile seemed to be the only one that had an issue or with the changes being made. It seemed to make things more unpredictable than the other features.

