# Challenge: Iterate and evaluate your classifier

It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

Do any of your classifiers seem to overfit?
Which seem to perform the best? Why?
Which features seemed to be most impactful to performance?
Write up your iterations and answers to the above questions in a few pages.

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy
import sklearn
from sklearn.metrics import confusion_matrix
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split

%matplotlib inline

In [11]:
amazon_cells_raw = pd.read_csv('amazon_cells_labelled.txt', delimiter='\t', header=None)
amazon_cells_raw.columns = ['review', 'pos']

In [12]:
## Python code to find the frequency of words
import re
import string
frequency = {}
document_text = open('amazon_cells_labelled.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
 
for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1
     
frequency_list = frequency.keys()
 
for words in frequency_list:
    print (words, frequency[words])

there 17
way 7
for 121
plug 11
here 4
the 519
unless 3
converter 1
good 77
case 29
excellent 27
value 5
great 99
jawbone 3
tied 1
charger 19
conversations 3
lasting 2
more 19
than 28
minutes 6
major 1
problems 12
mic 4
have 73
jiggle 1
get 22
line 5
right 12
decent 4
volume 12
you 71
several 7
dozen 1
hundred 1
contacts 3
then 17
imagine 1
fun 1
sending 4
each 1
them 13
one 41
are 42
razr 5
owner 1
must 4
this 208
needless 1
say 7
wasted 2
money 19
what 17
waste 14
and 311
time 27
sound 43
quality 49
was 90
very 104
impressed 9
when 22
going 6
from 33
original 5
battery 46
extended 2
two 14
were 4
seperated 1
mere 1
started 5
notice 2
excessive 1
static 3
garbled 1
headset 48
though 3
design 11
odd 1
ear 35
clip 4
not 117
comfortable 17
all 41
highly 9
recommend 26
any 20
who 3
has 34
blue 4
tooth 2
phone 168
advise 2
everyone 3
fooled 1
far 13
works 47
clicks 1
into 10
place 4
that 82
makes 11
wonder 1
how 9
long 13
mechanism 1
would 34
last 8
went 7
motorola 13
website 2
followed 1
d

humming 1
equipment 1
certain 1
places 1
girl 1
complain 1
wake 1
styling 1
restocking 1
fee 1
darn 1
lousy 1
seen 1
sweetest 1
securely 1
hook 1
directed 1
canal 1
unsatisfactory 1
videos 1
negatively 1
adapter 1
provide 1
hype 1
assumed 1
lense 1
covered 1
falls 1
text 1
messaging 1
tricky 1
painful 1
lasted 1
blew 1
flops 1
smudged 1
touches 1
disappoint 1
infra 1
port 1
irda 1


In [13]:
amazon_cells_raw.head()

Unnamed: 0,review,pos
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [15]:
#Top positive Keywords
keywords = ['work','very','great','good','use','product','quality','one','well',
           'real','price','excellent','recommend','look','call','did','buy','fit','nice','best',
           'service','love','new','item','purchase','better','ever','bought','comfortable','easy','frist']

#Columns to identify if keyword in review
for key in keywords:
    amazon_cells_raw[str(key)] = amazon_cells_raw.review.str.contains(
         ' '  + str(key) + ' ', 
        case=True
    )

In [18]:
amazon_cells_raw['long'] = amazon_cells_raw.review.str.len() > 20
amazon_cells_raw.pos = amazon_cells_raw.pos.astype(bool)

data = amazon_cells_raw[keywords + ['long']]
target = amazon_cells_raw['pos']

# Our data is binary / boolean using Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 350


In [19]:
A = confusion_matrix(target, y_pred)
print (A)
print ("Sensitivity of A: ", A[1,1] / (A[1,0]+A[1,1]))
print ("Specificity of A: ", A[0,0] / (A[0,1]+A[0,0]))

[[443  57]
 [293 207]]
Sensitivity of A:  0.414
Specificity of A:  0.886


In [20]:
#Top positive Keywords
keywords = ['Work','Very','Great','Good','Use','Product','Quality','One','Well',
           'Real','Price','Excellent','Recommend','Look','Call','Did','Buy','Fit','Nice','Best',
           'Service','Love','New','Item','Purchase','Better','Ever','Bought','Comfortable','Easy','Frist']

#Columns to identify if keyword in review
for key in keywords:
    amazon_cells_raw[str(key)] = amazon_cells_raw.review.str.contains(
         ' '  + str(key) + ' ', 
        case=True
    )
    
amazon_cells_raw['long'] = amazon_cells_raw.review.str.len() > 20
amazon_cells_raw.pos = amazon_cells_raw.pos.astype(bool)

data = amazon_cells_raw[keywords + ['long']]
target = amazon_cells_raw['pos']

# Our data is binary / boolean using Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))   

Number of mislabeled points out of a total 1000 points : 472


In [21]:
A = confusion_matrix(target, y_pred)
print (A)
print ("Sensitivity of A: ", A[1,1] / (A[1,0]+A[1,1]))
print ("Specificity of A: ", A[0,0] / (A[0,1]+A[0,0]))

[[430  70]
 [402  98]]
Sensitivity of A:  0.196
Specificity of A:  0.86


In [22]:
X_train, X_test, y_train, y_test = train_test_split(
     data, target, test_size=0.2)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(800, 32) (800,)
(200, 32) (200,)


In [23]:
# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data
bnb.fit(X_train, y_train)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(X_train)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    X_train.shape[0],
    (y_train != y_pred).sum()
))

Number of mislabeled points out of a total 800 points : 377
