# Assignment

It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

- Do any of your classifiers seem to overfit?
- Which seem to perform the best? Why?
- Which features seemed to be most impactful to performance?

Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.

### Import Statements

In [194]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

### The Amazon Dataframe

In [195]:
amazon_df = pd.read_excel('Amazon Dataframe.xlsx', delimiter='\t', header=None)
amazon_df.columns = ['Feedback', 'Sentiment']

# Make all the feedback lowercase.
amazon_df['Feedback'] = amazon_df['Feedback'].str.lower()

# Strip out punctuation and numbers.
amazon_df['Feedback'] = amazon_df['Feedback'].str.replace('[^\w\s]','')

# Add spaces before and after each entry in 'Feedback'.
amazon_df['Feedback'] = ' ' + amazon_df['Feedback'] + ' '

In [196]:
pd.Series(' '.join(amazon_df['Feedback']).lower().split()).value_counts()[:50]

the        513
i          316
and        310
it         281
is         243
a          218
this       206
to         196
phone      162
my         143
for        121
of         119
not        117
with       112
very       103
great       97
was         90
on          89
in          88
that        80
good        75
have        73
you         68
product     55
quality     49
had         48
headset     47
works       47
but         46
battery     45
as          45
its         43
so          42
are         42
all         41
use         41
sound       41
one         40
well        38
ear         35
work        34
has         34
would       34
from        33
your        32
dont        31
like        30
case        29
if          29
than        28
dtype: int64

_Feature Engineering on the Amazon Dataframe_

In [197]:
amazon_keywords = ['good', 'excellent', 'great', 'love']

for key in amazon_keywords:
    amazon_df[str(key)] = amazon_df.Feedback.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

_Building the Training Data on the Amazon Dataframe_

In [198]:
amazon_data = amazon_df[amazon_keywords]
amazon_target = amazon_df['Sentiment']

### Running the Model on the Amazon Dataframe

In [199]:
mnb = MultinomialNB()

mnb.fit(amazon_data, amazon_target)

y_pred = mnb.predict(amazon_data)

print('Number of mislabeled points in the Amazon dataframe out of a total {} points : {}'.format(
    amazon_data.shape[0],
    (amazon_target != y_pred).sum()
))

Number of mislabeled points in the Amazon dataframe out of a total 1000 points : 376


### Model Evaluations

I'm going to make five versions of this model and each one will be analyzed using: 

1. Success rate;
2. Confusion matrix; and 
3. Cross validation.

### Evaluating the First Model

_Success Rate Evaluation_

In [200]:
accuracy_score(amazon_target, y_pred)
print('The accuracy of the model is', accuracy_score(amazon_target, y_pred) * 100, '%')

The accuracy of the model is 62.4 %


_Confusion Matrix_

In [201]:
confusion_matrix(amazon_target, y_pred)

array([[494,   6],
       [370, 130]], dtype=int64)

- False Positive Count: six.
- False Negative Count: 370.
- Sensitivity: the model correctly identified 26% of the positives.
- Specificity: 98.9% of negatives were correctly identified by the model. 

_Cross Validation_

In [202]:
cross_val_score(mnb, amazon_data, amazon_target, cv=10)

array([0.69, 0.66, 0.66, 0.64, 0.63, 0.61, 0.61, 0.59, 0.59, 0.57])

### The Second Model

_Setting up the Dataframe_

In [203]:
second_amazon_df = amazon_df.copy()

_Looking at the Top 75 Most Used Words in the 'Feedback' Column_

In [204]:
top_75_words = pd.Series(' '.join(second_amazon_df['Feedback']).lower().split()).value_counts()[:75]
top_75_words.to_dict()

{'the': 513,
 'i': 316,
 'and': 310,
 'it': 281,
 'is': 243,
 'a': 218,
 'this': 206,
 'to': 196,
 'phone': 162,
 'my': 143,
 'for': 121,
 'of': 119,
 'not': 117,
 'with': 112,
 'very': 103,
 'great': 97,
 'was': 90,
 'on': 89,
 'in': 88,
 'that': 80,
 'good': 75,
 'have': 73,
 'you': 68,
 'product': 55,
 'quality': 49,
 'had': 48,
 'headset': 47,
 'works': 47,
 'but': 46,
 'battery': 45,
 'as': 45,
 'its': 43,
 'so': 42,
 'are': 42,
 'all': 41,
 'use': 41,
 'sound': 41,
 'one': 40,
 'well': 38,
 'ear': 35,
 'work': 34,
 'has': 34,
 'would': 34,
 'from': 33,
 'your': 32,
 'dont': 31,
 'like': 30,
 'case': 29,
 'if': 29,
 'than': 28,
 'me': 28,
 'ive': 28,
 'price': 27,
 'be': 27,
 'after': 27,
 'excellent': 27,
 'time': 27,
 'no': 26,
 'up': 26,
 'recommend': 26,
 'does': 26,
 'really': 26,
 'im': 24,
 'at': 24,
 'service': 23,
 'or': 23,
 'best': 23,
 'when': 22,
 'only': 22,
 'nice': 22,
 'out': 22,
 'get': 22,
 'also': 22,
 'too': 21,
 '2': 21}

_Feature Engineering on the Second Amazon Dataframe_

In [205]:
second_amazon_keywords_list = ['good', 'excellent', 'great', 'love', 'nice', 'best', 'well', 'works', 'like', 'really']

for key in second_amazon_keywords_list:
    second_amazon_df[str(key)] = second_amazon_df.Feedback.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

_Building the Training Data on the Second Amazon Dataframe_

In [206]:
second_amazon_data = second_amazon_df[second_amazon_keywords_list]
second_amazon_target = second_amazon_df['Sentiment']

_Running the Model on the Second Amazon Dataframe_

In [207]:
mnb = MultinomialNB()

mnb.fit(second_amazon_data, second_amazon_target)

y_pred = mnb.predict(second_amazon_data)

print('Number of mislabeled points in the Amazon dataframe out of a total {} points : {}'.format(
    second_amazon_data.shape[0],
    (second_amazon_target != y_pred).sum()
))

Number of mislabeled points in the Amazon dataframe out of a total 1000 points : 319


### Evaluating the Second Model

_Success Rate Evaluation_

In [208]:
accuracy_score(second_amazon_target, y_pred)
print('The accuracy of the second model is', accuracy_score(second_amazon_target, y_pred) * 100, '%')

The accuracy of the second model is 68.10000000000001 %


_Confusion Matrix_

In [209]:
confusion_matrix(second_amazon_target, y_pred)

array([[491,   9],
       [310, 190]], dtype=int64)

- False Positive Count: nine.
- False Negative Count: 310.
- Sensitivity: the model correctly identified 38% of the positives.
- Specificity: 98.2% of negatives were correctly identified by the model.

_Cross Validation_

In [210]:
cross_val_score(mnb, second_amazon_data, second_amazon_target, cv=10)

array([0.74, 0.71, 0.71, 0.7 , 0.7 , 0.61, 0.65, 0.64, 0.67, 0.61])

### The Third Model

_Setting up the Dataframe_

In [211]:
third_amazon_df = amazon_df.copy()

In [212]:
no_negative_reviews_df = third_amazon_df[third_amazon_df['Sentiment'] != 0]

In [213]:
top_75_words = pd.Series(' '.join(no_negative_reviews_df['Feedback']).lower().split()).value_counts()[:75]
top_75_words.to_dict()

{'the': 237,
 'and': 188,
 'i': 154,
 'is': 141,
 'it': 128,
 'a': 105,
 'this': 105,
 'great': 92,
 'to': 86,
 'phone': 86,
 'my': 72,
 'very': 69,
 'for': 66,
 'with': 65,
 'good': 62,
 'of': 49,
 'works': 46,
 'on': 44,
 'have': 38,
 'was': 36,
 'in': 34,
 'product': 33,
 'that': 32,
 'well': 31,
 'headset': 31,
 'quality': 31,
 'sound': 27,
 'so': 26,
 'excellent': 26,
 'price': 25,
 'its': 24,
 'has': 24,
 'one': 23,
 'are': 22,
 'battery': 22,
 'nice': 22,
 'best': 21,
 'use': 21,
 'had': 21,
 'but': 21,
 'you': 21,
 'love': 20,
 'as': 20,
 'recommend': 20,
 'all': 20,
 'than': 19,
 'ive': 19,
 'like': 18,
 'case': 18,
 'would': 17,
 'from': 16,
 'ear': 16,
 'really': 15,
 'not': 15,
 'any': 15,
 'easy': 14,
 'comfortable': 14,
 'your': 14,
 'happy': 13,
 'these': 13,
 'better': 12,
 'am': 12,
 'im': 12,
 'been': 12,
 'just': 12,
 'no': 12,
 'up': 12,
 'bluetooth': 12,
 'fine': 12,
 'new': 12,
 'also': 11,
 'be': 11,
 'even': 11,
 'time': 11,
 'car': 11}

_Feature Engineering on the Third Amazon Dataframe_

In [214]:
third_amazon_keywords_list = ['great', 'very', 'good', 'works', 'well', 'quality', 'excellent',
                        'nice', 'best', 'love', 'recommend', 'like', 'really', 'easy',
                        'comfortable', 'happy', 'better', 'fine', 'new']

for key in third_amazon_keywords_list:
    third_amazon_df[str(key)] = third_amazon_df.Feedback.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

_Building the Training Data on the Third Amazon Dataframe_

In [215]:
third_amazon_data = third_amazon_df[third_amazon_keywords_list]
third_amazon_target = third_amazon_df['Sentiment']

_Running the Model on the Third Amazon Dataframe_

In [216]:
mnb = MultinomialNB()

mnb.fit(third_amazon_data, third_amazon_target)

y_pred = mnb.predict(third_amazon_data)

print('Number of mislabeled points in the Amazon dataframe out of a total {} points : {}'.format(
    third_amazon_data.shape[0],
    (third_amazon_target != y_pred).sum()
))

Number of mislabeled points in the Amazon dataframe out of a total 1000 points : 289


### Evaluating the Third Model

_Success Rate Evaluation_

In [217]:
accuracy_score(third_amazon_target, y_pred)
print('The accuracy of the third model is', accuracy_score(third_amazon_target, y_pred) * 100, '%')

The accuracy of the third model is 71.1 %


_Confusion Matrix_

In [218]:
confusion_matrix(third_amazon_target, y_pred)

array([[473,  27],
       [262, 238]], dtype=int64)

- False Positive Count: 27.
- False Negative Count: 262.
- Sensitivity: the model correctly identified 47.6% of the positives.
- Specificity: 94.6% of negatives were correctly identified by the model.

_Cross Validation_

In [219]:
cross_val_score(mnb, third_amazon_data, third_amazon_target, cv=10)

array([0.77, 0.7 , 0.73, 0.71, 0.74, 0.63, 0.68, 0.67, 0.71, 0.64])

### The Fourth Model

_Setting Up the Dataframe_

In [220]:
fourth_amazon_df = amazon_df.copy()

_Feature Engineering on the Fourth Amazon Dataframe_

I'm going remove neutral words (such as 'very') from the keywords list and see what happens. 

In [221]:
# Dropped words: very, works, well, quality, recommend, like, really, easy, comfortable, better and new.

fourth_amazon_keywords_list = ['great', 'good', 'excellent', 'nice', 'best', 'love', 'happy', 'fine']

for key in fourth_amazon_keywords_list:
    fourth_amazon_df[str(key)] = fourth_amazon_df.Feedback.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

_Building the Training Data on the Fourth Amazon Dataframe_

In [222]:
fourth_amazon_data = fourth_amazon_df[fourth_amazon_keywords_list]
fourth_amazon_target = fourth_amazon_df['Sentiment']

_Running the Model on the Fourth Amazon Dataframe_

In [223]:
mnb = MultinomialNB()

mnb.fit(fourth_amazon_data, fourth_amazon_target)

y_pred = mnb.predict(fourth_amazon_data)

print('Number of mislabeled points in the Amazon dataframe out of a total {} points : {}'.format(
    fourth_amazon_data.shape[0],
    (fourth_amazon_target != y_pred).sum()
))

Number of mislabeled points in the Amazon dataframe out of a total 1000 points : 357


### Evaluating the Fourth Model

_Success Rate Evaluation_

In [224]:
accuracy_score(fourth_amazon_target, y_pred)
print('The accuracy of the fourth model is', accuracy_score(fourth_amazon_target, y_pred) * 100, '%')

The accuracy of the fourth model is 64.3 %


_Confusion Matrix_

In [225]:
confusion_matrix(fourth_amazon_target, y_pred)

array([[494,   6],
       [351, 149]], dtype=int64)

- False Positive Count: six.
- False Negative Count: 351.
- Sensitivity: the model correctly identified 29.8% of the positives.
- Specificity: 98.8% of negatives were correctly identified by the model.

_Cross Validation_

In [234]:
cross_val_score(mnb, amazon_data, fourth_amazon_target, cv=10)

array([0.69, 0.66, 0.66, 0.64, 0.63, 0.61, 0.61, 0.59, 0.59, 0.57])

### The Fifth Model

_Setting Up the Dataframe_

In [227]:
fifth_amazon_df = amazon_df.copy()

_Feature Engineering on the Fifth Dataframe_

For the sake of experimenting and learning, I'm going to drop all the keywords but one and see what happens with the data.

In [235]:
fifth_amazon_keywords_list = ['great']

for key in fifth_amazon_keywords_list:
    fifth_amazon_df[str(key)] = fifth_amazon_df.Feedback.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

_Building the Training Data on the Fifth Amazon Dataframe_

In [236]:
fifth_amazon_data = fifth_amazon_df[fifth_amazon_keywords_list]
fifth_amazon_target = fifth_amazon_df['Sentiment']

_Running the Model on the Fifth Amazon Dataframe_

In [237]:
mnb = MultinomialNB()

mnb.fit(fifth_amazon_data, fifth_amazon_target)

y_pred = mnb.predict(fifth_amazon_data)

print('Number of mislabeled points in the Amazon dataframe out of a total {} points : {}'.format(
    fifth_amazon_data.shape[0],
    (fifth_amazon_target != y_pred).sum()
))

Number of mislabeled points in the Amazon dataframe out of a total 1000 points : 500


### Evaluating the Fifth Model

_Success Rate Evaluation_

In [238]:
accuracy_score(fifth_amazon_target, y_pred)
print('The accuracy of the fifth model is', accuracy_score(fifth_amazon_target, y_pred) * 100, '%')

The accuracy of the fifth model is 50.0 %


_Confusion Matrix_

In [239]:
confusion_matrix(fifth_amazon_target, y_pred)

array([[500,   0],
       [500,   0]], dtype=int64)

- False Positive Count: 0.
- False Negative Count: 500.
- Sensitivity: the model correctly identified 0% of the positives.
- Specificity: 100% of negatives were correctly identified by the model.

_Cross Validation_

In [240]:
cross_val_score(mnb, amazon_data, fifth_amazon_target, cv=10)

array([0.69, 0.66, 0.66, 0.64, 0.63, 0.61, 0.61, 0.59, 0.59, 0.57])

### Conclusions

To start, here's a summary of each tables' evaluation:

In [245]:
model_df = pd.read_excel('18.10 Challenge Model Data.xlsx', delimiter='\t')
model_df.columns = ['Metric', 'One', 'Two', 'Three', 'Four', 'Five']

model_df

Unnamed: 0,Metric,One,Two,Three,Four,Five
0,Accuracy,0.624,0.681,0.711,0.643,0.5
1,False Positives,6.0,9.0,27.0,6.0,0.0
2,False Negatives,370.0,310.0,262.0,351.0,500.0
3,Sensitivity,0.26,0.38,0.476,0.298,0.0
4,Specificity,0.989,0.982,0.946,0.988,1.0
5,Cross Validation,0.63,0.67,0.7,0.63,0.63


Model three seemed to perform the best because it had the highest accuracy and cross validation scores, plus it had the least false negatives. 

Although it did have more false positives than the other models and its sensitivity was greater than the others.