# Challenge: evaluate your sentiment classifier

It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

    * Do any of your classifiers seem to overfit?
    * Which seem to perform the best? Why?
    * Which features seemed to be most impactful to performance?

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from wordcloud import STOPWORDS

The cell imports the dataset of interest, with 2 columns "review" and "sentiment." I am interested in predicting sentiment based on the "review."

Following the import, I create a keywords list and iterate through that list to create a new column in the data frame for each keyword in the list and the values in each observation relating to that column is a boolean that tells us whether that corresponding keyword is in that observation's review.

After that, I declare the data (keywords) and target (sentiment).

Next, I load the model with the data/target and create a new variable to store the model's results.

Finally, I print overall model performance.

### Model 1 - Orginal

In [2]:
PATH = r'C:\Users\latee\Downloads\sentiment labelled sentences\sentiment labelled sentences\imdb_labelled.txt'

IMDB = pd.read_csv(PATH, delimiter= '\t', header=None)
IMDB.columns = ['review', 'sentiment']

keywords = ['bad', 'funny', 'best', 'great', 'terrible', 'love', 'good', 'plot', 'never', 'real', 'really', 'script',
           'one', 'actor', 'see', 'little', 'make', 'way', 'recommend', 'line', 'movie', 'film', 'acting', 'even',
           'scene', 'watching', 'excellent', 'seen', 'piece', 'say', 'show', 'dialogue', 'perfect', 'cheap', 'thing']

for key in keywords:
    IMDB[str(key)] = IMDB.review.str.contains(' ' + str(key) + ' ', case=False)

data = IMDB[keywords]
target = IMDB['sentiment']

bnb = BernoulliNB()
bnb.fit(data, target)
y_pred = bnb.predict(data)

print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], (target != y_pred).sum()))
print("Our sensitivty is: {}".format(round(348/386, 4)))
print("Our specificity is: {}".format(round(119/362, 4)))
print("\n Confusion Matrix")
confusion_matrix(target, y_pred)

Number of mislabeled points out of a total 748 points : 281
Our sensitivty is: 0.9016
Our specificity is: 0.3287

 Confusion Matrix


array([[119, 243],
       [ 38, 348]], dtype=int64)

#### Now to check for overfitting and accuracy using Cross-Validation

In [3]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(bnb, data, target, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.61 (+/- 0.04)


### Model 2 - Negative Words

In [4]:
keywords = ['bad', 'terrible', 'never', 'cheap']

for key in keywords:
    IMDB[str(key)] = IMDB.review.str.contains(' ' + str(key) + ' ', case=False)

data = IMDB[keywords]
target = IMDB['sentiment']

bnb = BernoulliNB()
bnb.fit(data, target)
y_pred = bnb.predict(data)

print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], (target != y_pred).sum()))
print("Our sensitivty is: {}".format(round(376/386, 4)))
print("Our specificity is: {}".format(round(44/362, 4)))
print("\n Confusion Matrix")
confusion_matrix(target, y_pred)

Number of mislabeled points out of a total 748 points : 328
Our sensitivty is: 0.9741
Our specificity is: 0.1215

 Confusion Matrix


array([[ 44, 318],
       [ 10, 376]], dtype=int64)

In [5]:
scores = cross_val_score(bnb, data, target, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.56 (+/- 0.03)


### Model 3 - Positive Words

In [6]:
keywords = ['funny', 'best', 'great', 'love', 'good', 'recommend', 'excellent','perfect']

for key in keywords:
    IMDB[str(key)] = IMDB.review.str.contains(' ' + str(key) + ' ', case=False)

data = IMDB[keywords]
target = IMDB['sentiment']

bnb = BernoulliNB()
bnb.fit(data, target)
y_pred = bnb.predict(data)

print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], (target != y_pred).sum()))
print("Our sensitivty is: {}".format(round(86/386, 4)))
print("Our specificity is: {}".format(round(339/362, 4)))
print("\n Confusion Matrix")
confusion_matrix(target, y_pred)

Number of mislabeled points out of a total 748 points : 323
Our sensitivty is: 0.2228
Our specificity is: 0.9365

 Confusion Matrix


array([[339,  23],
       [300,  86]], dtype=int64)

In [7]:
scores = cross_val_score(bnb, data, target, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.57 (+/- 0.06)


### Model 4 - Positive + Negative Words

In [8]:
keywords = ['funny', 'best', 'great', 'love', 'good', 'recommend', 'excellent','perfect',
           'bad', 'terrible', 'never', 'cheap']

for key in keywords:
    IMDB[str(key)] = IMDB.review.str.contains(' ' + str(key) + ' ', case=False)

data = IMDB[keywords]
target = IMDB['sentiment']

bnb = BernoulliNB()
bnb.fit(data, target)
y_pred = bnb.predict(data)

print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], (target != y_pred).sum()))
print("Our sensitivty is: {}".format(round(381/386, 4)))
print("Our specificity is: {}".format(round(42/362, 4)))
print("\n Confusion Matrix")
confusion_matrix(target, y_pred)

Number of mislabeled points out of a total 748 points : 325
Our sensitivty is: 0.987
Our specificity is: 0.116

 Confusion Matrix


array([[ 42, 320],
       [  5, 381]], dtype=int64)

In [9]:
scores = cross_val_score(bnb, data, target, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.56 (+/- 0.02)


Of the above 4 models, the original one performs best. 

Even though the original model is better at detecting negative sentiment than it is at a positive one, it's cross-validation results tells me that it is most consistently more reliable than the other three. One of the biggest distinctions between the three models after the original one is the number of keywords used in data to detect our target. 

For the Fifth Model, I will stuff a ton more keywords into our data to see if we gain accuracy and uncover the costs of doing so. And since there doesn't seem to be overfitting in any of the above models this seems like a worthwhile tactic to try. 

### Model 5 - Word Blast

In [10]:
keywords = ['bad', 'funny', 'best', 'great',
            'terrible', 'love', 'good', 'plot',
            'never', 'real', 'really', 'script',
           'one', 'actor', 'see', 'little', 'make',
            'way', 'recommend', 'line', 'movie', 'film','acting', 'even',
           'scene', 'watching', 'excellent', 'seen', 'piece',
            'say', 'show', 'dialogue', 'perfect', 'cheap', 'thing',
           'give', 'work', 'go', 'horror', 'life', 'found', 
           'kid', 'mess', 'fan', 'budget', 'song', 'lack', 'face',
           'effect', 'time', 'much', 'truly', 'black', 'tv', 'human', 
           'ending', 'garbage', 'flick', 'casting']

for key in keywords:
    IMDB[str(key)] = IMDB.review.str.contains(' ' + str(key) + ' ', case=False)

data = IMDB[keywords]
target = IMDB['sentiment']

bnb = BernoulliNB()
bnb.fit(data, target)
y_pred = bnb.predict(data)

print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], (target != y_pred).sum()))
print("Our sensitivty is: {}".format(round(339/386, 4)))
print("Our specificity is: {}".format(round(139/362, 4)))
print("\n Confusion Matrix")
confusion_matrix(target, y_pred)

Number of mislabeled points out of a total 748 points : 270
Our sensitivty is: 0.8782
Our specificity is: 0.384

 Confusion Matrix


array([[139, 223],
       [ 47, 339]], dtype=int64)

In [11]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(bnb, data, target, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.61 (+/- 0.05)


The performance of this model is lackluster as well. 

    * Which features seemed to be most impactful to performance?
    
During my time spent in this notebook, I discovered that the features used have a huge impact on performance. Even though the accuracy produced by all of these models are not what I would consider reliable, the keywords used highly influenced the type errors produced by each model. The key models of this discovery are the negative word and positive word models. 

The negative model is more susceptible to type 1 error, alternatively, we can say that the model is more likely to assume positive reviews are going to be negative. The negative words model is a lot better at finding negative reviews but did not perform so well when classifying positive reviews. 

The positive model is more susceptible to type 2 errors. The positive words model is a lot better at finding positive reviews but performs poorly when faced with negative ones. 

More concisely put, the negative model has high sensitivity and low specificity. The inverse is true for the positive model. 

And although the accuracy of all the models generated was not "acceptable," there is something to be said about the tightness of variance in the resulting accuracy. So at least we don't have some overfitting.