Thinkful Bootcamp Course

Author: Ian Heaton

Email: iheaton@gmail.com

Mentor: Nemanja Radojkovic

Date: 2017/03/30


In [126]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sb

sb.set_style('darkgrid')

# Using a Naive Bayes Classifier to look at Yelp Feedback 

## Sentiment Analysis

In [127]:
file = "/media/ianh/space/ThinkfulData/SentimentSentencesData/yelp_labelled.xlsx"
messages = pd.read_excel(open(file,'rb'), sheetname='yelp_labelled')

In [128]:
print("The number of missing data points for message : %d" % (sum(messages.message.isnull())))
print("The number of missing data points for result  : %d" % (sum(messages.result.isnull())))
print(messages.result.value_counts() / messages.shape[0])

The number of missing data points for message : 0
The number of missing data points for result  : 0
1    0.5
0    0.5
Name: result, dtype: float64


In [129]:
X_train, X_test, y_train, y_test = train_test_split(messages.message, messages.result, random_state=42)
vect = CountVectorizer()
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)

#Take a look at the resulting dataframe for training data
pd.DataFrame(train_dtm.toarray(), columns=vect.get_feature_names()).head()

Unnamed: 0,10,100,12,17,1979,20,2007,30,30s,35,...,years,yellow,yet,you,your,yourself,yucky,yum,yummy,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Model Building With Naive Bayes

In [130]:
from sklearn.naive_bayes import BernoulliNB

#Create model
bnb = BernoulliNB()
# fit it to our training set
bnb.fit(train_dtm, y_train)
# make predictions on test data using test_dtm
predictions = bnb.predict(test_dtm)

## Accuracy of Initial Model 

In [131]:
# compare predictions to true results
print("Accuracy of the model is:  {:.2f}%\n".format(100 * accuracy_score(y_test, predictions)))
print("Confusion Matrix : \n", confusion_matrix(y_test, predictions))
print(bnb.classes_)
print("Where score is either 1 (for positive) or 0 (for negative) sentiment.")

Accuracy of the model is:  71.60%

Confusion Matrix : 
 [[80 48]
 [23 99]]
[0 1]
Where score is either 1 (for positive) or 0 (for negative) sentiment.


The misclassification rate is 28% with a sensitivity of 81% and specificity of 62%. At this time this model has a number of false positives and false negatives, ideally we would like to have these values at zero. Our model is pretty good at identifying positive sentiment and negative sentiment. 

Now using cross validation to increase the accuracy of our model.  Our dataset has balanced classes and there is no reason to believe that our model is overfitting.

In [132]:
# Get word counts for all messages in original data set.
X = vect.fit_transform(messages.message.values)
y = messages.result.values
#pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
feature_names = vect.get_feature_names()
class_lables = ['negative sentiment', 'positive sentiment']

# KFold needs row count not a sparse matrix
kf = cross_validation.KFold(X.shape[0], n_folds=5, shuffle=False, random_state=None)
# list will house the accuracy score, confusion matrix and report
results = []
for train, test in kf:
    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]
    prediction = bnb.fit(X_train, y_train).predict(X_test)
    results.append( (accuracy_score(y_test, prediction), confusion_matrix(y_test, prediction),
                    classification_report(prediction, y_test, target_names=class_lables))  )
    
print("Mean accuracy of iteration : %.2f" % (100 * np.array([score for score ,_ ,_ in results]).mean()))

# simple function to calculate misclassification from confusion matrix
def matrix_metrics(matrix):
    total       = matrix[0,0] + matrix[0,1] + matrix[1,0] + matrix[1,1]
    msclssfctn  = (matrix[0,1] + matrix[1,0])/total
    return msclssfctn * 100


# Retrieves the most informative features for a binary classifier
# with a little help from stack overflow
def most_informative_features(vectorizer, classifier, n=5):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

    for coef, feature in topn_class1:
        print("%d %.4f %s" % (class_labels[0], coef, feature))

    print("\n")

    for coef, feature in reversed(topn_class2):
        print("%d %.4f %s" % (class_labels[1], coef, feature))


Mean accuracy of iteration : 76.00


### First Model - 1st step

In [133]:
print("Accuracy of the model is:  {:.2f}%\n".format(100 * results[0][0]))
print("Confusion Matrix : \n", results[0][1])
misclassification = matrix_metrics(results[0][1])
print("\nMisclassification : %.2f\n"% (misclassification))
print('Classification report:\n', results[0][2])

Accuracy of the model is:  78.50%

Confusion Matrix : 
 [[68 20]
 [23 89]]

Misclassification : 21.50

Classification report:
                     precision    recall  f1-score   support

negative sentiment       0.77      0.75      0.76        91
positive sentiment       0.79      0.82      0.81       109

       avg / total       0.78      0.79      0.78       200



### Second Model - 2nd Step

In [134]:
print("Accuracy of the model is:  {:.2f}%\n".format(100 * results[1][0]))
print("Confusion Matrix : \n", results[1][1])
misclassification = matrix_metrics(results[1][1])
print("\nMisclassification : %.2f\n"% (misclassification))
print('Classification report:\n', results[1][2])

Accuracy of the model is:  80.00%

Confusion Matrix : 
 [[70 19]
 [21 90]]

Misclassification : 20.00

Classification report:
                     precision    recall  f1-score   support

negative sentiment       0.79      0.77      0.78        91
positive sentiment       0.81      0.83      0.82       109

       avg / total       0.80      0.80      0.80       200



### Third Model - 3rd Step

In [135]:
print("Accuracy of the model is:  {:.2f}%\n".format(100 * results[2][0]))
print("Confusion Matrix : \n", results[2][1])
misclassification  = matrix_metrics(results[2][1])
print("\nMisclassification : %.2f\n"% (misclassification))
print('Classification report:\n', results[2][2])

Accuracy of the model is:  74.50%

Confusion Matrix : 
 [[62 26]
 [25 87]]

Misclassification : 25.50

Classification report:
                     precision    recall  f1-score   support

negative sentiment       0.70      0.71      0.71        87
positive sentiment       0.78      0.77      0.77       113

       avg / total       0.75      0.74      0.75       200



### Fourth Model - 4th Step

In [136]:
print("Accuracy of the model is:  {:.2f}%\n".format(100 * results[3][0]))
print("Confusion Matrix : \n", results[3][1])
misclassification = matrix_metrics(results[3][1])
print("\nMisclassification : %.2f\n"% (misclassification))
print('Classification report:\n', results[3][2])
print("\n Most informative Features\n")
most_informative_features(vect, bnb, 40)

Accuracy of the model is:  82.50%

Confusion Matrix : 
 [[70 13]
 [22 95]]

Misclassification : 17.50

Classification report:
                     precision    recall  f1-score   support

negative sentiment       0.84      0.76      0.80        92
positive sentiment       0.81      0.88      0.84       108

       avg / total       0.83      0.82      0.82       200


 Most informative Features

0 -6.1181 00
0 -6.1181 10
0 -6.1181 11
0 -6.1181 12
0 -6.1181 15
0 -6.1181 17
0 -6.1181 1979
0 -6.1181 30
0 -6.1181 30s
0 -6.1181 35
0 -6.1181 40min
0 -6.1181 45
0 -6.1181 4ths
0 -6.1181 5lb
0 -6.1181 85
0 -6.1181 90
0 -6.1181 99
0 -6.1181 accountant
0 -6.1181 ache
0 -6.1181 acknowledged
0 -6.1181 actual
0 -6.1181 ahead
0 -6.1181 airline
0 -6.1181 ala
0 -6.1181 albondigas
0 -6.1181 allergy
0 -6.1181 alone
0 -6.1181 although
0 -6.1181 angry
0 -6.1181 annoying
0 -6.1181 anticipated
0 -6.1181 anymore
0 -6.1181 anytime
0 -6.1181 anyways
0 -6.1181 apart
0 -6.1181 apologize
0 -6.1181 apology
0 -6.118

### Fifth Model - 5th Step

In [137]:
print("Accuracy of the model is:  {:.2f}%\n".format(100 * results[4][0]))
print("Confusion Matrix : \n", results[4][1])
misclassification = matrix_metrics(results[4][1])
print("\nMisclassification : %.2f\n"% (misclassification))
print('Classification report:\n', results[4][2])

Accuracy of the model is:  64.50%

Confusion Matrix : 
 [[84 68]
 [ 3 45]]

Misclassification : 35.50

Classification report:
                     precision    recall  f1-score   support

negative sentiment       0.55      0.97      0.70        87
positive sentiment       0.94      0.40      0.56       113

       avg / total       0.77      0.65      0.62       200



## Conclusions

All models did a good job at identifying positive sentiment. The list of most informative features for the positive class made the most sense when compared to both classes. The real problem with the models lie with the ability to identify negative sentiment. The misclassification percentage ranged from  18 to 36 %; too high. The most informative features for negative sentiment appear to be nothing more than noise.

From the list of informative features it appears that the models are seeing noise words as important which is a symptom of overfitting.  Perhaps their are better models at discerning sentiment than Bayes Classification.