Thinkful Bootcamp Course

Author: Ian Heaton

Email: iheaton@gmail.com

Mentor: Nemanja Radojkovic

Date: 2017/03/30


In [7]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sb

sb.set_style('darkgrid')

# Using a Naive Bayes Classifier to look at Yelp Feedback 

## Sentiment Analysis

In [8]:
file = "/media/ianh/space/ThinkfulData/SentimentSentencesData/yelp_labelled.xlsx"
messages = pd.read_excel(open(file,'rb'), sheetname='yelp_labelled')

In [9]:
print("The number of missing data points for message : %d" % (sum(messages.message.isnull())))
print("The number of missing data points for result  : %d" % (sum(messages.result.isnull())))
print(messages.result.value_counts() / messages.shape[0])

The number of missing data points for message : 0
The number of missing data points for result  : 0
1    0.5
0    0.5
Name: result, dtype: float64


In [26]:
X_train, X_test, y_train, y_test = train_test_split(messages.message, messages.result, random_state=42)
vect = CountVectorizer()
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)

#Take a look at the resulting dataframe for training data
pd.DataFrame(train_dtm.toarray(), columns=vect.get_feature_names()).head()

Unnamed: 0,10,100,12,17,1979,20,2007,30,30s,35,...,years,yellow,yet,you,your,yourself,yucky,yum,yummy,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Model Building With Naive Bayes

In [15]:
from sklearn.naive_bayes import BernoulliNB

#Create model
bnb = BernoulliNB()
# fit it to our training set
bnb.fit(train_dtm, y_train)
# make predictions on test data using test_dtm
predictions = bnb.predict(test_dtm)

## Accuracy of Initial Model 

In [18]:
# compare predictions to true results
print("Accuracy of the model is:  {:.2f}%\n".format(100 * metrics.accuracy_score(y_test, predictions)))
print("Confusion Matrix : \n", metrics.confusion_matrix(y_test, predictions))
print(bnb.classes_)
print("Where score is either 1 (for positive) or 0 (for negative) sentiment.")

Accuracy of the model is:  71.60%

Confusion Matrix : 
 [[80 48]
 [23 99]]
[0 1]
Where Score is either 1 (for positive) or 0 (for negative) sentiment.


The misclassification rate is 28% with a sensitivity of 81% and specificity of 62%. At this time this model has a number of false positives and false negatives, ideally we would like to have these values at zero. Our model is pretty good at identifying positive sentiment and negative sentiment. 

Now using cross validation to increase the accuracy of our model.  Our dataset has balanced classes and there is no reason to believe that our model is overfitting.

In [61]:
# Get word counts for all messages in original data set.
X = vect.fit_transform(messages.message.values)
y = messages.result.values
#pd.DataFrame(X.toarray(), columns=vect.get_feature_names())

from sklearn import cross_validation

# KFold needs row count not a sparse matrix
kf = cross_validation.KFold(X.shape[0], n_folds=5, shuffle=False, random_state=None)
# list will house the accuracy score and the confusion matrix
results = []
for train, test in kf:
    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]
    prediction = bnb.fit(X_train, y_train).predict(X_test)
    results.append( (metrics.accuracy_score(y_test, prediction),metrics.confusion_matrix(y_test, prediction))  )
    
print("Mean accuracy of iteration : %.2f" % (100 * np.array([score for score,_ in results]).mean()))

# simple function to calculate misclassification, sensitivity
# and specificity from confusion matrix
def matrix_metrics(matrix):
    total       = matrix[0,0] + matrix[0,1] + matrix[1,0] + matrix[1,1]
    msclssfctn  = (matrix[0,1] + matrix[1,0])/total
    sensitivity = matrix[1,1]/(matrix[1,0] + matrix[1,1])
    specificity = matrix[0,0]/(matrix[0,0] + matrix[0,1]) 
    return msclssfctn * 100, sensitivity * 100, specificity * 100


Mean accuracy of iteration : 76.00


### First Model - 1st step

In [62]:
print("Accuracy of the model is:  {:.2f}%\n".format(100 * results[0][0]))
print("Confusion Matrix : \n", results[0][1])
misclassification, sensitivity, specificity = matrix_metrics(results[0][1])
print("\nMisclassification : %.2f\n"% (misclassification))
print("Sensitivity : %.2f\n" % (sensitivity))
print("Specificity : %.2f\n" % (specificity))

Accuracy of the model is:  78.50%

Confusion Matrix : 
 [[68 20]
 [23 89]]

Misclassification : 21.50

Sensitivity : 79.46

Specificity : 77.27



200
