## Challenge: sentiment analysis with Naive bayes

Now we'll perform a sentiment analysis, classifying whether feedback left on a website is either positive or negative.

Again the UCI Machine Learning database has a nice labeled dataset of sentiment labelled sentences for us to use. This dataset was created for the paper From Group to Individual Labels using Deep Features, Kotzias et. al., KDD 2015.

Pick one of the company data files and build your own classifier. When you're satisfied with its performance (at this point just using the accuracy measure shown in the example), test it on one of the other datasets to see how well these kinds of classifiers translate from one context to another.

Include your model and a brief writeup of your feature engineering and selection process to submit and review with your mentor.

In [23]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns


In [24]:
!pwd

/Users/sjadallah/Desktop/Thinkful_data/Supervised Learning1


In [25]:
data_path=("/Users/sjadallah/Desktop/Thinkful_data/Supervised Learning1/sentiment labelled sentences/yelp_labelled.txt")


sentiment=pd.read_csv(data_path, delimiter= '\t', header=None)

In [26]:
sentiment.columns=['Comment', 'Good or Bad']

In [27]:
#Use Regex to remove punctuation, spaces and underscores
import re

def remove_punct(text):
    new_words = []
    for word in text:
        w = re.sub(r'[^\w\s]','',word) #remove everything except words and spaces
        w = re.sub(r'\_','',w)         #remove underscore as well
        new_words.append(w)
    return new_words

remove_punct(sentiment['Comment'])

['Wow Loved this place',
 'Crust is not good',
 'Not tasty and the texture was just nasty',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it',
 'The selection on the menu was great and so were the prices',
 'Now I am getting angry and I want my damn pho',
 'Honeslty it didnt taste THAT fresh',
 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer',
 'The fries were great too',
 'A great touch',
 'Service was very prompt',
 'Would not go back',
 'The cashier had no care what so ever on what I had to say it still ended up being wayyy overpriced',
 'I tried the Cape Cod ravoli chickenwith cranberrymmmm',
 'I was disgusted because I was pretty sure that was human hair',
 'I was shocked because no signs indicate cash only',
 'Highly recommended',
 'Waitress was a little slow in service',
 'This place is not worth your time let alone Vegas',
 'did not like at all',
 'The Burrittos Blah',
 'The 

In [28]:
#Run function on dataframe and insert results into df
sentiment['Comment']=remove_punct(sentiment['Comment'])

#Make everything lower case
sentiment['Comment']=sentiment['Comment'].str.lower()

In [29]:
sentiment.head(15)

Unnamed: 0,Comment,Good or Bad
0,wow loved this place,1
1,crust is not good,0
2,not tasty and the texture was just nasty,0
3,stopped by during the late may bank holiday of...,1
4,the selection on the menu was great and so wer...,1
5,now i am getting angry and i want my damn pho,0
6,honeslty it didnt taste that fresh,0
7,the potatoes were like rubber and you could te...,0
8,the fries were great too,1
9,a great touch,1


In [30]:
# Enumerate our good keywords.
keywords = ['wow', 'great', 'excellent', 'amazing', 'cute', 'awesome', 'like', 'good']

#Create for loop to identify if key words are found
for key in keywords:
    sentiment[str(key)] = sentiment.Comment.str.contains('' + str(key) + '', case=True)

In [31]:
sentiment.head()

Unnamed: 0,Comment,Good or Bad,wow,great,excellent,amazing,cute,awesome,like,good
0,wow loved this place,1,True,False,False,False,False,False,False,False
1,crust is not good,0,False,False,False,False,False,False,False,True
2,not tasty and the texture was just nasty,0,False,False,False,False,False,False,False,False
3,stopped by during the late may bank holiday of...,1,False,False,False,False,False,False,False,False
4,the selection on the menu was great and so wer...,1,False,True,False,False,False,False,False,False


In [None]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], )

In [32]:
from sklearn.naive_bayes import BernoulliNB

#Define data and target
data = sentiment[keywords]
target = sentiment['Good or Bad']

#Data is binary so use BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)


print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0],(target != y_pred).sum()))

Number of mislabeled points out of a total 1000 points : 351


In [33]:
from sklearn.metrics import confusion_matrix

# Calculate the accuracy of your model here.
print("Accuracy of model: ", 1- (target != y_pred).sum()/(target != y_pred).count())

#Calculate confusion matrix
confusion_matrix(target, y_pred)

Accuracy of model:  0.649


array([[476,  24],
       [327, 173]])