# CHALLENGE: Sentiment Analysis and Naive Bayes (Part 2)

## By Jean-Philippe Pitteloud

### Requirements

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

### Data Gathering

The working dataset was selected from data available on the University of California Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#). A group of files were manually downloaded and the "Yelp" (yelp_labelled.txt) dataset for the model was selected responding to personal interests. The data from the downloaded file was read as a Pandas dataframe

In [2]:
yelp_raw = pd.read_csv('yelp_labelled.txt', delimiter= '\t', header=None)
yelp_raw.columns = ['sentence', 'score']

In [3]:
yelp_raw.sample(10)

Unnamed: 0,sentence,score
103,I LOVED their mussels cooked in this wine redu...,1
716,They have a really nice atmosphere.,1
931,If you want to wait for mediocre food and down...,0
445,"Not to mention the combination of pears, almon...",1
941,Probably not in a hurry to go back.,0
403,"Great place to eat, reminds me of the little m...",1
35,The only redeeming quality of the restaurant w...,1
257,I as well would've given godfathers zero stars...,0
608,We made the drive all the way from North Scott...,1
21,"The food, amazing.",1


Responding to our interest of creating a model to evaluate sentiments in text and produce a score in terms of "positive" or "negative" sentiment, the first step was to simplify the format of the text available by removing special characters and turning every letter in the text to lowercase

In [4]:
yelp_raw['sentence'] = yelp_raw['sentence'].str.lower().str.replace(r'[,.!]+', ' ')

In order to get a sense of which words are more commonly found in both "positive" and "negative" comments/reviews, all comments in every group of reviews were splitted in their composing words and the count of words was summarized from most common to least common

In [5]:
pos_words = pd.Series(' '.join(yelp_raw[yelp_raw['score'] == 1]['sentence']).lower().split(' '))

In [6]:
neg_words = pd.Series(' '.join(yelp_raw[yelp_raw['score'] == 0]['sentence']).lower().split(' '))

In [7]:
print('Most common words in Positive comments:\n')
pos_words.value_counts()[:50]

Most common words in Positive comments:



             698
the          310
and          222
was          138
i            117
a            112
is           104
to            87
this          77
good          73
great         70
food          60
in            59
place         57
of            53
it            51
very          47
service       45
for           43
with          42
had           37
are           36
so            35
you           34
we            34
were          34
have          33
my            33
on            32
they          32
here          29
all           25
friendly      24
that          24
back          23
delicious     23
be            23
our           22
time          22
nice          22
really        22
best          22
amazing       21
but           20
their         19
just          18
as            18
also          18
not           18
an            17
dtype: int64

In [8]:
print('Most common words in Negative comments:\n')
neg_words.value_counts()[:50]

Most common words in Negative comments:



           674
the        274
i          187
and        169
was        157
to         131
a          125
not         98
it          82
of          74
is          67
for         67
this        66
food        65
place       49
in          48
we          45
be          44
that        43
but         42
at          40
my          39
back        38
service     37
had         33
so          31
with        30
were        29
have        29
like        29
very        29
here        28
there       28
are         26
go          26
you         25
they        24
on          23
no          23
good        22
never       22
will        22
don't       22
would       21
time        20
if          20
our         19
ever        19
minutes     19
as          18
dtype: int64

Going through the lists of words displayed above, a list of keywords associated to both positive and negative comments was created and new features in our working dataset, indicating the presence or absence of the keywords in a given comment

In [9]:
yelp_1 = yelp_raw.copy()

In [10]:
pos_keywords = ['good', 'great', 'friendly', 'delicious', 'nice', 'best', 'amazing', 'like', 'love', 'fantastic', 'awesome', 'pretty', 'loved', 'excellent', 'tasty', 'recommend', 'fresh', 'not', 'bad', 'terrible', 'worst', 'disgusting', 'never', 'won"t', 'dissapointed', 'dissapointing']

for word in pos_keywords:
    yelp_1[str(word)] = yelp_1['sentence'].str.contains(str(word), case=False)

The working dataframe was splitted into two new dataframes. The 'data' dataset contained all new features created associated to the selecte keywords, while the 'target' dataset contain only the values associated to the score received by the original comment in terms of "positive" or "negative" sentiment associated to it

In [11]:
data = yelp_1[pos_keywords]
target = yelp_1['score']

Last, the necessary requirements were imported to apply a Naive Bayes Bernoulli classification model and the model executed using the two new datasets created above ('data' and 'target'). Once the model was built, the model was used to predict scores/values and its performance compared to the scores assigned in the original dataset

In [12]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
model_bnb = BernoulliNB()

# Fit our model to the data.
model_bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = model_bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], (target != y_pred).sum()))

Number of mislabeled points out of a total 1000 points : 242


To get a better idea about the performance of out first classifier, a confusion matrix was created

In [13]:
from sklearn.metrics import confusion_matrix
mod1_conf_mat = confusion_matrix(target, y_pred)
mod1_conf_mat

array([[463,  37],
       [205, 295]])

From the information provided by the confusion matrix, the accuracy of the classifier was calculated

In [14]:
accuracy = (mod1_conf_mat[0,0] + mod1_conf_mat[1,1]) * 100 / data.shape[0]

print('Accuracy:', accuracy)

Accuracy: 75.8


To expand the scope of our analysis, the built-in method 'classification_report' was used and the results displayed below. The report summarizes typically used metrics to evaluate the performance of models

In [15]:
from sklearn.metrics import classification_report
print(classification_report(target, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.93      0.79       500
           1       0.89      0.59      0.71       500

   micro avg       0.76      0.76      0.76      1000
   macro avg       0.79      0.76      0.75      1000
weighted avg       0.79      0.76      0.75      1000



Last, cross validation was performed to evaluate the tendency of the model to OVERFIT the data based on the data, conditions, and parameter used during training

In [16]:
from sklearn.model_selection import cross_val_score
cross_val_score(model_bnb, data, target, cv = 20)

array([0.64, 0.74, 0.72, 0.78, 0.8 , 0.72, 0.78, 0.82, 0.76, 0.78, 0.72,
       0.66, 0.86, 0.76, 0.7 , 0.8 , 0.74, 0.76, 0.78, 0.72])

#### Summary

In summary, the results of our analysis about the performance of this version of our sentiment classifier reflect that the model (model #1) has some degree of overfittin based on the variation in the score during the cross_validation process. Considering the training data is also the testing data, overfitting is very likely. Also, the high number of features including in the training phase may also impact the generality of the model. In terms of performance of the model, the selection of words/features to identify 'positive' sentiments presents good precision (89%) while performs poorly classifying negative sentiments (69%). In terms of recall, a large number of negative sentiment driven comments was correctly predicted by our model (93%) while a only 59% of positive sentiment driven comments was correctly predicted. The combined F-1 score also confirmed a better performance of our model while predicting negative sentiments

### Model 2

In the next sections, variations of the models were evaluated using the same metrics used above, and some conclusions driven from the process

In [17]:
yelp_2 = yelp_raw.copy()

In this particular model, words with negative meaning were removed from the list of words/features present in the original model and the performance evaluated

In [18]:
# words related to negative experiences removed
pos_keywords_2 = ['good', 'great', 'friendly', 'delicious', 'nice', 'best', 'amazing', 'like', 'love', 'fantastic', 'awesome', 'pretty', 'loved', 'excellent', 'tasty', 'recommend', 'fresh']

for word in pos_keywords_2:
    yelp_2[str(word)] = yelp_2['sentence'].str.contains(str(word), case=False)

In [19]:
data_2 = yelp_2[pos_keywords_2]
target_2 = yelp_2['score']

In [20]:
model_bnb_2 = BernoulliNB()

# Fit our model to the data.
model_bnb_2.fit(data_2, target_2)

# Classify, storing the result in a new variable.
y_pred_2 = model_bnb_2.predict(data_2)

In [21]:
mod_2_conf_mat = confusion_matrix(target_2, y_pred_2)
mod_2_conf_mat

array([[450,  50],
       [213, 287]])

In [22]:
accuracy_2 = (mod_2_conf_mat[0,0] + mod_2_conf_mat[1,1]) * 100 / data_2.shape[0]

print('Accuracy:', accuracy_2)

Accuracy: 73.7


As it can be seen above, the removal of negative sentiment words from the features reduced the accuracy of the model from the original 76%. Also, all metrics reported below were also reduced to a certain extent

In [23]:
print(classification_report(target_2, y_pred_2))

              precision    recall  f1-score   support

           0       0.68      0.90      0.77       500
           1       0.85      0.57      0.69       500

   micro avg       0.74      0.74      0.74      1000
   macro avg       0.77      0.74      0.73      1000
weighted avg       0.77      0.74      0.73      1000



In [24]:
cross_val_score(model_bnb_2, data_2, target_2, cv=20)

array([0.6 , 0.74, 0.76, 0.76, 0.8 , 0.72, 0.78, 0.82, 0.66, 0.76, 0.72,
       0.66, 0.8 , 0.68, 0.7 , 0.76, 0.7 , 0.72, 0.8 , 0.74])

In terms of overfitting, the results from cross-validation suggest some degree of overfitting

### Model 3

In [25]:
yelp_3 = yelp_raw.copy()

In this model, words/features related to food were not employed in order to gain generality in the model and reduce overfitting

In [26]:
# like the original but using less words related to food
pos_keywords_3 = ['good', 'great', 'friendly', 'nice', 'best', 'amazing', 'like', 'love', 'fantastic', 'awesome', 'pretty', 'loved', 'excellent', 'recommend', 'not', 'bad', 'terrible', 'worst', 'never', 'won"t', 'dissapointed', 'dissapointing']

for word in pos_keywords_3:
    yelp_3[str(word)] = yelp_3['sentence'].str.contains(str(word), case=False)

In [27]:
data_3 = yelp_3[pos_keywords_3]
target_3 = yelp_3['score']

In [28]:
model_bnb_3 = BernoulliNB()

# Fit our model to the data.
model_bnb_3.fit(data_3, target_3)

# Classify, storing the result in a new variable.
y_pred_3 = model_bnb_3.predict(data_3)

In [29]:
mod_3_conf_mat = confusion_matrix(target_3, y_pred_3)
mod_3_conf_mat

array([[467,  33],
       [234, 266]])

In [30]:
accuracy_3 = (mod_3_conf_mat[0,0] + mod_3_conf_mat[1,1]) * 100 / data_3.shape[0]

print('Accuracy:', accuracy_3)

Accuracy: 73.3


As it can be seen above, the changes introduced in this version of the model compared to the original, caused a decrease in the accuracy of the model

In [31]:
print(classification_report(target_3, y_pred_3))

              precision    recall  f1-score   support

           0       0.67      0.93      0.78       500
           1       0.89      0.53      0.67       500

   micro avg       0.73      0.73      0.73      1000
   macro avg       0.78      0.73      0.72      1000
weighted avg       0.78      0.73      0.72      1000



Comparison of other key metrics, also reflect a decrease in the ability of this model to correctly predict the sentiment of the comments present in the testing dataset

In [32]:
cross_val_score(model_bnb_3, data_3, target_3, cv=20)

array([0.68, 0.68, 0.74, 0.76, 0.74, 0.68, 0.72, 0.74, 0.76, 0.72, 0.68,
       0.64, 0.8 , 0.76, 0.72, 0.8 , 0.72, 0.76, 0.74, 0.74])

In terms of overfitting, cross-validation results still show fluctuations that suggest overfitting

### Model 4

In [33]:
yelp_4 = yelp_raw.copy()

In this version of our classifier, the number of words/features was drastically reduced in an attempt to create a simpler and more general model less prone to overfitting

In [34]:
# like the original reducing the list of words
pos_keywords_4 = ['good', 'great', 'nice', 'like', 'excellent', 'not', 'bad']

for word in pos_keywords_4:
    yelp_4[str(word)] = yelp_4['sentence'].str.contains(str(word), case=False)

In [35]:
data_4 = yelp_4[pos_keywords_4]
target_4 = yelp_4['score']

In [36]:
model_bnb_4 = BernoulliNB()

# Fit our model to the data.
model_bnb_4.fit(data_4, target_4)

# Classify, storing the result in a new variable.
y_pred_4 = model_bnb_4.predict(data_4)

In [37]:
mod_4_conf_mat = confusion_matrix(target_4, y_pred_4)
mod_4_conf_mat

array([[489,  11],
       [342, 158]])

In [38]:
accuracy_4 = (mod_4_conf_mat[0,0] + mod_4_conf_mat[1,1]) * 100 / data_4.shape[0]

print('Accuracy:', accuracy_4)

Accuracy: 64.7


As it can be seen above, the accuracy of this last model in predicting the sentiment of the comments present in the testing dataset was dramatically reduced in comparison to the original model (76%)

In [39]:
print(classification_report(target_4, y_pred_4))

              precision    recall  f1-score   support

           0       0.59      0.98      0.73       500
           1       0.93      0.32      0.47       500

   micro avg       0.65      0.65      0.65      1000
   macro avg       0.76      0.65      0.60      1000
weighted avg       0.76      0.65      0.60      1000



All other key metrics also were severely impacted making the new model worse than the original

In [40]:
cross_val_score(model_bnb_4, data_4, target_4, cv=20)

array([0.62, 0.66, 0.66, 0.66, 0.62, 0.62, 0.68, 0.66, 0.64, 0.6 , 0.6 ,
       0.6 , 0.7 , 0.6 , 0.68, 0.66, 0.64, 0.7 , 0.68, 0.66])

The cross-validation process, suggest this model to be less prone to overfit the data. The use of less features and more general words may be responsible for the seen decrease

### Model 5

In [41]:
yelp_5 = yelp_raw.copy()

For the last version of the model, all negative sentiment words were removed and new features introduced that use the negative version (not_) of each of the positive words included

In [42]:
# like the original list but negative words removed and new features about not_word created instead
pos_keywords_5 = ['good', 'great', 'friendly', 'delicious', 'nice', 'best', 'amazing', 'like', 'love', 'fantastic', 'awesome', 'pretty', 'loved', 'excellent', 'tasty', 'recommend', 'fresh']

for word in pos_keywords_5:
    yelp_5[str(word)] = yelp_5['sentence'].str.contains(str(word), case=False)
    yelp_5['not_' + str(word)] = (yelp_5['sentence'].str.contains('not', case=False) & yelp_5['sentence'].str.contains(str(word), case=False))

In [43]:
data_5 = yelp_5.iloc[:, 2:]
target_5 = yelp_5['score']

In [44]:
model_bnb_5 = BernoulliNB()

# Fit our model to the data.
model_bnb_5.fit(data_5, target_5)

# Classify, storing the result in a new variable.
y_pred_5 = model_bnb_5.predict(data_5)

In [45]:
mod_5_conf_mat = confusion_matrix(target_5, y_pred_5)
mod_5_conf_mat

array([[467,  33],
       [216, 284]])

As it can be seen below, this version performs very similar to the original model in terms of accuracy (75% vs 76%)

In [46]:
accuracy_5 = (mod_5_conf_mat[0,0] + mod_5_conf_mat[1,1]) * 100 / data_5.shape[0]

print('Accuracy:', accuracy_5)

Accuracy: 75.1


In terms of other key metrics, this version outperforms the original model slightly in some metrics while performs slightly worse in others

In [47]:
print(classification_report(target_5, y_pred_5))

              precision    recall  f1-score   support

           0       0.68      0.93      0.79       500
           1       0.90      0.57      0.70       500

   micro avg       0.75      0.75      0.75      1000
   macro avg       0.79      0.75      0.74      1000
weighted avg       0.79      0.75      0.74      1000



In [48]:
cross_val_score(model_bnb_5, data_5, target_5, cv = 20)

array([0.66, 0.76, 0.74, 0.78, 0.82, 0.72, 0.76, 0.82, 0.66, 0.76, 0.74,
       0.66, 0.84, 0.74, 0.7 , 0.76, 0.72, 0.76, 0.8 , 0.74])

As far as overfitting goes, this version exhibits similar fluctuations than the ones observed for the original model