# CHALLENGE: Sentiment Analysis and Naive Bayes

## By Jean-Philippe Pitteloud

### Requirements

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

### Data Gathering

The working dataset was selected from data available on the University of California Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#). A group of files were manually downloaded and the "Yelp" (yelp_labelled.txt) dataset for the model was selected responding to personal interests. The data from the downloaded file was read as a Pandas dataframe

In [2]:
yelp_raw = pd.read_csv('yelp_labelled.txt', delimiter= '\t', header=None)
yelp_raw.columns = ['sentence', 'score']

In [3]:
yelp_raw.sample(10)

Unnamed: 0,sentence,score
729,"As for the service, I thought it was good.",1
306,"Will never, ever go back.",0
195,The best place to go for a tasty bowl of Pho!,1
576,I swung in to give them a try but was deeply d...,0
888,Seriously killer hot chai latte.,1
338,"OMG, the food was delicioso!",1
29,The worst was the salmon sashimi.,0
573,"He also came back to check on us regularly, ex...",1
509,Thoroughly disappointed!,0
49,My side Greek salad with the Greek dressing wa...,1


Responding to our interest of creating a model to evaluate sentiments in text and produce a score in terms of "positive" or "negative" sentiment, the first step was to simplify the format of the text available by removing special characters and turning every letter in the text to lowercase

In [4]:
yelp_raw['sentence'] = yelp_raw['sentence'].str.lower().str.replace(r'[,.!]+', ' ')

In order to get a sense of which words are more commonly found in both "positive" and "negative" comments/reviews, all comments in every group of reviews were splitted in their composing words and the count of words was summarized from most common to least common

In [5]:
pos_words = pd.Series(' '.join(yelp_raw[yelp_raw['score'] == 1]['sentence']).lower().split(' '))

In [6]:
neg_words = pd.Series(' '.join(yelp_raw[yelp_raw['score'] == 0]['sentence']).lower().split(' '))

In [7]:
print('Most common words in Positive comments:\n')
pos_words.value_counts()[:50]

Most common words in Positive comments:



              698
the           310
and           222
was           138
i             117
a             112
is            104
to             87
this           77
good           73
great          70
food           60
in             59
place          57
of             53
it             51
very           47
service        45
for            43
with           42
had            37
are            36
so             35
we             34
were           34
you            34
have           33
my             33
on             32
they           32
here           29
all            25
friendly       24
that           24
delicious      23
back           23
be             23
best           22
time           22
our            22
really         22
nice           22
amazing        21
but            20
their          19
not            18
just           18
as             18
also           18
restaurant     17
dtype: int64

In [8]:
print('Most common words in Negative comments:\n')
neg_words.value_counts()[:50]

Most common words in Negative comments:



           674
the        274
i          187
and        169
was        157
to         131
a          125
not         98
it          82
of          74
is          67
for         67
this        66
food        65
place       49
in          48
we          45
be          44
that        43
but         42
at          40
my          39
back        38
service     37
had         33
so          31
with        30
like        29
very        29
were        29
have        29
here        28
there       28
go          26
are         26
you         25
they        24
no          23
on          23
good        22
don't       22
will        22
never       22
would       21
time        20
if          20
minutes     19
our         19
ever        19
bad         18
dtype: int64

Going through the lists of words displayed above, a list of keywords associated to both positive and negative comments was created and new features in our working dataset, indicating the presence or absence of the keywords in a given comment

In [9]:
pos_keywords = ['good', 'great', 'friendly', 'delicious', 'nice', 'best', 'amazing', 'like', 'love', 'fantastic', 'awesome', 'pretty', 'loved', 'excellent', 'tasty', 'recommend', 'fresh', 'not', 'bad', 'terrible', 'worst', 'disgusting', 'never', 'won"t', 'dissapointed', 'dissapointing']

for word in pos_keywords:
    yelp_raw[str(word)] = yelp_raw['sentence'].str.contains(str(word), case=False)

The working dataframe was splitted into two new dataframes. The 'data' dataset contained all new features created associated to the selecte keywords, while the 'target' dataset contain only the values associated to the score received by the original comment in terms of "positive" or "negative" sentiment associated to it

In [10]:
data = yelp_raw[pos_keywords]
target = yelp_raw['score']

Last, the necessary requirements were imported to apply a Naive Bayes Bernoulli classification model and the model executed using the two new datasets created above ('data' and 'target'). Once the model was built, the model was used to predict scores/values and its performance compared to the scores assigned in the original dataset

In [11]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
model_bnb = BernoulliNB()

# Fit our model to the data.
model_bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = model_bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], (target != y_pred).sum()))

Number of mislabeled points out of a total 1000 points : 242


As it can be seen above, our classification model successfully classified 758 (76%) from the 1000 comments available in the dataset 

### Application of the classification model to the IMDb dataset

In order test the appicability of our model in a dataset containing sentiment-related comments/reviews collected by a different industry, the model was tested in a dataset obtained on the same repository mentioned at the beginning of this work, but including movie and shows reviews from the website IMDb (imdb_labelled.txt). First the dataset was read into a Pandas dataframe and appropriately formatted and prepared

In [12]:
imdb_raw = pd.read_csv('imdb_labelled.txt', delimiter= '\t', header=None)
imdb_raw.columns = ['sentence', 'score']

In [13]:
imdb_raw.head()

Unnamed: 0,sentence,score
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [14]:
imdb_raw['sentence'] = imdb_raw['sentence'].str.lower().str.replace(r'[,.!]+', ' ')

Once the comments were formatted, a search of each keyword in the list created for our previous model was performed on the new dataset and features created to record the presence or absence of a given keyword in the comment

In [15]:
pos_keywords = ['good', 'great', 'friendly', 'delicious', 'nice', 'best', 'amazing', 'like', 'love', 'fantastic', 'awesome', 'pretty', 'loved', 'excellent', 'tasty', 'recommend', 'fresh', 'not', 'bad', 'terrible', 'worst', 'disgusting', 'never', 'won"t', 'dissapointed', 'dissapointing']

for word in pos_keywords:
    imdb_raw[str(word)] = imdb_raw['sentence'].str.contains(str(word), case=False)

In [18]:
data_imdb = imdb_raw[pos_keywords]
target_imdb = imdb_raw['score']

The data from the dataset was then fed into the model to generate predictions, and the predictions compared to the scores present in the original dataset

In [19]:
# Classify, storing the result in a new variable.
y_pred_imdb = model_bnb.predict(data_imdb)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(data_imdb.shape[0], (target_imdb != y_pred_imdb).sum()))

Number of mislabeled points out of a total 748 points : 291


As it can be seen above, our model (prepared using data from food/restaurant reviews) successfully predicted the score for 457 (61%) from the 748 reviews available

### Application of the classification model to the Amazon dataset 

Next, the appicability of our model in a dataset containing sentiment-related comments/reviews collected by Amazon. As described in the previous section, our classification model was built using food/restaurant reviews from the website YELP. In this section, a Pandas dataframe was created using information contained in the amazon_cells_labelled.txt file manually downloaded from the University of California Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#)

First, the dataframe was created and the columns formatted appropriately to run our working model

In [20]:
amazon_raw = pd.read_csv('amazon_cells_labelled.txt', delimiter= '\t', header=None)
amazon_raw.columns = ['sentence', 'score']

In [21]:
amazon_raw.head()

Unnamed: 0,sentence,score
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [22]:
amazon_raw['sentence'] = amazon_raw['sentence'].str.lower().str.replace(r'[,.!]+', ' ')

In [23]:
pos_keywords = ['good', 'great', 'friendly', 'delicious', 'nice', 'best', 'amazing', 'like', 'love', 'fantastic', 'awesome', 'pretty', 'loved', 'excellent', 'tasty', 'recommend', 'fresh', 'not', 'bad', 'terrible', 'worst', 'disgusting', 'never', 'won"t', 'dissapointed', 'dissapointing']

for word in pos_keywords:
    amazon_raw[str(word)] = amazon_raw['sentence'].str.contains(str(word), case=False)

In [24]:
data_amazon = amazon_raw[pos_keywords]
target_amazon = amazon_raw['score']

With the dataset appropriately formatted for the modeling stage, our working model was employed to predict the "sentiment" associated with a given review, and the performance of the model evaluated

In [25]:
# Classify, storing the result in a new variable.
y_pred_amazon = model_bnb.predict(data_amazon)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(data_amazon.shape[0], (target_amazon != y_pred_amazon).sum()))

Number of mislabeled points out of a total 1000 points : 266


From the results above, our working model succesfully predicted the "sentiment" of only 734 (73%) of the 1000 reviews available in the Amazon dataset

It can be concluded that while the selection of keywords made to built our model succesfully classified the majority of reviews in all three datasets employed, a more robust model may be required to be able to accurately predict the "sentiment" of comments/reviews in a general context