# BT4222 Assignment: Yelp Reviews Data

### Due on: *10 Feb 2017 @ 23:59 (Week 5)*

### Submit this .ipynb file to:  *IVLE > Student Submission > Individual Assignment*

### In addition, please prepend your NUS userID to the filename, i.e., "`a0123456_bt4222_assignment.ipynb`"

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the course repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [None]:
# for Python 2: use print only as a function
# from __future__ import print_function

## Task 1 (1 point)

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [1]:
import pandas as pd

In [2]:
original = pd.read_csv("yelp.csv")

In [3]:
original.shape

(10000, 10)

In [4]:
original.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2 (2 pts)

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](https://www.youtube.com/watch?v=YPItfQ87qjM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=9) explains how to do this.

In [5]:
data = original[(original.stars == 5) | (original.stars == 1)]

In [6]:
data.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


In [7]:
print(data.shape)
print(data[data.stars == 5].shape[0] + data[data.stars == 1].shape[0] == data.shape[0])

(4086, 10)
True


## Task 3 (2 pts)

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [8]:
X = data.text
y = data. stars

In [9]:
print(X.shape)
print(y.shape)
print(type(X))
print(type(y))

(4086,)
(4086,)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(X_train.size + X_test.size == X.size)

(3064,)
(1022,)
True


## Task 4 (2 pts)

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [12]:
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [13]:
X_train_dtm = vect.transform(X_train)
X_train_dtm

<3064x16825 sparse matrix of type '<class 'numpy.int64'>'
	with 237720 stored elements in Compressed Sparse Row format>

In [14]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1022x16825 sparse matrix of type '<class 'numpy.int64'>'
	with 77006 stored elements in Compressed Sparse Row format>

## Task 5 (2 pts)

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [16]:
# Creating the model
nb = MultinomialNB()
%time nb.fit(X_train_dtm, y_train)

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 9.23 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
# Making class prediction
y_pred_class = nb.predict(X_test_dtm)

In [18]:
print(y_pred_class.shape)
y_pred_class

(1022,)


array([5, 5, 5, ..., 5, 1, 5])

In [19]:
# Calculating the accuracy
print(metrics.accuracy_score(y_test, y_pred_class))

0.918786692759


In [20]:
# Printing the confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[126  58]
 [ 25 813]]


## Task 6 (3 pts)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [21]:
# This function will take the original response series and the test series
# It can handle multinomial responses 
def finding_null_frequency(y_original, y_test):
    unique_y_values = y_original.unique()
    
    # Creating a list of tuples of class and its size.
    lst = map(lambda x: (x, y_original[y_original == x].size), unique_y_values)
    
    # Finding the most frequent values
    frequent_class = max(lst, key= lambda x: x[1])[0]
    
    # Finding the null accuracy of the test data
    return y_test[y_test == frequent_class].size / y_test.size

In [22]:
finding_null_frequency(y, y_test)

0.8199608610567515

## Task 7 (4 pts)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [23]:
# Finding what is considered 'positive' and 'negative' by sklearn
print(y_test[(y_test == 5) & (y_pred_class == 5)].size)
print(y_test[(y_test == 1) & (y_pred_class == 1)].size)

813
126


Based on the value of the confusion matrix, sklearn consider y values == 5 as 1 and y values == 1 as 0

In [24]:
# Definition:
#  - False Positive: Negative review classified as positive
#  - False Negative: Positive review classified as negative

# False Positive
X_test[y_pred_class > y_test]

2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
8283    Don't know where I should start. Grand opening...
2765    Went last week, and ordered a dozen variety. I...
2839    Never Again,\nI brought my Mountain Bike in (w...
321     My wife and I live around the corner, hadn't e...
1919                                         D-scust-ing.
2490    Lazy Q CLOSED in 2010.  New Owners cleaned up ...
9125    La Grande Orange Grocery has a problem. It can...
9185    For frozen yogurt quality, I give this place a...
436     this another place that i would give no stars ...
2051    Sadly with new owners comes changes on menu.  ...
1721    This is the closest to a New York hipster styl...
3447    If you want a school that cares more about you...
842     Boy is

In [25]:
# Some of the false positive reviews have positive words
print(X_test[4311])

Donuts are really good, if they have any when you get there!!!  Went in on a Tuesday morning at 1030, and they only had a total of 10 donuts. Drove out of my way to go there and still ended up at Dunkin Donuts. Very disappointed!!


In [26]:
# Some are very lengthy
print(X_test[5818])

Most horrible buffet I have ever been to.

My boyfriend and I are big chinese buffet fans. A ton of them have raised their prices recently (and, not surprisingly, have been closing down left and right).. so the prospect of a good buffet at 5-8 dollars was absolutely amazing. We read all the reviews on yelp and I had high hopes.

Dear lord. The food is awful here. Inedible. Tastes very distinctly like a bad frozen dinner.  The only thing I remember being remotely food-like was some kind of beef and onion dish.. which tasted a bit like a philly cheesesteak. not exactly asian fare..

I tried really hard to like this place. My boyfriend didn't like any of it either. I don't understand how they have such good ratings. The ONLY thing I like about this place is their dessert bar, because they have toppings for the frozen yogurt. It always annoyed me how so many buffets don't even have sprinkles.. so they've got that going for them.. though I think I'll just go to Mojo or Yogurtland. haha.

In

In [27]:
# False negative
X_test[y_pred_class < y_test]

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
3448    I was there last week with my sisters and whil...
6050    I went to sears today to check on a layaway th...
2504    I've passed by prestige nails in walmart 100s ...
2475    This place is so great! I am a nanny and had t...
241     I was sad to come back to lai lai's and they n...
3149    I was told to see Greg after a local shop diag...
423     These guys helped me out with my rear windshie...
763     Here's the deal. I said I was done with OT, bu...
8956    I took my computer to RedSeven recently when m...
750     This store has the most pleasant employees of ...
9765    You can't give anything less than 5 stars to a...
6334    I came here today for a manicure and pedicure....
1282    Loved 

In [28]:
# Some of the false negative reviews have no indication whether it is a postive review (no or little positive adjective)
print(X_test[750])
print()
print(X_test[1282])

This store has the most pleasant employees of any Forever 21 I have ever been to. The girls are always smiling and they take the time to esquire if you need help. The other day, I went in and an employee spent over 10 minutes helping me locate a particular skirt my sister wanted for Christmas.

Loved my haircut. Walked in and waited for just a minute. The stylist cut exactly like I wanted and I walked out paying 20 which included the tip.


One of the weaknesses of Naive Bayes model is that it assumes independence between each feature which may not always be the case. Hence, it may fail to capture the true meaning of phrases with 2 or more words such as "not happy" as it still contains postive words.

Moreover, as we can see from some sample above, some of the false positive do have positive words (but not positive intention) which may be a worng signal for model. Some of them are too lengthy (too much noise) and have no indicative positive or negative words (weak signal). These may cause the model to ocassionaly fail too.

## Task 8 (4 pts)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [29]:
X_train_tokens = vect.get_feature_names()
X_train_tokens[0:10]

['00', '000', '00a', '00am', '00pm', '01', '02', '03', '03342', '04']

In [30]:
X_train_tokens[-10:]

['zucchini',
 'zuchinni',
 'zumba',
 'zupa',
 'zuzu',
 'zwiebel',
 'zzed',
 'éclairs',
 'école',
 'ém']

In [31]:
tokens = pd.DataFrame({'token': X_train_tokens, 'bad':nb.feature_count_[0, :], 'good':nb.feature_count_[1, :]}).set_index('token')

In [32]:
tokens.sample(5, random_state=1)

Unnamed: 0_level_0,bad,good
token,Unnamed: 1_level_1,Unnamed: 2_level_1
chopstick,0.0,1.0
clipped,1.0,1.0
decisions,2.0,4.0
satori,0.0,1.0
performances,0.0,1.0


In [33]:
# Add value of 1 to avoid 0 division error
tokens['bad'] += 1
tokens['good'] += 1
tokens.sample(5, random_state=1)

Unnamed: 0_level_0,bad,good
token,Unnamed: 1_level_1,Unnamed: 2_level_1
chopstick,1.0,2.0
clipped,2.0,2.0
decisions,3.0,5.0
satori,1.0,2.0
performances,1.0,2.0


In [34]:
# Calculating the frequency of each word
tokens['bad'] /= nb.class_count_[0]
tokens['good'] /= nb.class_count_[1]
tokens.sample(5, random_state=1)

Unnamed: 0_level_0,bad,good
token,Unnamed: 1_level_1,Unnamed: 2_level_1
chopstick,0.00177,0.0008
clipped,0.00354,0.0008
decisions,0.00531,0.002001
satori,0.00177,0.0008
performances,0.00177,0.0008


In [35]:
# Calculating bad-to-good ration for each word
tokens['ratio'] = tokens['bad'] / tokens['good']
tokens.sample(5, random_state=1)

Unnamed: 0_level_0,bad,good,ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chopstick,0.00177,0.0008,2.211504
clipped,0.00354,0.0008,4.423009
decisions,0.00531,0.002001,2.653805
satori,0.00177,0.0008,2.211504
performances,0.00177,0.0008,2.211504


In [36]:
tokens.sort_values('ratio', ascending=True)[:10]

Unnamed: 0_level_0,bad,good,ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.00354,0.077231,0.045834
perfect,0.00531,0.098039,0.054159
yum,0.00177,0.02481,0.071339
favorite,0.012389,0.138055,0.089742
outstanding,0.00177,0.019608,0.090265
brunch,0.00177,0.016807,0.10531
gem,0.00177,0.016006,0.110575
mozzarella,0.00177,0.015606,0.11341
pasty,0.00177,0.015606,0.11341
amazing,0.021239,0.185274,0.114635


Above is the list of the 10 most predictive words for 5-star reviews. One interesting thing is 'mozarella' is there which may mean that the addition of mozzarella cheese possibly boost customer reviews

In [37]:
tokens.sort_values('ratio', ascending=False)[:10]

Unnamed: 0_level_0,bad,good,ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
staffperson,0.030088,0.0004,75.19115
refused,0.024779,0.0004,61.922124
disgusting,0.042478,0.0008,53.076106
filthy,0.019469,0.0004,48.653097
unacceptable,0.015929,0.0004,39.80708
acknowledge,0.015929,0.0004,39.80708
unprofessional,0.015929,0.0004,39.80708
ugh,0.030088,0.0008,37.595575
yuck,0.028319,0.0008,35.384071
fuse,0.014159,0.0004,35.384071


Above is the list of 10 most predictive words for 1-star review. Apparently, a company's bad staffperson is the number one cause of most bad reviews 