# Tutorial Exercise: Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [20]:
import pandas as pd
path = 'data/yelp.csv'
review = pd.read_csv(path, index_col = 0)

In [21]:
review.shape

(10000, 9)

In [25]:
review.head()

Unnamed: 0_level_0,date,review_id,stars,text,type,user_id,cool,useful,funny
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [29]:
review.stars.value_counts()

4    3526
5    3337
3    1461
2     927
1     749
Name: stars, dtype: int64

## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [45]:
review_extremes = review[(review.stars == 5) | (review.stars == 1)]

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [48]:
X = review_extremes.text
y = review_extremes.stars

In [49]:
print X.shape
print y.shape

(4086,)
(4086,)


In [50]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2)
print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape

(3064,)
(1022,)
(3064,)
(1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [51]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

vect.fit(X_train)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [52]:
X_train_dtm = vect.transform(X_train)
print X_train_dtm.shape

(3064, 16530)


In [53]:
X_test_dtm = vect.transform(X_test)
print X_test_dtm.shape

(1022, 16530)


## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [54]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
nb.class_count_

array([  560.,  2504.])

In [57]:
y_pred_class = nb.predict(X_test_dtm)
print y_pred_class.shape

(1022,)


In [59]:
from sklearn import metrics
classification_accuracy = metrics.accuracy_score(y_test, y_pred_class)
print classification_accuracy

0.92759295499


In [61]:
C = metrics.confusion_matrix(y_test, y_pred_class)
print C

[[128  61]
 [ 13 820]]


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [110]:
print y_train.value_counts()

5    2504
1     560
Name: stars, dtype: int64


In [72]:
null_accuracy = metrics.accuracy_score(y_test, [5]*len(y_test))
print null_accuracy

0.815068493151


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [82]:
print "False positives: "
print X_test[y_test < y_pred_class][1]

 False positives:  This place was messy and loud.  The food really wasn't great and the salsa bar looked like a three year old put it together.  Overall the only plus was the tortilla chips that were free before the food.  I will not be back unless one of my friends pays for me


In [83]:
print "False Negatives: "
print X_test[y_test > y_pred_class]

False Negatives: 
business_id
1F-pelV0fTduYV_vCrvjLA    When I met some friends for dinner at this res...
Y_TtMiH_nx33FH4C48XhsA    TJ was there for me when my water heater broke...
C_eWAEOvkHZ_IZYGwjtpmg       They have a mechanical bull.  Need I say more?
ywea9tHgyxdEymLls7wKPQ    When my youngest son graduated I took him to B...
PwtYeGu-19v9bU4nbP9UbA                       Unfortunately Out of Business.
8qL697NwICTc_ac0-26Ycw    I went to sears today to check on a layaway th...
EAMPV2fgs9cU21MXOgv3Ig    First, I'm sorry this review is lengthy, but i...
Nq7eB1wB2EArUICtiNePvQ    EXCELLENT CUSTOMER SERVICE! \n\nEven with Happ...
6FECmOLQSICW1ykyBbEHng    I was told to see Greg after a local shop diag...
FpnLEpRLtDvcJvmz2N1UdA    I came here today for a manicure and pedicure....
R3sbDS0YcJDedSmUjwE48Q    Tried going there for my 1st visit and they we...
tenKOmTRi2rjZAWwNCDv6w    This is the only auto repair place I've ever s...
tenKOmTRi2rjZAWwNCDv6w    There are certain people in your

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [91]:
X_train_tokens = vect.get_feature_names()
print X_train_tokens[0:10]
print len(X_train_tokens)

[u'00', u'000', u'00am', u'00pm', u'01', u'02', u'03', u'03342', u'04', u'05']
16530


In [106]:
five_star_token_count = nb.feature_count_[1, :]
print five_star_token_count

[ 36.   7.   2. ...,   1.   1.   1.]


In [107]:
one_star_token_count = nb.feature_count_[0, :]
print one_star_token_count

[ 28.   4.   3. ...,   0.   0.   0.]


In [129]:
tokens = pd.DataFrame({'tokens':X_train_tokens, 'five_star':five_star_token_count, 'one_star':one_star_token_count}).set_index('tokens')

In [130]:
tokens.sample(5, random_state = 1)

Unnamed: 0_level_0,five_star,one_star
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
bourbon,4.0,0.0
staycation,2.0,0.0
charming,22.0,2.0
therein,1.0,0.0
visualize,1.0,0.0


In [131]:
tokens.five_star = tokens.five_star / nb.class_count_[0]
tokens.one_star = tokens.one_star / nb.class_count_[1]
tokens.sample(5, random_state = 1)

Unnamed: 0_level_0,five_star,one_star
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
bourbon,0.007143,0.0
staycation,0.003571,0.0
charming,0.039286,0.000799
therein,0.001786,0.0
visualize,0.001786,0.0


In [132]:
tokens.sort_values('five_star', ascending = False)[0:10]

Unnamed: 0_level_0,five_star,one_star
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
the,24.673214,1.627796
and,18.273214,1.029153
to,11.314286,0.914936
of,8.014286,0.500399
is,7.691071,0.299121
it,7.385714,0.508786
was,6.216071,0.583067
in,6.2,0.365415
for,5.821429,0.361821
you,5.128571,0.263179


In [133]:
tokens.sort_values('one_star', ascending = False)[0:10]

Unnamed: 0_level_0,five_star,one_star
tokens,Unnamed: 1_level_1,Unnamed: 2_level_1
the,24.673214,1.627796
and,18.273214,1.029153
to,11.314286,0.914936
was,6.216071,0.583067
it,7.385714,0.508786
of,8.014286,0.500399
that,4.4,0.366613
in,6.2,0.365415
for,5.821429,0.361821
my,4.791071,0.334665


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [135]:
#Step 1: Define X and Y
X = review.text
y = review.stars

In [136]:
print X.shape
print y.shape

(10000,)
(10000,)


In [144]:
#Step 2: Split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3)

In [145]:
X_train.shape

(7500,)

In [146]:
X_test.shape

(2500,)

In [152]:
#Step 3: Cretae document-term matrices
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [153]:
X_train_dtm.shape

(7500, 25444)

In [154]:
y_train.value_counts()

4    2660
5    2509
3    1077
2     708
1     546
Name: stars, dtype: int64

In [155]:
#Step 4: Check accuracy of Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [156]:
y_pred_class = nb.predict(X_test_dtm)

In [158]:
from sklearn import metrics
testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
print testing_accuracy

0.4848


In [160]:
#Null accuracy
null_accuracy = metrics.accuracy_score(y_test, [4]*len(y_test))
print null_accuracy

0.3464


In [165]:
C = metrics.confusion_matrix(y_test, y_pred_class)
C

array([[ 66,  34,  14,  65,  24],
       [ 12,  22,  37, 127,  21],
       [  7,  18,  38, 281,  40],
       [  7,   6,  26, 624, 203],
       [  5,   2,   6, 353, 462]])

In [167]:
from sklearn.metrics import classification_report

In [169]:
class_names = ['1 Star', '2 Star', '3 Star', '4 Star', '5 Star']
cr = classification_report(y_test, y_pred_class, target_names = class_names)
print cr

             precision    recall  f1-score   support

     1 Star       0.68      0.33      0.44       203
     2 Star       0.27      0.10      0.15       219
     3 Star       0.31      0.10      0.15       384
     4 Star       0.43      0.72      0.54       866
     5 Star       0.62      0.56      0.59       828

avg / total       0.48      0.48      0.45      2500

