# Tutorial Exercise: Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [2]:
import pandas as pd
df = pd.read_csv('data\yelp.csv')
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [3]:
df2 = df[(df['stars']==5) | (df['stars']==1)]
df2.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


In [4]:
print(df.shape)
print(df2.shape)

(10000, 10)
(4086, 10)


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [6]:
X = df2.text
y = df2.stars
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3064,)
(1022,)
(3064,)
(1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
X_train_dtm

<3064x16823 sparse matrix of type '<class 'numpy.int64'>'
	with 239078 stored elements in Compressed Sparse Row format>

## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [8]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred = nb.predict(X_test_dtm)
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))

0.909980430528
[[122  79]
 [ 13 808]]


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [9]:
print(y_train.value_counts().iloc[0] / len(y_train))

0.821148825065


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [11]:
FP = X_test[y_test < y_pred]
print(FP.head())

4595    I have been here once for lunch and sat outsid...
7112    The staff is friendly. But the physical hotel ...
715     What's with the cheese?  It isn't even Velveta...
1963                       Prices are often way too high!
6584    Jimmy Johns is cheaper and better ... The Capr...
Name: text, dtype: object


In [12]:
print(X_test[4595])

I have been here once for lunch and sat outside only to get eaten alive by flies, nothing worse than eating outside on a nice patio and being pestered the entire time. 

I decided to give it another chance for happy hour, well guess what? flies followed me inside too. 

I don't know if it was the time of year (early spring) or what, but they really need an exterminator!


## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [14]:
token = vect.get_feature_names()
print(len(token))
features = nb.feature_count_
print(nb.feature_count_.shape)

16823
(2, 16823)


In [15]:
tokens = pd.DataFrame({'token': token, 'one': features[0], 'five': features[1]}).set_index('token')
tokens.head()

Unnamed: 0_level_0,five,one
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,36.0,30.0
000,8.0,3.0
00a,0.0,1.0
00am,2.0,2.0
00pm,6.0,0.0


In [16]:
tokens += 1
tokens.head()

Unnamed: 0_level_0,five,one
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,37.0,31.0
000,9.0,4.0
00a,1.0,2.0
00am,3.0,3.0
00pm,7.0,1.0


In [17]:
tokens.one = tokens.one / nb.class_count_[0]
tokens.five = tokens.five / nb.class_count_[1]
tokens['five ratio'] = tokens.five / (tokens.five + tokens.one)
tokens.head()

Unnamed: 0_level_0,five,one,five ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00,0.014706,0.056569,0.206325
000,0.003577,0.007299,0.328888
00a,0.000397,0.00365,0.098208
00am,0.001192,0.005474,0.178851
00pm,0.002782,0.001825,0.603904


In [18]:
tokens.sort_values('five ratio', ascending=False)

Unnamed: 0_level_0,five,one,five ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
flavors,0.044913,0.001825,0.960956
fantastic,0.075517,0.003650,0.953899
perfect,0.096582,0.007299,0.929734
yum,0.023847,0.001825,0.928919
favorite,0.130763,0.010949,0.922738
perfection,0.018680,0.001825,0.911007
bianco,0.017488,0.001825,0.905513
gluten,0.017091,0.001825,0.903528
casual,0.016693,0.001825,0.901457
brunch,0.016693,0.001825,0.901457


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [19]:
X = df.text
y = df.stars
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(7500,)
(2500,)
(7500,)
(2500,)


In [20]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
X_train_dtm

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred = nb.predict(X_test_dtm)
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))

0.4844
[[ 61  22  17  65  18]
 [ 15  14  49 146  22]
 [  3  14  30 280  28]
 [ 11   1  19 678 202]
 [  5   1  10 361 428]]


In [21]:
print(y_train.value_counts().max() / len(y_train))

0.348666666667


In [22]:
print(metrics.classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          1       0.64      0.33      0.44       183
          2       0.27      0.06      0.09       246
          3       0.24      0.08      0.12       355
          4       0.44      0.74      0.56       911
          5       0.61      0.53      0.57       805

avg / total       0.47      0.48      0.44      2500



In [23]:
print(df.sort_values('stars', inplace=True))
print(df.head(4))

None
                 business_id        date               review_id  stars  \
3456  0VLZj0f3llL9IkL5GOT3lg  2011-10-26  tuNdSzBSNmcwUXHLY0re0A      1   
1747  eGj1NnvbIUVWgDYQWEOwQg  2012-06-10  MEzqi22MaWQV1LMSSPmh6Q      1   
5543  kJFS_3WlP6TFdNUYt6V6FA  2011-09-26  mSB78PqDRD7jE5g4yZuaMQ      1   
4934  e8FMAuTswDueAlLsNyLhcA  2010-02-11  4vvdwQyS5uCSo74iw81irw      1   

                                                   text    type  \
3456  My husband and I went there for our first time...  review   
1747  This dog park gets one star.Why?Because the la...  review   
5543  This review is for the bar only.\nI had a date...  review   
4934  I'd have to disagree with the person who said ...  review   

                     user_id  cool  useful  funny  
3456  E8FSWBUXArSJU7cWpBXF7Q     0       0      0  
1747  qBAaWZxyuFnSZU0NzFlNDw     0       0      0  
5543  AYGHNy8gPxl2Q-etTT3hZw     2       6      6  
4934  W5Pd_GmMem2LdHZkoxQCtw     0       0      0  
