# Homework with Yelp reviews data

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.json`** is the original format of the file. **`yelp.csv`** contains the same data, in a more convenient format. Both of the files are in the course repo (in the **`data`** directory), so there is no need to download the data from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.
- The **cool** column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
- The **useful** and **funny** columns are similar to the **cool** column.

**Goal:** Predict the star rating of a review using **only** the review text. (We will not be using the other columns.)

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [1]:
# use print only as a function
from __future__ import print_function

In [2]:
import pandas as pd

In [4]:
data_df = pd.read_csv('../data/yelp.csv')
data_df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [5]:
yelp_pred = data_df[['stars','text']]

In [6]:
yelp_pred.head()

Unnamed: 0,stars,text
0,5,My wife took me here on my birthday for breakf...
1,5,I have no idea why some people give bad review...
2,4,love the gyro plate. Rice is so good and I als...
3,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!..."
4,5,General Manager Scott Petello is a good egg!!!...


## Task 1 (Alternative)

Ignore the **`yelp.csv`** file, and instead construct this DataFrame manually using **`yelp.json`**. This involves reading the file into Python, decoding the JSON, converting it to a DataFrame, and adding individual columns for each of the vote types.

**Note:** This may be a challenging task, so I recommend skipping it unless you are fluent with Python and Pandas.

In [None]:
# import json
# read file as json, come back to later
#with open('../data/yelp.json', 'rb') as f:
    #data = f.readlines()

In [None]:
# data = map(lambda x: x.rstrip(), data)

## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** You will need to filter the DataFrame using an OR condition. [Working with DataFrames](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/) has an example of this.

In [12]:
yelp1_5 = yelp_pred[(yelp_pred.stars == 5) | (yelp_pred.stars == 1)]

In [13]:
type(yelp1_5)

pandas.core.frame.DataFrame

In [14]:
print(yelp1_5.head())

   stars                                               text
0      5  My wife took me here on my birthday for breakf...
1      5  I have no idea why some people give bad review...
3      5  Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4      5  General Manager Scott Petello is a good egg!!!...
6      5  Drop what you're doing and drive here. After I...


In [16]:
yelp1_5.describe()

Unnamed: 0,stars
count,4086.0
mean,4.266765
std,1.547868
min,1.0
25%,5.0
50%,5.0
75%,5.0
max,5.0


In [17]:
yelp1_5.shape

(4086, 2)

In [18]:
#sms.label.value_counts()
yelp1_5.stars.value_counts()

5    3337
1     749
dtype: int64

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [19]:
# test using both types of subsetting - either seems to work
X = yelp1_5.text
y = yelp1_5.stars
print(X.shape)
print(y.shape)

(4086,)
(4086,)


In [20]:
type(X)
type(y)

pandas.core.series.Series

In [21]:
                        # WHY IS INDEX MISSING NUMBERS??? QUESTION FOR KEVIN...
X.head(10)

0     My wife took me here on my birthday for breakf...
1     I have no idea why some people give bad review...
3     Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4     General Manager Scott Petello is a good egg!!!...
6     Drop what you're doing and drive here. After I...
9     Nobuo shows his unique talents with everything...
10    The oldish man who owns the store is as sweet ...
11    Wonderful Vietnamese sandwich shoppe. Their ba...
12    They have a limited time thing going on right ...
17    okay this is the best place EVER! i grew up sh...
Name: text, dtype: object

In [24]:
y.head(15)

0     5
1     5
3     5
4     5
6     5
9     5
10    5
11    5
12    5
17    5
21    5
22    5
23    1
24    5
26    5
Name: stars, dtype: int64

In [25]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3064,)
(1022,)
(3064,)
(1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [26]:
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [27]:
# transform give you a "document term matrix" "dtm" - stored as a sparse matrix
# documents are the rows, terms are the features. hence "Document x Term x Matrix"

vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

# transform does the count. so it's counted when you transform it. That's why you can do "fit" for X_Train and then
# And THEN transform.  But for y_train, we only transform. So it doesn't "fit" -- meaning find the relationship between
# the features (term) and the rows (document) like fit does. It just counts - and specifically it ONLY counts the words
# that were in X "fit". NOT new words.
X_train_dtm

<3064x16825 sparse matrix of type '<class 'numpy.int64'>'
	with 237720 stored elements in Compressed Sparse Row format>

In [57]:
# alternative: combine fit and transform into a single step    THIS IS BETTER. USE THIS!!!
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<3064x16825 sparse matrix of type '<class 'numpy.int64'>'
	with 237720 stored elements in Compressed Sparse Row format>

In [68]:
vect.get_feature_names()

['00',
 '000',
 '00a',
 '00am',
 '00pm',
 '01',
 '02',
 '03',
 '03342',
 '04',
 '05',
 '06',
 '07',
 '09',
 '0buxoc0crqjpvkezo3bqog',
 '0l',
 '10',
 '100',
 '1000',
 '1000x',
 '1001',
 '100th',
 '101',
 '102',
 '105',
 '1070',
 '108',
 '10am',
 '10ish',
 '10min',
 '10mins',
 '10minutes',
 '10pm',
 '10th',
 '10x',
 '11',
 '110',
 '1100',
 '111',
 '111th',
 '112',
 '115th',
 '118',
 '11a',
 '11am',
 '11p',
 '11pm',
 '12',
 '120',
 '128i',
 '129',
 '12am',
 '12oz',
 '12pm',
 '12th',
 '13',
 '14',
 '140',
 '147',
 '14lbs',
 '15',
 '150',
 '1500',
 '150mm',
 '15am',
 '15mins',
 '15pm',
 '15th',
 '16',
 '160',
 '165',
 '169',
 '16th',
 '17',
 '17p',
 '18',
 '180',
 '18th',
 '19',
 '1900',
 '1913',
 '1928',
 '1929',
 '1930s',
 '1940',
 '1952',
 '1955',
 '1956',
 '1960',
 '1961',
 '1969',
 '1970',
 '1980',
 '1980s',
 '1987',
 '1990s',
 '1992',
 '1995',
 '1996',
 '1998',
 '1999',
 '19th',
 '1cent',
 '1k',
 '1p',
 '1pm',
 '1st',
 '20',
 '200',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007'

In [59]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1022x16825 sparse matrix of type '<class 'numpy.int64'>'
	with 77006 stored elements in Compressed Sparse Row format>

In [37]:
                            #QUESTION FOR KEVIN & ALEX - 
#                             how does vect.get_feature_names() know to get feature names from X_train
#                             without passing X_train as a parameter????

# vect.get_feature_names essentially tokenizes 
X_train_tokens = vect.get_feature_names()

In [38]:
# examine the first 50 tokens
print(X_train_tokens[0:50])

['00', '000', '00pm', '02', '04', '05', '06', '10', '100', '1000', '100s', '101', '1030', '105', '108', '109', '10am', '10pm', '10th', '10yo', '11', '110', '115', '116', '11am', '12', '120', '13', '1300', '13331', '13th', '14', '15', '150', '157', '16', '16th', '17', '175', '17th', '18', '1800', '1895', '19', '1968', '1978', '1980s', '1990', '1997', '19th']


In [43]:
type(X_train_tokens)

list

In [39]:
# examine the last 50 tokens
print(X_train_tokens[-50:])

['york', 'yorker', 'yorkers', 'you', 'youd', 'young', 'younger', 'youngest', 'youngggg', 'youngtown', 'your', 'yours', 'yourself', 'youth', 'yr', 'yuck', 'yum', 'yumm', 'yumminess', 'yummmayyyy', 'yummmmm', 'yummmmmmmm', 'yummo', 'yummy', 'yup', 'yur', 'yuri', 'yyyeeaahhhh', 'zen', 'zero', 'zesty', 'zichini', 'zillion', 'zin', 'zinburger', 'zinc', 'zip', 'ziploc', 'zipps', 'zoe', 'zone', 'zoners', 'zones', 'zoo', 'zoom', 'zucchini', 'zuccini', 'zupas', 'zuzu', 'zuzus']


In [40]:
X_train_dtm.shape

(3064, 16825)

In [60]:
import numpy as np
# count how many times EACH token appears across ALL messages in X_train_dtm
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

array([65,  9,  1, ...,  1,  1,  1], dtype=int64)

## Task 5

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [61]:
# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [62]:
# train a Naive Bayes model using X_train_dtm
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [63]:
# quick aside: Naive Bayes will count the features in each class for you!
nb.feature_count_

array([[ 26.,   4.,   1., ...,   0.,   0.,   0.],
       [ 39.,   5.,   0., ...,   1.,   1.,   1.]])

In [67]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [68]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.91878669275929548

In [69]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[126,  58],
       [ 25, 813]])

## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

## Task 7 (Challenge)

Calculate which 5 tokens are the most predictive of **5-star reviews**, and which 5 tokens are the most predictive of **1-star reviews**.

- **Hint:** Use the `feature_count_` attribute from the Naive Bayes model object as a shortcut, so that you don't have to do any NumPy math.

## Task 8 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy.
- Print the confusion matrix.
- Comment on the results.