# Predicting sentiment from product reviews


The goal of this first notebook is to explore logistic regression and feature engineering with existing GraphLab functions.

In this notebook you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

* Use SFrames to do some feature engineering
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the **accuracy** of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

Let's get started!
    
## Fire up GraphLab Create

Make sure you have the latest version of GraphLab Create.

In [29]:
from __future__ import division
import sframe
import math
import string

# Data preparation

We will use a dataset consisting of baby product reviews on Amazon.com.

In [18]:
products = sframe.SFrame('amazon_baby.gl/')

Now, let us see a preview of what the dataset looks like.

In [19]:
products

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


## Build the word count vector for each review

Let us explore a specific example of a baby product.


In [20]:
products[269]

{'name': 'The First Years Massaging Action Teether',
 'rating': 5.0,
 'review': 'A favorite in our house!'}

Now, we will perform 2 simple data transformations:

1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string functionality.
2. Transform the reviews into word-counts.

**Aside**. In this notebook, we remove all punctuations for the sake of simplicity. A smarter approach to punctuations would preserve phrases such as "I'd", "would've", "hadn't" and so forth. See [this page](https://www.cis.upenn.edu/~treebank/tokenization.html) for an example of smart handling of punctuations.

In [22]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)


Now, let us explore what the sample example above looks like after these 2 transformations. Here, each entry in the **word_count** column is a dictionary where the key is the word and the value is a count of the number of times the word occurs.

In [None]:
products[269]['word_count']

## Extract sentiments

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.

In [23]:
products = products[products['rating'] != 3]
len(products)

166752

Now, we will assign reviews with a rating of 4 or higher to be *positive* reviews, while the ones with rating of 2 or lower are *negative*. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label.

In [24]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
products

name,review,rating,review_clean,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,it came early and was not disappointed i love ...,1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,Very soft and comfortable and warmer than it ...,1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,This is a product well worth the purchase I ...,1
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,All of my kids have cried nonstop when I tried to ...,1
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,When the Binky Fairy came to our house we didnt ...,1
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0,Lovely book its bound tightly so you may no ...,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0,Perfect for new parents We were able to keep ...,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0,A friend of mine pinned this product on Pinte ...,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,This has been an easy way for my nanny to record ...,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,I love this journal and our nanny uses it ...,1


Now, we can see that the dataset contains an extra column called **sentiment** which is either positive (+1) or negative (-1).

## Split data into training and test sets

Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. We use `seed=1` so that everyone gets the same result.

In [25]:
train_data, test_data = products.random_split(.8, seed=1)
print len(train_data)
print len(test_data)

133416
33336


In [27]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.transform(test_data['review_clean'])

In [31]:
import sklearn

In [33]:
sentiment_model = sklearn.linear_model.LogisticRegression()
sentiment_model.fit(X=train_matrix, y=train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [42]:
(sentiment_model.coef_ > 0).sum()

87243

In [43]:
sample_test_data = test_data[10:13]
print sample_test_data

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
| New Style Trailing Cherry ... | Was so excited to get this... |  1.0   |
+-------------------------------+-------------------------------+--------+
+-------------------------------+-----------+
|          review_clean         | sentiment |
+-------------------------------+-----------+
| Absolutely love it and all... |     1     |
| Would not purchase again o... |     -1    |
| Was so excited to get this... |     -1    |
+-------------------------------+-----------+
[3 rows x 5 columns]



In [45]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [46]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores

[  5.60153832  -3.17045551 -10.42328017]


In [56]:
import numpy as np
sinusoid_scores = 1 / (1 + np.exp(scores))
print sinusoid_scores

[ 0.0036786   0.9597072   0.99997027]


In [108]:
# most positive ones
full_predictions = sentiment_model.predict_proba(X=test_matrix)[:,1]
sorted_predictions = full_predictions.argsort()[::-1][:20]
for i in sorted_predictions:
    print test_data[i]['name']

Graco FastAction Fold Jogger Click Connect Stroller, Grapeade
Freemie Hands-Free Concealable Breast Pump Collection System
Infantino Wrap and Tie Baby Carrier, Black Blueberries
Evenflo X Sport Plus Convenience Stroller - Christina
Fisher-Price Cradle 'N Swing,  My Little Snugabunny
Diono RadianRXT Convertible Car Seat, Plum
Evenflo 6 Pack Classic Glass Bottle, 4-Ounce
Simple Wishes Hands-Free Breastpump Bra, Pink, XS-L
Baby Jogger City Mini GT Single Stroller, Shadow/Orange
Baby Einstein Around The World Discovery Center
Graco Pack 'n Play Element Playard - Flint
Britax 2012 B-Agile Stroller, Red
Britax Decathlon Convertible Car Seat, Tiffany
Roan Rocco Classic Pram Stroller 2-in-1 with Bassinet and Seat Unit - Coffee
Mamas &amp; Papas 2014 Urbo2 Stroller - Black
Buttons Cloth Diaper Cover - One Size - 8 Color Options
P'Kolino Silly Soft Seating in Tias, Green
Summer Infant Wide View Digital Color Video Monitor
Ikea 36 Pcs Kalas Kids Plastic BPA Free Flatware, Bowl, Plate, Tumbler Set

In [109]:
# most negative ones
full_predictions = sentiment_model.predict_proba(X=test_matrix)[:,0]
sorted_predictions = full_predictions.argsort()[::-1][:20]
for i in sorted_predictions:
    print test_data[i]['name']

Fisher-Price Ocean Wonders Aquarium Bouncer
Levana Safe N'See Digital Video Baby Monitor with Talk-to-Baby Intercom and Lullaby Control (LV-TW501)
Safety 1st Exchangeable Tip 3 in 1 Thermometer
Adiri BPA Free Natural Nurser Ultimate Bottle Stage 1 White, Slow Flow (0-3 months)
VTech Communications Safe &amp; Sounds Full Color Video and Audio Monitor
The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit
Safety 1st High-Def Digital Monitor
Cloth Diaper Sprayer--styles may vary
Philips AVENT Newborn Starter Set
Motorola Digital Video Baby Monitor with Room Temperature Thermometer
Ellaroo Mei Tai Baby Carrier - Hershey
Cosco Alpha Omega Elite Convertible Car Seat
Chicco Cortina KeyFit 30 Travel System in Adventure
Belkin WeMo Wi-Fi Baby Monitor for Apple iPhone, iPad, and iPod Touch (Firmware Update)
Peg-Perego Tatamia High Chair, White Latte
NUK Cook-n-Blend Baby Food Maker
VTech Communications Safe &amp; Sound Digital Audio Monitor with two Parent Units
Safety 1st Delux

In [134]:
true_predictions = sentiment_model.predict(X=test_matrix)
a = np.sum(true_predictions == test_data['sentiment'])
b = len(true_predictions)
print a, b , round(a/b, 4)
print sklearn.metrics.accuracy_score(true_predictions, test_data['sentiment'].to_numpy())

31079 33336 0.9323
0.932295416367


In [118]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [119]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

In [120]:
simple_model = sklearn.linear_model.LogisticRegression()
simple_model.fit(X=train_matrix_word_subset, y=train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [129]:
simple_model_coef_table = sframe.SFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
simple_model_coef_table = simple_model_coef_table.sort(sort_columns='coefficient', ascending=False)
simple_model_coef_table.print_rows(num_rows=20)

+-----------------+--------------+
|   coefficient   |     word     |
+-----------------+--------------+
|  1.67307389259  |    loves     |
|  1.50981247669  |   perfect    |
|  1.36368975931  |     love     |
|  1.19253827349  |     easy     |
|  0.943999590572 |    great     |
|  0.520185762718 |    little    |
|  0.503760457767 |     well     |
|  0.190908572065 |     able     |
|  0.085512779463 |     old      |
| 0.0588546711521 |     car      |
| -0.209562864534 |     less     |
| -0.320556236735 |   product    |
| -0.362166742274 |    would     |
| -0.511379631799 |     even     |
| -0.621168773641 |     work     |
| -0.898030737715 |    money     |
|  -1.65157634496 |    broke     |
|  -2.03369861394 |    waste     |
|  -2.10933109032 |    return    |
|  -2.3482982195  | disappointed |
+-----------------+--------------+
[20 rows x 2 columns]



In [140]:
simple_predictions = simple_model.predict(train_matrix_word_subset)
print sklearn.metrics.accuracy_score(simple_predictions, train_data['sentiment'].to_numpy())
sentiment_train_predict = sentiment_model.predict(train_matrix)
print sklearn.metrics.accuracy_score(sentiment_train_predict, train_data['sentiment'].to_numpy())
simple_test_prediction = simple_model.predict(test_matrix_word_subset)
print sklearn.metrics.accuracy_score(simple_test_prediction, test_data['sentiment'].to_numpy())
sentiment_test_predict = sentiment_model.predict(X=test_matrix)
print sklearn.metrics.accuracy_score(sentiment_test_predict, test_data['sentiment'].to_numpy())

0.866822570007
0.968489536487
0.869360451164
0.932295416367


In [146]:
print sklearn.metrics.accuracy_score(np.ones(len(test_data)), test_data['sentiment'].to_numpy())

0.842782577394
