# Predicting sentiment from product reviews


The goal of this first notebook is to explore logistic regression and feature engineering

# Data preperation

We will use a dataset consisting of baby product reviews on Amazon.com.

In [4]:
import sframe
products = sframe.SFrame('amazon_baby.gl/')

[INFO] sframe.cython.cy_server: SFrame v2.1 started. Logging /tmp/sframe_server_1470060292.log


## Build the word count vector for each review

Let us explore a specific example of a baby product.

In [5]:
products[269]

{'name': 'The First Years Massaging Action Teether',
 'rating': 5.0,
 'review': 'A favorite in our house!'}

In [6]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

In [7]:
# products = products.fillna({'review':''})  # fill in N/A's in the review column

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment. 

In [8]:
products = products[products['rating'] != 3]

Now, we will assign reviews with a rating of 4 or higher to be positive reviews, 
while the ones with rating of 2 or lower are negative. For the sentiment column, 
we use +1 for the positive class label and -1 for the negative class label. 
A good way is to create an anonymous function that converts a rating into a class label 
and then apply that function to every element in the rating column. 

In [9]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

Now, we can see that the dataset contains an extra column called **sentiment** which is either positive (+1) or negative (-1).

## Split data into training and test sets

Let's perform a train/test split with 80% of the data in the training set 
and 20% of the data in the test set. If you are using SFrame, make sure to use seed=1 
so that you get the same result as everyone else does. (This way, you will get the right numbers for the quiz.)

In [10]:
train_data, test_data = products.random_split(.8, seed=1)

We will now compute the word count for each word that appears in the reviews. 
A vector consisting of word counts is often referred to as bag-of-word features. 
Since most words occur in only a few reviews, word count vectors are sparse. 
For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. 
Refer to appropriate manuals to produce sparse word count vectors. 
General steps for extracting word count vectors are as follows:

* Learn a vocabulary (set of all words) from the training data. 
Only the words that show up in the training data will be considered for feature extraction.

* Compute the occurrences of the words in each review and collect them into a row vector.

* Build a sparse matrix where each row is the word count vector for the corresponding review. 
Call this matrix train_matrix.

* Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.
The following cell uses CountVectorizer in scikit-learn. Notice the token_pattern argument in the constructor.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

# Use this token pattern to keep single-letter words
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])

# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [23]:
test_data

name,review,rating,review_clean,sentiment
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,This has been an easy way for my nanny to record ...,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,I love this journal and our nanny uses it ...,1
Nature's Lullabies First Year Sticker Calendar ...,"I love this little calender, you can keep ...",5.0,I love this little calender you can keep ...,1
Nature's Lullabies Second Year Sticker Calendar ...,"I had a hard time finding a second year calendar, ...",5.0,I had a hard time finding a second year calendar ...,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4.0,One of babys first and favorite books and it is ...,1
"Lamaze Peekaboo, I Love You ...",My son loved this book as an infant. It was ...,5.0,My son loved this book as an infant It was per ...,1
"Lamaze Peekaboo, I Love You ...",Our baby loves this book & has loved it for a ...,5.0,Our baby loves this book has loved it for a while ...,1
"SoftPlay Giggle Jiggle Funbook, Happy Bear ...",This bear is absolutely adorable and I would ...,2.0,This bear is absolutely adorable and I would ...,-1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,I bought two for recent baby showers! The book ...,5.0,I bought two for recent baby showers The boo ...,1
Baby's First Year Undated Wall Calendar with ...,I searched high and low for a first year cale ...,5.0,I searched high and low for a first year cale ...,1

sentiment_score,sentiment_prediction
1.27028751489,1.0
14.1364991409,1.0
2.6485852739,1.0
10.7406875239,1.0
3.90634026182,1.0
9.95346825602,1.0
6.67431255257,1.0
1.43611644038,1.0
6.4769286775,1.0
5.91478049644,1.0


### Train a sentiment classifier with logistic regression

In [24]:
from sklearn.linear_model import LogisticRegression
sentiment_model = LogisticRegression().fit(X=train_matrix, y=train_data['sentiment'])
print sentiment_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


In [25]:
'''
How many weights are >= 0?
'''
def count_positive_weights(model):
    counter = 0
    for i in range(0, model.coef_.size):
        if sentiment_model.coef_[0,i] >= 0:
            counter += 1
    return counter
        
print 'positive weights:', count_positive_weights(sentiment_model)

positive weights: 85814


## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the **test data**. In this section, we will explore this in the context of 3 examples in the test dataset.  We refer to this set of 3 examples as the **sample_test_data**.

In [26]:
sample_test_data = test_data[10:13]
print sample_test_data

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
| New Style Trailing Cherry ... | Was so excited to get this... |  1.0   |
+-------------------------------+-------------------------------+--------+
+-------------------------------+-----------+-----------------+----------------------+
|          review_clean         | sentiment | sentiment_score | sentiment_prediction |
+-------------------------------+-----------+-----------------+----------------------+
| Absolutely love it and all... |     1     |  5.60440049313  |         1.0          |
| Would not purchase again o... |     -1    |  -3.15509227588 |         -1.0         |
| Was so excited to get this... |     -1

In [27]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [28]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [29]:
import numpy as np
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores
print 'probability predictions:', 1/(1 + np.exp((-1)*scores))

[  5.60440049  -3.15509228 -10.42872935]
probability predictions: [  9.96331878e-01   4.08910966e-02   2.95697429e-05]


In [30]:
test_data

name,review,rating,review_clean,sentiment
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,This has been an easy way for my nanny to record ...,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,I love this journal and our nanny uses it ...,1
Nature's Lullabies First Year Sticker Calendar ...,"I love this little calender, you can keep ...",5.0,I love this little calender you can keep ...,1
Nature's Lullabies Second Year Sticker Calendar ...,"I had a hard time finding a second year calendar, ...",5.0,I had a hard time finding a second year calendar ...,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4.0,One of babys first and favorite books and it is ...,1
"Lamaze Peekaboo, I Love You ...",My son loved this book as an infant. It was ...,5.0,My son loved this book as an infant It was per ...,1
"Lamaze Peekaboo, I Love You ...",Our baby loves this book & has loved it for a ...,5.0,Our baby loves this book has loved it for a while ...,1
"SoftPlay Giggle Jiggle Funbook, Happy Bear ...",This bear is absolutely adorable and I would ...,2.0,This bear is absolutely adorable and I would ...,-1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,I bought two for recent baby showers! The book ...,5.0,I bought two for recent baby showers The boo ...,1
Baby's First Year Undated Wall Calendar with ...,I searched high and low for a first year cale ...,5.0,I searched high and low for a first year cale ...,1

sentiment_score,sentiment_prediction
1.27028751489,1.0
14.1364991409,1.0
2.6485852739,1.0
10.7406875239,1.0
3.90634026182,1.0
9.95346825602,1.0
6.67431255257,1.0
1.43611644038,1.0
6.4769286775,1.0
5.91478049644,1.0


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [36]:
def check_sentiment_prediction(model, test_matrix, test_data):
    complete_scores = model.decision_function(test_matrix)
    

    test_data['sentiment_score'] = complete_scores
    
    for i in range(0, complete_scores.size):
        if complete_scores[i] > 0:
            complete_scores[i] = 1
        elif complete_scores[i] < 0:
            complete_scores[i]  = -1
            
    test_data['sentiment_prediction'] = complete_scores
    return complete_scores
    
complete_scores = check_sentiment_prediction(sentiment_model, test_matrix, test_data)

In [37]:
complete_probabilities = 1/(1 + np.exp((-1)*complete_scores))
print 'complete probability predictions:', complete_probabilities
test_data['probabilities'] = complete_probabilities

complete probability predictions: [ 0.73105858  0.73105858  0.73105858 ...,  0.73105858  0.73105858
  0.73105858]


In [38]:
test_data

name,review,rating,review_clean,sentiment
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,This has been an easy way for my nanny to record ...,1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,I love this journal and our nanny uses it ...,1
Nature's Lullabies First Year Sticker Calendar ...,"I love this little calender, you can keep ...",5.0,I love this little calender you can keep ...,1
Nature's Lullabies Second Year Sticker Calendar ...,"I had a hard time finding a second year calendar, ...",5.0,I had a hard time finding a second year calendar ...,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4.0,One of babys first and favorite books and it is ...,1
"Lamaze Peekaboo, I Love You ...",My son loved this book as an infant. It was ...,5.0,My son loved this book as an infant It was per ...,1
"Lamaze Peekaboo, I Love You ...",Our baby loves this book & has loved it for a ...,5.0,Our baby loves this book has loved it for a while ...,1
"SoftPlay Giggle Jiggle Funbook, Happy Bear ...",This bear is absolutely adorable and I would ...,2.0,This bear is absolutely adorable and I would ...,-1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,I bought two for recent baby showers! The book ...,5.0,I bought two for recent baby showers The boo ...,1
Baby's First Year Undated Wall Calendar with ...,I searched high and low for a first year cale ...,5.0,I searched high and low for a first year cale ...,1

sentiment_score,sentiment_prediction,probabilities
1.27028751489,1.0,0.73105857863
14.1364991409,1.0,0.73105857863
2.6485852739,1.0,0.73105857863
10.7406875239,1.0,0.73105857863
3.90634026182,1.0,0.73105857863
9.95346825602,1.0,0.73105857863
6.67431255257,1.0,0.73105857863
1.43611644038,1.0,0.73105857863
6.4769286775,1.0,0.73105857863
5.91478049644,1.0,0.73105857863


# Find the most positive (and negative) review

In [39]:
test_data.sort('sentiment_score', ascending = False)['name']
test_data.sort('sentiment_score', ascending = False)['name'][0:20]

dtype: str
Rows: 20
['Infantino Wrap and Tie Baby Carrier, Black Blueberries', 'Baby Einstein Around The World Discovery Center', 'Britax 2012 B-Agile Stroller, Red', 'Diono RadianRXT Convertible Car Seat, Plum', "Graco Pack 'n Play Element Playard - Flint", "P'Kolino Silly Soft Seating in Tias, Green", 'Roan Rocco Classic Pram Stroller 2-in-1 with Bassinet and Seat Unit - Coffee', 'Mamas &amp; Papas 2014 Urbo2 Stroller - Black', 'Buttons Cloth Diaper Cover - One Size - 8 Color Options', 'Evenflo X Sport Plus Convenience Stroller - Christina', 'Simple Wishes Hands-Free Breastpump Bra, Pink, XS-L', 'Graco FastAction Fold Jogger Click Connect Stroller, Grapeade', 'Freemie Hands-Free Concealable Breast Pump Collection System', 'Baby Jogger City Mini GT Single Stroller, Shadow/Orange', "Fisher-Price Cradle 'N Swing,  My Little Snugabunny", 'Evenflo 6 Pack Classic Glass Bottle, 4-Ounce', 'Britax Decathlon Convertible Car Seat, Tiffany', 'Ikea 36 Pcs Kalas Kids Plastic BPA Free Flatware, Bow

In [40]:
test_data.sort('sentiment_score', ascending = True)['name']
test_data.sort('sentiment_score', ascending = True)['name'][0:20]

dtype: str
Rows: 20
['Fisher-Price Ocean Wonders Aquarium Bouncer', "Levana Safe N'See Digital Video Baby Monitor with Talk-to-Baby Intercom and Lullaby Control (LV-TW501)", 'Safety 1st Exchangeable Tip 3 in 1 Thermometer', 'Adiri BPA Free Natural Nurser Ultimate Bottle Stage 1 White, Slow Flow (0-3 months)', 'VTech Communications Safe &amp; Sounds Full Color Video and Audio Monitor', 'The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit', 'Safety 1st High-Def Digital Monitor', 'Cloth Diaper Sprayer--styles may vary', 'Motorola Digital Video Baby Monitor with Room Temperature Thermometer', 'Philips AVENT Newborn Starter Set', 'Cosco Alpha Omega Elite Convertible Car Seat', 'Ellaroo Mei Tai Baby Carrier - Hershey', 'Peg-Perego Tatamia High Chair, White Latte', 'Belkin WeMo Wi-Fi Baby Monitor for Apple iPhone, iPad, and iPod Touch (Firmware Update)', 'Chicco Cortina KeyFit 30 Travel System in Adventure', 'NUK Cook-n-Blend Baby Food Maker', 'VTech Communications Safe &a

## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifer. Recall that the accuracy is given by

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

In [None]:
def get_classification_accuracy(test_data):
    correct_examples = 0
    for i in range(0, complete_scores.size):
        if int(test_data['sentiment'][i]) == int(test_data['sentiment_prediction'][i]):
            correct_examples += 1
    print 'correct examples:', correct_examples
    accuracy = float(correct_examples)/(complete_scores.size)
    print 'classification accuracy:', accuracy

get_classification_accuracy(test_data)

## Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews. For this assignment, we selected a 20 words to work with. These are:

In [None]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [None]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

## Train a logistic regression model on a subset of data

In [None]:
from sklearn.linear_model import LogisticRegression
simple_model = LogisticRegression().fit(X=train_matrix_word_subset, y=train_data['sentiment'])
print simple_model

In [None]:
simple_model_coef_table = sframe.SFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
simple_model_coef_table['coefficient']

In [None]:
counter = 0
for element in simple_model_coef_table['coefficient']:
    if element > 0:
        counter += 1
print 'number of positive coefficients', counter

In [None]:
check_sentiment_prediction(simple_model, test_matrix_word_subset, test_data)

In [None]:
get_classification_accuracy(test_data)

## Baseline: Majority class prediction

It is quite common to use the **majority class classifier** as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

In [None]:
num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print 'majority classification accuracy', float(num_positive)/(num_positive + num_negative)