### Logistic regression on Amazon review data  
The goal of this project is to predict whether the sentiments about a product (from its reviews) are positive or negative using product review data from Amazon.

The review data is stored in the form of SFrame.


#### Skills learned:
- extract bag-of-word features with CountVectorizer in scikit learn
- compare two classification models: classify with the sign of score or a logit link
- compare the effect of vocabolary size on classifying Amazon review data

#### Loading and cleaning data

In [1]:
import sframe
# from graphlab import SFrame
products = sframe.SFrame('amazon_baby.gl/')

[INFO] sframe.cython.cy_server: SFrame v2.1 started. Logging /tmp/sframe_server_1475025929.log
INFO:sframe.cython.cy_server:SFrame v2.1 started. Logging /tmp/sframe_server_1475025929.log


In [2]:
# this is suppose to work but not working :(
# let's stick with sframe for now...
# products_df = SFrame.to_dataframe(products)
# Apparently I need graphlab in order to output sframe into dataframe. 

In [3]:
# Let's strip off the punctuation first:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

#### IMPORTANT. 
Make sure to fill n/a values in the review column with empty strings (if applicable). The n/a values indicate empty reviews. 
(In pandas, the syntax will be: products = products.fillna({'review':''}))

In [4]:
products = products.fillna('review','')  # fill in N/A's in the review column

In [5]:
# As required by the assignment, let's ignore all the neutral ratings.

products = products[products['rating'] != 3]

In [6]:
# Let's create a new binary prediction for positive/negative experience
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

In [7]:
# Split the training and testing sets
train_data, test_data = products.random_split(.8, seed=1)

In [8]:
products['sentiment']

dtype: int
Rows: 166752
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ]

### Feature extraction 
#### bag-of-words with scikit learn
Let's use sparse matrices to store the word counts. 

General advise from the instructors:
- Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
- Compute the occurrences of the words in each review and collect them into a row vector.
- Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
- Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [10]:
train_matrix

<133416x121712 sparse matrix of type '<type 'numpy.int64'>'
	with 7326618 stored elements in Compressed Sparse Row format>

### Model training
7. Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

8. There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.

In [11]:
from sklearn.linear_model import LogisticRegression

sentiment_model = LogisticRegression()


In [12]:
sentiment_model = sentiment_model.fit(train_matrix, train_data['sentiment'])

In [13]:
print train_matrix.shape
print train_data['sentiment'].shape

(133416, 121712)
(133416,)


In [14]:
# print model accuracy
sentiment_model.score(train_matrix, train_data['sentiment'])

0.96850452719314029

In [16]:
#Quiz question: How many weights are >= 0?
import numpy as np
print np.sum(sentiment_model.coef_>=0)

87059


#### Let's first look at properties of three specific entries.

In [17]:

sample_test_data = test_data[10:13]
print sample_test_data

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
| New Style Trailing Cherry ... | Was so excited to get this... |  1.0   |
+-------------------------------+-------------------------------+--------+
+-------------------------------+-----------+
|          review_clean         | sentiment |
+-------------------------------+-----------+
| Absolutely love it and all... |     1     |
| Would not purchase again o... |     -1    |
| Was so excited to get this... |     -1    |
+-------------------------------+-----------+
[3 rows x 5 columns]



In [18]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [19]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

### Now let's use the trained model to make some predictions

In [20]:

sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores

[  5.60150644  -3.17110494 -10.42378277]


If we predict with the sign of the score, the prediction would be 1, -1, -1.
In order to quantify the degree of confidence in our predictions, let's use a logic link -P(yi=1|xi,w)=1/(1+exp(-wT h(x))) -  and translate the scores into probabilities of a comment being position or negative.

In [23]:
[1/(1+np.exp(-s)) for s in scores]

[0.99632128559196442, 0.04026769205939245, 2.9716370056769632e-05]

In [24]:
# let's compare our calculation with the prediction from the trained model
sentiment_model.predict(sample_test_matrix)

array([ 1, -1, -1])

### Questions:
Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."

In [26]:
test_matrix = vectorizer.transform(test_data['review_clean'])
total_scores = sentiment_model.decision_function(test_matrix)

In [34]:
prob = [1/(1+np.exp(-s)) for s in total_scores]

In [35]:
test_data['prob'] = prob

In [36]:
# Now we need to sort the SFrame based on 'prob'

In [37]:
type(test_data)

sframe.data_structures.sframe.SFrame

In [39]:
test_data.sort('prob', ascending = False)[0:20]

name,review,rating,review_clean,sentiment,prob
Freemie Hands-Free Concealable Breast Pump ...,I absolutely love this product. I work as a ...,5.0,I absolutely love this product I work as a ...,1,1.0
Baby Einstein Around The World Discovery Center ...,I am so HAPPY I brought this item for my 7 mo ...,5.0,I am so HAPPY I brought this item for my 7 mo ...,1,1.0
"Fisher-Price Cradle 'N Swing, My Little ...",My husband and I cannot state enough how much we ...,5.0,My husband and I cannot state enough how much we ...,1,1.0
"P'Kolino Silly Soft Seating in Tias, Green ...",I've purchased both the P'Kolino Little Reader ...,4.0,Ive purchased both the PKolino Little Reader ...,1,1.0
Buttons Cloth Diaper Cover - One Size - 8 ...,"We are big Best Bottoms fans here, but I wanted ...",4.0,We are big Best Bottoms fans here but I wante ...,1,1.0
"Baby Jogger City Mini GT Single Stroller, ...","Amazing, Love, Love, Love it !!! All 5 STARS all ...",5.0,Amazing Love Love Love it All 5 STARS all the w ...,1,1.0
Mamas &amp; Papas 2014 Urbo2 Stroller - Black ...,After much research I purchased an Urbo2. It's ...,4.0,After much research I purchased an Urbo2 Its ...,1,1.0
"Britax Decathlon Convertible Car Seat, ...",I researched a few different seats to pu ...,4.0,I researched a few different seats to pu ...,1,1.0
Roan Rocco Classic Pram Stroller 2-in-1 with ...,Great Pram Rocco!!!!!!I bought this pram from ...,5.0,Great Pram RoccoI bought this pram from Europe ...,1,1.0
"Simple Wishes Hands-Free Breastpump Bra, Pink, ...","I just tried this hands free breastpump bra, and ...",5.0,I just tried this hands free breastpump bra a ...,1,1.0


In [40]:
test_data.sort('prob', ascending = True)[0:20]

name,review,rating,review_clean,sentiment
Fisher-Price Ocean Wonders Aquarium Bouncer ...,We have not had ANY luck with Fisher-Price ...,2.0,We have not had ANY luck with FisherPrice prod ...,-1
Levana Safe N'See Digital Video Baby Monitor with ...,This is the first review I have ever written out ...,1.0,This is the first review I have ever written out ...,-1
Safety 1st Exchangeable Tip 3 in 1 Thermometer ...,I thought it sounded great to have different ...,1.0,I thought it sounded great to have different ...,-1
Adiri BPA Free Natural Nurser Ultimate Bottle ...,I will try to write an objective review of the ...,2.0,I will try to write an objective review of the ...,-1
VTech Communications Safe &amp; Sounds Full Color ...,"This is my second video monitoring system, the ...",1.0,This is my second video monitoring system the ...,-1
The First Years True Choice P400 Premium ...,Note: we never installed batteries in these un ...,1.0,Note we never installed batteries in these units ...,-1
Safety 1st High-Def Digital Monitor ...,We bought this baby monitor to replace a ...,1.0,We bought this baby monitor to replace a ...,-1
Cloth Diaper Sprayer-- styles may vary ...,I bought this sprayer out of desperation during a ...,1.0,I bought this sprayer out of desperation during a ...,-1
Philips AVENT Newborn Starter Set ...,"It's 3am in the morning and needless to say, ...",1.0,Its 3am in the morning and needless to say this ...,-1
Motorola Digital Video Baby Monitor with Room ...,DO NOT BUY THIS BABY MONITOR!I purchased this ...,1.0,DO NOT BUY THIS BABY MONITORI purchased this ...,-1

prob
8.47442279902e-16
1.59485709412e-15
8.141166559009999e-14
9.830461809779999e-14
1.94179307181e-13
3.32465459763e-13
3.27225252468e-11
3.3295057511e-11
9.49459741846e-11
9.58560032751e-11


#### Let's compute the accuracy of the classifier

In [60]:


# Accuracy on the training set:
train_predictions = sentiment_model.predict(train_matrix)
print 'training accuracy: ',(train_predictions == train_data['sentiment']).sum().astype(float)/train_data.shape[0]


training accuracy:  0.968504527193


In [59]:
# Accuracy on the testing set: 
test_predictions =sentiment_model.predict(test_matrix)
print 'testing accuracy:  ', (test_predictions == test_data['sentiment']).sum().astype(float)/test_data.shape[0]

testing accuracy:   0.932295416367


#### The model works pretty well.
The model we used contains a lot of words and it takes some time to compute. How about we try to train the model with fewer words?

Here is a list of 20 words given for this part of the assignment:


In [43]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [44]:
# Do word count again only for given vocabulary
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words)
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

In [46]:
# Now let's train a new model:
simple_model = LogisticRegression()

simple_model = simple_model.fit(train_matrix_word_subset, train_data['sentiment'])

In [48]:
# Let's have a look at the coefficients
simple_model_coef_table = sframe.SFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
print simple_model_coef_table

+-----------------+---------+
|   coefficient   |   word  |
+-----------------+---------+
|  1.36368975931  |   love  |
|  0.943999590571 |  great  |
|  1.19253827349  |   easy  |
|  0.085512779463 |   old   |
|  0.520185762718 |  little |
|  1.50981247669  | perfect |
|  1.67307389259  |  loves  |
|  0.503760457767 |   well  |
|  0.190908572065 |   able  |
| 0.0588546711524 |   car   |
+-----------------+---------+
[20 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [54]:
# Let's look at the words woth highest coefficient
simple_model_coef_table.sort('coefficient', ascending = False)


coefficient,word
1.67307389259,loves
1.50981247669,perfect
1.36368975931,love
1.19253827349,easy
0.943999590571,great
0.520185762718,little
0.503760457767,well
0.190908572065,able
0.085512779463,old
0.0588546711524,car


In [57]:
# Accuracy on the training set:
train_simple_predictions = simple_model.predict(train_matrix_word_subset)
print 'training accuracy: ',(train_simple_predictions == train_data['sentiment']).sum().astype(float)/train_data.shape[0]

# Accuracy on the testing set: 

test_simple_predictions = simple_model.predict(test_matrix_word_subset)
print 'testing accuracy:  ',(test_simple_predictions == test_data['sentiment']).sum().astype(float)/test_data.shape[0]

training accuracy:  0.866822570007
testing accuracy:   0.869360451164


#### Discussion
The model with only 20 words performs decently on both training and testing data. Compared to the model with all the words, the difference in accuracies between the training and testing set is much smaller in the simple model with only 20 words. 

The first model characterise data better with more words, but in the mean time, we also overfitted the noise in the training data, therefore the testing accuracy is not as good ast the training accuracy.

In [None]:
# Quiz Question: Are the positive words in the simple_model also positive words in the sentiment_model?
# I need to find matching coefficient in the previous model