# Predicting sentiment from product reviews


The goal of this first notebook is to 
explore logistic regression and feature engineering with existing GraphLab functions.

In this notebook you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

* Use SFrames to do some feature engineering
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the **accuracy** of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

Let's get started!
    
## Fire up Scikit-learn, Pandas and Numpy

In [159]:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model

## Load Amazon dataset

Load the dataset consisting of baby product reviews on Amazon.com. Store the data in a data frame products. 

In [160]:
products = pd.read_csv('amazon_baby.csv')

In [161]:
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Build the word count vector for each review

Let us explore a specific example of a baby product.

In [162]:
products['review'][269]

'A favorite in our house!'

2.. We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.

Write a function remove_punctuation that strips punctuation from a line of text

Apply this function to every element in the review column of products, and save the result to a new column review_clean.

Refer to your tool's manual for string processing capabilities. Python lets us express the operation in a succinct way, as follows:

In [163]:
def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator) 

In [164]:
products = products.fillna({'review':''})  # turn nan to nothing

In [165]:
products['review_clean'] = products['review'].astype(str).apply(remove_punctuation) #astype(str) makes sure all reviews are strings

In [166]:
products.head()

Unnamed: 0,name,review,rating,review_clean
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,These flannel wipes are OK but in my opinion n...
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...


Aside. In this notebook, we remove all punctuation for the sake of simplicity. A smarter approach to punctuation would preserve phrases such as "I'd", "would've", "hadn't" and so forth. See this page for an example of smart handling of punctuation.

IMPORTANT. Make sure to fill n/a values in the review column with empty strings (if applicable). The n/a values indicate empty reviews. For instance, Pandas's the fillna() method lets you replace all N/A's in the review columns as follows:

In [167]:
'''#others' codes
import string
products2=pd.read_csv('amazon_baby.csv')
products2=products2.fillna({'review':''})
full=len(products2['review'])
print(full)

translator2=str.maketrans({key: None for key in string.punctuation})
def translate(text):
    return text.translate(translator)
products2['clean_review']=products2['review'].iloc[:].apply(translate)
products2=products2[products2['rating']!=3].copy()
products2['sentiment']=products2['rating'].apply(lambda rating: +1 if rating>3 else -1)
train2=pd.read_json('module-2-assignment-train-idx.json')
train_data2=products2.iloc[train2[0]]
train_data2.shape'''

183531


(133416, 5)

## Extract sentiments

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.

In [168]:
products = products[products['rating'] != 3]

4.. Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column. In SFrame, you would use apply():

In [169]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

Now, we can see that the dataset contains an extra column called sentiment which is either positive (+1) or negative (-1).

In [170]:
products.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


## Split data into training and test sets

Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. We use `seed=1` so that everyone gets the same result.

In [171]:
#read the index documents
traindata = pd.read_json('module-2-assignment-train-idx.json')
testdata = pd.read_json('module-2-assignment-test-idx.json')

In [172]:
#pull train and test data from products data using the index docs
train_data = products.iloc[traindata.values.reshape((-1,))]
test_data = products.iloc[testdata.values.reshape((-1,))]

In [173]:
train_data.review_clean[0:10]

1     it came early and was not disappointed i love ...
2     Very soft and comfortable and warmer than it l...
3     This is a product well worth the purchase  I h...
4     All of my kids have cried nonstop when I tried...
5     When the Binky Fairy came to our house we didn...
6     Lovely book its bound tightly so you may not b...
7     Perfect for new parents We were able to keep t...
8     A friend of mine pinned this product on Pinter...
11    This book is perfect  Im a first time new mom ...
12    I originally just gave the nanny a pad of pape...
Name: review_clean, dtype: object

In [174]:
train_data [train_data2.clean_review != train_data.review_clean]

Unnamed: 0,name,review,rating,review_clean,sentiment


In [175]:
print (len(train_data))
print (len(test_data))

133416
33336


 ## Build the word count vector for each review

6.. We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

~Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.

~Compute the occurrences of the words in each review and collect them into a row vector.

~Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.

~Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

The following cell uses CountVectorizer in scikit-learn. Notice the token_pattern argument in the constructor.

In [176]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
#train_matrix2 = vectorizer.fit_transform(train_data2['clean_review'])
train_matrix3 = vectorizer.fit_transform(train_data2['clean_review'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [177]:
print(train_matrix.shape)
print(train_matrix3.shape)

(133416, 121712)
(133416, 121712)


In [178]:
#others' codes
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix2 = vectorizer.fit_transform(train_data2['clean_review'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
#test_matrix = vectorizer.transform(test_data['clean_review'])

In [179]:
print(train_matrix2.shape)

(133416, 121712)


In [180]:
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\b\\w+\\b', tokenizer=None,
        vocabulary=None)

In [181]:
train_matrix

<133416x121712 sparse matrix of type '<class 'numpy.int64'>'
	with 7326618 stored elements in Compressed Sparse Row format>

# Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data.

7.. Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

In [245]:
#Step 1. Import the model I want to use
from sklearn.linear_model import LogisticRegression

#Step 2. Make an instance of the Model
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()
#logisticRegr2 = linear_model.LogisticRegression()

#Step 3. Training the model on the data, storing the information learned from the data
sentiment_model = logisticRegr.fit(train_matrix, train_data['sentiment'])
#sentiment_model2 = logisticRegr2.fit(train_matrix2, train_data2['sentiment'])

In [183]:
from sklearn import linear_model
sentiment=linear_model.LogisticRegression()
a = sentiment.fit(train_matrix2,train_data2['sentiment'])
weightsa= a.coef_
print(weightsa)

[[ -1.23643972e+00   2.05963522e-04   2.59016487e-02 ...,   1.14053962e-02
    3.20394181e-03  -7.15396558e-05]]


In [184]:
print(weightsa.shape)

(1, 121712)


8.. There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.

Quiz question: How many weights are >= 0?

In [185]:
weights = sentiment_model.coef_
weights

array([[ -1.23643972e+00,   2.05963522e-04,   2.59016487e-02, ...,
          1.14053962e-02,   3.20394181e-03,  -7.15396558e-05]])

In [186]:
weights.shape

(1, 121712)

In [187]:
#Approach 1:
b=weights>=0
b

array([[False,  True,  True, ...,  True,  True, False]], dtype=bool)

In [188]:
weights[b]

array([  2.05963522e-04,   2.59016487e-02,   6.15274363e-03, ...,
         2.69073329e-05,   1.14053962e-02,   3.20394181e-03])

In [189]:
len(weights[b])

85811

In [190]:
print ('Answer: the amount of weights >=0 is', len(weights[b]))

Answer: the amount of weights >=0 is 85811


In [191]:
#Approach 2:
num_positive_weights = np.where(weights >= 0)[0].size #np.where return the coordinates of the weights>=0
num_positive_weights

85811

In [192]:
np.where(weights >= 0)

(array([0, 0, 0, ..., 0, 0, 0]),
 array([     1,      2,      3, ..., 121708, 121709, 121710]))

In [193]:
#how does np.where return the coordinates of the weights?
a = np.array([[0,1,2],[-1,2,-3]])
a

array([[ 0,  1,  2],
       [-1,  2, -3]])

In [194]:
np.where(a >= 0) # the results as below means the corrdinates of the weights>=0 are (0,0),(0,1),(0,2),(1,1)

(array([0, 0, 0, 1]), array([0, 1, 2, 1]))

In [195]:
#Approach 3
np.sum(weights >=0)

85811

## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the **test data**. In this section, we will explore this in the context of 3 examples in the test dataset.  We refer to this set of 3 examples as the **sample_test_data**.

In [196]:
sample_test_data = test_data[10:13]
print (sample_test_data['rating'])


59    5
71    2
91    1
Name: rating, dtype: int64


In [197]:
sample_test_data

Unnamed: 0,name,review,rating,review_clean,sentiment
59,Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in...,5,Absolutely love it and all of the Scripture in...,1
71,Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The dec...,2,Would not purchase again or recommend The deca...,-1
91,New Style Trailing Cherry Blossom Tree Decal R...,Was so excited to get this product for my baby...,1,Was so excited to get this product for my baby...,-1


Let's dig deeper into the first row of the **sample_test_data**. Here's the full review:

In [198]:
sample_test_data['review_clean'].iloc[0]

'Absolutely love it and all of the Scripture in it  I purchased the Baby Boy version for my grandson when he was born and my daughterinlaw was thrilled to receive the same book again'

That review seems pretty positive.

Now, let's see what the next row of the sample_test_data looks like. As we could guess from the rating (-1), the review is quite negative.

In [199]:
sample_test_data['review_clean'].iloc[1]

'Would not purchase again or recommend The decals were thick almost plastic like and were coming off the wall as I was applying them The would NOT stick Literally stayed stuck for about 5 minutes then started peeling off'

In [200]:
sample_test_data['review_clean'].iloc[2]

'Was so excited to get this product for my baby girls bedroom  When I got it the back is NOT STICKY at all  Every time I walked into the bedroom I was picking up pieces off of the floor  Very very frustrating  Ended up having to super glue it to the wallvery disappointing  I wouldnt waste the time or money on it'

We will now make a **class** prediction for the **sample_test_data**. The `sentiment_model` should predict **+1** if the sentiment is positive and **-1** if the sentiment is negative. Recall from the lecture that the **score** (sometimes called **margin**) for the logistic regression model  is defined as:

$$
\mbox{score}_i = \mathbf{w}^T h(\mathbf{x}_i)
$$ 

where $h(\mathbf{x}_i)$ represents the features for example $i$.  We will write some code to obtain the scores. For each row, the score (or margin) is a number in the range (-inf, inf). Use a pre-built function in your tool to calculate the score of each data point in sample_test_data. In scikit-learn, you can call the decision_function() function.

Hint: You'd probably need to convert sample_test_data into the sparse matrix format first.

In [201]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print (scores)

[  5.60351902  -3.13388392 -10.40529952]


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

Checkpoint: Make sure your class predictions match with the ones obtained from sentiment_model. The logistic regression classifier in scikit-learn comes with the predict function for this purpose.

In [202]:
scores

array([  5.60351902,  -3.13388392, -10.40529952])

In [203]:
for i in range(len(scores)):
    if scores[i]>0:
        prediction_i = +1
    else:
        prediction_i = -1
    print(prediction_i)
        

1
-1
-1


In [204]:
sentiment_model.predict(sample_test_matrix)

array([ 1, -1, -1])

### Probability predictions

Recall from the lectures that we can also calculate the probability predictions from the scores using:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

Using the variable **scores** calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range **[0, 1]**.

Checkpoint: Make sure your probability predictions match the ones obtained from sentiment_model.

In [206]:
probability=[]
for i in range(len(scores)):
    probability_i = 1/(1+np.exp(-scores[i]))
    print(probability_i)
    probability.append(probability_i)
        

0.996328654703
0.0417310141735
3.02707157568e-05


In [207]:
probability

[0.99632865470250187, 0.041731014173507615, 3.0270715756816633e-05]

In [208]:
sentiment_model.predict_proba(sample_test_matrix)

array([[  3.67134530e-03,   9.96328655e-01],
       [  9.58268986e-01,   4.17310142e-02],
       [  9.99969729e-01,   3.02707158e-05]])

Quiz question: Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

In [209]:
min(sentiment_model.predict_proba(sample_test_matrix)[:,1])

3.0270715756816633e-05

In [210]:
print(probability)
index_min = np.argmin(probability)
print('Answer: the one has the lowest probabilty of being classified as a positive review is', index_min+1)

[0.99632865470250187, 0.041731014173507615, 3.0270715756816633e-05]
Answer: the one has the lowest probabilty of being classified as a positive review is 3


We now turn to examining the full test dataset, **test_data**, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points for faster performance.

Using the `sentiment_model`, find the 20 reviews in the entire **test_data** with the **highest probability** of being classified as a **positive review**. We refer to these as the "most positive reviews."

To calculate these top-20 reviews, use the following steps:

1.  Make probability predictions on test_data using the sentiment_model.

2.  Sort the data according to those predictions and pick the top 20.

In [211]:
test_matrix = vectorizer.transform(test_data['review_clean'])

In [212]:
positive_proba = sentiment_model.predict_proba(test_matrix)[:,1]
positive_proba

array([ 0.78107798,  0.99999929,  0.93412345, ...,  0.99999448,
        0.99999741,  0.98089635])

In [213]:
'''index_min2 = np.argmin(positive_proba)
index_min2'''

'index_min2 = np.argmin(positive_proba)\nindex_min2'

In [214]:
df_proba = pd.DataFrame({'proba':positive_proba})
test_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,This has been an easy way for my nanny to reco...,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,I love this journal and our nanny uses it ever...,1
16,Nature's Lullabies First Year Sticker Calendar,"I love this little calender, you can keep trac...",5,I love this little calender you can keep track...,1
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,5,I had a hard time finding a second year calend...,1
28,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,One of babys first and favorite books and it i...,1


In [215]:
df_proba.head()

Unnamed: 0,proba
0,0.781078
1,0.999999
2,0.934123
3,0.999978
4,0.98021


In [216]:
df_proba.iloc[11923]

proba    1.0
Name: 11923, dtype: float64

In [217]:
min(df_proba['proba'])

9.07322809301731e-16

In [218]:
test_data.iloc[11923]

name                 Evenflo 6 Pack Classic Glass Bottle, 4-Ounce
review          It's always fun to write a review on those pro...
rating                                                          5
review_clean    Its always fun to write a review on those prod...
sentiment                                                       1
Name: 66059, dtype: object

In [219]:
index_largest = positive_proba.argsort()[-20:][::-1]
index_largest

array([21531, 25554, 26830, 17558, 11923, 24286, 18112,  9125, 30535,
       20743, 15732, 14482, 24899, 32782,  9555, 30634,  4140, 30076,
       33060, 26838])

In [220]:
top20 = test_data['name'].iloc[index_largest] #note that iloc doesn't consider the orginal idex. It will re-apply new indexes

In [221]:
top20.shape

(20,)

In [222]:
df = pd.DataFrame(top20)
df.to_csv("top20.csv")

In [223]:
pd.options.display.max_colwidth = 1000
print(top20)

119182            Roan Rocco Classic Pram Stroller 2-in-1 with Bassinet and Seat Unit - Coffee
140816                                              Diono RadianRXT Convertible Car Seat, Plum
147949                                 Baby Jogger City Mini GT Single Stroller, Shadow/Orange
97325                             Freemie Hands-Free Concealable Breast Pump Collection System
66059                                             Evenflo 6 Pack Classic Glass Bottle, 4-Ounce
133651                                                       Britax 2012 B-Agile Stroller, Red
100166                                  Infantino Wrap and Tie Baby Carrier, Black Blueberries
50315                                               P'Kolino Silly Soft Seating in Tias, Green
168081                                 Buttons Cloth Diaper Cover - One Size - 8 Color Options
114796                                     Fisher-Price Cradle 'N Swing,  My Little Snugabunny
87017                                          Bab

Quiz Question: Which of the following products are represented in the 20 most positive reviews?

Now, let us repeat this exercise to find the "most negative reviews." Use the prediction probabilities to find the 20 reviews in the test_data with the lowest probability of being classified as a positive review. Repeat the same steps above but make sure you sort in the opposite order.

In [224]:
index_smallest = np.argsort(positive_proba)[:20]
index_smallest

array([ 2931, 21700, 13939,  8818, 28184, 17069,  9655, 14711, 20594,
        1942,  1810, 10814, 31226, 13751,  7310, 27231, 28120,   205,
       15062,  5831])

In [225]:
smallest20 = test_data['name'].iloc[index_smallest]
print(smallest20)

16042                                                                Fisher-Price Ocean Wonders Aquarium Bouncer
120209    Levana Safe N'See Digital Video Baby Monitor with Talk-to-Baby Intercom and Lullaby Control (LV-TW501)
77072                                                             Safety 1st Exchangeable Tip 3 in 1 Thermometer
48694                        Adiri BPA Free Natural Nurser Ultimate Bottle Stage 1 White, Slow Flow (0-3 months)
155287                                 VTech Communications Safe &amp; Sounds Full Color Video and Audio Monitor
94560                                    The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit
53207                                                                        Safety 1st High-Def Digital Monitor
81332                                                                      Cloth Diaper Sprayer--styles may vary
113995                                     Motorola Digital Video Baby Monitor with Room Tempera

Quiz Question: Which of the following products are represented in the 20 most negative reviews?

## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

This can be computed as follows:

* **Step 1:** Use the trained model to compute class predictions (**Hint:** Use the `predict` method)
* **Step 2:** Count the number of data points when the predicted class labels match the ground truth labels (called `true_labels` below).
* **Step 3:** Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [226]:
def get_classification_accuracy(model, data, true_labels):
    # First get the predictions
    prediction = model.predict(data)
    print('prediction',prediction)
    
    # Compute the number of correctly classified examples
    diff = prediction - np.array(true_labels)
    print('diff is', diff)
    print('diff shape', diff.shape)
    num_correct = np.sum(diff==0)
    print('num_correct',num_correct)

    # Then compute accuracy by dividing num_correct by total number of examples
    accuracy = num_correct/len(true_labels)
    
    return accuracy

Now, let's compute the classification accuracy of the **sentiment_model** on the **test_data**.

In [227]:
get_classification_accuracy(sentiment_model, sample_test_matrix, sample_test_data['sentiment'])

prediction [ 1 -1 -1]
diff is [0 0 0]
diff shape (3,)
num_correct 3


1.0

In [228]:
accuracy_test = get_classification_accuracy(sentiment_model, test_matrix, test_data['sentiment'])

prediction [1 1 1 ..., 1 1 1]
diff is [0 0 0 ..., 0 0 0]
diff shape (33336,)
num_correct 31078


In [229]:
accuracy_test

0.93226541876649871

**Quiz Question**: What is the accuracy of the **sentiment_model** on the **test_data**? Round your answer to 2 decimal places (e.g. 0.76).

**Quiz Question**: Does a higher accuracy value on the **training_data** always imply that the classifier is better?

In [230]:
print('Answer:',np.round(accuracy_test, decimals=2))

Answer: 0.93


In [231]:
print('Answer:no')

Answer:no


## Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subset of words that occur in the reviews. For this assignment, we selected a 20 words to work with. These are:

In [232]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [233]:
len(significant_words)

20

Compute a new set of word count vectors using only these words. The CountVectorizer class has a parameter that lets you limit the choice of words when building word count vectors:



In [234]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [235]:
train_bag_of_words = pd.DataFrame(train_matrix.toarray(), columns=vectorizer.vocabulary_.keys())
train_bag_of_words

Unnamed: 0,it,came,early,and,was,not,disappointed,i,love,planet,...,evaluationlove,fortun,removers,valueless,aspiratorsand,uprightright,squeezeable,squeasy,unllike,bimbi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [236]:
train_bag_of_words_list = list(train_bag_of_words.columns.values)
train_bag_of_words_list

['it',
 'came',
 'early',
 'and',
 'was',
 'not',
 'disappointed',
 'i',
 'love',
 'planet',
 'wise',
 'bags',
 'now',
 'my',
 'wipe',
 'holder',
 'keps',
 'osocozy',
 'wipes',
 'moist',
 'does',
 'leak',
 'highly',
 'recommend',
 'very',
 'soft',
 'comfortable',
 'warmer',
 'than',
 'looksfit',
 'the',
 'full',
 'size',
 'bed',
 'perfectlywould',
 'to',
 'anyone',
 'looking',
 'for',
 'this',
 'type',
 'of',
 'quilt',
 'is',
 'a',
 'product',
 'well',
 'worth',
 'purchase',
 'have',
 'found',
 'anything',
 'else',
 'like',
 'positive',
 'ingenious',
 'approach',
 'losing',
 'binky',
 'what',
 'most',
 'about',
 'how',
 'much',
 'ownership',
 'daughter',
 'has',
 'in',
 'getting',
 'rid',
 'she',
 'so',
 'proud',
 'herself',
 'loves',
 'her',
 'little',
 'fairy',
 'artwork',
 'chart',
 'back',
 'clever',
 'tool',
 'all',
 'kids',
 'cried',
 'nonstop',
 'when',
 'tried',
 'ween',
 'them',
 'off',
 'their',
 'pacifier',
 'until',
 'thumbuddy',
 'puppet',
 'an',
 'easy',
 'way',
 'work',


In [237]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

Compute word count vectors for the training and test data and obtain the sparse matrices train_matrix_word_subset and test_matrix_word_subset, respectively.

## Train a logistic regression model on a subset of data

We will now build a classifier with **word_count_subset** as the feature and **sentiment** as the target. 

17.. Now build a logistic regression classifier with train_matrix_word_subset as features and sentiment as the target. Call this model simple_model.

In [238]:
simple_model = logisticRegr.fit(train_matrix_word_subset, train_data['sentiment'])

18..  Let us inspect the weights (coefficients) of the simple_model. First, build a table to store (word, coefficient) pairs. If you are using SFrame with scikit-learn, you can combine words with coefficients by running

Sort the data frame by the coefficient value in descending order.

In [239]:
simple_model.coef_.shape

(1, 20)

In [240]:
#combine a list and an array to be a dataframe
simple_model_coef_table = pd.DataFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
simple_model_coef_table.sort_values('coefficient', ascending=False)

Unnamed: 0,word,coefficient
6,loves,1.673074
5,perfect,1.509812
0,love,1.36369
2,easy,1.192538
1,great,0.944
4,little,0.520186
7,well,0.50376
8,able,0.190909
3,old,0.085513
9,car,0.058855


Quiz Question: Consider the coefficients of simple_model. How many of the 20 coefficients (corresponding to the 20 significant_words) are positive for the simple_model?



In [241]:
simple_model_coef_table.loc[simple_model_coef_table['coefficient']>0]

Unnamed: 0,word,coefficient
0,love,1.36369
1,great,0.944
2,easy,1.192538
3,old,0.085513
4,little,0.520186
5,perfect,1.509812
6,loves,1.673074
7,well,0.50376
8,able,0.190909
9,car,0.058855


In [242]:
print('Answer:',len(simple_model_coef_table.loc[simple_model_coef_table['coefficient']>0]))

Answer: 10


In [246]:
sentiment_model.coef_.flatten().shape

(121712,)

In [247]:
#pull all words
words_vec=vectorizer.vocabulary_.keys()
words_vec



In [248]:
sentiment_model.coef_.tolist()

[[-1.2364397215775291,
  0.0002059635220541042,
  0.025901648720599513,
  0.006152743626914593,
  4.6111661387666984e-05,
  8.94886308248898e-07,
  0.0024337907431659705,
  0.26438620406046515,
  0.25213863218035165,
  -0.0019413511348899256,
  0.055545218472303676,
  -0.0001225908662198072,
  -0.22895031271602848,
  0.28186979044785776,
  -0.0006837213158683122,
  6.636193094319235e-06,
  1.669364368939638e-05,
  -0.004994633234490622,
  -0.011523728941159588,
  9.701678940646108e-05,
  0.08067394041664622,
  -0.08078003754082703,
  0.0033728260367063877,
  0.09770346101328771,
  0.010761432331785881,
  0.23539959973579627,
  0.08863943120217291,
  -0.02854222033641947,
  -0.041079914312711074,
  0.17729743640099266,
  1.8573996646859537e-05,
  1.8573996646859537e-05,
  -0.002863747027360251,
  0.08863943120217291,
  1.8573996646859537e-05,
  1.8573996646859537e-05,
  0.07875921083540027,
  0.17727886240434582,
  2.295616069403374e-06,
  -0.21539922537641212,
  1.2705482791853855e-05,

In [249]:
#associate all words with weights
sentiment_weights=pd.Series(sentiment_model.coef_.tolist()[0],index=words_vec)
sentiment_weights

it                 -1.236440e+00
came                2.059635e-04
early               2.590165e-02
and                 6.152744e-03
was                 4.611166e-05
not                 8.948863e-07
disappointed        2.433791e-03
i                   2.643862e-01
love                2.521386e-01
planet             -1.941351e-03
wise                5.554522e-02
bags               -1.225909e-04
now                -2.289503e-01
my                  2.818698e-01
wipe               -6.837213e-04
holder              6.636193e-06
keps                1.669364e-05
osocozy            -4.994633e-03
wipes              -1.152373e-02
moist               9.701679e-05
does                8.067394e-02
leak               -8.078004e-02
highly              3.372826e-03
recommend           9.770346e-02
very                1.076143e-02
soft                2.353996e-01
comfortable         8.863943e-02
warmer             -2.854222e-02
than               -4.107991e-02
looksfit            1.772974e-01
          

In [None]:
'''pos=sentiment_weights[significant_words][sentiment_weights[significant_words]>0].index.tolist()
print(pos)'''


['love', 'great', 'old', 'loves', 'well', 'able', 'car', 'less', 'even', 'waste', 'disappointed', 'work', 'product', 'money', 'would', 'return']

In [None]:
'''sentiment_weights[significant_words]'''

In [250]:
len(train_bag_of_words_list)

121712

In [251]:
sentiment_model.coef_.flatten().shape

(121712,)

Quiz Question: Are the positive words in the simple_model also positive words in the sentiment_model?

In [253]:
#positive words in sentiment_model
sentiment_model_coef_table = pd.DataFrame({'word':train_bag_of_words_list,
                                         'coefficient':sentiment_model.coef_.flatten()})
sentiment_model_positive_coef_table = sentiment_model_coef_table.loc[sentiment_model_coef_table['coefficient']>0]

In [254]:
sentiment_model_coef_table

Unnamed: 0,word,coefficient
0,it,-1.236440e+00
1,came,2.059635e-04
2,early,2.590165e-02
3,and,6.152744e-03
4,was,4.611166e-05
5,not,8.948863e-07
6,disappointed,2.433791e-03
7,i,2.643862e-01
8,love,2.521386e-01
9,planet,-1.941351e-03


In [255]:
sentiment_weights=pd.Series(sentiment_model.coef_.tolist()[0],index=words_vec)
#positive words in sentiment_model
sentiment_model_coef_table2 = pd.DataFrame({'word':train_bag_of_words_list,
                                         'coefficient':sentiment_weights})
sentiment_model_coef_table2

Unnamed: 0,word,coefficient
it,it,-1.236440e+00
came,came,2.059635e-04
early,early,2.590165e-02
and,and,6.152744e-03
was,was,4.611166e-05
not,not,8.948863e-07
disappointed,disappointed,2.433791e-03
i,i,2.643862e-01
love,love,2.521386e-01
planet,planet,-1.941351e-03


In [256]:
pos=sentiment_weights[significant_words][sentiment_weights[significant_words]>0].index.tolist()
print(pos)
sentiment_weights[significant_words]

['love', 'great', 'old', 'loves', 'well', 'able', 'car', 'less', 'even', 'waste', 'disappointed', 'work', 'product', 'money', 'would', 'return']


love            0.252139
great           0.068700
easy           -0.004995
old             0.008204
little         -0.295291
perfect        -0.617489
loves           0.008759
well            0.000019
able            0.207451
car             0.050096
broke          -0.654365
less            0.041665
even            0.077787
waste           0.005453
disappointed    0.002434
work            0.000670
product         0.022522
money           0.000606
would           0.201160
return          0.200584
dtype: float64

In [257]:
#Approach 1: find matching words from sentiment_model_coef_table and look at the coefficients
sentiment_model_coef_table[sentiment_model_coef_table['word'].isin(significant_words)] 

Unnamed: 0,word,coefficient
6,disappointed,0.002434
8,love,0.252139
45,product,0.022522
46,well,1.9e-05
74,loves,0.008759
76,little,-0.295291
98,easy,-0.004995
100,work,0.00067
114,great,0.0687
150,would,0.20116


In [259]:
#Approach 2: find matching words from sentiment_model_positive_coef_table
sentiment_model_positive_coef_table[sentiment_model_positive_coef_table['word'].isin(significant_words)] 

Unnamed: 0,word,coefficient
6,disappointed,0.002434
8,love,0.252139
45,product,0.022522
46,well,1.9e-05
74,loves,0.008759
100,work,0.00067
114,great,0.0687
150,would,0.20116
164,able,0.207451
462,old,0.008204


In [None]:
print('Answer is no')

# Comparing models

We will now compare the accuracy of the **sentiment_model** and the **simple_model** using the `get_classification_accuracy` method you implemented above.

First, compute the classification accuracy of the **sentiment_model** on the **train_data**:

In [260]:
get_classification_accuracy(sentiment_model, train_matrix, train_data['sentiment'])

prediction [1 1 1 ..., 1 1 1]
diff is [0 0 0 ..., 0 0 0]
diff shape (133416,)
num_correct 129108


0.96771001978773163

Now, compute the classification accuracy of the **simple_model** on the **train_data**:

In [261]:
get_classification_accuracy(simple_model, train_matrix_word_subset, train_data['sentiment'])

prediction [1 1 1 ..., 1 1 1]
diff is [0 0 0 ..., 0 0 0]
diff shape (133416,)
num_correct 115648


0.8668225700065959

**Quiz Question**: Which model (**sentiment_model** or **simple_model**) has higher accuracy on the TRAINING set?

In [262]:
print('Answer is sentiment_model')

Answer is sentiment_model


Now, we will repeat this exercise on the **test_data**. Start by computing the classification accuracy of the **sentiment_model** on the **test_data**:

In [263]:
get_classification_accuracy(sentiment_model, test_matrix, test_data['sentiment'])

prediction [1 1 1 ..., 1 1 1]
diff is [0 0 0 ..., 0 0 0]
diff shape (33336,)
num_correct 31078


0.93226541876649871

Next, we will compute the classification accuracy of the **simple_model** on the **test_data**:

In [264]:
get_classification_accuracy(simple_model, test_matrix_word_subset, test_data['sentiment'])

prediction [1 1 1 ..., 1 1 1]
diff is [0 0 0 ..., 0 0 0]
diff shape (33336,)
num_correct 28981


0.86936045116390692

**Quiz Question**: Which model (**sentiment_model** or **simple_model**) has higher accuracy on the TEST set?

In [265]:
print('Answer is sentiment_model')

Answer is sentiment_model


## Baseline: Majority class prediction

It is quite common to use the **majority class classifier** as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

What is the majority class in the **train_data**?

In [266]:
num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print (num_positive)
print (num_negative)

112164
21252


Now compute the accuracy of the majority class classifier on **test_data**.

**Quiz Question**: Enter the accuracy of the majority class classifier model on the **test_data**. Round your answer to two decimal places (e.g. 0.76).

In [267]:
num_positive_test  = (test_data['sentiment'] == +1).sum()
num_negative_test = (test_data['sentiment'] == -1).sum()

In [268]:
accuracy_majority_class_classifier = num_positive_test/(num_negative_test+num_positive_test)
accuracy_majority_class_classifier

0.84278257739380846

**Quiz Question**: Is the **sentiment_model** definitely better than the majority class classifier (the baseline)?

In [269]:
print('Answer is yes')

Answer is yes
