---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*

# Case Study: Sentiment Analysis

### Data Prep

In [33]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Sample the data to speed up computation
# Comment out this line to match with lecture
df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
8841,111936,BLU Dash JR 4.0 K Smartphone - Unlocked - White,BLU,43.95,5,Excelente!!!,0.0
32237,31957,Apple iPhone 5c 32GB (Pink) - Verizon Wireless,Apple,159.99,5,It was sent to me promptly . I am very happy w...,3.0
28177,65658,Apple iPhone 6s 16 GB International Warranty U...,"Amazon.com, LLC *** KEEP PORules ACTIVE ***",540.0,5,Good condition as posted. Very pleased!,0.0
6928,277272,Nokia N9 16 GB Unlocked GSM Phone with MeeGo O...,Nokia,149.0,5,this phone is amazing! everything i ever wante...,4.0
10916,41081,Apple iPhone 5S 16GB Factory Unlocked GSM Cell...,Apple,265.0,5,Just what I wanted.,0.0


In [34]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
8841,111936,BLU Dash JR 4.0 K Smartphone - Unlocked - White,BLU,43.95,5,Excelente!!!,0.0,1
32237,31957,Apple iPhone 5c 32GB (Pink) - Verizon Wireless,Apple,159.99,5,It was sent to me promptly . I am very happy w...,3.0,1
28177,65658,Apple iPhone 6s 16 GB International Warranty U...,"Amazon.com, LLC *** KEEP PORules ACTIVE ***",540.0,5,Good condition as posted. Very pleased!,0.0,1
6928,277272,Nokia N9 16 GB Unlocked GSM Phone with MeeGo O...,Nokia,149.0,5,this phone is amazing! everything i ever wante...,4.0,1
10916,41081,Apple iPhone 5S 16GB Factory Unlocked GSM Cell...,Apple,265.0,5,Just what I wanted.,0.0,1
27493,140487,"BLU Studio 5.0 HD Unlocked Cellphone, White",BLU,119.99,5,"Excellent option, with a great relationship be...",0.0,1
11531,410339,Vivo 4.8 HD Quad Band Unlocked (Yellow),BLU,236.28,4,I bought this phone as I was searching for a w...,3.0,1
22229,200775,Huawei Mate 2 - Factory Unlocked (Black),Huawei,229.99,5,"Phone is great,I love it!",1.0,1
14564,220277,LG G3 D855 32GB LTE Unlocked GSM Android Smart...,LG,210.95,5,I'm in the US and use this on a GSM network (T...,0.0,1
5479,90003,Blackberry Bold Touch 9930 CDMA GSM Unlocked P...,BlackBerry,104.95,5,Good little blackberry. The touch screen is co...,0.0,1


In [35]:
# Most ratings are positive
df['Positively Rated'].mean()

0.7510548523206751

In [36]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [37]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 suddenly, PHONE died after 5 months. it is not starting or accepting charging.. no way to fix this issue until date and workaround available from blu or on the internet . This phone has built in battery therefore you will not be able to remove battery to reset cell...other freiend has same cell has another problem, the cell stuck in vibration mode while charging and no option to solve this one too.. I would not recommend this phone due to these problems.


X_train shape:  (2310,)


# CountVectorizer

In [38]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [39]:
vect.get_feature_names()[::2000]

['00', 'exelente', 'pod', 'withs']

In [40]:
len(vect.get_feature_names())

6129

In [41]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<2310x6129 sparse matrix of type '<class 'numpy.int64'>'
	with 60910 stored elements in Compressed Sparse Row format>

In [42]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression(max_iter=500)
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [43]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.8181818181818183


In [44]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['bad' 'not' 'sucks' 'disappointed' 'sprint' 'work' 'horrible' 'months'
 'poor' 'doesn']

Largest Coefs: 
['love' 'great' 'excellent' 'perfect' 'excelente' 'nice' 'excelent' 'good'
 'thanks' 'awesome']


# Tfidf

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

1533

In [46]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.7637000254647313


In [47]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['prefer' 'pc' 'disabled' 'stylus' 'become' 'manually' 'manager' 'laptop'
 '40' 'media']

Largest tfidf: 
['good' 'junk' 'as' 'fantastic' 'satisfied' 'top' 'bien' 'wonderful'
 'awesome' 'exelente']


In [48]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'bad' 'disappointed' 'work' 'after' 'months' 'return' 'sucks'
 'didn' 'off']

Largest Coefs: 
['great' 'love' 'good' 'excellent' 'perfect' 'nice' 'price' 'excelente'
 'works' 'as']


In [49]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


# n-grams

In [50]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

3732

In [51]:
model = LogisticRegression(max_iter=500)
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.8247262541380188


In [52]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['bad' 'not' 'sucks' 'not good' 'sprint' 'work' 'poor' 'disappointed'
 'didn' 'doesn']

Largest Coefs: 
['love' 'great' 'excellent' 'perfect' 'excelente' 'nice' 'good' 'excelent'
 'thanks' 'awesome']


In [53]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]
