---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*

# Case Study: Sentiment Analysis

In [1]:
#Amazon_Unlocked_Mobile = df.sample(frac=1, random_state=10)
#Amazon_Unlocked_Mobile.to_csv(r'Amazon_Unlocked_Mobile_reduced.csv', index = False)

In [2]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile_reduced.csv')

# Sample the data to speed up computation
# Comment out this line to match with lecture
df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
8841,Apple iPhone 6 128GB Factory Unlocked GSM Smar...,Apple,449.99,5,Looks 100% like a new one,4.0
32237,Apple iPhone 5 16GB - Unlocked - Black (Certif...,Apple,124.0,5,It has been working great I would recommend this,0.0
28177,"Samsung T139 Unlocked Phone with Camera, Bluet...",Samsung,33.95,5,This was a replacement phone for the broken on...,1.0
6928,LG Optimus Factory Unlocked Gsm Android Phone ...,LG,208.76,3,the processor is too slow,0.0
10916,Samsung Galaxy S6 G920A 64GB Unlocked GSM 4G L...,Samsung,429.93,5,Everything is perfect I found no problem with ...,0.0


### Data Prep

In [3]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
8841,Apple iPhone 6 128GB Factory Unlocked GSM Smar...,Apple,449.99,5,Looks 100% like a new one,4.0,1
32237,Apple iPhone 5 16GB - Unlocked - Black (Certif...,Apple,124.0,5,It has been working great I would recommend this,0.0,1
28177,"Samsung T139 Unlocked Phone with Camera, Bluet...",Samsung,33.95,5,This was a replacement phone for the broken on...,1.0,1
10916,Samsung Galaxy S6 G920A 64GB Unlocked GSM 4G L...,Samsung,429.93,5,Everything is perfect I found no problem with ...,0.0,1
27493,Apple iPhone 5C 16GB White - Unlocked Cell Phones,Apple,135.0,5,my son loves his iPhone,0.0,1
4152,Sony Ericsson XPERIA X10 Mini E10i Unlocked Sm...,Sony Ericsson Mobile,143.99,4,It's a pretty decent phone. Everybody is in sh...,2.0,1
22229,"Apple iPhone 6 Plus Unlocked Cellphone, 16GB, ...",Apple,519.0,5,Awesom,0.0,1
40586,"ZTE Axon 7 unlocked smartphone,64GB Grey (US W...",ZTE,399.99,1,"Unfortunately, while the product seemed to be ...",13.0,0
16061,Samsung Galaxy S5 SM-G900T - 16GB - Shimmery W...,Samsung,189.99,5,Great..loving every minute of it.,0.0,1
9626,Apple iPhone 5s 64GB (Gold) -T-Mobile,Apple,265.0,5,Im loving it....So far it works very good no p...,1.0,1


In [4]:
# Most ratings are positive
df['Positively Rated'].mean()

0.7466882067851374

In [5]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [6]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 Great value for money. Plus it is stylish.


X_train shape:  (2321,)


# CountVectorizer

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [8]:
vect.get_feature_names()[::2000]



['00', 'entry', 'optimus', 'unacceptable']

In [9]:
len(vect.get_feature_names())

6497

In [10]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<2321x6497 sparse matrix of type '<class 'numpy.int64'>'
	with 65016 stored elements in Compressed Sparse Row format>

In [11]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [12]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.8461517952364024


In [13]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'bad' 'slow' 'months' 'doesn' 'locked' 'broken' 'didn' 'poor'
 'disappointed']

Largest Coefs: 
['great' 'love' 'excellent' 'excelente' 'perfect' 'good' 'nice' 'awesome'
 'excelent' 'happy']




# Tfidf

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

1600

In [15]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.7843227870600782


In [16]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['300' 'supposedly' 'song' 'usable' 'decision' 'users' 'rounded' 'typing'
 'slower' 'hundreds']

Largest tfidf: 
['regular' 'none' 'disappointing' 'okay' 'ok' 'very' 'excelent'
 'excelente' 'excellent' 'amazing']




In [17]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'bad' 'locked' 'slow' 'months' 'doesn' 'work' 'broken' 'money'
 'return']

Largest Coefs: 
['great' 'love' 'good' 'excellent' 'perfect' 'works' 'excelente' 'nice'
 'happy' 'awesome']


In [18]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


# n-grams

In [19]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

4020

In [20]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.844427657305368


In [21]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'bad' 'doesn' 'slow' 'months' 'locked' 'broken' 'not good' 'didn'
 'wrong']

Largest Coefs: 
['great' 'excellent' 'love' 'excelente' 'perfect' 'good' 'nice' 'awesome'
 'excelent' 'happy']




In [22]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]
