# Nafisur Rahman
nafisur21@gmail.com
https://www.linkedin.com/in/nafisur-rahman

## Sentiment Analysis on Amazon Reviews: Unlocked Mobile Phones
PromptCloud extracted 400 thousand reviews of unlocked mobile phones sold on Amazon.com to find out insights with respect to reviews, ratings, price and their relationships.

## Sentiment Analysis
Finding the sentiment (positive or negative) from Amazon reviews.

## A. Loading Libraries and Dataset

In [1]:
import nltk
import re
import numpy as np # linear algebra
import pandas as pd # data processing
import random
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import SnowballStemmer
stemmer=SnowballStemmer('english')

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.tokenize import word_tokenize

%matplotlib inline

In [2]:
raw_dataset=pd.read_csv('Amazon_Unlocked_Mobile.csv')
raw_dataset.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


### Basic visualization of dataset

In [3]:
df=raw_dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413840 entries, 0 to 413839
Data columns (total 6 columns):
Product Name    413840 non-null object
Brand Name      348669 non-null object
Price           407907 non-null float64
Rating          413840 non-null int64
Reviews         413778 non-null object
Review Votes    401544 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 18.9+ MB


selecting only two columns that is Reviews and Rating

In [4]:
df=df[['Reviews','Rating']]


In [5]:
df.head()

Unnamed: 0,Reviews,Rating
0,I feel so LUCKY to have found this used (phone...,5
1,"nice phone, nice up grade from my pantach revu...",4
2,Very pleased,5
3,It works good but it goes slow sometimes but i...,4
4,Great phone to replace my lost phone. The only...,4


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413840 entries, 0 to 413839
Data columns (total 2 columns):
Reviews    413778 non-null object
Rating     413840 non-null int64
dtypes: int64(1), object(1)
memory usage: 6.3+ MB


Removing rows with missing values

In [7]:
df=df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 413778 entries, 0 to 413839
Data columns (total 2 columns):
Reviews    413778 non-null object
Rating     413778 non-null int64
dtypes: int64(1), object(1)
memory usage: 9.5+ MB


Removing rows with rating=3 that is neutral sentiment

In [8]:
df=df[df['Rating']!=3]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 382015 entries, 0 to 413839
Data columns (total 2 columns):
Reviews    382015 non-null object
Rating     382015 non-null int64
dtypes: int64(1), object(1)
memory usage: 8.7+ MB


In [9]:
df=df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 382015 entries, 0 to 382014
Data columns (total 2 columns):
Reviews    382015 non-null object
Rating     382015 non-null int64
dtypes: int64(1), object(1)
memory usage: 5.8+ MB


In [10]:
df['sentiment']=np.where(df['Rating'] > 3, 1, 0)
df.head()

Unnamed: 0,Reviews,Rating,sentiment
0,I feel so LUCKY to have found this used (phone...,5,1
1,"nice phone, nice up grade from my pantach revu...",4,1
2,Very pleased,5,1
3,It works good but it goes slow sometimes but i...,4,1
4,Great phone to replace my lost phone. The only...,4,1


In [11]:
df.tail()

Unnamed: 0,Reviews,Rating,sentiment
382010,good rugged phone that has a long-lasting batt...,4,1
382011,used hard,1,0
382012,another great deal great price,5,1
382013,Passes every drop test onto porcelain tile!,5,1
382014,Only downside is that apparently Verizon no lo...,4,1


## B. Data Cleaning and Text Preprocessing

In [12]:
Cstopwords=set(stopwords.words('english')+list(punctuation))
from nltk.stem import WordNetLemmatizer
lemma=WordNetLemmatizer()
def clean_review(review_column):
    review_corpus=[]
    for i in range(0,len(review_column)):
        review=review_column[i]
        #review=BeautifulSoup(review,'lxml').text
        review=re.sub('[^a-zA-Z]',' ',review)
        review=str(review).lower()
        review=word_tokenize(review)
        #review=[stemmer.stem(w) for w in review if w not in Cstopwords]
        review=[lemma.lemmatize(w) for w in review ]
        review=' '.join(review)
        review_corpus.append(review)
    return review_corpus

In [13]:
review_column=df['Reviews']
review_corpus=clean_review(review_column)

In [14]:
df['clean_review']=review_corpus
df.tail(20)

Unnamed: 0,Reviews,Rating,sentiment,clean_review
381995,"This phone is simple, very good , and it works...",5,1,this phone is simple very good and it work exc...
381996,Good sturdy phone for a pre-teen to have avail...,4,1,good sturdy phone for a pre teen to have avail...
381997,This is the second junk Convoy I have owned. T...,1,0,this is the second junk convoy i have owned th...
381998,I BOUGHT THIS PHONE FOR MY HUSBAND AND HE LOVE...,5,1,i bought this phone for my husband and he love...
381999,They said phone was normal wear but it was a l...,1,0,they said phone wa normal wear but it wa a lie...
382000,"You could shoot this out of a potato gun, and ...",5,1,you could shoot this out of a potato gun and p...
382001,Bought this for my mother and she loves it. Gr...,5,1,bought this for my mother and she love it grea...
382002,"Excellent phone, as advertised. Love the push-...",5,1,excellent phone a advertised love the push to ...
382003,works great and picks up signal in place my ot...,4,1,work great and pick up signal in place my othe...
382004,"Great phone. Large keys, best flip phone I hav...",5,1,great phone large key best flip phone i have o...


## C. Creating Features

### 1. Bag of words model
* CountVectorizer

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
cv=CountVectorizer(max_features=20000,min_df=5,ngram_range=(1,2))

In [17]:
X1=cv.fit_transform(df['clean_review'])
X1.shape

(382015, 20000)

### 2. Tfidf 

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
tfidf=TfidfVectorizer(min_df=5, max_df=0.95, max_features = 20000, ngram_range = ( 1, 2 ),
                              sublinear_tf = True)

In [20]:
tfidf=tfidf.fit(df['clean_review'])

In [21]:
X2=tfidf.transform(df['clean_review'])
X2.shape

(382015, 20000)

In [22]:
y=df['sentiment'].values
y.shape

(382015,)

## D. Machine Learning

#### Splitting data into Training and Test set

In [30]:
X=X2 #X1 for bag of words model and X2 for Tfidf model

In [31]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(305612, 20000) (305612,)
(76403, 20000) (76403,)


In [32]:
# average positive reviews in train and test
print('mean positive review in train : {0:.3f}'.format(np.mean(y_train)))
print('mean positive review in test : {0:.3f}'.format(np.mean(y_test)))

mean positive review in train : 0.746
mean positive review in test : 0.745


### 1. Logistic Regression

In [33]:
from sklearn.linear_model import LogisticRegression as lr

In [34]:
model_lr=lr(random_state=0)

%%time
from sklearn.model_selection import GridSearchCV
parameters = {'C':[0.5,1.0, 10.0], 'penalty' : ['l1','l2']}
grid_search = GridSearchCV(estimator = model_lr,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print('Best Accuracy :',best_accuracy)
print('Best parameters:\n',best_parameters)

In [35]:
%%time
model_lr=lr(penalty='l2',C=1.0,random_state=0)
model_lr.fit(X_train,y_train)
y_pred_lr=model_lr.predict(X_test)
print('accuracy for Logistic Regression :',accuracy_score(y_test,y_pred_lr))
print('confusion matrix for Logistic Regression:\n',confusion_matrix(y_test,y_pred_lr))
print('F1 score for Logistic Regression :',f1_score(y_test,y_pred_lr))
print('Precision score for Logistic Regression :',precision_score(y_test,y_pred_lr))
print('recall score for Logistic Regression :',recall_score(y_test,y_pred_lr))
print('AUC: ', roc_auc_score(y_test, y_pred_lr))

accuracy for Logistic Regression : 0.96234441056
confusion matrix for Logistic Regression:
 [[17833  1632]
 [ 1245 55693]]
F1 score for Logistic Regression : 0.974821245723
Precision score for Logistic Regression : 0.971530745748
recall score for Logistic Regression : 0.978134110787
AUC:  0.947145658014
Wall time: 5.42 s


In [36]:
# get the feature names as numpy array
feature_names = np.array(cv.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model_lr.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'worst' 'disappointed' 'not happy' 'poor' 'terrible' 'doesn'
 'horrible' 'useless' 'return']

Largest Coefs: 
['great' 'love' 'excellent' 'perfect' 'no problem' 'amazing' 'awesome'
 'best' 'love this' 'not bad']


### 2. Naive Bayes Classifier

In [37]:
from sklearn.naive_bayes import MultinomialNB
model_nb=MultinomialNB()
model_nb.fit(X_train,y_train)
y_pred_nb=model_nb.predict(X_test)
print('accuracy for Naive Bayes Classifier :',accuracy_score(y_test,y_pred_nb))
print('confusion matrix for Naive Bayes Classifier:\n',confusion_matrix(y_test,y_pred_nb))
print('F1 score for Logistic Regression :',f1_score(y_test,y_pred_nb))
print('Precision score for Logistic Regression :',precision_score(y_test,y_pred_nb))
print('recall score for Logistic Regression :',recall_score(y_test,y_pred_nb))
print('AUC: ', roc_auc_score(y_test, y_pred_nb))

accuracy for Naive Bayes Classifier : 0.936848029528
confusion matrix for Naive Bayes Classifier:
 [[16687  2778]
 [ 2047 54891]]
F1 score for Logistic Regression : 0.957899604736
Precision score for Logistic Regression : 0.95182853873
recall score for Logistic Regression : 0.964048614282
AUC:  0.910665457925


In [38]:
# get the feature names as numpy array
feature_names = np.array(cv.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model_nb.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst purchase' 'never recommend' 'total waste' 'nothing work' 'is scam'
 'royalty' 'reported stolen' 'zero star' 'very dissapointed'
 'started freezing']

Largest Coefs: 
['good' 'great' 'phone' 'it' 'excellent' 'the' 'and' 'love' 'very' 'is']


### 3. Random Forest

In [39]:
from sklearn.ensemble import RandomForestClassifier

In [40]:
%%time
model_rf=RandomForestClassifier()
model_rf.fit(X_train,y_train)
y_pred_rf=model_rf.predict(X_test)
print('accuracy for Random Forest Classifier :',accuracy_score(y_test,y_pred_rf))
print('confusion matrix for Random Forest Classifier:\n',confusion_matrix(y_test,y_pred_rf))

accuracy for Random Forest Classifier : 0.969320576417
confusion matrix for Random Forest Classifier:
 [[18233  1232]
 [ 1112 55826]]
Wall time: 2min 30s


In [41]:
# get the feature names as numpy array
feature_names = np.array(cv.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model_rf.feature_importances_.argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['scanning' 'read lot' 'read about' 'black color' 'reaction' 're not'
 'blackberry and' 'rd party' 'raw' 'rapidly']

Largest Coefs: 
['not' 'great' 'good' 'after' 'disappointed' 'not work' 'bad' 'is not'
 'back' 'love it']
