### Amazon Reviews: Sentiment Analysis

Use one of the following datasets to perform sentiment analysis on the given Amazon reviews. Pick one of the "small" datasets that is a reasonable size for your computer. The goal is to create a model to algorithmically predict if a review is positive or negative just based on its text. Try to see how these reviews compare across categories. Does a review classification model for one category work for another?

In [140]:

#importing modules and potential modules
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import random
import  re

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

#stemmers
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

import json
import gzip

from sklearn.naive_bayes import BernoulliNB
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [141]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('C:/Users/genta/Desktop/Thinkful/SupervisedLearning/Datasets/reviews_Amazon_Instant_Video_5.json.gz')

In [142]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A11N155CW1UV02,B000H00VBQ,AdrianaM,"[0, 0]",I had big expectations because I love English ...,2.0,A little bit boring for me,1399075200,"05 3, 2014"
1,A3BC8O2KCL29V2,B000H00VBQ,Carol T,"[0, 0]",I highly recommend this series. It is a must f...,5.0,Excellent Grown Up TV,1346630400,"09 3, 2012"
2,A60D5HQFOTSOM,B000H00VBQ,"Daniel Cooper ""dancoopermedia""","[0, 1]",This one is a real snoozer. Don't believe anyt...,1.0,Way too boring for me,1381881600,"10 16, 2013"
3,A1RJPIGRSNX4PW,B000H00VBQ,"J. Kaplan ""JJ""","[0, 0]",Mysteries are interesting. The tension betwee...,4.0,Robson Green is mesmerizing,1383091200,"10 30, 2013"
4,A16XRPF40679KG,B000H00VBQ,Michael Dobey,"[1, 1]","This show always is excellent, as far as briti...",5.0,Robson green and great writing,1234310400,"02 11, 2009"


In [143]:
df.columns

Index(['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText',
       'overall', 'summary', 'unixReviewTime', 'reviewTime'],
      dtype='object')

In [144]:
#convert time

df["reviewTime"] = pd.to_datetime(df["reviewTime"])

In [145]:
df.tail()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
37121,A1ELO9LMSE1CQ7,B00LPWPMCS,Mpr90,"[0, 0]",I love the books! The show is amazing so far. ...,5.0,Great Series!,1405728000,2014-07-19
37122,AGOEKVIJV9UX6,B00LPWPMCS,Mr. Markster,"[13, 15]","""The Strain"" has potential to be an excellent ...",5.0,Forget the Vampire Diaries -- This is a REAL V...,1405296000,2014-07-14
37123,A3I291BE0RNZCU,B00LPWPMCS,Rating My Best Pick,"[0, 2]","I'm not real sure on how, I should rate this s...",3.0,It's only the first episode so I'm not real su...,1405296000,2014-07-14
37124,A1MNITZRYU71IO,B00LPWPMCS,"Sherry ""trying in ohio""","[1, 1]",episode one so far makes me want to watch more...,4.0,and that is good. The accents are a bit much h...,1405296000,2014-07-14
37125,A1XMHK9HN5MW2H,B00LPWPMCS,Victoria J. Dennison,"[3, 4]",I watched the pilot. I guess I've just seen t...,3.0,I may have paid towatch the pilot,1405468800,2014-07-16


In [146]:
df.dtypes

reviewerID                object
asin                      object
reviewerName              object
helpful                   object
reviewText                object
overall                  float64
summary                   object
unixReviewTime             int64
reviewTime        datetime64[ns]
dtype: object

In [147]:
df.overall.unique()

array([2., 5., 1., 4., 3.])

In [148]:
df.overall.describe()

count    37126.00000
mean         4.20953
std          1.11855
min          1.00000
25%          4.00000
50%          5.00000
75%          5.00000
max          5.00000
Name: overall, dtype: float64

In [149]:
df.isnull().values.any()

True

In [150]:
df.isnull().sum()

reviewerID          0
asin                0
reviewerName      329
helpful             0
reviewText          0
overall             0
summary             0
unixReviewTime      0
reviewTime          0
dtype: int64

In [151]:
df['reviewText'][:5]

0    I had big expectations because I love English ...
1    I highly recommend this series. It is a must f...
2    This one is a real snoozer. Don't believe anyt...
3    Mysteries are interesting.  The tension betwee...
4    This show always is excellent, as far as briti...
Name: reviewText, dtype: object

In [152]:
df.groupby('overall').describe()

Unnamed: 0_level_0,unixReviewTime,unixReviewTime,unixReviewTime,unixReviewTime,unixReviewTime,unixReviewTime,unixReviewTime,unixReviewTime
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
overall,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1.0,1718.0,1377980000.0,26875750.0,1185322000.0,1366848000.0,1386936000.0,1394237000.0,1405987000.0
2.0,1885.0,1376357000.0,29910260.0,1180915000.0,1366675000.0,1384906000.0,1393718000.0,1406074000.0
3.0,4187.0,1375855000.0,31292610.0,1172016000.0,1367021000.0,1384646000.0,1394064000.0,1406074000.0
4.0,8446.0,1375656000.0,32278820.0,1098403000.0,1367280000.0,1384906000.0,1394064000.0,1406074000.0
5.0,20890.0,1377385000.0,29999480.0,975456000.0,1368576000.0,1384992000.0,1394323000.0,1406074000.0


In [153]:
#good vs bad feature for overall
result = np.where(df['overall']>=3.0, 'good', 'bad')

In [154]:
#TfidfVectorizer


stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)


In [155]:
#lowercase the text
df.reviewText=df.reviewText.str.lower()

In [156]:





#converting text to features
reviewed = vectorizer.fit_transform(df.reviewText)

In [157]:
#test train setup
reviewed_train,  reviewed_test, result_train, result_test = train_test_split(reviewed, result, test_size=0.3, random_state=50)

In [158]:
#train naive bayes
clf = naive_bayes.MultinomialNB()
clf.fit(reviewed_train, result_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [159]:
pred=clf.predict(reviewed_test)

In [160]:
from sklearn.metrics import accuracy_score
print("ACCURACY : "+str(accuracy_score(result_test,pred)))

ACCURACY : 0.9023163943257317


In [161]:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(result_test, pred))
print('\n')
print(classification_report(result_test, pred))

[[    0  1087]
 [    1 10050]]


              precision    recall  f1-score   support

         bad       0.00      0.00      0.00      1087
        good       0.90      1.00      0.95     10051

   micro avg       0.90      0.90      0.90     11138
   macro avg       0.45      0.50      0.47     11138
weighted avg       0.81      0.90      0.86     11138

