# Amazon Review Analysis

#### Formate Data Access Class

In [23]:
import random

REVIEW_DICT = {
    "Negative": "NEGATIVE",
    "Neutral": "NEUTRAL",
    "Positive": "POSITIVE"
}

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score 
        self.sentiment = self.get_sentiment() 
        
    def get_sentiment(self):
        if self.score <= 2:
            return REVIEW_DICT["Negative"]
        elif self.score == 3:
            return REVIEW_DICT["Neutral"]
        else:
            return REVIEW_DICT["Positive"] 
        

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == REVIEW_DICT["Negative"], self.reviews))
        positive = list(filter(lambda x: x.sentiment == REVIEW_DICT["Positive"], self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
        

#### Load Data

In [26]:
import json 

#### set path for data  
file_name = "./Amazon_review_data/reviews_10000.json"


reviews  = []
with open(file_name) as file:
    for line in file:
        review = json.loads(line)
        reviews.append(Review(review["reviewText"], review["overall"]))
        
reviews[0].text

"I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with."

#### Prepare Data

In [27]:
from sklearn.model_selection import train_test_split 

In [28]:
training, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(training)
test_container = ReviewContainer(test)

train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(REVIEW_DICT["Negative"]))
print(train_y.count(REVIEW_DICT["Positive"]))


436
436


#### Bag of words vectorization

In [29]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [30]:
vectorizer = TfidfVectorizer()

train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())

I agree with the other 2 and 1 star reviewers.  The story was rather implausible but even beyond that, something about the writing made this book incredibly boring.  I enjoyed the third book in the series as a light slightly memorable read.
[[0. 0. 0. ... 0. 0. 0.]]


## Classification

#### Linear SVM

from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

test_x[0]

clf_svm.predict(test_x_vectors[0])

#### Decision Tree

In [37]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()

clf_dec.fit(train_x_vectors, train_y)
clf_dec.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

#### Naive Bayes

In [38]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(train_x_vectors, train_y)

clf_gnb.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

#### Logistic Regression

In [39]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

#### Evaluation

In [40]:
#Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6730769230769231
0.6274038461538461
0.8052884615384616


In [44]:
# F1 Scores
from sklearn.metrics import f1_score
f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[REVIEW_DICT["Positive"], REVIEW_DICT["Negative"]])

array([0.80582524, 0.80952381])

In [45]:
test_set = ['very fun', "bad book do not buy", 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

#### Tuning our model (with Grid Search)

In [46]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [47]:
print(clf.score(test_x_vectors, test_y))

0.8100961538461539


#### Saving Model

In [48]:
import pickle

with open('./Amazon_review_data/models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

#### Load model

In [51]:
with open('./Amazon_review_data/models/sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

print(test_x[0])
loaded_clf.predict(test_x_vectors[0])

This book is really cute and my daughter loves it.  The pictures are gorgeous and the text is super funny, even for adults.  Makes for very enjoyable repeat reading.  She also loves the smoochy last part, so sometimes we just reread the last quarter over and over so we can play a smooching game.


array(['POSITIVE'], dtype='<U8')