<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-Librarys" data-toc-modified-id="Importing-Librarys-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing Librarys</a></span></li><li><span><a href="#Loading-Data" data-toc-modified-id="Loading-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Loading Data</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Bag-of-words-vectorization" data-toc-modified-id="Bag-of-words-vectorization-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Bag of words vectorization</a></span></li><li><span><a href="#Different-Models" data-toc-modified-id="Different-Models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Different Models</a></span><ul class="toc-item"><li><span><a href="#Classification" data-toc-modified-id="Classification-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Classification</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Improving-model" data-toc-modified-id="Improving-model-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Improving model</a></span></li><li><span><a href="#Changing-Vectorization-to-TfidfVectorizer" data-toc-modified-id="Changing-Vectorization-to-TfidfVectorizer-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Changing Vectorization to TfidfVectorizer</a></span></li><li><span><a href="#Tuning-model-with-Grid-Search" data-toc-modified-id="Tuning-model-with-Grid-Search-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Tuning model with Grid Search</a></span></li><li><span><a href="#saving-model" data-toc-modified-id="saving-model-5.6"><span class="toc-item-num">5.6&nbsp;&nbsp;</span>saving model</a></span></li><li><span><a href="#loading-model" data-toc-modified-id="loading-model-5.7"><span class="toc-item-num">5.7&nbsp;&nbsp;</span>loading model</a></span></li></ul></li></ul></div>

# Importing Librarys

In [24]:
import json
import pandas as pd
import numpy as np
import matplotlib as plt
import random

# Loading Data

In [2]:
# test code
file_name = 'Datafiles/booksreview.json'
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append((review['reviewText'], review['overall']))
print(reviews[0][0])
print(reviews[0][1])

Da Silva takes the divine by storm with this unique new novel.  She develops a world unlike any others while keeping it firmly in the real world.  This is a very well written and entertaining novel.  I was quite impressed and intrigued by the way that this solid storyline was developed, bringing the readers right into the world of the story.  I was engaged throughout and definitely enjoyed my time spent reading it.I loved the character development in this novel.  Da Silva creates a cast of high school students who actually act like high school students.  I really appreciated the fact that none of them were thrown into situations far beyond their years, nor did they deal with events as if they had decades of life experience under their belts.  It was very refreshing and added to the realism and impact of the novel.  The friendships between the characters in this novel were also truly touching.Overall, this novel was fantastic.  I can&#8217;t wait to read more and to find out what happen

In [3]:
# main code
# creating class to load data
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE

In [4]:
# main code
file_name = 'Datafiles/booksreview.json'
Reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        Reviews.append(Review(review['reviewText'], review['overall']))
        
print(Reviews[5].text)
print(Reviews[5].score)
print(Reviews[5].sentiment)

Love the book, great story line, keeps you entertained.for a first novel from this author she did a great job,  Would definitely recommend!
4.0
POSITIVE


In [5]:
len(Reviews)

1000

# Data Preparation

In [6]:
from sklearn.model_selection import train_test_split
training, test = train_test_split(Reviews, test_size=0.33, random_state=42)

In [7]:
print(len(training))
print(len(test))
print(training[0].text)

670
330
Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.


In [8]:
# test code
train_x = [x.text for x in training]
train_y = [x.sentiment for x in training]

test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

print(train_x[0])
print(train_y[0])
print(test_x[0])
print(test_y[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
POSITIVE
Every new Myke Cole book is better than the last, and this is no exception. If you haven't read the Shadow Ops series before start with Control Point, but go ahead and order Fortress Frontier and Breach Zone as well - you're going to want them.
POSITIVE


# Bag of words vectorization

In [9]:
# converting text to numerical vectors, bag of words(bow)
# https://scikit-learn.org/stable/modules/feature_extraction.html
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)  # first fit then transform
test_x_vectors = vectorizer.transform(test_x)    # not used fit as it won't be used on training model

print(train_x[0])
print(train_x_vectors[0].toarray())
print(train_x_vectors[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
[[0 0 0 ... 0 0 0]]
  (0, 7086)	1
  (0, 1148)	1
  (0, 350)	2
  (0, 1800)	1
  (0, 6595)	1
  (0, 562)	1
  (0, 3054)	1
  (0, 1558)	1
  (0, 6475)	1
  (0, 6593)	1
  (0, 2895)	1
  (0, 7353)	1
  (0, 539)	1
  (0, 1515)	1
  (0, 5197)	1
  (0, 3545)	1
  (0, 2007)	1


# Different Models

## Classification

In [10]:
# Linear SVM
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
# predicting on test set
print(test_x[0])
print(test_y[0])
clf_svm.predict(test_x_vectors[0])

Every new Myke Cole book is better than the last, and this is no exception. If you haven't read the Shadow Ops series before start with Control Point, but go ahead and order Fortress Frontier and Breach Zone as well - you're going to want them.
POSITIVE


array(['POSITIVE'], dtype='<U8')

In [11]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
# predicting on test set
print(test_y[0])
clf_dec.predict(test_x_vectors[0])

POSITIVE


array(['POSITIVE'], dtype='<U8')

In [12]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
# train_x_vectors is a sparse matrix, dense data is required to work with GaussianNB
# Using train_x_vectors.toarray() to convert to a dense numpy array
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)
# predicting on test set
print(test_y[0])
clf_gnb.predict(test_x_vectors[0].toarray())

POSITIVE


array(['POSITIVE'], dtype='<U8')

In [13]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression(solver='lbfgs', max_iter=200)
clf_log.fit(train_x_vectors, train_y)
# predicting on test set
print(test_y[0])
clf_log.predict(test_x_vectors[0])

POSITIVE


array(['POSITIVE'], dtype='<U8')

## Evaluation

In [14]:
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8242424242424242
0.7575757575757576
0.8121212121212121
0.8303030303030303


In [15]:
# F1 Scores
from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.91319444 0.21052632 0.22222222]
[0.86514886 0.1        0.        ]
[0.89678511 0.08510638 0.09090909]
[0.91370558 0.12244898 0.1       ]


In [21]:
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEUTRAL))
print(train_y.count(Sentiment.NEGATIVE))

552
71
47


## Improving model 
(evenly distributing positive & negative examples and loading in more data)

In [25]:
# main code
# creating class to load data
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    # this function takes same number of positive and nagative data from whole data 
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]  # takes same number of positive data like nagative data
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

In [26]:
# loading big data file and working on that file
# main code
file_name = 'Datafiles/Books_small_10000.json'
Reviews2 = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        Reviews2.append(Review(review['reviewText'], review['overall']))
        
print(Reviews2[5].text)
print(Reviews2[5].score)
print(Reviews2[5].sentiment)

I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.
5.0
POSITIVE


In [27]:
print(len(Reviews2))

10000


In [28]:
# Data Preparation
training, test = train_test_split(Reviews2, test_size=0.33, random_state=42)
print(len(training))
print(len(test))

6700
3300


In [29]:
train_container = ReviewContainer(training)
test_container = ReviewContainer(test)
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


In [31]:
# Bag of words vectorization
vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)  # first fit then transform
test_x_vectors = vectorizer.transform(test_x)    # not used fit as it won't be used on training model

print(train_x[0])
print(train_x_vectors[0].toarray())

I definitely enjoyed this story. The characters were all multi-dimensional, interesting, and relatable. Abby did a wonderful job conveying emotion and passion shared by the two lead characters. Although the book could use another trip through the editing process it was pretty well written.  Can't wait for book two!  I really hope Cash gets to hit Gavin at some point because that guy is a major DB.
[[0 0 0 ... 0 0 0]]


In [32]:
# Different Models
# Classification
# Linear SVM
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
# Decision Tree
clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
# Naive Bayes
# train_x_vectors is a sparse matrix, dense data is required to work with GaussianNB
# Using train_x_vectors.toarray() to convert to a dense numpy array
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)
# Logistic Regression
clf_log = LogisticRegression(solver='lbfgs', max_iter=200)
clf_log.fit(train_x_vectors, train_y)

In [33]:
# Evaluation
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.7980769230769231
0.6490384615384616
0.6346153846153846
0.8149038461538461


In [36]:
# F1 Scores
from sklearn.metrics import f1_score
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

[0.8028169  0.79310345]
[0.6507177 0.647343 ]
[0.59574468 0.66666667]
[0.82051282 0.808933  ]


In [37]:
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


In [38]:
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


In [39]:
print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))

208
208


In [42]:
# testing model on random data
#test_set = ["very fun","bad book do not buy",'horrible waste of time']
test_set = ["Not very good","best book to buy",'horrible waste of time']
new_test = vectorizer.transform(test_set)
clf_svm.predict(new_test)

array(['NEGATIVE', 'POSITIVE', 'NEGATIVE'], dtype='<U8')

## Changing Vectorization to TfidfVectorizer

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Bag of words vectorization
vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)  # first fit then transform
test_x_vectors = vectorizer.transform(test_x)    # not used fit as it won't be used on training model
# Different Models
# Classification
# Linear SVM
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
# Decision Tree
clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
# Naive Bayes
# train_x_vectors is a sparse matrix, dense data is required to work with GaussianNB
# Using train_x_vectors.toarray() to convert to a dense numpy array
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)
# Logistic Regression
clf_log = LogisticRegression(solver='lbfgs', max_iter=200)
clf_log.fit(train_x_vectors, train_y)

In [44]:
# Evaluation
# Mean Accuracy slightly better than previous model 
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6634615384615384
0.6610576923076923
0.8052884615384616


In [45]:
# F1 Scores
from sklearn.metrics import f1_score
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

[0.80582524 0.80952381]
[0.66019417 0.66666667]
[0.65693431 0.66508314]
[0.80291971 0.80760095]


## Tuning model with Grid Search

In [48]:
from sklearn.model_selection import GridSearchCV
parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)  # cv means Cross-Validation method
clf.fit(train_x_vectors, train_y)
# print best parameter after tuning 
print(clf.best_params_) 

{'C': 4, 'kernel': 'rbf'}


In [49]:
print(clf.score(test_x_vectors, test_y))

0.8197115384615384


## saving model

In [50]:
import pickle
with open('Models/amazonbookreview.pkl', 'wb') as f:
    pickle.dump(clf, f)

## loading model

In [51]:
with open('Models/amazonbookreview.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [52]:
print(test_x[0])
loaded_clf.predict(test_x_vectors[0])

I have really liked all of Emily Giffin's books. Her character development, writing style and themes have always been enjoyable and her books hard to put down. However, The One and Only is just awful. First, I understand that football is a character in the book but I feel she relies on it too much in lieu of developing her characters on a deeper level. I got bogged down by the endless amount of football statistics and discussion. The characters just didn't grab me and I found them difficult to relate with. Second, I was really hoping Shea wasn't going to develop a romantic relationship with the coach but it went there. If it weren't for the fact that they were like &#34;family&#34; and he was a father figure to her and messages of that nature were heavily emphasized early on, it would not have been bothersome. But the whole thing just felt very inscestual and VERY, VERY creepy. The main character has deep psychological issues regarding her relationships with men. I felt myself analyzin

array(['NEGATIVE'], dtype='<U8')