### Importing Librarys

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib as plt

### Loading Data

In [2]:
# test code
file_name = 'Datafiles/booksreview.json'
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append((review['reviewText'], review['overall']))
print(reviews[0][0])
print(reviews[0][1])

Da Silva takes the divine by storm with this unique new novel.  She develops a world unlike any others while keeping it firmly in the real world.  This is a very well written and entertaining novel.  I was quite impressed and intrigued by the way that this solid storyline was developed, bringing the readers right into the world of the story.  I was engaged throughout and definitely enjoyed my time spent reading it.I loved the character development in this novel.  Da Silva creates a cast of high school students who actually act like high school students.  I really appreciated the fact that none of them were thrown into situations far beyond their years, nor did they deal with events as if they had decades of life experience under their belts.  It was very refreshing and added to the realism and impact of the novel.  The friendships between the characters in this novel were also truly touching.Overall, this novel was fantastic.  I can&#8217;t wait to read more and to find out what happen

In [3]:
# main code
# creating class to load data
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE

In [4]:
# main code
file_name = 'Datafiles/booksreview.json'
Reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        Reviews.append(Review(review['reviewText'], review['overall']))
        
print(Reviews[5].text)
print(Reviews[5].score)
print(Reviews[5].sentiment)

Love the book, great story line, keeps you entertained.for a first novel from this author she did a great job,  Would definitely recommend!
4.0
POSITIVE


In [5]:
len(Reviews)

1000

### Data Preparation

In [8]:
from sklearn.model_selection import train_test_split
training, test = train_test_split(Reviews, test_size=0.33, random_state=42)

In [10]:
print(len(training))
print(len(test))
print(training[0].text)

670
330
Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.


In [14]:
# test code
train_x = [x.text for x in training]
train_y = [x.sentiment for x in training]

test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

print(train_x[0])
print(train_y[0])
print(test_x[0])
print(test_y[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
POSITIVE
Every new Myke Cole book is better than the last, and this is no exception. If you haven't read the Shadow Ops series before start with Control Point, but go ahead and order Fortress Frontier and Breach Zone as well - you're going to want them.
POSITIVE


### Bag of words vectorization

In [16]:
# converting text to numerical vectors, bag of words(bow)
# https://scikit-learn.org/stable/modules/feature_extraction.html
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)  # first fit then transform
test_x_vectors = vectorizer.transform(test_x)    # not used fit as it won't be used on training model

print(train_x[0])
print(train_x_vectors[0].toarray())
print(train_x_vectors[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
[[0 0 0 ... 0 0 0]]
  (0, 7086)	1
  (0, 1148)	1
  (0, 350)	2
  (0, 1800)	1
  (0, 6595)	1
  (0, 562)	1
  (0, 3054)	1
  (0, 1558)	1
  (0, 6475)	1
  (0, 6593)	1
  (0, 2895)	1
  (0, 7353)	1
  (0, 539)	1
  (0, 1515)	1
  (0, 5197)	1
  (0, 3545)	1
  (0, 2007)	1


## Different Models

### Classification

In [20]:
# Linear SVM
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
# predicting on test set
print(test_x[0])
print(test_y[0])
clf_svm.predict(test_x_vectors[0])

Every new Myke Cole book is better than the last, and this is no exception. If you haven't read the Shadow Ops series before start with Control Point, but go ahead and order Fortress Frontier and Breach Zone as well - you're going to want them.
POSITIVE


array(['POSITIVE'], dtype='<U8')

In [21]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
# predicting on test set
print(test_y[0])
clf_dec.predict(test_x_vectors[0])

POSITIVE


array(['POSITIVE'], dtype='<U8')

In [28]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
# train_x_vectors is a sparse matrix, dense data is required to work with GaussianNB
# Using train_x_vectors.toarray() to convert to a dense numpy array
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)
# predicting on test set
print(test_y[0])
clf_gnb.predict(test_x_vectors[0].toarray())

POSITIVE


array(['POSITIVE'], dtype='<U8')

In [35]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression(solver='lbfgs', max_iter=200)
clf_log.fit(train_x_vectors, train_y)
# predicting on test set
print(test_y[0])
clf_log.predict(test_x_vectors[0])

POSITIVE


array(['POSITIVE'], dtype='<U8')

### Evaluation

In [37]:
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8242424242424242
0.7666666666666667
0.8121212121212121
0.8303030303030303


In [46]:
# F1 Scores
from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, 
         labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.91319444 0.21052632 0.22222222]
[0.87260035 0.1        0.        ]
[0.89678511 0.08510638 0.09090909]
[0.91370558 0.12244898 0.1       ]
