## Setup

Data Class

In [3]:
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE

Import Data

In [17]:
import json

file = './Books_small.json'
reviews = []
with open (file) as f:
    for line in f:
        review = json.loads(line) # returns a python dictionary
        reviews.append(Review(review['reviewText'],review['overall']))
        
# To test if the import was succesfull
# print(reviews[0].text, reviews[0].sentiment)

## Preparing Data

In [20]:
# library to split test data and training data
from sklearn.model_selection import train_test_split

# divides the data 80-20 training and testing respectively
training, testing = train_test_split(reviews,test_size=0.2, random_state=40)

In [27]:
#returns the text and the sentiment part of the reviews respectively
train_x = [x.text for x in training]
train_y = [x.sentiment for x in training]

#train_x[0]

In [28]:
#returns the text and the sentiment part of the reviews respectively
test_x = [x.text for x in testing]
test_y = [x.sentiment for x in testing]

#test_x[0]

#### Bag of words vectorization

In [29]:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
# fits the model and transforms the data (2 in one function)
train_x_vectors = vectorizer.fit_transform(train_x)

#data already fitted to trainigntext data so testing test data wouldnt need to be refitted
test_x_vectors = vectorizer.transform(test_x)

## Classification

Linear SVM

In [34]:
from sklearn.svm import SVC

clf_svc = SVC(kernel = 'linear')
clf_svc.fit(train_x_vectors, train_y)

print(test_x[3])
print(clf_svc.predict(test_x_vectors[3]))

Very interesting and I was born in 1937 in Nebraska.  We lived in the back of a grocery store and I remember the heat and wet sheets; cannot stand heat to this day.  It was all true, but rather drawn out and tediously redundant.  A great lesson for the later generations to preserve our resources; both land AND water.
['POSITIVE']


Naive Bayes

In [44]:
from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression()
clf_lr.fit(train_x_vectors, train_y)

print(test_x[3])
print(clf_lr.predict(test_x_vectors[3]))

Very interesting and I was born in 1937 in Nebraska.  We lived in the back of a grocery store and I remember the heat and wet sheets; cannot stand heat to this day.  It was all true, but rather drawn out and tediously redundant.  A great lesson for the later generations to preserve our resources; both land AND water.
['POSITIVE']


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Decision Tree

In [40]:
from sklearn.tree import DecisionTreeClassifier

clf_dt = DecisionTreeClassifier()
clf_dt.fit(train_x_vectors, train_y)

print(test_x[3])
print(clf_dt.predict(test_x_vectors[3]))

Very interesting and I was born in 1937 in Nebraska.  We lived in the back of a grocery store and I remember the heat and wet sheets; cannot stand heat to this day.  It was all true, but rather drawn out and tediously redundant.  A great lesson for the later generations to preserve our resources; both land AND water.
['POSITIVE']


K Nearest Neighbour

In [41]:
from sklearn.neighbors import KNeighborsClassifier

clf_knn = KNeighborsClassifier(n_neighbors=3)
clf_knn.fit(train_x_vectors, train_y)

print(test_x[3])
print(clf_knn.predict(test_x_vectors[3]))

Very interesting and I was born in 1937 in Nebraska.  We lived in the back of a grocery store and I remember the heat and wet sheets; cannot stand heat to this day.  It was all true, but rather drawn out and tediously redundant.  A great lesson for the later generations to preserve our resources; both land AND water.
['NEUTRAL']


## Evaluation

In [48]:
# Mean Accuracy
print("Accuracy of SVC: ", clf_svc.score(test_x_vectors, test_y))
print("Accuracy of Logistic Regression: ", clf_lr.score(test_x_vectors, test_y))
print("Accuracy of Decision Tree: ", clf_dt.score(test_x_vectors, test_y))
print("Accuracy of KNN: ", clf_knn.score(test_x_vectors, test_y))

Accuracy of SVC 0.815
Accuracy of Logistic Regression 0.83
Accuracy of Decision Tree 0.755
Accuracy of KNN 0.785


In [60]:
# F1 Scores
from sklearn.metrics import f1_score

svc_f1 = f1_score(test_y, clf_svc.predict(test_x_vectors), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
print("F1 score of SVC: ", svc_f1)

lr_f1 = f1_score(test_y, clf_lr.predict(test_x_vectors), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
print("F1 score of Logistic Reg: ", lr_f1)

dt_f1 = f1_score(test_y, clf_dt.predict(test_x_vectors), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
print("F1 score of Decision Tree: ", dt_f1)

knn_f1 = f1_score(test_y, clf_knn.predict(test_x_vectors), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
print("F1 score of KNN: ", knn_f1)



F1 score of SVC:  [0.90434783 0.25       0.26086957]
F1 score of Logistic Reg:  [0.90909091 0.20689655 0.31578947]
F1 score of Decision Tree:  [0.86646884 0.19512195 0.09090909]
F1 score of KNN:  [0.88       0.13333333 0.1       ]


In [61]:
train_y[0:5]

['POSITIVE', 'POSITIVE', 'POSITIVE', 'POSITIVE', 'POSITIVE']

In [62]:
train_y.count(Sentiment.POSITIVE)

668

In [64]:
len(train_y)

800