# Create a class to easily acces text, and score

In [1]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review: #we create a class, Review(text, score)
    def __init__(self, text, score): # it means that we can designate values to text and score
        self.text = text # Review(text, score).text
        self.score = score
        self.sentiment = self.get_sentiment() 
        
    def get_sentiment(self):
        if self.score <=2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #score 4 or 5
            return Sentiment.POSITIVE
        
class ReviewContainer: # ReviewContainer(reviews)
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews] # we get the list of text from reviews list
        
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews] # we get the sentiment from reviews list
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)] # say we have 450 negatives and 600 positives. We only select the 450 positives
        self.reviews = negative + positive_shrunk # our new reviews length. using the above example, we have 450 negatives + 450 positives which was shrunk
        random.shuffle(self.reviews) # after that we shuffle the list to get random positive and random negative

# Import data and append it into a list

In [2]:
import json
import pandas as pd

file_name = 'C:/Users/Randy/Downloads/archive/Books_small_10000.json'

reviews = []

with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall'])) # Review(text, score), we designate a value to each variable so...

print(reviews[5].score) # when we called on score, it prints the score
print(reviews[5].text)
print(reviews[5].get_sentiment()) # same with the code below
reviews[5].sentiment # same with the code above

5.0
I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.
POSITIVE


'POSITIVE'

In [3]:
reviews[5].sentiment

'POSITIVE'

In [4]:
len(reviews)

10000

# Train Test Split

In [5]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(train)
test_container = ReviewContainer(test)

In [6]:
len(train) # x variable of train data

6700

In [7]:
len(test) # x variable of test data

3300

In [8]:
print(train[0].sentiment)

POSITIVE


In [9]:
train_container.evenly_distribute() # from the function we created, this splits the review list into equal numbers of POSITIVE AND NEGATIVE

train_x = train_container.get_text() # from the train data, we get the text
train_y = train_container.get_sentiment() # from the train data, we get the sentiments

test_container.evenly_distribute()

test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print("Train data: {x}, {y}".format(x = len(train_x) , y = len(train_y)))
print("Test data: {0}, {1}".format(len(test_x) , len(test_y)))

Train data: 872, 872
Test data: 416, 416


# Bag of words

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

vectorizer = CountVectorizer() # calls and designate a name to the function
X_train_vectors = vectorizer.fit_transform(train_x) # fits (make a model) and transforms (scales it) train x data - Vectorized train x data

X_test_vectors = vectorizer.transform(test_x) # transforms test x (we do not fit it as it will not be used as a model)


print(train_x[0])
print(X_train_vectors[0])

This book was quite a disappointment for me: the blurb caught my attention but the execution was a far cry from what I was expecting.Darcie is a survivor coming from a totally messed up childhood, at the age of 14 Reggie rescues her becoming her older brother. She always refers to him as her &#34;knight in shining harmor&#34;.The problem is that Darcie is an annoying 17 yo bordering to flat-out stupid that passes her time in fight and suspensions from school. Suddenly she decides that she's always been in love with him, an goes on throwing tantrums nod being insufferable.Reggie is a far cry from a knight in shining harmor, more of a psychopath with anger issues and murdering tendencies.The story is quite unbelievable and most of the times it appears nothing more than a series of scenes without a real logical connection between them.Also the writing wasn't very good: too minimal without a real deepness to the characters and the situations that appear to be flat at best. Sometimes it ste

# Classification

## Linear SVM

In [11]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear') # create an SVM regressor
clf_svm.fit(X_train_vectors, train_y) # Fits the X_train_vectors to train_y using SVM

# Model
print(test_x[0])
clf_svm.predict(X_test_vectors[0]) # Uses the model we created to predict a Vectorized test x data

# Below, we see that the first element in the test x data has a text. The model predicts whether that text is positive or negative

As I was reading this book I kept wondering why we were getting so much endless detail about the inn and the businesses in the town of Boonsboro; seriously, this read almost like a brochure put out by the Boonsboro chamber of commerce, with special emphasis on the inn.  Then I googled Boonsboro and discovered that the inn in question exists in "real life" and is owned by the author, as is the bookshop.  This use of a novel to promote a business venture would be fine if the story were interesting.  It's not.  The romance is only a side story -- the real story is about all the details relating to the refurbishment of the Inn, and the plans to buy stuff to put in it.  The heroine is perfectly acceptable but seems untouched by her past, the Hero is unbelievably perfect, and there is absolutely no tension in the relationship at all.  No angst, no obstacles in the way, nothing.  At the end I felt as if I'd been manipulated into reading a long ad for the inn.  I won't be reading the others in

array(['NEGATIVE'], dtype='<U8')

## Decision Tree 

In [12]:
from sklearn.tree import DecisionTreeClassifier
clf_dec_tree = DecisionTreeClassifier() 
clf_dec_tree.fit(X_train_vectors, train_y) # fit X_train_vectors to train_y


print(test_x[0])
clf_dec_tree.predict(X_test_vectors[0])

As I was reading this book I kept wondering why we were getting so much endless detail about the inn and the businesses in the town of Boonsboro; seriously, this read almost like a brochure put out by the Boonsboro chamber of commerce, with special emphasis on the inn.  Then I googled Boonsboro and discovered that the inn in question exists in "real life" and is owned by the author, as is the bookshop.  This use of a novel to promote a business venture would be fine if the story were interesting.  It's not.  The romance is only a side story -- the real story is about all the details relating to the refurbishment of the Inn, and the plans to buy stuff to put in it.  The heroine is perfectly acceptable but seems untouched by her past, the Hero is unbelievably perfect, and there is absolutely no tension in the relationship at all.  No angst, no obstacles in the way, nothing.  At the end I felt as if I'd been manipulated into reading a long ad for the inn.  I won't be reading the others in

array(['POSITIVE'], dtype='<U8')

## Naive Bayes

In [13]:
class DenseTransformer(): # Makes X dense using toarray() function

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.toarray()

In [14]:
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline # we use pipeline and store all functions inside of it

pipeline = Pipeline([('to_dense', DenseTransformer()), ('classifier', GaussianNB())])

pipeline.fit(X_train_vectors, train_y)

print(test_x[0])
pipeline.predict(X_test_vectors[0])

As I was reading this book I kept wondering why we were getting so much endless detail about the inn and the businesses in the town of Boonsboro; seriously, this read almost like a brochure put out by the Boonsboro chamber of commerce, with special emphasis on the inn.  Then I googled Boonsboro and discovered that the inn in question exists in "real life" and is owned by the author, as is the bookshop.  This use of a novel to promote a business venture would be fine if the story were interesting.  It's not.  The romance is only a side story -- the real story is about all the details relating to the refurbishment of the Inn, and the plans to buy stuff to put in it.  The heroine is perfectly acceptable but seems untouched by her past, the Hero is unbelievably perfect, and there is absolutely no tension in the relationship at all.  No angst, no obstacles in the way, nothing.  At the end I felt as if I'd been manipulated into reading a long ad for the inn.  I won't be reading the others in

array(['POSITIVE'], dtype='<U8')

In [15]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(X_train_vectors, train_y)

print(test_x[0])
clf_log.predict(X_test_vectors[0])

As I was reading this book I kept wondering why we were getting so much endless detail about the inn and the businesses in the town of Boonsboro; seriously, this read almost like a brochure put out by the Boonsboro chamber of commerce, with special emphasis on the inn.  Then I googled Boonsboro and discovered that the inn in question exists in "real life" and is owned by the author, as is the bookshop.  This use of a novel to promote a business venture would be fine if the story were interesting.  It's not.  The romance is only a side story -- the real story is about all the details relating to the refurbishment of the Inn, and the plans to buy stuff to put in it.  The heroine is perfectly acceptable but seems untouched by her past, the Hero is unbelievably perfect, and there is absolutely no tension in the relationship at all.  No angst, no obstacles in the way, nothing.  At the end I felt as if I'd been manipulated into reading a long ad for the inn.  I won't be reading the others in

array(['NEGATIVE'], dtype='<U8')

# Evaluation Metrics

In [16]:
# Mean Accuracy

print(clf_dec_tree.score(X_test_vectors, test_y))
print(pipeline.score(X_test_vectors, test_y))
print(clf_svm.score(X_test_vectors, test_y))
print(clf_log.score(X_test_vectors, test_y))


0.6370192307692307
0.6346153846153846
0.7980769230769231
0.8149038461538461


In [17]:
# F1 Score, Confusion Matrix (Sensitivity, Specificity, % of correctly predicting TRUE NEUTRALS)

from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(X_test_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE, Sentiment.NEUTRAL]))
print(f1_score(test_y, clf_log.predict(X_test_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE, Sentiment.NEUTRAL]))

[0.8028169  0.79310345 0.        ]
[0.82051282 0.808933   0.        ]


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


In [18]:
test_y.count(Sentiment.NEGATIVE)

208

In [19]:
print(train_y.count(Sentiment.NEGATIVE))
train_y.count(Sentiment.POSITIVE)# we have 670 train labels, 552 are positive

436


436

# Balance Positives and Negative

In [20]:
# what we have found out in our model is that it is heavily biased on predicting TRUE POSITIVES. Thus we load in more raw data
# Since adding raw data did nothing significant, we also equally splitted the counts of negative and positive, both on test and train data
# besides vectorizer, we can use tf-idf

# Testing out the model

In [21]:
prototype_data = [input("Enter a comment: ")]
new_test = vectorizer.transform(prototype_data)

clf_svm.predict(new_test)

Enter a comment: I did not enoy reading the book


array(['NEGATIVE'], dtype='<U8')

# Saving our Model

In [308]:
import pickle

with open('C:/Users/Randy/Downloads/IE things/ML Models/Sentiment_Classifier.pkl', 'wb') as f:
    pickle.dump(clf_svm, f)

# Load Model

In [309]:
with open('C:/Users/Randy/Downloads/IE things/ML Models/Sentiment_Classifier.pkl', 'rb') as f:
    loaded_clf_svm = pickle.load(f)

In [312]:
# Loads the model again

prototype_data = [input("Enter a comment: ")]
new_test = vectorizer.transform(prototype_data)

loaded_clf_svm.predict(new_test)

Enter a comment: It was mediocre


array(['NEGATIVE'], dtype='<U8')