### Book Reviews Sentiment Classification
#### By: Adebola Orogun

This project is aimed at building a simple machine learning model that classifies reviews  from people who bought a book.This explores Natural Language Processing techniques and other data manipulation and machine learning procedures.

In [2]:
#Creating a class for the sentiments, reviews and review container for easier implementation of the project
# Also helps the calling of the components of the json files in python

import random

# The sentiment class which creates instances for categories of sentiments.
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"
    
# Review class which returns the category of sentiment based on the text and score of the input text(review)
class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <=2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE

# The ReviewContainer class creates an object for the reviews, giving us the functionality to retrieve texts, retrieve sentiments
# Also gives the functionality to evenly distribute the reviews based on their classes (Negative and positive)
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews)) #Checks if sentiment of text is Negative
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews)) #Checks for positive sentiments
        positive_shrunk = positive[:len(negative)] # Reduce the lenght of positive reviews to same as negative reviews
        self.reviews = negative + positive_shrunk #Combines the list of positive and negative reviews 
        random.shuffle(self.reviews) # Shuffles the list of reviews to remove any type of order as a result of previous steps.

In [3]:
#Loading the dataset into the jupyter notebook and storing the reviews as a list.
import json

review_file = "./data/book_reviews.json"

reviews = []
with open(review_file) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review["reviewText"], review["overall"]))

# Display the third review in the list of reviews
reviews[2].text



In [6]:
# Splitting the reviews into train and test splits
from sklearn.model_selection import train_test_split

# Split the reviews using 30% as test set and 70% as training set.
train, test = train_test_split(reviews, test_size=0.3, random_state=42)

# Instantiating the train_container and test_container using the ReviewContainer class created above.
train_container = ReviewContainer(train)

test_container = ReviewContainer(test)

In [7]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

461
461


In [8]:
## Natural language processing on the reviews
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# TfidfVectorizer helps to break sentences into vectors, this helps the computation of the reviews.
vectorizer = TfidfVectorizer() # Instantiating the TFIDFvectorizer
train_x_vectors = vectorizer.fit_transform(train_x) # Applying the vectorizer to the reviews.

test_x_vectors = vectorizer.transform(test_x) #Applying the transformation from the training data to the test dataset

# Displaying how both the original review and the vectorized reviews look for better understanding.
print(train_x[0]) # Original text
print(train_x_vectors[0].toarray()) # Vectorized text

Winter's Past by Mary E. Hanks is a contemporary Christian novel that speaks of God's redemption and forgiveness in marriage.Winter Cowan is a Christian speaker and the head of Passion's Prayer, a group of like-minded individuals travelling across the country sharing the Good News of Jesus Christ. Unfortunately, they are going to  Coeur d'Alene, Idaho and there is one person there she does not want to see - her ex-husband.Ty Williams is a changed man and he is determined that Winter knows it. While he realizes he was the cause of their break-up, he is trusting God to make Winter see that he is a changed man - one who believes in God now. He is determined to win her back and re-marry &#34;the wife of his youth.&#34; Now all he has to do is get Winter to forgive him. But will Winter practice what she preaches? Can she ever really trust or forgive Ty for cheating on her?And that is the crux of this story right there. Winter preaches about God's forgiveness, so now she basically has to pra

In [9]:
# Importing machine learning classifier models to be trained.
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

In [11]:
# Support Vector Machines: We are using a linear kernel. 
#This block of code trains the svm model on the training set and evaluates the model using the test set.
# Then also check a review and investgates what the model predicts as the sentiment. 
svm_classifier = svm.SVC(kernel="linear")

svm_classifier.fit(train_x_vectors, train_y)

print(svm_classifier.score(test_x_vectors, test_y))
print()

print(test_x[0])
svm_classifier.predict(test_x_vectors[0])

0.8387978142076503

I read the full account so felt this was a waste of my time and I could have been reading something more meaningful


array(['NEGATIVE'], dtype='<U8')

In [12]:
# Decision Tree Classifier is trained in this cell of code. 
dt_classifier = DecisionTreeClassifier()

dt_classifier.fit(train_x_vectors, train_y) #Training the model on the train split of the dataset.

print(dt_classifier.score(test_x_vectors, test_y)) # Evaluating the performance of the dataset using the test splits.
print()

print(test_x[5]) # Prints the 6th review text
dt_classifier.predict(test_x_vectors[5]) # Prints the output sentiment from the model.

0.644808743169399

The writing was ok,but there were so many places where the foul language was so unnecessary that it detracted from the storyline.  The actual concept was great, but in so many ways it just became tedious.


array(['NEGATIVE'], dtype='<U8')

In [13]:
# Logistic Regression classifier is trained in this block of code.
lr_classifier = LogisticRegression() # Instantiating the Logistic regression model

lr_classifier.fit(train_x_vectors, train_y) # Training the model on the train split of the dataset

print(lr_classifier.score(test_x_vectors, test_y)) #Evaluating the model using the test dataset.
print()

print(test_x[3]) #Print the third review (text) from the reviews list
lr_classifier.predict(test_x_vectors[3]) #Print the predict sentiment for the review above.

0.8224043715846995

While the attempt to write something new and worthwhile in an abused subgenre is more than welcome, the end result, at least in this case, fails exactly in the literary moments that should  embody its novelty.First, beside the steady use of an atrocious 'baristO' in place of the correct 'baristA', there are several grammar mistakes, which is quite rich in a work that tries to use a new, lyrical language.Second, the French locutions are out of place.Last, but absolutely not least, the pervasive poetic quotations appear -to me at least- far too long and hardly related with the feeling they are supposed to convey.Plot does not make a lot of sense and the overlong scene with Eden at the cafe is hardly conducive of the lead's growth.


array(['NEGATIVE'], dtype='<U8')

In [14]:
# Testing the capabilities of the model by giving it reviews to classify.
# Creating a list of reviews
test_set = ["This is a great book", "I do not recommend this content", "Writer  must be very great"]
# VEctorizing the reviews using the previously instantiated above.
new_test = vectorizer.transform(test_set)

# Making predictions on the vectorized test reviews.
lr_classifier.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'POSITIVE'], dtype='<U8')

#### Saving the model.

In [15]:
import pickle

In [16]:
# Saving the model as a pickle file.
with open("./sentiment_classifier.pkl", "wb") as f:
    pickle.dump(lr_classifier, f)

In [17]:
# Loading the model for reuse.
with open("./sentiment_classifier.pkl", "rb") as f:
    loaded_model = pickle.load(f)
    
print(test_x[10])
loaded_model.predict(test_x_vectors[10])

From the cover I thought this would be interesting. The SEX, SEX, SEX.... lots of SEX.... I read novels for good story lines, characters I care about, etc. Do not get this book even if free.


array(['NEGATIVE'], dtype='<U8')