In [1]:
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"
    

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE # A SCORE OF 4 OR 5
        

## Load Data

In [108]:
import json

file_path = './data/Books_reviews.json'

reviews = []
with open(file_path) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review["reviewText"], review["overall"]))


In [3]:
reviews[3].sentiment

'POSITIVE'

## Prep Data

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
training, test = train_test_split(reviews, test_size=0.3, random_state=42)


In [6]:
print(len(training))
print(len(test))

700
300


In [7]:
training[0].sentiment

'POSITIVE'

In [8]:
train_x = [x.text for x in training]
train_y = [x.sentiment for x in training]
test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

In [9]:
train_x[2]

'The story of Savannah is like reality....your expectations are all being fulfilled, then LIFE throws a curve ball you do not see coming. As with all of the trials you find, there is sometimes an unexpected surprise that changes your direction. Kayden & Savannah have both been hurt in a relationship, but he & Savannah take a chance on each other.Recommended reading...await the next part of this story.'

### Bag of words

Bag-of-words(BoW) is a statistical language model used to analyze text and documents based on word count

In [158]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)
print(train_x[1])
print(train_x_vectors[1].toarray())

I found this book very interesting, despite the way the story unfolds.  It is hard to believe that at this day and age this kind of prostitution is still going on, but I guess this is another way for the rich to get richer.Please read for yourself and let me know what you think.  My biggest no no with this book is the child exploitation, which unfortunately still happens and I believe It will happen for a long time to come.Sad story for the girls, but overall a great read.  Highly recommended.
[[0 0 0 ... 0 0 0]]


## Classification

We will try different algorithms and see how they perform

### Linear SVM

In [159]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

In [160]:
print(test_x[1])
print(test_x_vectors[1].toarray())

This book was a great version of Cinderella, it's better than the original version, I like how the "ugly step-sister's" were actually nice and caring, the people of the kingdom only judged a book by the cover, and the mother just wanted her daughters to have a happy life, and the coachman had been in love with Anna all this time! This is a great version of Cinderella, this should become a movie and a real novel not an novelette
[[0 0 0 ... 0 0 0]]


In [161]:
clf_svm.predict(test_x_vectors[1])

array(['POSITIVE'], dtype='<U8')

In [162]:
test_y[1]

'POSITIVE'

In [163]:
# Mean Accuracy
score_svm = clf_svm.score(test_x_vectors.toarray(), test_y)
score_svm

0.8233333333333334

### Decision Tree

In [164]:
from sklearn import tree 

clf_dec = tree.DecisionTreeClassifier()


clf_dec.fit(train_x_vectors, train_y)

In [165]:
print(test_x[2])
print(test_x_vectors[2].toarray())

Michael Cunningham mesmerizes with the thoughtful, elegant prose that is this book.  The reader becomes so close to its characters...the reader feels what these brothers feel.  Beautiful and tragic...a book that will stay with me for a long, long time.  Thank you again, Mr. Cunningham.  The Hours remains at the top of my list and The Snow Queen is another gift to your readers.
[[0 0 0 ... 0 0 0]]


In [166]:
clf_dec.predict(test_x_vectors[2])

array(['POSITIVE'], dtype='<U8')

In [167]:
score_dec = clf_dec.score(test_x_vectors, test_y)
score_dec

0.7633333333333333

### Naive Bayes

In [168]:
from sklearn.naive_bayes import GaussianNB
clf_bay = GaussianNB()

clf_bay.fit(train_x_vectors.toarray(), train_y)


In [169]:
clf_bay.predict(test_x_vectors[2].toarray())

array(['POSITIVE'], dtype='<U8')

In [170]:
score_bay = clf_bay.score(test_x_vectors.toarray(), test_y)
score_bay

0.8133333333333334

### Random Forest

In [171]:
from sklearn.ensemble import RandomForestClassifier
clf_rfc = RandomForestClassifier()
clf_rfc.fit(train_x_vectors.toarray(), train_y)

In [172]:
clf_rfc.predict(test_x_vectors[2].toarray())

array(['POSITIVE'], dtype='<U8')

In [173]:
score_rfc = clf_rfc.score(test_x_vectors.toarray(), test_y)
score_rfc

0.86

### QDA

In [174]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
clf_qda = QuadraticDiscriminantAnalysis()
clf_qda.fit(train_x_vectors.toarray(), train_y)



In [175]:
clf_qda.predict(test_x_vectors[2].toarray())

array(['NEUTRAL'], dtype='<U8')

In [176]:
score_qda = clf_qda.score(test_x_vectors.toarray(), test_y)
score_qda

0.27666666666666667

### F1 Score Evaluation

In [177]:
from sklearn.metrics import f1_score
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

print(f1_score(test_y, clf_bay.predict(test_x_vectors.toarray()), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

print(f1_score(test_y, clf_rfc.predict(test_x_vectors.toarray()), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

print(f1_score(test_y, clf_qda.predict(test_x_vectors.toarray()), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))


[0.9059501  0.26923077 0.2962963 ]
[0.86821705 0.14035088 0.07407407]
[0.89757914 0.0952381  0.0952381 ]
[0.92473118 0.         0.        ]
[0.38690476 0.14529915 0.06666667]


As we can see the model is pretty much just predicting everything as it is positive

In [178]:
train_y[0:10]

['POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL']

In [179]:
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEUTRAL))
print(train_y.count(Sentiment.NEGATIVE))

577
75
48


this model is trained mostly on positive reviews which caused a lack of prediction on the other options(NEGATIVE, NEUTRAL) reviews

This is a bit of 'Data' problem rather than a 'Model' Problem so we can't do much

### Testing The Model on an external data

In [184]:
test_data = ["not very recommended", "I love it"]
new_data = vectorizer.transform(test_data)


In [185]:
clf_qda.predict(new_data.toarray())

array(['NEUTRAL', 'POSITIVE'], dtype='<U8')