This is based from a tutorial video by Keith Galli:
https://www.youtube.com/watch?v=M9Itm95JzL0

Keith has taken the data from (http://jmcauley.ucsd.edu/data/amazon/) and cleaned the review only for books and the year 2014. He then took 1000 random samples from those subset 


Keith's data are provided in his Github: https://github.com/keithgalli/sklearn

# Load data

In [89]:
import json
import random

In [101]:
# create basic class to be neat
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        # assume score 1 and 2 are negative, 3 is neutral, 4 and 5 are positive (amazon review out of 5)
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE
    
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
    
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
        

In [72]:
# read the file
file_name = "C:/Users/Riyan Aditya/Desktop/ML_learning/project8_explore/Books_small_10000.json"

reviews = []

with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        
        reviews.append(Review(review['reviewText'],review['overall']))
        

In [73]:
reviews[5].text

'I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia\'s trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\'s.'

# Prep data

In [74]:
len(reviews)

10000

In [75]:
from sklearn.model_selection import train_test_split

In [102]:
training, test = train_test_split(reviews,test_size = 0.33, random_state=42)

train_container = ReviewContainer(training)
test_container = ReviewContainer(test)


In [128]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))


436
436


## Bag of words vectorisation

In [159]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [160]:
# apply bag of words vectorisation
vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)

In [161]:
print(train_x[0])
print(train_x_vectors[0])

Right into the second chapter, I was extremely confused. Once I got to chapter four, I was slightly upset and even more confused than before. This last book, this supposed culmination of a trilogy, was basically a different book altogether. I honestly felt like Divergent and Insurgent were the true beginning and end to a story, and Allegiant was fan fiction the author couldn't help but throw in.We get a completely different plot that ends up overshadowing the main plot of the first two books, our Mary Sue character apparently becomes a bit too perfect for the author to handle, and a character who is clearly emotionally unstable and extremely codependent on the main character suddenly becomes the main character himself.I adore Toby. He is like that hurt puppy you want to cuddle all day long. But we just spent two books cheering on the extremely perfect Tris. I don't mean perfect as in she can do no wrong, but I do mean that in the end she always fights for the greater good and we are al

# classification

## Linear SVM

In [162]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [163]:
test_x[0]

'1st let me say I love this series!  Gray and Promise is a wonderful love story.  Things started out so terrible for Promise, but along came Grayson .  The story is different from any mc book I have read.  I enjoyed each character in the series.  Cannot wait for the next book!!!'

In [164]:
test_x_vectors[0]

<1x8906 sparse matrix of type '<class 'numpy.float64'>'
	with 38 stored elements in Compressed Sparse Row format>

In [165]:
clf_svm.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

## Decision Tree

In [166]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

DecisionTreeClassifier()

In [167]:
clf_dec.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

## Naive Bayes

In [168]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)

GaussianNB()

In [169]:
clf_gnb.predict(test_x_vectors[0].toarray())

array(['NEGATIVE'], dtype='<U8')

## Logistic Regression

In [170]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors.toarray(), train_y)

LogisticRegression()

In [171]:
clf_log.predict(test_x_vectors[0].toarray())

array(['POSITIVE'], dtype='<U8')

# Evaluation

## mean accuracy

In [172]:
clf_svm.score(test_x_vectors, test_y)

0.8076923076923077

In [173]:
clf_dec.score(test_x_vectors, test_y)

0.6538461538461539

In [174]:
clf_gnb.score(test_x_vectors.toarray(), test_y)

0.6610576923076923

In [175]:
clf_log.score(test_x_vectors, test_y)

0.8052884615384616

## f1 score

In [176]:
from sklearn.metrics import f1_score

In [177]:
f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.80582524, 0.80952381])

This means the model is good for positive, but bad for neutral and negative

In [178]:
f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.65048544, 0.65714286])

In [179]:
f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.65693431, 0.66508314])

In [180]:
f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.80291971, 0.80760095])

All the model pretty much only good for positive reviews

## Lets investigate

In [126]:
test_y.count(Sentiment.POSITIVE)

2767

This suggests our model are bias for those positive label

In [127]:
test_y.count(Sentiment.NEGATIVE)

208

Lets use bigger dataset. 10000 random sample. Reload the data with the 10,000 samples

Model slightly better after we use bigger training test and make it equal to 50-50 between positive and negative

Then we notice that the test set is not evenly distributed. Lets make it the same

Ok thats better now after we make the test set 50-50 too

In [181]:
test_set = ['very brilliant', "bad book do not buy", "horrible waste of time"]
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

Probably did not know what brilliant is

## Use Gridsearch

In [182]:
from sklearn.model_selection import GridSearchCV

In [183]:
parameters ={'kernel':('linear','rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv = 5)
clf.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [184]:
clf.best_params_

{'C': 1, 'kernel': 'linear'}

the best param already the one we started with anyway

# saving model

In [186]:
import pickle

with open('sentiment_classifier.pkl','wb') as f:
    pickle.dump(clf, f)

In [187]:
# to load

with open('sentiment_classifier.pkl','rb') as f:
    loaded_clf = pickle.load(f)

In [188]:
loaded_clf.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')