### Data Class

In [138]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: # score of 4 or 5
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
   

### Load Data

In [78]:
import json

file_name = 'Books_small_10000.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review["reviewText"], review["overall"]))
        
reviews[7].sentiment

'POSITIVE'

### Prep Data

In [140]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(training)
test_container = ReviewContainer(test)


In [157]:
train_container.evenly_distribute()
test_container.evenly_distribute()

train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))

208
208


### Bag of Words Vectorization

In [174]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Tfidf: term frequency inverse document frequency



vectorizer = TfidfVectorizer()
# vectorizer.fit(train_x)
# train_x_vectors = vectorizer.transform(train_x)
train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())

I was very disappointed with this book, not up to snuff by Deaver. Too many filler words, too expensive. Not interesting.
[[0. 0. 0. ... 0. 0. 0.]]


### Classification

#### Linear SVM (Support Vector Machine)

In [175]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)

print(test_x[0])
clf_svm.predict(test_x_vectors[0])

This was a fascinating book that slowly drew me in as the author ( economist, columnist , academic) developed his theme.  Like the book on Debt by David Graeber,  the author has a particular economic point of view , yet my own readings and subsequent central bank actions after the 2008 collapse seem to support his thoughts .  You will trace money/currency from the earliest times ( the stories are fascinating)  right up to present day and its dependence on credit, trust and credibility. . He does make sense out of why economists missed the collapse of banking/credit etc.  You will learn how the credit support of national government was needed to prop things up , that as it stands the govt is still taking all the risk and none of the upside - leaving the financial industry able to make bets and only have an upside , no downside. He gives  ideas from history that show how we can improve the system in each nation state and internationally, with the goal of a stable prosperous economy with 

array(['POSITIVE'], dtype='<U8')

#### Decision Tree

In [176]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

print(test_x[0])
clf_dec.predict(test_x_vectors[0])

This was a fascinating book that slowly drew me in as the author ( economist, columnist , academic) developed his theme.  Like the book on Debt by David Graeber,  the author has a particular economic point of view , yet my own readings and subsequent central bank actions after the 2008 collapse seem to support his thoughts .  You will trace money/currency from the earliest times ( the stories are fascinating)  right up to present day and its dependence on credit, trust and credibility. . He does make sense out of why economists missed the collapse of banking/credit etc.  You will learn how the credit support of national government was needed to prop things up , that as it stands the govt is still taking all the risk and none of the upside - leaving the financial industry able to make bets and only have an upside , no downside. He gives  ideas from history that show how we can improve the system in each nation state and internationally, with the goal of a stable prosperous economy with 

array(['NEGATIVE'], dtype='<U8')

#### Naive Bayes

In [180]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors, train_y)

print(test_x[0])
clf_gnb.predict(test_x_vectors[0])

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

#### Logistic Regression

In [181]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)

print(test_x[0])
clf_log.predict(test_x_vectors[0])

This was a fascinating book that slowly drew me in as the author ( economist, columnist , academic) developed his theme.  Like the book on Debt by David Graeber,  the author has a particular economic point of view , yet my own readings and subsequent central bank actions after the 2008 collapse seem to support his thoughts .  You will trace money/currency from the earliest times ( the stories are fascinating)  right up to present day and its dependence on credit, trust and credibility. . He does make sense out of why economists missed the collapse of banking/credit etc.  You will learn how the credit support of national government was needed to prop things up , that as it stands the govt is still taking all the risk and none of the upside - leaving the financial industry able to make bets and only have an upside , no downside. He gives  ideas from history that show how we can improve the system in each nation state and internationally, with the goal of a stable prosperous economy with 

array(['POSITIVE'], dtype='<U8')

### Evaluation

In [182]:
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6177884615384616
0.6177884615384616
0.8052884615384616


In [183]:
# F1 Scores
from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
# print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

[0.80582524 0.80952381]


In [184]:
test_set = ['very fun', 'bad book do not buy', 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

### Tuning our model (with Grid Search)

In [193]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)


GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [195]:
clf.score(test_x_vectors, test_y)

0.8100961538461539

### Saving Model

In [203]:
import pickle

with open('C:\\Users\\HP\\Desktop\\sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

### Loading Model

In [204]:
with open('C:\\Users\\HP\\Desktop\\sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [207]:
print(test_x[0])

loaded_clf.predict(test_x_vectors[0])

This was a fascinating book that slowly drew me in as the author ( economist, columnist , academic) developed his theme.  Like the book on Debt by David Graeber,  the author has a particular economic point of view , yet my own readings and subsequent central bank actions after the 2008 collapse seem to support his thoughts .  You will trace money/currency from the earliest times ( the stories are fascinating)  right up to present day and its dependence on credit, trust and credibility. . He does make sense out of why economists missed the collapse of banking/credit etc.  You will learn how the credit support of national government was needed to prop things up , that as it stands the govt is still taking all the risk and none of the upside - leaving the financial industry able to make bets and only have an upside , no downside. He gives  ideas from history that show how we can improve the system in each nation state and internationally, with the goal of a stable prosperous economy with 

array(['POSITIVE'], dtype='<U8')