Data Class 

In [3]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

Load Data

In [7]:
url = 'https://raw.githubusercontent.com/KeithGalli/sklearn/master/data/sentiment/Books_small_10000.json'
if url.find('/'):
  print url.rsplit('/', 1)[1]

SyntaxError: invalid syntax (<ipython-input-7-42d59ce3810f>, line 3)

In [10]:
import requests

url = 'https://raw.githubusercontent.com/KeithGalli/sklearn/master/data/sentiment/Books_small_10000.json'
r = requests.get(url, allow_redirects=True)
open('githubusercontent.Books_small_10000.json', 'wb').write(r.content)

8063542

In [11]:
import json

file_name = 'githubusercontent.Books_small_10000.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))
        
reviews[5].text

'I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia\'s trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\'s.'

In [12]:

from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)

In [13]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


Bag of Words Vectorization 

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# This book is great !
# This book was so bad

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())

Great Suspenseful Book. Well written with the plot masterfully taken thru the story. Kept me interested can't wait to read the second book.
[[0. 0. 0. ... 0. 0. 0.]]


Classification 

Linear SVM

In [17]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

test_x[0]

clf_svm.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

Decision Tree

In [18]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

Naive Bayes

In [19]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(train_x_vectors, train_y)

clf_gnb.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

Logistic Regression

In [20]:

from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])



array(['NEGATIVE'], dtype='<U8')

Evaluation 

In [21]:
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6586538461538461
0.625
0.8028846153846154


In [22]:
# F1 Scores
from sklearn.metrics import f1_score

f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])
#f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])

array([0.80582524, 0.80952381])

In [23]:
test_set = ['very fun', "bad book do not buy", 'horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

Tuning our model(grid search)

In [25]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [26]:
print(clf.score(test_x_vectors, test_y))


0.8076923076923077


# Saving Model 

save model 

In [31]:

import pickle

with open('sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

Load Model 

In [32]:

with open('sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [33]:
print(test_x[0])

loaded_clf.predict(test_x_vectors[0])

The Brian Herbert-Kevin J. Anderson Dune books are quite polarizing. After having read the final book in this series, I find some of the criticism valid. But it's also clear that the authors did try to flesh out the Dune universe in a way that would satisfy Frank Herbert. They're clearly not quite as good with the art of writing, but the books aren't trash.As I've noted in my reviews of the other Dune House books, there is a lot of repetition and lack of subtlety in the writing. I'm not sure if this is because the book had two authors, but the book repeats itself as if worrying readers will forget. In House Corrino, I found the biggest problem to be that the book had too many disparate and sometimes interwoven plots, but little to make me care about many of them. By the middle of the book, you have dozens of characters running around starting wars, having babies, scheming, etc. The chapters are quite short, meaning that you jump around these various subplots quite a bit because every s

NotFittedError: This Perceptron instance is not fitted yet

In [36]:
from sklearn.linear_model import Perceptron

clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(test_x, test_y)

ValueError: could not convert string to float: "The Brian Herbert-Kevin J. Anderson Dune books are quite polarizing. After having read the final book in this series, I find some of the criticism valid. But it's also clear that the authors did try to flesh out the Dune universe in a way that would satisfy Frank Herbert. They're clearly not quite as good with the art of writing, but the books aren't trash.As I've noted in my reviews of the other Dune House books, there is a lot of repetition and lack of subtlety in the writing. I'm not sure if this is because the book had two authors, but the book repeats itself as if worrying readers will forget. In House Corrino, I found the biggest problem to be that the book had too many disparate and sometimes interwoven plots, but little to make me care about many of them. By the middle of the book, you have dozens of characters running around starting wars, having babies, scheming, etc. The chapters are quite short, meaning that you jump around these various subplots quite a bit because every single subplot seems to climax at the same time. This happens a lot in Star Wars books and frustrates me to no end.And if you've read Dune, you know that none of it matters. Of course, this is an inherent risk with any Prequel as it's almost impossible to surprise the audience. But this is why authors who try to pen a prequel should focus on building character and depth than on plot twists. We know what has to happen, so why bother? Instead, a good prequel should tell the reader why they should care. Dune, as a single book, ironically feels like it had more depth than this book.With one exception. I did like seeing Leto Atreides becoming a leader. I thought making him an aggressive military commander was an interesting choice. The authors, as they did in the previous novels, did make him a bit too much of a goody two-shoes, but I do feel like the explanation for why he became so popular prior to the events of Dune makes sense.Ironically, despite the title, I didn't feel I'd gained much insight into House Corrino at all. Emperor Shaddam comes across as petulant and semi-retarded. This is not the same man who laid such devious plans in Dune and was so careful to hide his involvement in the downfall of House Atreides. Perhaps Shaddam grew wiser within the intervening years, but if so THAT is the story I would have wanted to read. Instead, in this book, we constantly hear Shaddam moaning about how he can make decisions without Fenring's advice (even when it's clear that he can't).Also, this book contradicts Dune in a major way. Paul is born on Kaitin, not Caladan, even though in Dune it's clear he'd never left his home world. Why? Why contradict Frank Herbert's masterpiece? Having Paul born on Kaitin added nothing to the story or his character.As this is the final book, I'm forced to ask myself if I should have read the series at all. I'd give the series as a whole 3 stars. There were some nice backstories, particularly for Gurney Halleck, Leto, and Count Fenring. But the books are just not well written. House Corrino especially feels like a triumph of breadth over depth - exactly the opposite of Frank Herbert's Dune books. I &#34;read&#34; these books as audiobooks while doing other chores, so overall I didn't exactly lose too much of my time with the books. For Dune fans, they'll perhaps give you a spice hit, but they won't inspire you to run out and get the next set of Dune books written by Herbert & Anderson.Overall: 2.5 stars."