## Scikit-Learn Tutorial with Keith Galli

### 0. Create data classes (for neater code)

In [3]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__ (self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: # Score of 4 or 5
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__ (self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

### 1. Importing and reading data

In [4]:
import json

file_name = './Datasets/sentiment/Books_small_10000.json'

# Create reviews list with reviewText and overall from json file
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))
        
reviews[1].sentiment

'NEUTRAL'

### 2. Data Preparation

In [5]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
training, test = train_test_split(reviews, test_size=0.33, random_state=42)

# Make train and test containers
train_container = ReviewContainer(training)
test_container = ReviewContainer(test)

# Evenly distribute positive and negative reviews
train_container.evenly_distribute()
test_container.evenly_distribute()

print(len(train_container.reviews))
print(len(test_container.reviews))

872
416


In [6]:
# Split training and test data into x (text) and y (sentiment)

train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


#### Bags of Words Vectorization

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer()

# Fit and transform training data (can be 2 different steps)
train_x_vectors = vectorizer.fit_transform(train_x)

# Transform test data
test_x_vectors = vectorizer.transform(test_x)

### 3. Classification

#### Linear SVM (Support Vector Machine)

In [8]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

# Passing training data to fit in classifier
clf_svm.fit(train_x_vectors, train_y)

print(test_x[0])

# Use predict method to predict sentiment from test data
clf_svm.predict(test_x_vectors[0])

While I really enjoyed the premise of this book the ending left a lot to be desired. There were just so many threads left open in this book. I understand that this is a series and it's going to answer each question like where is Allie and is she going to come back, and where and what is going on with Greta (Josh's sister) in each of their own stories I just thought we could have got a bit of a wrap up. I would have liked more of a conclusion between Devon and her dad and also with Josh. It kind of just ended with her revelation and didn't show the future or anything which I really enjoy a the end of the books I read. This book just ended very abruptly for me and unfortunately that negatively affected my overall outlook on the book.I don't think I will read Allie's story because I already don't really like her. I mean okay she ran off on her wedding day and that's a dick move to someone you claim to care about but I was more upset with the fact that she dated her best friends ex-boyfrie

array(['POSITIVE'], dtype='<U8')

#### Decision Tree

In [9]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()

# Passing training data to fit in classifier
clf_dec.fit(train_x_vectors, train_y)

print(test_x[1])

# Use predict method to predict sentiment from test data
clf_dec.predict(test_x_vectors[1])

She gave a lot of good strategies in her battle with fibromyalgia in day to day living.  if you are a sufferer of such, I recommend her book.


array(['POSITIVE'], dtype='<U8')

#### Naive Bayes

In [10]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()

# Passing training data to fit in classifier
clf_gnb.fit(train_x_vectors.toarray(), train_y)

print(test_x[2])

# Use predict method to predict sentiment from test data
clf_gnb.predict(test_x_vectors[2].toarray())

I got this book for free and boy, let me just say that I'm glad I didn't pay for it. The writing wasn't bad, but the story itself was not worth the 5 star reviews it's received. No, I wouldn't recommend this story.


array(['NEGATIVE'], dtype='<U8')

#### Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()

# Passing training data to fit in classifier
clf_log.fit(train_x_vectors, train_y)

print(test_x[3])

# Use predict method to predict sentiment from test data
clf_log.predict(test_x_vectors[3])

This was OK. I had problems finishing it and actually read the last chapter at about 65% and decided that it wasn't worth it to finish. That's very unusual for me, but it just wasn't my cup of tea. It started out pretty good and the concept was good, but I became uncomfortable reading it at about 1/4 of the way through.


array(['NEGATIVE'], dtype='<U8')

### 4. Evaluation

In [12]:
# Comparing classifiers using score method (Mean Accuracy)

print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6418269230769231
0.6610576923076923
0.8052884615384616


In [13]:
# Comparing F1 scores between classifiers

from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

[0.80582524 0.80952381]
[0.64268585 0.64096386]
[0.65693431 0.66508314]
[0.80291971 0.80760095]


### 5. Qualitative Test

In [14]:
test_set = ['i don\'t like this book', 'this book is bad', 'looking forward to new book']
new_test = vectorizer.transform(test_set)
gnb_test = new_test.toarray()

print(clf_svm.predict(new_test))
print(clf_dec.predict(new_test))
print(clf_gnb.predict(gnb_test))
print(clf_log.predict(new_test))

['NEGATIVE' 'NEGATIVE' 'POSITIVE']
['NEGATIVE' 'POSITIVE' 'POSITIVE']
['NEGATIVE' 'NEGATIVE' 'POSITIVE']
['NEGATIVE' 'NEGATIVE' 'POSITIVE']


### 6. Tuning our model (with Grid Search)

In [15]:
from sklearn.model_selection import GridSearchCV

# Grid Search will choose which parameters to use best
parameters = {'kernel': ('linear', 'rbf'), 'C': (1, 4, 8, 16, 32)}

# Retraining model
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [16]:
# Testing new classifier

print(clf.score(test_x_vectors, test_y))

0.8100961538461539


### 7. Saving and Loading Model

#### Save Model

In [17]:
import pickle

with open('./Models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

#### Load Model

In [18]:
with open('./Models/sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [19]:
print(test_x[0])

print(loaded_clf.predict(test_x_vectors[0]))

While I really enjoyed the premise of this book the ending left a lot to be desired. There were just so many threads left open in this book. I understand that this is a series and it's going to answer each question like where is Allie and is she going to come back, and where and what is going on with Greta (Josh's sister) in each of their own stories I just thought we could have got a bit of a wrap up. I would have liked more of a conclusion between Devon and her dad and also with Josh. It kind of just ended with her revelation and didn't show the future or anything which I really enjoy a the end of the books I read. This book just ended very abruptly for me and unfortunately that negatively affected my overall outlook on the book.I don't think I will read Allie's story because I already don't really like her. I mean okay she ran off on her wedding day and that's a dick move to someone you claim to care about but I was more upset with the fact that she dated her best friends ex-boyfrie