### Data Class

In [15]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
        
        

### Load Data

In [16]:
import json

file_name = 'data.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))
        
reviews[5].text
        

'Love the book, great story line, keeps you entertained.for a first novel from this author she did a great job,  Would definitely recommend!'

### Preparing the Data

In [17]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.3, random_state=42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)

In [18]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

48
48


#### Bag of words vectorization

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())


So, I loved the early IAD books. Kresley Cole is capable of telling a great story filled with humor, action, and romance. That said...1. I had put off reading this book due to the book price coinciding with hardcover price. If I had paid HC price for this, it would have been a 1.2. I know the hero was tortured and all but he heaped abuse on the heroine. Paranormal elements aside, if she were a friend, I'd subtly be showing her how unhealthy the relationship is and trying to convince her to end it.3. This book doesn't the have the laugh-out-loud moments and one-liners that previous ones did.4. I'm afraid this series is in decline.  We're moving the grand plot along by pushing out meh stories. This happens I guess--I've actually thought the same about another paranormal series I follow. I wish the authors would quit while ahead and leave us with good memories of good books with good characters.I do hope KC keeps writing. Maybe she can recreate her special brand of magic in a new project.

## Linear SVM

In [20]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

test_x[0]

clf_svm.predict(test_x_vectors[0])







array(['POSITIVE'], dtype='<U8')

## Evaluation

In [21]:
# Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))

0.7142857142857143


##Testing though User Input

In [27]:
test_set = ["shit book","great book"]
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)


array(['NEGATIVE', 'POSITIVE'], dtype='<U8')