# Scikit-Learn Reference

---

# Table of Contents

## [Data Class ](#Data-Class)

## [Load Data ](#Load-Data)

## [Prep Data ](#Prep-Data)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Bag of Words Vectorization ](#Bag-of-Words-Vectorization)

## [Classification ](#Classification)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Linear SVM ](#Linear-SVM)

---

# Data Class

In [1]:
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: # score of 4 or 5
            return Sentiment.POSITIVE

---

# Load Data

In [2]:
import json

file_name = "./data/sentiment/books_small.json"
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review["reviewText"], review["overall"]))
reviews[342].sentiment

'POSITIVE'

---

# Prep Data

In [11]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state=42)

In [19]:
train_x = [x.text for x in training]
train_y = [x.sentiment for x in training]

test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

### Bag of Words Vectorization

In [25]:
 from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)

print(train_x_vectors[0])

  (0, 7086)	1
  (0, 1148)	1
  (0, 350)	2
  (0, 1800)	1
  (0, 6595)	1
  (0, 562)	1
  (0, 3054)	1
  (0, 1558)	1
  (0, 6475)	1
  (0, 6593)	1
  (0, 2895)	1
  (0, 7353)	1
  (0, 539)	1
  (0, 1515)	1
  (0, 5197)	1
  (0, 3545)	1
  (0, 2007)	1


---

# Classification

### Linear SVM

In [28]:
from sklearn import svm
clf_svm = svm.SVC(kernel="linear")
clf_svm.fit(train_x_vectors, train_y)
print(clf_svm.predict(test_x_vectors[0]))

print(test_x[0], "\n", test_y[0])

['POSITIVE']
Every new Myke Cole book is better than the last, and this is no exception. If you haven't read the Shadow Ops series before start with Control Point, but go ahead and order Fortress Frontier and Breach Zone as well - you're going to want them. 
 POSITIVE
