## Data Classes

This section here was to get some practice using classes. Of course I could have written all of this without classes but i think this made it a little easier to use and understand.

In [2]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or greater
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

## Load Data

In [6]:
import json
import os

file_name = os.path.join('data', 'sentiment', 'Books_small_10000.json')

reviews = []

with open(file_name) as file:
    for line in file:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))
        
reviews[88].text

'Excellent book Rick was a great believable character very romantic I want to read about stem next I hope she writes about him'

## Prep Data

At this point in time I haven't use the Sci-Kit Learn library very much so I'm not 100% sure if there's a neater way to do this. This section takes all of the data and passes it into the *test_train_split* method from sklearn. This one step allowed me to take the data and split it into a training send and a test set. I decided to use 66% of the data for training and the test for testing.

In [7]:
from sklearn.model_selection import train_test_split
# This will split our data into a training set and a testing set.
training, test = train_test_split(reviews, test_size=0.33, random_state = 42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)

In [8]:
# This will split the training and test variables into the review text and its sentiment value

train_container.evenly_distribute()
training_x = train_container.get_text()
training_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

## Bag of Words Vectorization

Bag of words is a cool technique that I learned that takes the the occurance of every word in the entire dataset and adds it to a dictionary and then counts the number of times a word is used. For example:
The phrase: John likes to watch movies. Mary likes movies too.
Would be converted to {"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1}

In [19]:
from sklearn.feature_extraction.text import CountVectorizer #Use Tfidf instead next time

vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(training_x)
test_x_vectors = vectorizer.transform(test_x)

train_x_vectors

<872x8906 sparse matrix of type '<class 'numpy.int64'>'
	with 53647 stored elements in Compressed Sparse Row format>

## Classification

In this section I wanted to see how different classifications would perform when predicting with the test set. They all seemed to get the job done but I needed to know if one was more accururate than than the others.

#### Linear SVM

In [11]:
from sklearn import svm #support vector machine

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, training_y)

test_x[0]

clf_svm.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

#### Decision Tree

In [12]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, training_y)

clf_dec.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Logistic Regression

In [13]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, training_y)

clf_log.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

## Evaluation

After testing the accuracy I realized that they were all pretty close with SVM and Logistic Regression performing similarily. For our purposes in this project I think SVM will do just fine. When testing the F1 score I relaized that the value was pretty good POSITIVE but was less than 0.10 for NEGATIVE. I realized this was because the data is heavily biased towards POSITIVE reviews since they made up the majority of the data. This lead to me having to go back and remake my training and test set and downsample my positive reviews so that they more closely matched the number of negative and now they perform evenly.  

In [14]:
# Check to see how accurate these classifcations are at predicting the correct values

# Mean accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.7980769230769231
0.6634615384615384
0.8149038461538461


In [15]:
# F1 Scores
from sklearn.metrics import f1_score

f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.8028169 , 0.79310345])

## Testing With New Data

At this point the training is complete and the model can be testing with new data it has never seen before!

In [20]:
test_set = ['not great book', 'probably would not purchase this again', 'really disliked it']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['NEGATIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

##  Ways to Improve

There are definily a few things I could change to make this model better. For example the data set came from book reviews so the model does very well when the string contains the word book or something similar. If I passed in "I really enjoyed the video" it returns NEGATIVE sometimes because the model has never seen the word video and does understand its sentimenent. This can be fixed in many ways, either by having a larger more generalized dataset or by designing the model to deconstruct sentences and identify verbs, nouns, and participles to better understand language. 

Another imporovement that might be made is changing the vertorizor from Bag of Words to a term frequency–inverse document frequency model so that common words like "the, and, is, or" arent weighted equally as more relevent words like "great, aweful, brilliant, distasteful". I did try TFIDF in this model but with this dataset the results were negligable. More research is required.