# Sentiment Analyzer 
* sentiment = how positive or negative some text is 
* this can be seen in yelp, amazon, etc...
* We are going to be looking at a data set from: https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html
* This data has been labeled, and we can use that knowledge to our advantage 

# Data set
* These files are XML files, so we are going to need an XML parser to work with them 

---

# Outline of Sentiment Analyzer
### What are we trying to predict? 
* we could use 5 star targets to do regression, but lets just do classification since they are already marked **positive** and **negative**

### What category are we looking at?
* The electronics category

### XML parser
* BeautifulSoup
* only going to look at key "review_text"

### Feature vector creation
* going to count number of occurences of each word, and divide it by the total number of words 
* For that to work we will have to do 2 passes through the data
* one to collect the total number of distinct words, so that we know the size of our feature vector (and possibily remove stop words like this, is, I, to, etc, to reduce vocabulary size
* This is so we know the index of each token
* then on the second pass we will be able to assign values to each data vector

### Classifier Creation
* After that it is simply a matter of creating a classifier using SKLearn
* by using a model like logistic regression we can look at the weights of the learned model to get a score for each individual input word
* that will tell us if see a word like horrible and it has a score of -1, it is associated with negative reviews

# Implementation Details
Two main things we want to do in order to pass our training data to the classifier:
* X = one-hot encoded bag of words
* Y = 1/0 (positive/negative)

# Lets get to the code!
### Imports
* Start with our imports 

In [71]:
import nltk
import numpy as np

from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

### Lemmatization
Next lets create an instance of the WordNetLemmatizer. To learn more about lemmatization, check this link out: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html. 

In [72]:
wordnet_lemmatizer = WordNetLemmatizer()

### Stop Words
Nexts lets grab a list of stop words, found here: http://www.lextek.com/manuals/onix/stopwords1.html, and then create a set.

More information on stop words can be found here: https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

In [73]:
stopwords = set(w.rstrip() for w in open('data/stopwords.txt'))

### Load Reviews
Now lets load the reviews. 

In [74]:
positive_reviews = BeautifulSoup(open('Data/electronics/positive.review').read())
positive_reviews = positive_reviews.findAll('review_text')

# there are 1000 total
len(positive_reviews)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


1000

In [75]:
negative_reviews = BeautifulSoup(open('Data/electronics/negative.review').read())
negative_reviews = negative_reviews.findAll('review_text')

len(negative_reviews)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


1000

### Imbalanced Classes
If there were more positive reviews compared to negative then we would want to take a random sample so we had balanced classes.

In [76]:
np.random.shuffle(positive_reviews)
positive_reviews = positive_reviews[:len(negative_reviews)]

### Tokenize using NLTK's Tokenizer
Lets try this first, and see what some of the disadvantages may be.

In [77]:
t = positive_reviews[0]
nltk.tokenize.word_tokenize(t.text)

['I',
 'recommend',
 'these',
 'speakers',
 '-',
 'have',
 'had',
 'them',
 'for',
 '3',
 'months',
 'now',
 'and',
 'they',
 'are',
 'great',
 '-',
 'sound',
 'quality',
 'is',
 'great',
 '.',
 'Power',
 'converter',
 'gets',
 'pretty',
 'hot',
 'though',
 'so',
 'i',
 'suggest',
 'putting',
 'it',
 'on',
 'something',
 'non',
 'flammable',
 '-',
 'do',
 "n't",
 'think',
 'it',
 'would',
 'catch',
 'anything',
 'on',
 'fire',
 'but',
 'just',
 'to',
 'be',
 'safe',
 '.',
 'Speakers',
 'are',
 'not',
 'super',
 'loud',
 'but',
 'are',
 'great',
 'if',
 'you',
 'just',
 'want',
 'quality',
 'sound',
 '.']

We can see that this does not downcase, so It != it. Not only that, but do we really want to include 'a', 'and' or 'of', etc, anyways? Those words most likely wouldn't be any more common in a positive vs. a negative review, and hence may only add noise to our model. 

### Preprocessing Function
Instead of using the above approach, lets write a function that does our preprocessing for us.

In [78]:
def my_tokenizer(s):
    # downcase all of characters
    s = s.lower() 
    
    # split string into words (tokens)
    tokens = nltk.tokenize.word_tokenize(s) 
    
    # remove short words, they're probably not useful 
    tokens = [t for t in tokens if len(t) > 2] 
    
    # put words into base form
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
    
    # remove stop words
    tokens = [t for t in tokens if t not in stopwords]
    
    return tokens

### Word-to-Index map
Lets now create a word to index map so that we can create our word-frequency vectors later. We can also save the tokenized versions so we don't have to tokenize again later. 

In [79]:
word_index_map = {}
current_index = 0
positive_tokenized = []
negative_tokenized = []

In [80]:
for review in positive_reviews:
    tokens = my_tokenizer(review.text)
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

In [81]:
for review in negative_reviews:
    tokens = my_tokenizer(review.text)
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

In [82]:
word_index_map['epson']

791

### Create input Matrices
The input matrices are going to be made up of the words from our word index map. 

> * The function `tokens_to_vector`  takes in an array of tokens. 
* It then loops through the array, and then will add 1 to the x training example in the column of any word that occurs
* example: if the array contains the word "horrible", the x training vector that is being created will have a +1 added to the index that represents "horrible"
* this will continue onward 

In [83]:
def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index_map) + 1) # last element is for the label
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x / x.sum() # normalize it before setting label
    x[-1] = label
    return x

We now are going to look through the both the positive and negative token arrays. What is really happening is any time a word is encountered, the input matrix has a +1 added to its specific column for that training example.

In [84]:
N = len(positive_tokenized) + len(negative_tokenized)

# (N x D+1 matrix - keeping them together for now so we can shuffle more easily later
data = np.zeros((N, len(word_index_map) + 1))
i = 0
for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens, 1)
    data[i,:] = xy
    i += 1

for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens, 0)
    data[i,:] = xy
    i += 1

### Shuffle our data
Lets now shuffle our data and create traing test splits!

In [85]:
np.random.shuffle(data)

X = data[:, :-1]
Y = data[:, -1]

# the last 100 rows will be test data
Xtrain = X[:-100,]
Ytrain = Y[:-100,]
Xtest = X[-100:,]
Ytest = Y[-100:,]

### Create Logistic Regression Model and train model

In [86]:
model = LogisticRegression()
model.fit(Xtrain, Ytrain)
print("Classification rate:", model.score(Xtest, Ytest))

Classification rate: 0.7


### Lets look at the weights for each word

In [91]:
threshold = 0.5
for word, index in word_index_map.items():
    weight = model.coef_[0][index]
    if weight > threshold or weight < -threshold:
        print(word, weight)

recommend 0.721731694164
speaker 0.840179660187
month -0.763958577277
sound 1.18229711024
quality 1.39155588211
pretty 0.669390842765
n't -2.24120843936
you 0.922648398124
price 2.7031713505
excellent 1.39295020663
wa -1.66334493742
laptop 0.546463190134
perfect 1.06620527026
ha 0.698991284566
cable 0.686356713833
bit 0.569320252056
little 0.861271185304
fit 0.546196149522
buy -0.916600516047
try -0.658400265073
doe -1.18328921761
lot 0.63453293964
using 0.683173546191
easy 1.75884199088
bad -0.747977767389
company -0.570860437642
time -0.747340714474
then -1.03498636974
software -0.533526038642
video 0.547596533954
value 0.536045087478
money -1.09037737947
've 0.838069726274
poor -0.782166454075
home 0.567139496652
tried -0.78750377086
love 1.23219324718
unit -0.649458604082
highly 1.00506284325
fast 0.914970773823
expected 0.520602586276
week -0.669957616405
happy 0.646355913785
comfortable 0.638802425366
returned -0.784685888305
hour -0.582630032301
support -0.930850313177
item -1.0

# Final Results
We can see from the above results what words the model found to be the most useful in predicting whether it was a positive or negative review! Pretty cool stuff!