# Sentiment Analysis Overview
* sentiment is a measure of how positive or negative something is
* we are going to build a very simple sentiment analyzer 
* These are amazon reviews, so they come with 5 star ratings, and we are going to look at the electronics category 
* These are XML files, so we will need an XML parser 

# NLP Terminology
* **Corpus**: Collection of text
* **Tokens**: Words and punctuation that make up the corpus. 
* **Type**: a distinct token. Ex. "Run, Lola Run" has four tokens (comma counts as one) and 3 types.
* **Vocabulary**: The set of all types. 
* The google corpus (collection of text) has 1 trillion tokens, and only 13 million types. English only has 1 million dictionary words, but the google corpus includes types such as "www.facebook.com". 

# Outline 
* we are just going to look at the electronics category 
* we could use the 5 star targets to do regression, but lets just do classification since they are already marked "positive" and "negative"
* XML parser (going to use BeautifulSoup)
* only look at **review_test**
* To create our feature vector, we will count up the number of occurences of each word, and divided it by the total number of words  
* For that to work we will need two passes through the data
    1. One to collect the total number of distinct words, so that we know the size of our feature vector, in other words the vocabulary size, and possibly remove stop words like "this", "is", "I", "to", etc, to decrease the vocabulary size. The goal here is to know the index of each token
    2. On the second pass, we will be able to assign values to each data vector which index corresponds to which words, and one to create data vectors 
* once we have that, it is simply a matter of creating a classifier like the one we did for our spam detector 
* so we will use logistic regression, so we can intepret the weights! 
* Ex: if you see a word like horrible and it has a weight of minus 1, it is associated with negative reviews

---
# Sentiment Analysis in Python using Logistic Regression 

In [1]:
import nltk
import numpy as np

from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

wordnet_lemmatizer = WordNetLemmatizer()                                # this turns words into their base form 

stopwords = set(w.rstrip() for w in open('data/stopwords.txt'))         # grab stop words 

positive_reviews = BeautifulSoup(open('data/electronics/positive.review').read(), "lxml")   # get pos reviews
positive_reviews = positive_reviews.findAll('review_text')                                  # only want rev text

negative_reviews = BeautifulSoup(open('data/electronics/negative.review').read(), "lxml")
negative_reviews = negative_reviews.findAll('review_text')

### Class imbalance 
There are more positive than negative reviews, so we are going to shuffle the positive reviews and then cut off any extra that we may have so that they are both the same size. 

In [2]:
np.random.shuffle(positive_reviews)
positive_reviews = positive_reviews[:len(negative_reviews)]

### Tokenizer function
Lets now create a tokenizer function that can be used on our specific reviews.

In [3]:
def my_tokenizer(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)                        # essentially string.split()
    tokens = [t for t in tokens if len(t) > 2]                     # get rid of short words
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]     # get words to base form
    tokens = [t for t in tokens if t not in stopwords]
    return tokens

### Index each word
We now need to create an index for each of the words, so that each word has an index in the final data vector. However, to able able to do that we need to know the size of the final data vector, and to be able to know that we need to know how big the vocabulary is. Remember, the **vocabulary** is just the set of all types!

We are essentially going to look at every individual review, tokenize them, and then add those tokens 1 by 1 to the map if they do not exist yet.

In [9]:
word_index_map = {}                            # our vocabulary - dictionary that will map words to dictionaries
current_index = 0                              # counter increases whenever we see a new word

positive_tokenized = []
negative_tokenized = []

# --------- loop through positive reviews ---------
for review in positive_reviews:              
    tokens = my_tokenizer(review.text)          # converts single review into array of tokens (split function)
    positive_tokenized.append(tokens)
    for token in tokens:                        # loops through array of tokens for specific review
        if token not in word_index_map:                        # if the token is not in the map, add it
            word_index_map[token] = current_index          
            current_index += 1                                 # increment current index
                
# --------- loop through negative reviews ---------
for review in negative_reviews:              
    tokens = my_tokenizer(review.text)          
    negative_tokenized.append(tokens)
    for token in tokens:                       
        if token not in word_index_map:                        
            word_index_map[token] = current_index          
            current_index += 1                             

### Convert tokens into vector
Now that we have our tokens and vocabulary, we need to convert our tokens into a vector. Because we are going to shuffle our train and test sets again, we are going to want to put labels and vector into same array for now since it makes it easier to shuffle. 

Note, this function operates on **one** review. So the +1 is creating our label, and this function is basically designed to take our input vector from an english form the a numeric vector form.

In [5]:
def tokens_to_vector(tokens, label):
    xy_data = np.zeros(len(word_index_map) + 1)          # equal to the vocab size + 1 for the label 
    for t in tokens:                                     # loop through every token
        i = word_index_map[t]                            # get index from word index map
        xy_data[i] += 1                                  # increment data at that index 
    xy_data = xy_data / xy_data.sum()                    # divide entire array by total, so they add to 1
    xy_data[-1] = label                                  # set last element to label
    return xy_data

Time to actually assign these tokens to vectors.

In [6]:
N = len(positive_tokenized) + len(negative_tokenized)               # total number of examples 
data = np.zeros((N, len(word_index_map) + 1))                       # N examples x vocab size + 1 for label
i = 0                                                               # counter to keep track of sample

for tokens in positive_tokenized:                                   # loop through postive tokenized reviews
    xy = tokens_to_vector(tokens, 1)                                # passing in 1 because these are pos reviews
    data[i,:] = xy                                                  # set data row to that of the input vector
    i += 1                                                          # increment 1
    
for tokens in negative_tokenized:                                   
    xy = tokens_to_vector(tokens, 0)                                
    data[i,:] = xy                                                 
    i += 1         

Our data is now 1000 rows of positively labeled reviews, followed by 1000 rows of negatively labeled reviews. Lets shuffle before getting our train and test set.

In [7]:
np.random.shuffle(data)
X = data[:, :-1]
Y = data[:, -1]

Xtrain = X[:-100,]
Ytrain = Y[:-100,]
Xtest = X[-100:,]
Ytest = Y[-100:,]


model = LogisticRegression()
model.fit(Xtrain, Ytrain)
print("Classification Rate: ", model.score(Xtest, Ytest))

Classification Rate:  0.75


## Classification Rate
We end up with a classification rate of 0.71, which is not so great, but it is better than random guessing. 

## Sentiment Analysis
Something interesting that we can do is look at the weights of each word, to see if that word has positive or negative sentiment. 

In [8]:
threshold = 0.7 
for word, index in word_index_map.items():
    weight = model.coef_[0][index]
    if weight > threshold or weight < -threshold:
        print(word, weight )

wa -1.6205776784041184
support -0.8956809464339932
you 0.9622591116426046
n't -2.072270899271859
ha 0.7972162517778588
've 0.7528020714446826
speaker 0.8392260153759414
easy 1.6136082792535267
sound 1.0177268252650973
price 2.7300861427730254
perfect 0.8897970815602825
doe -1.210339120885801
then -1.1249372508712399
love 1.2092346493947035
excellent 1.3555759654547663
buy -0.8339913049383023
waste -0.9732929671574023
money -1.014294971971789
item -0.9751132748321245
fast 0.8677509536212568
little 0.969135640601565
pretty 0.7484421871306464
returned -0.79917600822562
memory 0.9223474917596889
bad -0.7352617301232087
return -1.188112795061513
poor -0.7844645155015383
highly 0.9257962361017322
quality 1.5988673580358377
tried -0.7805135905526898
