# Week 12 Notebook: Sentiment Analysis (Part 2)
## Bag of Words

The first step in sentiment analysis is to transform language into numeric form. We need a way for computers to "read" sentiment, because computers cannot _reason_ abstractly and independently. Instead, computers need to work with numeric features that we create based on a standard set of data. Bag of words is one approach. It is imprecise, and throws out other linguistic / language features in order to focus on _measuring the emotional weight of words_. 

Bag of Words: describes the occurance of words within a document or a colection of documents; builds a vocabulary of the words and a measure of their presence (keeps track of their frequencies); creates a dictionary-like output (each word in sentence is a count). It loses word order and grammar rules. 

NOTE: Recall the Topic Modeling Game, when we took our Lego creations apart? We separated all the pieces and grouped them together to get a count of the type of pieces. We lost any sense of how that piece correlated to the whole work. 

In [2]:
text = "This is the best book ever, and I would recommend this book to everyone."


## Tokenizing
The first step in creating a Bag of Words approach is to "tokenize" the string of characters. Tokenizing uses regular expressions to determine how and where to chunk characters into _meaningful_ bits. (Remember, meaning here is a human construction.) We have done this many times already this semester, but we want to walk through it again because the context changes each time. 

In [14]:
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
movies = 'IMDB_sample.csv'
movies_df = pd.read_csv(movies)

In [16]:
vect = CountVectorizer(max_features=100)
vect.fit(movies_df.review)
X = vect.transform(movies_df.review)

In [19]:
# X becomes a "sparse matrix." A sparse matrix [https://en.wikipedia.org/wiki/Sparse_matrix]. 
# We learned in the earlier notebook that sentiment analysis does not work well with null values. We need to 
# transform the data into a numpy array using the array method, and then back into a dataframe. 

# transform to an array
my_array = X.toarray()
X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

# also could use
# X_df = pd.DataFrame(my_array.toarray(), columns=vect.get_feature_names())
# Returns a list where every entry corresponds to one feature

In [20]:
X_df.head()

Unnamed: 0,about,after,all,also,an,and,any,are,as,at,...,well,were,what,when,which,who,will,with,would,you
0,0,0,0,0,0,1,0,0,2,0,...,0,0,0,0,0,0,0,1,1,0
1,0,0,3,1,1,11,0,3,3,4,...,0,0,1,1,2,0,2,7,2,3
2,0,1,0,0,1,7,0,1,2,1,...,0,0,0,0,0,0,0,2,0,0
3,0,0,0,0,2,1,0,1,2,2,...,1,0,0,0,0,1,0,0,0,1
4,0,0,3,0,0,8,0,3,1,0,...,2,1,0,1,1,0,0,2,0,0


In [61]:
# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(movies_df.review)

# Transform the review
X_review = vect.transform(movies_df.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   able to  about it  about this  acting and  acting is  acting was  \
0        0         0           0           0          0           0   
1        0         0           0           0          0           0   
2        0         0           0           0          0           0   
3        0         0           0           0          0           0   
4        0         0           0           0          0           0   

   actors and  after all  after the  again and  ...  you ll  you might  \
0           0          0          0          0  ...       0          0   
1           0          0          0          0  ...       0          0   
2           0          0          0          0  ...       0          0   
3           0          0          0          0  ...       0          0   
4           0          0          0          0  ...       0          0   

   you see  you think  you to  you ve  you want  you will  you would  \
0        0          0       0       0         0         

## N-Grams

I thought it was wondeful, but also stupid. 
I thought it was stupid, but also wonderful. 

Context is important. 

Capturing context with Bag of Words approach. 

* collect 2-word or 3-word groups
* word groups are n-grams (number of tokens per unit analyzed)

Unigrams = single tokens
Bigrams = pairs of tokens
Trigrams = triples of tokens
n-gram = sequence of tokens


In [33]:
movies = 'IMDB_sample.csv'
movies_df = pd.read_csv(movies)
vect = CountVectorizer(max_features=100, ngram_range=(1, 2))
vect.fit(movies_df.review)
X_grams = vect.transform(movies_df.review)
X_gramset = pd.DataFrame(X_grams.toarray(), columns=vect.get_feature_names())

In [36]:
X_gramset.tail()

Unnamed: 0,about,all,also,an,and,and the,are,as,at,bad,...,well,were,what,when,which,who,will,with,would,you
7496,0,1,0,0,4,0,0,0,3,0,...,0,0,0,0,0,1,0,1,1,3
7497,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,1,1,1,0,0
7498,0,0,0,0,4,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7499,0,1,0,1,8,2,2,0,0,0,...,1,0,0,0,3,0,1,2,0,0
7500,1,1,1,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


### Tokenizing a string

In [40]:
from nltk import word_tokenize
anna_k = "Happy families are all alike, every unhappy family is unhappy in its own way."

In [41]:
word_tokenize(anna_k)

['Happy',
 'families',
 'are',
 'all',
 'alike',
 ',',
 'every',
 'unhappy',
 'family',
 'is',
 'unhappy',
 'in',
 'its',
 'own',
 'way',
 '.']

In [42]:
# The result is a list where each item is a token from the string. Not just words are tokenized, but also punctuation. 
# Apply the same logic to the reviews column. 

In [62]:
# Import the needed packages
from nltk import word_tokenize

# Tokenize each item in the review column 
word_tokens = [word_tokenize(review) for review in movies_df.review]

# Print out the first item of the word_tokens list
print(word_tokens[0])

['This', 'short', 'spoof', 'can', 'be', 'found', 'on', 'Elite', "'s", 'Millennium', 'Edition', 'DVD', 'of', '``', 'Night', 'of', 'the', 'Living', 'Dead', "''", '.', 'Good', 'thing', 'to', 'as', 'I', 'would', 'have', 'never', 'went', 'even', 'a', 'tad', 'out', 'of', 'my', 'way', 'to', 'see', 'it.Replacing', 'zombies', 'with', 'bread', 'sounds', 'just', 'like', 'silly', 'harmless', 'fun', 'on', 'paper', '.', 'In', 'execution', ',', 'it', "'s", 'a', 'different', 'matter', '.', 'This', 'short', 'did', "n't", 'even', 'elicit', 'a', 'chuckle', 'from', 'me', '.', 'I', 'really', 'never', 'thought', 'I', "'d", 'say', 'this', ',', 'but', '``', 'Night', 'of', 'the', 'Day', 'of', 'the', 'Dawn', 'of', 'the', 'Son', 'of', 'the', 'Bride', 'of', 'the', 'Return', 'of', 'the', 'Revenge', 'of', 'the', 'Terror', 'of', 'the', 'Attack', 'of', 'the', 'Evil', ',', 'Mutant', ',', 'Alien', ',', 'Flesh', 'Eating', ',', 'Hellbound', ',', 'Zombified', 'Living', 'Dead', 'Part', '2', ':', 'In', 'Shocking', '2-D', "'