# Kaggle tutorial: _Bag of Words Meets Bags of Popcorn_

See https://www.kaggle.com/c/word2vec-nlp-tutorial and http://fastml.com/classifying-text-with-bag-of-words-a-tutorial/ (which refers to the Kaggle tutorial)

## Prerequisite
This tutorial assumes that the following libraries are already installed on your system:
* BeautifulSoup4
* NLTK

## Part 1: For Beginners - Bag of Words
The tutorial code for Part 1 lives [here](https://github.com/wendykan/DeepLearningMovies/blob/master/BagOfWords.py).

The [data](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) has been stored under the `/data` folder.

In [None]:
import pandas as pd
# Here, "header=0" indicates that the first line of the file contains column names, "delimiter=\t" indicates that the fields are separated by tabs, and quoting=3 tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file.
train = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
# type of train is pandas.core.frame.DataFrame

### Inspecting the data

In [None]:
# The three columns are called "id", "sentiment", and "review"
train.columns.values

In [None]:
# When inspecting the first review, you'll notice that it's HTML
train["review"][0]

### Data Cleaning and Text Preprocessing

#### Removing HTML Markup: The BeautifulSoup Package

In [None]:
from bs4 import BeautifulSoup
# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train["review"][0])  

# Print the raw review and then the output of get_text(), for 
# comparison
print "raw:" + train["review"][0] + "\n"
# Calling get_text() gives you the text of the review, without tags or markup.
print "wihout html tags:" + example1.get_text()

#### Dealing with Punctuation, Numbers and Stopwords: NLTK and regular expressions

In [None]:
import re
# Use regular expressions to do a find-and-replace. Find anything that is NOT a lowercase letter (a-z) or an upper case letter (A-Z), and replace it with a space. 
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print letters_only

In [None]:
# We'll also convert our reviews to lower case and split them into individual words (called "tokenization" in NLP lingo):
lower_case = letters_only.lower()
words = lower_case.split()

In [None]:
import nltk
nltk.download()

When you execute the command above the following window will appear:

![NLTK Downloader](images/nltk-downloader.png)


In [None]:
from nltk.corpus import stopwords # Import the stop word list
print stopwords.words("english") 

In [None]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print words

#### Putting it all together

In [None]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))  

Now let's loop through and clean all of the training set at once (this might take a few minutes depending on your computer):

In [None]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in xrange( 0, num_reviews ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_reviews.append( review_to_words( train["review"][i] ) )

In [None]:
print clean_train_reviews[0]

### Creating Features from a Bag of Words (using scikit-learn)

In [None]:
print "Creating the bag of words...\n"
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()
print train_data_features[0]

In [None]:
print train.shape # the original *.tsv files' content
print train_data_features.shape # CountVectorizer created feature vectors for all training samples

Now that the Bag of Words model is trained, let's look at the vocabulary. Remeber we use the 5000 most frequent words and that stop words have already been removed:

In [None]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print len(vocab)
print vocab

You can also print the counts of each word in 
the vocabulary:

In [None]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag