# Introduction to Natural Language Processing (NLP)
This Jupyter notebook was created for the 2018 AIS Natural Language Processing Workshop for the 2018 AI Conference at UT Dallas and focuses on using NLP techniques to build a sentiment classifier using data from 50,000 movie reviews.

## What is Natural Language Processing?
Natural language processing, often abbreviated as NLP, is a broad area of artificial intelligence concerned with allowing machines to process and extract meaning from large amounts of human language data. There are many problems in natural language processing, but in this workshop, we will be focusing primarily on feature extraction from text data and sentiment analysis. 

## Importing Libraries

In [1]:
import numpy as np # for linear algebra
import pandas as pd # for CSV file I/O and data manipulation
import sklearn # for machine learning
import nltk # for natural language processing utilities

## Reading in the Data - IMDb Movie Reviews Dataset
We will be using a dataset of 50,000 movie reviews that was used in a Stanford paper titled "Learning Word Vectors for Sentiment Analysis". If you are interested, you can find the paper here: http://ai.stanford.edu/~ang/papers/acl11-WordVectorsSentimentAnalysis.pdf
The original dataset, formatted as a CSV file, is included in the repository for this workshop, but you can also download the original dataset from http://ai.stanford.edu/~amaas/data/sentiment/ as a zip archive. Let's start by reading in the data as a dataframe object using the **read_csv** function from the **Pandas** library.

In [2]:
data = pd.read_csv('stanford_movie_data.csv') #pass in the name of the file
data.head(10) #looks at the first ten rows of the dataframe

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


As we can see above, our dataset contains a series of reviews, with the sentiment of each review provided in the **sentiment** column. If we want to get some information about our dataframe, we can use the **info()** function.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
review       50000 non-null object
sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


Based on the output above, we can confirm that our data contains 50,000 entries, with two columns, and we can even get an estimate of how much space our dataframe consumes in memory.

## DataFrame Basics
Pandas dataframes are really useful tools for working with data because we can easily query the data and perform operations on the data.

### Accessing Specific Columns
We can use the following syntax to access specific columns of the dataframe.

In [4]:
data['review'] # Gets the review column of the data

0        In 1974, the teenager Martha Moxley (Maggie Gr...
1        OK... so... I really like Kris Kristofferson a...
2        ***SPOILER*** Do not read this, if you think a...
3        hi for all the people who have seen this wonde...
4        I recently bought the DVD, forgetting just how...
5        Leave it to Braik to put on a good show. Final...
6        Nathan Detroit (Frank Sinatra) is the manager ...
7        To understand "Crash Course" in the right cont...
8        I've been impressed with Chavez's stance again...
9        This movie is directed by Renny Harlin the fin...
10       I once lived in the u.p and let me tell you wh...
11       Hidden Frontier is notable for being the longe...
12       It's a while ago, that I have seen Sleuth (197...
13       What is it about the French? First, they (appa...
14       This very strange movie is unlike anything mad...
15       I saw this movie on the strength of the single...
16       There are some great philosophical questions. .

In [5]:
data['sentiment'] # Gets the sentiment column of the data

0        1
1        0
2        0
3        1
4        0
5        1
6        1
7        1
8        1
9        1
10       0
11       1
12       0
13       0
14       1
15       0
16       0
17       1
18       0
19       1
20       0
21       0
22       0
23       0
24       0
25       1
26       0
27       1
28       0
29       1
        ..
49970    1
49971    0
49972    1
49973    1
49974    0
49975    1
49976    0
49977    1
49978    0
49979    1
49980    0
49981    0
49982    0
49983    1
49984    0
49985    0
49986    0
49987    0
49988    0
49989    0
49990    0
49991    0
49992    0
49993    1
49994    1
49995    0
49996    0
49997    0
49998    0
49999    1
Name: sentiment, Length: 50000, dtype: int64

### Indexing/Slicing a Dataframe
If we want to get a specific row of a dataframe, we can use the following syntax.

In [6]:
data.iloc[0] # Grabs the first row of the dataframe

review       In 1974, the teenager Martha Moxley (Maggie Gr...
sentiment                                                    1
Name: 0, dtype: object

In [7]:
data.iloc[0:50] # Grabs rows 0 to 49 of the dataframe

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


### Selecting Data Based on Conditions
We can also select rows of our data based on certain conditions.

In [8]:
data[data['sentiment'] == 1] # Gets all positive reviews

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
3,hi for all the people who have seen this wonde...,1
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1
11,Hidden Frontier is notable for being the longe...,1
14,This very strange movie is unlike anything mad...,1
17,I was cast as the Surfer Dude in the beach sce...,1


## Preprocessing Text Data

Before we start working with NLP tools and actually getting into training machine learning algorithms, we need to preprocess the text data and remove unwanted characters such as HTML markup and punctuation. We can use Python's regex libary to do this. Just to show the an example of the effect of preprocessing, we can take a look at an entry in the review column before preprocessing and then after preprocessing.

In [9]:
data['review'][4]

'I recently bought the DVD, forgetting just how much I hated the movie version of "A Chorus Line." Every change the director Attenborough made to the story failed.<br /><br />By making the Director-Cassie relationship so prominent, the entire ensemble-premise of the musical sails out the window.<br /><br />Some of the musical numbers are sped up and rushed. The show\'s hit song gets the entire meaning shattered when it is given to Cassie\'s character.<br /><br />The overall staging is very self-conscious.<br /><br />The only reason I give it a 2, is because a few of the great numbers are still able to be enjoyed despite the film\'s attempt to squeeze every bit of joy and spontaneity out of it.'

As we can see above, some of our reviews may have HTML characters and punctuation that we want to remove. Let's go ahead and define and run our preprocessing function on the review data.

In [10]:
import re # regex library
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) # Effectively removes HTML markup tags
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

In [11]:
data['review'] = data['review'].apply(preprocessor) #Applies the preprocessor function to the data

Now let's take a look at this same entry in our preprocessed review data.

In [12]:
data['review'][4]

'i recently bought the dvd forgetting just how much i hated the movie version of a chorus line every change the director attenborough made to the story failed by making the director cassie relationship so prominent the entire ensemble premise of the musical sails out the window some of the musical numbers are sped up and rushed the show s hit song gets the entire meaning shattered when it is given to cassie s character the overall staging is very self conscious the only reason i give it a 2 is because a few of the great numbers are still able to be enjoyed despite the film s attempt to squeeze every bit of joy and spontaneity out of it '

As we can see, the same entry now has no HTML tages and punctuation. All of the characters have also been reduced to lowercase for uniformity.

## Feature Extraction from Text Data
A key problem that appears in NLP applications that involve machine learning is extracting features from text data. In general, this problem involves converting text data into a set of quantitative values or features that summarize the data in a form that machine learning algorithms can actually work with.

### Quick Definitions
Here are some terms that are frequently used throughout this tutorial that we should go ahead and define briefly before we proceed:
- **document**: an ordered collection of characters or words that constitute a single instance of text data. Example: a single movie review can count as a document.
- **corpus**: a usually large collection of documents that can be used for feature extraction techniques or generalizations on text data.
- **n-gram**: a contiguous sequence of **n items** from a sample of text. These items can be words, characters, or even syllables depending on the application.

### Bag of Words - Looking at the Frequency of Words
A popular and simple model for extracting features from text data is the **Bag of Words (BOW)** model. This model gets its name from the underlying assumption that each document can be treated as a **collection of words or n-grams** that does not take into account the order of the words but looks at the frequency of each word. 

This model is also called the **vector-space model** because it transforms each document in a corpus into a vector, where each component corresponds to the frequency of a particular word from the whole corpus in that specific document. This process is known as **count vectorization**. Let's go through an example to demonstrate how this process works.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer # import CountVectorizer module

Now that we have imported the CountVectorizer module, let's go ahead and fit a count vectorizer on a small corpus.

In [38]:
doc1 = 'I like cats'
doc2 = 'I like dogs and cats'
doc3 = 'The cats got the rats'
corpus = [doc1, doc2, doc3] # the corpus is basically a list of strings (documents)

count_vectorizer = CountVectorizer() # Creates a CountVectorizer with default parameters
count_vectorizer.fit(corpus) # Fits the count vectorizer on the corpus we just created

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Now that we have a CountVectorizer object that has been trained on a corpus, there are several operations that we can perform with it. First of all, let's take a look at the our count vectorizer's vocabulary - all of the unique words that the vectorizer found in the corpus.

In [39]:
count_vectorizer.vocabulary_

{'and': 0, 'cats': 1, 'dogs': 2, 'got': 3, 'like': 4, 'rats': 5, 'the': 6}

As we can see above, our vectorizer has seven words in its vocabulary. Single letter words such as "I" are not considered by default. The vocabulary is represented as a dictionary where the keys are the words or n-grams (a word is basically a 1-gram) and the values are the index of each word for the vectors of each document. To demonstrate this concept, let's actually convert some sentences to vectors using the **transform** function. Let's transform the first document into a vector.

In [40]:
doc1

'I like cats'

In [41]:
count_vectorizer.transform([doc1])

<1x7 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

If we try to transform the first document like this, notice how we get a **sparse matrix**. This is because in general, these vectors can become quite large, with many zeros when using a corpus with many words. Our corpus is really small but the transform function still outputs the vector in this format. If we want to actually see the vector, we can do the following:

In [42]:
count_vectorizer.transform([doc1]).todense()

matrix([[0, 1, 0, 0, 1, 0, 0]])

This vector basically tells us that the words **"cat"** and **"like"** both appear once in doc1 and the other words in the vocabulary do not appear at all. We can actually take any sentence or document and transform it into a vector with our count vectorizer. **The vectorizer ignores words that are not in its vocabulary.** Here are some examples:

In [45]:
count_vectorizer.transform(['I like cats and dogs, but not rats.']).todense()

matrix([[1, 1, 1, 0, 1, 1, 0]])

In [49]:
count_vectorizer.transform(['Dogs are cool, but cats are scary! I do not like cats']).todense()

matrix([[0, 2, 1, 0, 1, 0, 0]])

### TF-IDF: A More Sophisticated Bag of Words Model
So far we have only worked with a simple bag of words model where we just count the frequencies of each vocabulary word in each document. However, one problem with this simple approach is that for a large corpus, words that frequently appear in the English language will have larger components when in reality, they do not contribute much to the overall meaning of each document. 

**TF-IDF**, short for **term frequency-inverse document frequency** is another bag of words approach that attempts to solve this problem. Rather than just computing the raw frequencies of each word in a document, the TF-IDF approach involves multiplying these frequencies by the **inverse** of statistics representing the **frequencies of these words in the entire corpus**. The TF-IDF statistic is basically the **product** of the **term frequency** and **inverse document frequency** statistics as demonstrated in the equation below:

\begin{equation*}
TFIDF(t, d, D) = f_{t, d} \bullet log\frac{N_D} {n_t}
\end{equation*}