# 1. Introduction to Natural Language Processing
Natural Language Processing is certainly one of the most fascinating and exciting areas to be involved with at this point in time. It is a wonderful intersection of computer science, artificial intelligence, machine learning and linguistics. With the (somewhat) recent rise of Deep Learning, Natural Language Processing currently has a great deal of buzz surrounding it, and for good reason. The goal of this post is to do three things:

1. Inspire the reader with the beauty of the problem of NLP
2. Explain how machine learning techniques (i.e. something as simple as Logistic Regression) can be applied to text data.
3. Prepare the reader for the next sections surrounding Deep Learning as it is applied to NLP.

Before we dive in, I would like to share the poem _Jabberwocky_ by Lewis Carrol, and an accompanying excerpt from the book "_Godel, Escher, Bach_", by Douglas Hofstadter.

<img src="https://drive.google.com/uc?id=1ROLVf2p6xYyTqQ3fmeky0eSD6ZCdfJ3M" width="300">

And now, the corresponding excerpt, _**Translations of Jabberwocky**_. 

> ### Translations of Jabberwocky<br>
Douglas R. Hofstadter
Imagine native speakers of English, French, and German, all of whom have excellent command of their respective native languages, and all of whom enjoy wordplay in their own language. Would their symbol networks be similar on a local level, or on a global level? Or is it meaningful to ask such a question? The question becomes concrete when you look at the preceding translations of Lewis Carroll's famous "Jabberwocky".
<br>
<br>
[The "preceding translations" were "Jabberwocky" (English, original), by Lewis Carroll, "Le Jaseroque", (French), by Frank L. Warrin, and "Der Jammerwoch" (German), by Robert Scott. --kl]
<br>
<br>
I chose this example because it demonstrates, perhaps better than an example in ordinary prose, the problem of trying to find "the same node" in two different networks which are, on some level of analysis, extremely nonisomorphic. In ordinary language, the task of translation is more straightforward, since to each word or phrase in the original language, there can usually be found a corresponding word or phrase in the new language. By contrast, in a poem of this type, many "words" do not carry ordinary meaning, but act purely as exciters of nearby symbols. However, what is nearby in one language may be remote in another.
<br>
<br>
Thus, in the brain of a native speaker of English, "slithy" probably activates such symbols as "slimy", "slither", "slippery", "lithe", and "sly", to varying extents. Does "lubricilleux" do the corresponding thing in the brain of a Frenchman? What indeed would be "the corresponding thing"? Would it be to activate symbols which are the ordinary translations of those words? What if there is no word, real or fabricated, which will accomplish that? Or what if a word does exist, but it is very intellectual-sounding and Latinate ("lubricilleux"), rather than earthy and Anglo-Saxon ("slithy")? Perhaps "huilasse" would be better than "lubricilleux"? Or does the Latin origin of the word "lubricilleux" not make itself felt to a speaker of French in the way that it would if it were an English word ("lubricilious", perhaps)?
<br>
<br>
An interesting feature of the translation into French is the transposition into the present tense. To keep it in the past would make some unnatural turns of phrase necessary, and the present tense has a much fresher flavour in French than in the past. The translator sensed that this would be "more appropriate"--in some ill-defined yet compelling sense--and made the switch. Who can say whether remaining faithful to the English tense would have been better?
<br>
<br>
In the German version, the droll phrase "er an-zu-denken-fing" occurs; it does not correspond to any English original. It is a playful reversal of words, whose flavour vaguely resembles that of the English phrase "he out-to-ponder set", if I may hazard a reverse translation. Most likely this funny turnabout of words was inspired by the similar playful reversal in the English of one line earlier: "So rested he by the Tumtum tree". It corresponds, yet doesn't correspond.
<br>
<br>
Incidentally, why did the Tumtum tree get changed into an "arbre Té-té" in French? Figure it out for yourself.
<br>
<br>
The word "manxome" in the original, whose "x" imbues it with many rich overtones, is weakly rendered in German by "manchsam", which back-translates into English as "maniful". The French "manscant" also lacks the manifold overtones of "manxome". There is no end to the interest of this kind of translation task.
<br>
<br>
When confronted with such an example, one realizes that it is utterly impossible to make an exact translation. Yet even in this pathologically difficult case of translation, there seems to be some rough equivalence obtainable. Why is this so, if there really is no isomorphism between the brains of people who will read the different versions? The answer is that there is a kind of rough isomorphism, partly global, partly local, between the brains of all the readers of these three poems.


Now, the purpose of sharing the above is because if you are reading these posts (and are anything like me), you may very well spend a large chunk of your time studying mathematics, computer science, machine learning, writing code, and so on. But, if you are new to NLP the appreciation for the beauty and deeper meaning surrounding language may not be on the forefront of your mind-that is understandable! But hopefully the passage and commentary above ignited some interest in the wonderfully complex and worthwhile problem of Natural Language Processing and Understanding.

## 2. Spam Detection
Now, especially at first, I don't want to dive into phonemes, morphemes, syntactical structure, and the like. We will leave those linguistic concepts for later on. The goal here is to quickly allow someone with an understanding of basic machine learning algorithms and techniques to implement them in the domain of NLP. 

We will see that, at least at first, a lot of NLP deals with preprocessing data, which allows us to use algorithms that we already know. The question that most definitely arises is: How do we take a bunch of documents which are basically a bunch of text, and feed them into other machine learning algorithms where the input is usually a vector of numbers? 

Well, before we even get to that, let's take a preprocessed data set from the [uci archive](https://archive.ics.uci.edu/ml/datasets/Spambase) and perform a simple classification on it. The data has been processed in such a way that we can consider columns 1-48 to the be the input, and column 49 to the be label (1 = spam, 0 = not spam). 

The input columns are considered the input, and they are a **word frequency measure**. This measure can be calculated via:

$$\text{Word Frequency Measure} = \frac{\text{# of times word appears in a document}}{\text{Number of words in document}} * 100$$

This will result in a **Document Term matrix**, which is a matrix where _terms_ (words that appeared in the document) go along the columns, and _documents_ (emails in this case) go along the rows:

|       |word 1|word 2|word 3|word 4|word 5|word 6|word 7|word 8|
|-------|------|------|------|------|------|------|------|------|
|Email 1|||||||||
|Email 2|||||||||
|Email 3|||||||||
|Email 4|||||||||
|Email 5|||||||||

### 2.1 Implementation in Code
We will now use `Scikit Learn` to show that we can use _any_ model on NLP data, as long as it has been preprocessed correctly. First, let's use scikit learns `NaiveBayes` classifier:

In [16]:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np

In [18]:
data = pd.read_csv('../../data/nlp/spambase.data')
data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [21]:
data = data.values
np.random.shuffle(data)    # randomly split data into train and test sets

X = data[:, :48]
Y = data[:, -1]

Xtrain = X[:-100,]
Ytrain = Y[:-100,]
Xtest = X[-100:,]
Ytest = Y[-100:,]

model = MultinomialNB()
model.fit(Xtrain, Ytrain)
print ("Classifcation Rate for NB: ", model.score(Xtest, Ytest))

Classifcation Rate for NB:  0.88


Excellent, a classification rate of 92%! Let's now look utilize `AdaBoost`:

In [22]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()
model.fit(Xtrain, Ytrain)
print ("Classifcation Rate for Adaboost: ", model.score(Xtest, Ytest))

Classifcation Rate for Adaboost:  0.93


Great, a nice improvement, but more importantly, we have shown that we can take text data and that via correct preprocessing we are able to utilize it with standard machine learning API's. The next step is to dig into _how_ basic preprocessing is performed.

# 3. Sentiment Analysis
To go through the basic preprocessing steps that are frequently used when performing machine learning on text data (often referred to an NLP pipeline) we are going to want to work on the problem of **sentiment analysis**. Sentiment is a measure of how positive or negative something is, and we are going to build a very simple sentiment analyzer to predict the sentiment of Amazon reviews. These are reviews, so they come with 5 star ratings, and we are going to look at the electronics category in particular. These are XML files, so we will need an XML parser. 

### 3.1 NLP Terminology 
Before we begin, I would just like to quickly go over some basic NLP terminology that will come up frequently throughout this post.
* **Corpus**: Collection of text
* **Tokens**: Words and punctuation that make up the corpus. 
* **Type**: a distinct token. Ex. "Run, Lola Run" has four tokens (comma counts as one) and 3 types.
* **Vocabulary**: The set of all types. 
* The google corpus (collection of text) has 1 trillion tokens, and only 13 million types. English only has 1 million dictionary words, but the google corpus includes types such as "www.facebook.com". 

### 3.2 Problem Overview
Now, we are just going to be looking at the electronics category. We could use the 5 star targets to do regression, but instead we will just do classification since they are already marked "positive" and "negative". As I mentioned, we are going to be working with XML data, so we will need an XML parser, for which we will use `BeautifulSoup`. We will only look at the `review_text` attribute. To create our feature vector, we will count up the number of occurences of each word, and divided it by the total number of words. However, for that to work we will need two passes through the data:

1. One to collect the total number of distinct words, so that we know the size of our feature vector, in other words the vocabulary size, and possibly remove stop words like "this", "is", "I", "to", etc, to decrease the vocabulary size. The goal here is to know the index of each token
2. On the second pass, we will be able to assign values to each data vector whose index corresponds to which words, and one to create data vectors 

Once we have that, it is simply a matter of creating a classifier like the one we did for our spam detector! Here, we will use logistic regression, so we can intepret the weights! For example, if you see a word like horrible and it has a weight of minus 1, it is associated with negative reviews. With that started, let's begin!

## 3.3 Sentiment Analysis in Code

In [25]:
import nltk
import numpy as np

from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

wordnet_lemmatizer = WordNetLemmatizer()                                # this turns words into their base form 

stopwords = set(w.rstrip() for w in open('../../data/nlp/stopwords.txt'))         # grab stop words 

# get pos reviews
# only want rev text
positive_reviews = BeautifulSoup(open('../../data/nlp/electronics/positive.review').read(), "lxml") 
positive_reviews = positive_reviews.findAll('review_text')                                  

negative_reviews = BeautifulSoup(open('../../data/nlp/electronics/negative.review').read(), "lxml")
negative_reviews = negative_reviews.findAll('review_text')

### Class Imbalance
There are more positive than negative reviews, so we are going to shuffle the positive reviews and then cut off any extra that we may have so that they are both the same size.

In [26]:
np.random.shuffle(positive_reviews)
positive_reviews = positive_reviews[:len(negative_reviews)]

### Tokenizer function
Lets now create a tokenizer function that can be used on our specific reviews.

In [27]:
def my_tokenizer(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)                        # essentially string.split()
    tokens = [t for t in tokens if len(t) > 2]                     # get rid of short words
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]     # get words to base form
    tokens = [t for t in tokens if t not in stopwords]
    return tokens

### Index each word
We now need to create an index for each of the words, so that each word has an index in the final data vector. However, to able able to do that we need to know the size of the final data vector, and to be able to know that we need to know how big the vocabulary is. Remember, the **vocabulary** is just the set of all types!

We are essentially going to look at every individual review, tokenize them, and then add those tokens 1 by 1 to the map if they do not exist yet.

In [28]:
word_index_map = {}                            # our vocabulary - dictionary that will map words to dictionaries
current_index = 0                              # counter increases whenever we see a new word

positive_tokenized = []
negative_tokenized = []

# --------- loop through positive reviews ---------
for review in positive_reviews:              
    tokens = my_tokenizer(review.text)          # converts single review into array of tokens (split function)
    positive_tokenized.append(tokens)
    for token in tokens:                        # loops through array of tokens for specific review
        if token not in word_index_map:                        # if the token is not in the map, add it
            word_index_map[token] = current_index          
            current_index += 1                                 # increment current index
                
# --------- loop through negative reviews ---------
for review in negative_reviews:              
    tokens = my_tokenizer(review.text)          
    negative_tokenized.append(tokens)
    for token in tokens:                       
        if token not in word_index_map:                        
            word_index_map[token] = current_index          
            current_index += 1   

In [29]:
word_index_map

{'you': 0,
 'lot': 1,
 'aaa': 2,
 'battery': 3,
 'this': 4,
 'deal': 5,
 'market': 6,
 'short': 7,
 'investing': 8,
 'recharging': 9,
 'unit': 10,
 'couple': 11,
 'rechargeable': 12,
 'value': 13,
 'easy': 14,
 'install': 15,
 'remote': 16,
 'doe': 17,
 'on/off': 18,
 'switch': 19,
 'replaced': 20,
 'tape': 21,
 'pleased': 22,
 'waiting': 23,
 'rewind': 24,
 'message': 25,
 'played': 26,
 'remove': 27,
 'accurate': 28,
 'time': 29,
 'love': 30,
 'advertized': 31,
 'wire': 32,
 'ipod': 33,
 'bag': 34,
 'headset': 35,
 'bit': 36,
 'control': 37,
 'power': 38,
 'receiver': 39,
 'instead': 40,
 'then': 41,
 'otherwise': 42,
 'sound': 43,
 'ha': 44,
 "n't": 45,
 'broken': 46,
 'keyboard': 47,
 'quality': 48,
 'product': 49,
 'price': 50,
 'recommend': 51,
 'looking': 52,
 'low': 53,
 'profile': 54,
 'w/numeric': 55,
 'entry': 56,
 'pad': 57,
 'mimimum': 58,
 'anount': 59,
 'desk': 60,
 'space': 61,
 'keypad': 62,
 'feel': 63,
 'notebook': 64,
 'computer': 65,
 'docking': 66,
 'station': 67,