# An Introduction to Natural Language Processing

The study of natural language processing is primarily concerned with teaching machines to process and "understand" text.  It is fairly straight forward for machines to understand structured data like numbers, however to understand language is far more difficult.  Part of this has to do with the nature of numbers versus the nature of language.  Specifically, words can have multiple meanings in multiple contexts.  While the mean behind a number can change, based on its reference, the definition of a number is consistent across all mediums and contexts.

To make this discrete, if you have 5 chairs or 5 starfish, while the type of thing being described changes, the quantity of things being described stays constant.  However, with language we cannot make such a claim.  For instance, let's look at the following sentence:

"I hope you do a great job!"

Taken out of context, its meaning is most likely that the narrator is expressing hope to the subject to do a great job.  Let's see the same sentence in another context:

Person1: "I am going to have the best project"
Person2: "Yea, right, you know I'll do better"
Person1: "I hope you do a great job."
Person1: "Not"
Person2: "No need to be sarcastic"

Here "I hope you do a great job" is intended _sarcastically_ so the meaning of the statement is reversed, despite us not changing the phrasing of the sentence, the _meaning_ has changed completely.  How do we even begin to encode something like sarcasm in our language?  Would a person even do that in a context we care about?  These are just some of the exciting questions of natural language processing!  

Below we'll make our way through the following topics:

* Bag Of Words Model
* stop words
* stemming
* lemmatization
* part of speech tagging
    * n-gram analysis
    * Hidden Markov Models

* syntax trees
* tf-idf
* corpus generation
* named entity recognition

* word2vec
* Topic modeling with LDA
* Text Classification with Naive Bayes
* Text Prediction with Logistic Regression
* Building A Simple Recommendation Engine
* Fuzzy Matching For Deduplication
* Finding Human Traffickers Online

References:
* https://moj-analytical-services.github.io/NLP-guidance/FeatureSelection.html
* https://towardsdatascience.com/feature-selection-on-text-classification-1b86879f548e
* https://machinelearningmastery.com/feature-selection-machine-learning-python/
* https://nlp.stanford.edu/IR-book/html/htmledition/contents-1.html
* http://blog.datumbox.com/using-feature-selection-methods-in-text-classification/
* https://www.nltk.org/book/
* https://spacy.io/usage/spacy-101/
* https://course.spacy.io/
* https://pythonspot.com/nltk-stop-words/
* https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
* https://medium.freecodecamp.org/an-introduction-to-part-of-speech-tagging-and-the-hidden-markov-model-953d45338f24


## The Bag Of Words Model

To get to a point where we can effectively model language, we need to consider it grounded in some task.  Here our task will always be the same, at a high level, how does a machine understand language?  You'll find the specific tasks that follow are mere consequences of this overarching theme of first, how do we understand language?  And as a secondary question, what aspects of language are useful for our specific understanding?  Therefore, any model, in principal will be interested in segmenting natural language into a decomposed or component state.

Let's begin with our first model and a specific sub-task, which will inform what's important.  Here we consider the bag of words model and the task of simple information retrieval.  

For this system we will need two parts:

* a system for transforming the text into numbers
* a system for doing the information retrival

Let's first define the text into numbers part of the system - 

## Enter the Bag Of Words Model

The Bag of words model is possibly the simplest model you could think of, let's see some code to implement it:

In [12]:
import string
import re

def get_bag_of_words(text):
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    text = regex.sub('', text)
    tokens = [elem.rstrip() for elem in text.split(" ")]
    return {token:tokens.count(token) for token in tokens}

sentence = "Hello there friends, how are you?"
get_bag_of_words(sentence)

{'Hello': 1, 'there': 1, 'friends': 1, 'how': 1, 'are': 1, 'you': 1}

There are a couple of things to notice here:

1. we don't want to keep the punctuation
2. We now have a set of numbers that have lost semantic meaning

Now let's go about defining our simplicitic informaiton retrevial system.

Let's assume that we have a web application that should query something different depending on what a user types in.  We give them a "search" bar to look up information.  Let's assume for simplicity, that they can only type in keywords.

A good example for this is job search based on keyword.  Here, someone enters a role, like "data scientist" and open data science roles are returned.  How could we return the results?  Well the simplest way is with a look up table.  Now we are in a position to set up the second part of the system.  Let's see how that might get coded:

In [22]:
import string
import re

def get_bag_of_words(text, space_of_words):
    text = text.lower()
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    text = regex.sub('', text)
    tokens = [elem.rstrip() for elem in text.split(" ")]
    return [(word,tokens.count(word)) for word in space_of_words]

def jobs_to_return(job_phrase):
    if job_phrase[0][1] >= 1:
        if job_phrase[1][1] >= 1:
            return "Looking for a mid level data scientist"
        elif job_phrase[2][1] >= 1:
            return "Looking for a senior data engineer"
        elif job_phrase[3][1] >= 1:
            return "Looking for an experienced data manager"
    else:
        return "Sorry, we don't have any jobs"

space_of_words = "data scientist engineer manager".split()
sentence = "Data Scientist"
job_phrase = get_bag_of_words(sentence, space_of_words)
print(job_phrase)
jobs_to_return(job_phrase)

[('data', 1), ('scientist', 1), ('engineer', 0), ('manager', 0)]


'Looking for a mid level data scientist'

As you can see here this system is extremely simplistic, but it shows a possible design that could be implemented in the real world - If you have a web connection, you can check out a system I helped with called CALC that uses this very idea:

[https://calc.gsa.gov/](https://calc.gsa.gov/)

Check it out!

So far we've built an incredibly symplistic search engine.  One we could make it more natural, is by allowing for more flexible queries that help humans express what they are after, but that aren't important for our query.

## Enter Stop Words

First we'll see an example of how to do stop words on their own and then we'll add stop words to our system:

In [23]:
def remove_stop_words(sentence):
    stop_words = "hey i a i'm".split()
    return " ".join([word 
                     for word in sentence.split() 
                     if word not in stop_words])

sentence = "Hey Fred, I'm looking for a new car, do you have any recommendations?"
remove_stop_words(sentence)

"Hey Fred, I'm looking for new car, do you have any recommendations?"

The basic idea here is that we remove words that occur a lot in every day language, but don't hold semantically relevant information.  Most stop word lists are standard and come from analysis of major bodies of text called corpora or corpus in the singular.  From these corpora are massive bodies of text, that are supposed to capture the frequency of language in a general setting.  Of course, domain specific corpora exist as well.  For instance, words in the medical community are likely to have a different frequency and usage than in say the gaming community.  So we can differentiate dialetics and communities, in theory, from the language they use and directly from the frequency and occurence of different types of words.

Let's look at a standard set of stop words, from a very popular natural language processing library - Natural Language Toolkit (NLTK):

In [25]:
from nltk.corpus import stopwords
for word in stopwords.words("english"):
    print(word)
print(len(stopwords.words("english")))

i
me
my
myself
we
our
ours
ourselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
he
him
his
himself
she
she's
her
hers
herself
it
it's
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
that'll
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
don't
should
should've
now
d
ll
m
o
re
ve
y
ain
aren
aren't
couldn
couldn't
didn
didn't
doesn
doesn't
hadn
hadn't
hasn
hasn't
haven
haven't
isn
isn't
ma
mightn
mightn't
mustn
mustn't
needn
needn't
shan
shan't
shouldn
shouldn't
wasn
wasn't
weren
weren't
won
won't
wouldn
wouldn't
179


As you can see there are 179 words that occur commonly and don't convey meaning we care about for our information retrieval problem, or for some NLP tasks more generally.  Let's make use of the nltk stop words in or problem to see what we get out now:

In [26]:
def remove_stop_words(sentence):
    stop_words = stopwords.words("english")
    return " ".join([word 
                     for word in sentence.split() 
                     if word not in stop_words])

sentence = "Hey Fred, I'm looking for a new car, do you have any recommendations?"
remove_stop_words(sentence)

"Hey Fred, I'm looking new car, recommendations?"

As you can see, we get to the crux of what is being asked for and can now move onto more sophisticated processing.  Let's add the stop words component to our job query engine, so we can add more "natural" language querying.

In [38]:
import string
import re

def remove_stop_words(sentence):
    stop_words = stopwords.words("english")
    return " ".join([word 
                     for word in sentence.split() 
                     if word not in stop_words])

def get_bag_of_words(text, space_of_words):
    text = text.lower()
    text = remove_stop_words(text)
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    text = regex.sub('', text)
    tokens = [elem.rstrip() for elem in text.split(" ")]
    return [(word,tokens.count(word)) for word in space_of_words]
    
def jobs_to_return(job_phrase):
    results = []
    if job_phrase[0][1] >= 1:
        if job_phrase[1][1] >= 1:
            results.append("Looking for a mid level data scientist")
        if job_phrase[2][1] >= 1:
            results.append("Looking for a senior data engineer")
        if job_phrase[3][1] >= 1:
            results.append("Looking for an experienced data manager")
    else:
        results.append("Sorry, we don't have any jobs")
    return results

space_of_words = "data scientist engineer manager".split()
sentence = "I'm looking for a Data Scientist job or a Data Engineer job"
job_phrases = get_bag_of_words(sentence, space_of_words)
print(job_phrases)
for job in jobs_to_return(job_phrases):
    print(job)

[('data', 2), ('scientist', 1), ('engineer', 1), ('manager', 0)]
Looking for a mid level data scientist
Looking for a senior data engineer


As you can see we now have a more flexible search engine because of the preprocessing we've done so far.  However we can pretty easily break our engine if we change our query slightly:

In [39]:
space_of_words = "data scientist engineer manager".split()
sentence = "I'm looking for a Data Scientist job or a Data Engineering job"
job_phrases = get_bag_of_words(sentence, space_of_words)
print(job_phrases)
for job in jobs_to_return(job_phrases):
    print(job)

[('data', 2), ('scientist', 1), ('engineer', 0), ('manager', 0)]
Looking for a mid level data scientist


What happened?!  Well it turns out by changing "data engineer" to the more natural "data engineering", we lose our second search result.  In order to recover it let's introduce our next concept - stemming

## Enter Stemming

Stemming is the idea of taking the stem of a word.  So in this case, the stem of engineering is engineer.  Stemming will also handle the case of engineer versus engineers.  Basically extra pieces of grammatical syntax are removed.  This is sort of like a form of regularization for words.  Because we create a standard representation for related words that only have slight variance in meaning.

Let's look at how we might implement a stemmer:

In [43]:
def get_stem(word):
    if word.endswith("s"):
        return word[:-1]
    elif word.endswith("ing"):
        return word[:-3]
    else:
        return word

sentence = "I heard you have the hiccups, have you tried jumping up and down?"
" ".join([get_stem(word) for word in sentence.split(" ")])

'I heard you have the hiccups, have you tried jump up and down?'

As you can see, we are able to get the stem of the word jumping - in this case jump.  Let's look at another example where our simplistic stemmer fais: 

In [44]:
sentence = "Hey!  Are you going running  later?  I'd love to come with you."
" ".join([get_stem(word) for word in sentence.split(" ")])

"Hey!  Are you go runn  later?  I'd love to come with you."

As you can see, "runn" is not the stem of "running", so we need a quite sophisticated stemmer to handle all the cases, essentially writing down a lot of grammatical rules and edge cases.  Because that in it of itself would be a large enough project, we won't do that here.  Instead we will make use of an off the shelf stemmer, in this case again from NLTK:

In [48]:
from nltk.stem import SnowballStemmer

def get_stem(word):
    stemmer = SnowballStemmer("english")
    return stemmer.stem(word)

plurals = ['running', 'caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted']

singles = [get_stem(plural) for plural in plurals]
print(' '.join(singles))

run caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot


As you can see the "running" case is now well handled.  Other cases aren't handled perfectly but it's still decent.  Here's an example of our off the shelf stemmer performing well:

In [49]:
print(SnowballStemmer("english").stem("generously"))

generous


Be advised, not all stemmers are created equal, and we can probably always do better by covering more edge cases.  Here is an example of a classic stemmer that doesn't do as well in this case, but is generally pretty good:

In [50]:
from nltk.stem import PorterStemmer

print(PorterStemmer().stem("generously"))

gener


Notice, the goal of a good stemmer is that the stemmer gives us the singular case.  Sometimes this is easy, other times, not so much.  So there is always a trade off.  Of course, if we can standardize to some typical expectation for a stem and get the root meaning of the word in question, then it doesn't matter if we cover all the edge cases, of which there will always be an ever expanding list.

Let's incorporate our Snowball stemmer into our engine to increase the size of our query surface:

In [62]:
import string
import re
from nltk.stem import SnowballStemmer

def get_stem(word):
    stemmer = SnowballStemmer("english")
    return stemmer.stem(word)

def remove_stop_words(sentence):
    stop_words = stopwords.words("english")
    return " ".join([word 
                     for word in sentence.split() 
                     if word not in stop_words])

def get_bag_of_words(text, space_of_words):
    text = text.lower()
    text = remove_stop_words(text)
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    text = regex.sub('', text)
    tokens = [elem.rstrip() for elem in text.split(" ")]
    tokens = [get_stem(token) for token in tokens]
    return [(word,tokens.count(word)) for word in space_of_words]
    
def jobs_to_return(job_phrase):
    results = []
    if job_phrase[0][1] >= 1:
        if job_phrase[1][1] >= 1:
            results.append("Looking for a mid level data scientist")
        if job_phrase[2][1] >= 1:
            results.append("Looking for a senior data engineer")
        if job_phrase[3][1] >= 1:
            results.append("Looking for an experienced data manager")
    else:
        results.append("Sorry, we don't have any jobs")
    return results

def process_query(query):
    space_of_words = "data scientist engin manager".split()
    job_phrases = get_bag_of_words(query, space_of_words)
    print("Job phrases", job_phrases)
    for job in jobs_to_return(job_phrases):
        print(job)
        

sentence_one = "I'm looking for a Data Scientist job or a Data Engineer job"
print("results for sentence one:")
process_query(sentence_one)
print()
print("results for sentence two:")
sentence_two = "I'm looking for a Data Scientist job or a Data Engineering job"
process_query(sentence_one)

results for sentence one:
Job phrases [('data', 2), ('scientist', 1), ('engin', 1), ('manager', 0)]
Looking for a mid level data scientist
Looking for a senior data engineer

results for sentence two:
Job phrases [('data', 2), ('scientist', 1), ('engin', 1), ('manager', 0)]
Looking for a mid level data scientist
Looking for a senior data engineer


Notice that now both of our cases work now!  Of course, we had to change our recognized word for "engineer" to "engin" which is not ideal.  However, this does mean we can cover more cases.  With all things, there is a trade off between specificity and flexibility.  

In [63]:
print("results for sentence two:")
sentence_two = "I'm looking for a Data Scientist job or a Data Engine job"
process_query(sentence_one)

results for sentence two:
Job phrases [('data', 2), ('scientist', 1), ('engin', 1), ('manager', 0)]
Looking for a mid level data scientist
Looking for a senior data engineer


As you can see above, we possibly get a false positive.  Of course, I don't think "data engine job" is a thing, so the only way this comes up is a typo.  

What if we wanted even more flexibility in our query?  Well we can take our processing even further with a technique called lemmatization.  Lemmatization is similar to stemming in that, the central component of the word is preserved while all other pieces are disgarded.  The major difference between stemming and lemmatization is stemming is based on a rules engine, while lemmatization makes us of formal theory to find the root of the word.  The stanford nlp group even goes so far as to call lemmatization the "right way", while stemming is seen as "a crude rules engine".  

## Enter Lemmatization

Because lemmatization is sophisticated, we won't attempt implementation here, but instead simply make use of NLTK's solution by way of example first:

In [64]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats", pos=''))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


As you can see we can even add a part of speech for increases flexability and to ensure a greater degree of correctness.  Let's go back to our example above and see how our lemmatizer does:

In [65]:
from nltk.stem import WordNetLemmatizer

def get_lemma(word):
    lemmatizer = WordNetLemmatizer()
    return lemmatizer.lemmatize(word)

plurals = ['running', 'caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted']

singles = [get_lemma(plural) for plural in plurals]
print(' '.join(singles))

running caress fly dy mule denied died agreed owned humbled sized meeting stating siezing itemization sensational traditional reference colonizer plotted


As you can see, on some of the cases lemmatization doesn't work that well at all, but in other cases, lemmatizing does significantly better.  For example, fly is handled much better with the lemmatizer than the stemmer.  

The reason the lemmatizer appears to do not as well, is because it doesn't have the part of speech.  Note that in the first example, we hinted that we'll need part of speech in order to do a good job.  For this we will need to consider some part of speech tagging automatically, or be forced to do this manually ourselves (the horror!!!).

Before we move onto explaining part of speech tagging in general, let's look at a motivating example as to why our lemmatizer may be shy about getting the lemmas for our plurals above:

`They refuse to permit us to obtain the refuse permit.`

In the above sentence, refuse is used twice - once meaning to deny and the second time meaning trash.  The difference in definition is made apparent by the part of speech in use!  So for some other words, the plural maybe the same but without the added context, we can't be sure of the underlying meaning.

## Enter Part Of Speech Tagging

Part of speech tagging is a wide ranging and increadibly powerful tool.  It's invention is one of the great watershed moments in natural language processing.  With it, we have a basic model of the syntax of natural language and therefore can get a sense of the meaning, via the syntax of the sentence.  

Over the years, we've continued to see continual improvement in part of speech tagging because of its central nature to many NLP tasks.  Du

In [73]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer

def process_word(word):
    word = get_stem(word)
    return get_lemma(word)

def get_stem(word):
    stemmer = SnowballStemmer("porter")
    return stemmer.stem(word)

def get_lemma(word):
    lemmatizer = WordNetLemmatizer()
    return lemmatizer.lemmatize(word)

plurals = ['running', 'caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted', 'engineering']

singles_process = [process_word(plural) for plural in plurals]
singles_stem = [get_stem(plural) for plural in plurals]
print(singles_process == singles_stem)

True


In [68]:
get_stem("fly")

'fli'