In [1]:
import re
import nltk

from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.util import ngrams

## Natural Language Processing

### Introduction

As you may have noticed, this set of BLUs will revolve around the topic of Natural Language Processing (NLP). As the name implies, this field is all about the processing and handling of language in such a way that a computer may be able to do useful things with it. There are plenty of tasks and problems around it, namely:

- **Speech recognition**: the task of, given a sample of audio, extract the words that are being spoken or even prosody features, for example.
- **Natural language generation**: the task of putting computational formulations into actual text, for example, automated generation of labels to images, summarisation of texts and data, creation of dialogue systems, etc.
- **Natural language understanding**: the task of getting some meaning out of the data, for instance, recognizing entities in sentences, semantic roles, or even classify sentences according to their sentiment, etc., or transforming it into something machines can work on (numbers).

Some of the main tasks and areas of research of NLP are:

- **Part of Speech tagging**: Determine the role of each word in a given sentence, for instance, if it is an adjective, verb, noun, etc.

- **Word Segmentation**: Break continuous text into words.

- **Parsing**: Define a tree that represents the grammatical structure of a sentence.

- **Machine Translation**: Translate sentences from a source language to a target language automatically.

- **Named entity recognition**: Find parts of the text that correspond to certain entities, like names of places, people, companies, etc.

- **Question answering**: Given a question in human language, find the most appropriate answer.

- **Text to speech**: As the name implies, transform written text into audible, human-like sounds that correspond to the given input.

Many of these tasks are out of the scope of these learning units, but we think that it is important to at least acknowledge that they exist in the realm of NLP. Also, some of these things may seem "easy", but when you think about the diversity that exists in terms of languages you start to understand how daunting all these tasks are. For instance, word segmentation may seem like a really easy task. After all, words are separated by spaces or maybe some punctuation. But, if you take a look at Mandarin Chinese, for instance, that's not the case, making that "heuristic" no longer universal. And for many of the tasks, there are plenty of corner cases, which make this field one of the most challenging but also more rewarding to work on.

Throughout these learning units we hope to give you some basic understanding on how to transform text into something useful for us, what are some of the challenges in this field, solve some interesting problems and hopefully make you want to learn more about the topic afterwards!

The first part of this BLU goes through some of the fundamental concepts that will be helpful for all the practical tasks that you will need during this month, but also in the future, if you ever need to work with text data. We will start by introducing **regular expressions**, followed by three important concepts in data pre-processing (**tokenization**, **stopwords**, and **stemming**). Finally, we will see what are **n-grams** and what is an **n-gram model**.

### Regular Expressions (aka Regex)

Regular expressions are sequences of characters that allow us to define search patterns. It goes by several rules and is one of the most fundamental concepts in computer science regarding working with text data.

#### Cheatsheet [\[1\]](https://regexr.com/3lvai)

`.` - matches any character, except newline.

`\d, \s \S` - match digit, match whitespace, not whitespace.

`\b, \B` - word, not word boundary.

`[xyz]` - matches x, y or z.

`[^xyz]` - matches anything that is not x, y or z.

`[x-z]` - matches a character between x and z.

`^xyz$` - `^` is the start of the string, `$` is the end of the string.

`\.` - use escaping to match special characters.

`\t`, `\n` - matches tab and newline.

`x*` - matches 0 or more symbols x.

`x+` - matches 1 or more symbols x.

`x?` - matches 0 or 1 symbol x.

`.?`, `*?`, `+?`, etc - represent non-greedy search. 

`x{5}` - matches exactly 5 symbols x.

`x{5,}` - matches 5 or more symbols x.

`x{5, 8}` - matches between 5 and 8 symbols x.

`xy|yz` - matches `xy` or `yz`.

We use python's [re](https://docs.python.org/3/library/re.html) library. Using `search()` we can take a certain pattern and look for it in a text. This function will return a `Match` object, from which we can obtain the text portion that was matched by our pattern.

In [2]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

print("Looking for \"Madrid\":")
match = re.search("Madrid", text)
print(match)

print("\nLooking for \"Rome\":")
match = re.search("Rome", text)
print(match)

print("\nLooking for \"Lisbon\":")
match = re.search("Lisbon", text)
print(match)

Looking for "Madrid":
<re.Match object; span=(7, 13), match='Madrid'>

Looking for "Rome":
None

Looking for "Lisbon":
<re.Match object; span=(0, 6), match='Lisbon'>


So, it is already possible to observe some things about `re.search()`:

- When there is no match, `search()` returns `None`.

- The `Match` object has the index of the beginning and end of the match. Might be used via `match.start()` and `match.end()`.

- If there is more than one instance of the word in the text, only the first will be retrieved.

If we want to return all the matches to our pattern in a given text we might use the function `findall()`. In this case, the matched portions of the text will be returned, instead of the `Match` object.

In [3]:
pattern = "Lisbon"

for match in re.findall(pattern, text):
    print(match)

Lisbon
Lisbon
Lisbon


Notice that, one of the words was written as _Lisbona_ , but we still match the _Lisbon_ portion of that word. If we add the condition of having a white space after the letter *n* we will only get two matches.

In [4]:
pattern = "Lisbon\s"

for match in re.findall(pattern, text):
    print(match)

Lisbon 
Lisbon 


If instead we really want the `Match` objects for some reason, `finditer()` should be used instead.

In [5]:
pattern = "Lisbon"

for match in re.finditer(pattern, text):
    print(match)

<re.Match object; span=(0, 6), match='Lisbon'>
<re.Match object; span=(14, 20), match='Lisbon'>
<re.Match object; span=(34, 40), match='Lisbon'>


---

Now, looking at some of the previously shown codes at cheatsheet, let's see in some simple examples how that may help us!

In [6]:
text = "x xy xyy"

Remembering what we've shown previously, `.` will match any character after x:

In [7]:
re.findall("x.", text)

['x ', 'xy', 'xy']

`*` will match 0 or more y symbols after xy:

In [8]:
re.findall("xy*", text)

['x', 'xy', 'xyy']

`+` will match 1 or more y symbols after x:

In [9]:
re.findall("xy+", text)

['xy', 'xyy']

`?` will match 0 or 1 y symbols after x:

In [10]:
re.findall("xy?", text)

['x', 'xy', 'xy']

`{i}` will match i y symbols after x:

In [11]:
re.findall("xy{2}", text)

['xyy']

---

In [12]:
text="lotterer Jani Senna conway Kobayashi Lopez buemi Nakajima alonso"

If we want to match only the names that start with capital letters:

In [13]:
re.findall("[A-Z][a-z]+", text) # find substrings starting with a capital letter
                                # followed by 1 or more lowercase letters

['Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

If we want to match all the names that don't start with letters "B" and "L".

In [14]:
re.findall(r"\b[^bBlL\s][A-Za-z]+", text) # find substrings after a word boundary that...
                                          # do not begin with B or L or whitespace

['Jani', 'Senna', 'conway', 'Kobayashi', 'Nakajima', 'alonso']

You may be wondering what that hacky `r` is doing before the actual regex we are using. This has no connection with regex. It is just a way of telling python that it should interpret backslashes `\` literally (Notice how our regex has `\b` and `\s`). For instance:

In [15]:
print("With r:\n")
print(r"lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")
print("\n")
print("Without r:\n")
print("lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")

With r:

lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso


Without r:

lotterer 
 Jani 
 Senna conway Kobayashi Lopez buemi Nakajima alonso


In the first case, since we are using `r` the model takes `\n` literally and in the second case, python interprets it as the escaped symbol for newline.

---

Imagine now we have some extra information in front of the names, and that we receive a file with many lines. We still want only names starting with capital letters. So we run the previous regex and...

In [16]:
text="lotterer Rebellion\nJani Rebellion\nSenna Rebellion\nconway Toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima Toyota\nalonso Toyota"

In [17]:
re.findall("[A-Z][a-z]+", text)

['Rebellion',
 'Jani',
 'Rebellion',
 'Senna',
 'Rebellion',
 'Toyota',
 'Kobayashi',
 'Toyota',
 'Lopez',
 'Toyota',
 'Toyota',
 'Nakajima',
 'Toyota',
 'Toyota']

Well, we don't want those extra names in there. So let's try to add the symbol `^` to make sure the expression only captures the beginning part of the sentence.

In [18]:
re.findall("^[A-Z][a-z]+", text)

[]

Hum.. we got a handful of nothing. Why is this happening? Well, the regex processes all the text as a single line, and the first name doesn't start with a capital letter. To make sure this is the case, let's change `lotterer` to `Lotterer`.

In [19]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text)

['Lotterer']

But we still only capture one line. Luckily, we have [`re.MULTILINE`](https://docs.python.org/3/library/re.html#re.MULTILINE), that allows us to process multiline strings easily.

In [20]:
re.findall("^[A-Z][a-z]+", text, re.MULTILINE)

['Lotterer', 'Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

And now we were able to get all the information we wanted! And what if we wanted the second part of each line? Well, in this case, that is the last word of the line, so we may use `$`.

In [21]:
re.findall("[A-Z][a-z]+$", text, re.MULTILINE)

['Rebellion', 'Rebellion', 'Toyota', 'Toyota', 'Toyota', 'Toyota']

What if we want all full lines ending with `rebellion`?

In [22]:
re.findall(".*rebellion$", text, flags=(re.MULTILINE|re.IGNORECASE))

['Lotterer Rebellion', 'Jani rebellion', 'Senna Rebellion']

You may notice that here we are also taking advantage of the flag `re.IGNORECASE`. This is a convenient flag to add if you want case-insensitive matches. Multiple regex flags can be strung together with pipes: `|`.

Regular expressions can get hard to read really fast, but even knowing the basics will be certainly helpful sometime in the future. To better understand how they work, nothing is better than practicing, and sites like [this](https://regexr.com/3lvai) and [this](https://regex101.com/) are valuable visual tools to do so. The python library that we used has a lot of more powerful methods too, which might be useful to future tasks.

---

### Tokenizer

One important step when dealing with text data is to _tokenize_ the data. In practice what this means is splitting the strings of a corpus into substrings. This is important because it transforms a string into parts that are more suitable to be used by the tools that exist in natural language processing. For instance, if we are working with the sentence

_"The car went too fast on the second lap. This damaged the tires."_ ,

would be better approached as a list,

_["The", "car", "went", "too", "fast", "on", "the", "second", "lap", ".", "This", "damaged", "the", "tires", "."]_ .

We will be using [NLTK](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) implementations.

In [23]:
text = "The car went too fast on the second lap. This damaged the tires..."

In [24]:
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


Notice that the tokenizer is created by taking advantage of the regular expressions we learned earlier. This means that we can make different tokenizers according to what we want to split on. For instance, if we had used `[A-Z]\w+`, the tokenizer would only select the words that begin with capital letters.

In [25]:
tokenizer = RegexpTokenizer('[A-Z]\w+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'This']


Notice that there are already some pre-defined implementations we can use by taking advantage of `RegexpTokenizer`. These are:
- `BlanklineTokenizer` - Tokenize a string using blank lines as the delimiter.
- `WordPunctTokenizer` - Tokenize a string into alphabetic and non-alphabetic characters.
- `WhitespaceTokenizer`-  Tokenize a string using spaces, tabs, and newlines as delimiters.

In [26]:
from nltk.tokenize import BlanklineTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import WhitespaceTokenizer

In [27]:
BlanklineTokenizer().tokenize(text)

['The car went too fast on the second lap. This damaged the tires...']

In [28]:
WordPunctTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap',
 '.',
 'This',
 'damaged',
 'the',
 'tires',
 '...']

In [29]:
WhitespaceTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap.',
 'This',
 'damaged',
 'the',
 'tires...']

Notice that the `WordPunctTokenizer()` is similar to the first one we defined. This is commonly used and the default method of tokenization that will be used when we talk about the method.

---

### Stemming

Stemming allows us to get the "root" of words. This is important because in certain tasks we are more interested in a broader representation of a given word and not the specific variation of it, like its plural, for instance. Before using the stemmer it is necessary to download some tools required by `nltk`, regarding the language we want to use. We will be working with the English language, using the NLTK Downloader, the same way we would import `nltk`.

In [30]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/christinemaroti/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

So, let's see what this step gets us for the same example we have been using. To do that, we will be using the NLTK implementation of the [snowball stemmer](https://www.nltk.org/api/nltk.stem.html#nltk.stem.snowball.SnowballStemmer). Notice that there are other stemmers, some of them specific to certain tasks.

In [31]:
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text)
stemmer = SnowballStemmer("english", ignore_stopwords=True)
stems = [list(map(stemmer.stem, words))]
print(stems)

[['the', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'this', 'damag', 'the', 'tire', '...']]


We can see that _"damage"_ and _"tires"_ are transformed into simpler forms of the respective words. Notice as well that all the words have been lowercased. Lowercasing the data is also a common step in text pre-processing.

One thing that you may have noticed was the concept of "stopwords" being used. **Stopwords** are common words in a given corpus or language that, due to being so common, lose interest for most natural language processing applications. 

For instance, imagine a search engine, looking through a whole range of documents. Words as "*the*", "*a*", "*at*", etc. will be present in so many documents that using them in the search will not reduce the number of possible files that could be relevant to our query. So filtering them out is beneficial to our goal.

In the specific case of the stemmer function that we are using, defining `ignore_stopwords` as `True` will prevent the stemming of stopwords.

In the next part of this BLU you will read about stopwords again, as they are important for the task you will be doing there.

Besides stemming there is also the process of **lemmatization**. Both processes share the goal of getting the root of the word, or more formally, reduce inflectional forms of a word to a common base form [\[7\]](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html), but they act differently. Whereas stemming follows a heuristic approach that drops the suffix of words in order to get closer to the common base form, lemmatization uses a dictionary and morphological analysis of words to return the base form of words, known as _lemma_.

Using the example in the cited reference, if shown the word _saw_, stemming would tend to return only *s*, while lemmatization would take into account if the word was the verb or the noun, and correspondingly, return _see_ or _saw_  as the base form of the word.

As you may expect, lemmatization is much more expensive in computational terms and, for certain applications, stemming might be more than enough to obtain good results. We will be using only stemming throughout the NLP learning units.

---

### N-Grams

_n-grams_ correspond to sequences of n consecutive elements from a given sentence. Commonly each element is a word, or "token," but we may define it as we wish for the task at hand. Usually, we refer to unigrams, bigrams, trigrams, four-grams, etc. according to the length of the sequence of elements.

For instance, for the sentence

`"The driver made a mistake"`,

we would have:

- unigrams: `The`, `driver`, `made`, `a`, `mistake`
- bigrams: `The driver`, `driver made`, `made a`, `a mistake`
- trigrams: `The driver made`, `driver made a`, `made a mistake`
- four-grams: `The driver made a`, `driver made a mistake`

We will create _n-grams_ but taking advantage of the [NLTK ngram](http://www.nltk.org/_modules/nltk/model/ngram.html) implementation. We will be using the tokenized list `words` created previously.

In [32]:
print(words)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


In [33]:
print(list(ngrams(words, 1)))

[('The',), ('car',), ('went',), ('too',), ('fast',), ('on',), ('the',), ('second',), ('lap',), ('.',), ('This',), ('damaged',), ('the',), ('tires',), ('...',)]


In [34]:
print(list(ngrams(words, 2)))

[('The', 'car'), ('car', 'went'), ('went', 'too'), ('too', 'fast'), ('fast', 'on'), ('on', 'the'), ('the', 'second'), ('second', 'lap'), ('lap', '.'), ('.', 'This'), ('This', 'damaged'), ('damaged', 'the'), ('the', 'tires'), ('tires', '...')]


In [35]:
print(list(ngrams(words, 3)))

[('The', 'car', 'went'), ('car', 'went', 'too'), ('went', 'too', 'fast'), ('too', 'fast', 'on'), ('fast', 'on', 'the'), ('on', 'the', 'second'), ('the', 'second', 'lap'), ('second', 'lap', '.'), ('lap', '.', 'This'), ('.', 'This', 'damaged'), ('This', 'damaged', 'the'), ('damaged', 'the', 'tires'), ('the', 'tires', '...')]


And by looking at the output, it's possible to observe that we are getting what we expected.

N-grams may be used for several things, like extra features in natural language processing classification tasks. Imagine counts of "very good" vs. "very" and "good" individually when doing sentiment analysis, or the difference in the counts of n-grams present in a reference and our hypothesis as a way of calculating similarity between generated texts, and so on. Another interesting use-case is **n-gram language models**. This kind of model predicts the next item (in this case a word) in a given sequence based on *n* past items, using conditional probabilities.

More formally, following the notation in reference [\[6\]](https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf), the probability of a given sentence will be given by

$$ P(w_1w_2...w_n) = \prod_i P(w_i \lvert w_1w_2...w_{i-1})$$

But a conditional probability on all the past context doesn't seem very wise. Therefore we will make use of the Markov property, and assume that each of the right-hand side components of the product is conditioned on the _n-1_ past most recent elements.

$$ P(w_1w_2...w_n) = \prod_i P(w_i \lvert w_{i-n}...w_{i-1})$$

Estimating the probabilities of the _n-grams_ is a matter of taking their counts in a given training corpus and calculating the probabilities.

$$ P(w_i \lvert w_{i-n}...w_{i-1}) = \frac{count(w_{i-n},...,w_{i-1},w_i)}{count(w_{i-n},...,w_{i-1})} $$

To give a simple example, forgetting n-grams of words for a while, imagine that you have three possible states for the weather: _rainy_, _sunny_, and _cloudy_. Having counts for each of the occurrences, for instance (rainy, rainy), (rainy, cloudy), (rainy, sunny), (rainy, rainy, sunny), etc. you may calculate the probabilities for each of the transitions. Therefore, you may be able calculate the probability of a given sequence of weather occurrences, or even guess what is the most probable weather condition for the following day. If you're curious, use [this page](http://setosa.io/ev/markov-chains/) to get a more interactive approach to a similar example.

Going back to n-grams of words, you just have to replace weather states by words and you can find the most probable sentences or even generate text by choosing the most probable word given its context. For the latter you would be producing a very simple language model (a computational model that can output text that should be somewhat fluent). If you're already familiar with some of these things and want to take it up a notch regarding n-gram language models go through [this set of slides](https://web.stanford.edu/class/cs124/lec/lm2021.pdf) that cover some practical considerations to take into account when building such a model.

---

### Word of Advice

Even though we are using NLTK library during this BLU, some other libraries are commonly used as well, and are probably better. Here is a list of some to consider in your future challenges in NLP:

- [Spacy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/other-languages.html#python)

___

### Reference

\[1\] - [RegExr](https://regexr.com/3lvai)

\[2\] - [Python Module of the Week](https://pymotw.com/2/re/)

\[3\] - [NLTK Book](https://www.nltk.org/book/)

\[4\] - [N-grams](https://en.wikipedia.org/wiki/N-gram#n-gram_models)

\[5\] - [Language Model](https://en.wikipedia.org/wiki/Language_model)

\[6\] - [Stanford CS124 Language Modeling slides](https://web.stanford.edu/class/cs124/lec/lm2021.pdf)

\[7\] - [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Extra Regex: [RegexOne](https://regexone.com/)