In [1]:
import re
import nltk

from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.util import ngrams

# Natural Language Processing

### Introduction

This set of BLUs revolves around the topic of Natural Language Processing (NLP). As the name implies, this field is all about the processing and handling of natural language in a way that enables a computer to do useful things with it. There are plenty of interesting applications around it, namely:

- **Speech recognition**: the task of extracting the words that are being spoken on a sample of audio, or even extracting prosody features, for example.
- **Natural language generation**: the task of putting computational formulations into actual text, for example, automated generation of labels to images, summarisation of texts and data, creation of dialogue systems, etc.
- **Natural language understanding**: the task of getting some meaning out of the data. For instance, recognizing entities in sentences, semantic roles, or even classify sentences according to their sentiment.

More formally, some of the main tasks and areas of research of NLP are:

- **Part of Speech tagging**: Determine the role of each word in a given sentence, for instance, if it is an adjective, verb, noun, etc.

- **Word Segmentation**: Break continuous text into words.

- **Parsing**: Define a tree that represents the grammatical structure of a sentence.

- **Machine Translation**: Translate sentences from a source language to a target language automatically.

- **Named entity recognition**: Find parts of the text that correspond to certain entities, like names of places, people, companies, etc.

- **Question answering**: Given a question in human language, find the most appropriate answer.

- **Text to speech**: As the name implies, transform written text into audible, human-like sounds that correspond to the given input.

Many of these tasks are out of the scope of these learning units, but we still think they're a motivating entry point into exploring this exciting topic!

# Preprocessing text 

Most fields that use text data require, in one way or the other, transforming it into an easier to use format. In order to extract relevant information from text, it is extremely important to preprocess it. This task usually includes the following methods:

- Splitting words in a sentence
- Lowercasing
- Removing punctuation
- Removing common words (stop words)
- Extracting suffixes or prefixes
- Part of speech filtering

At first, these text processing tasks may seem _easy_. For instance, separating the words in a sentence is as simple as looking at the spaces or punctuation between words. But when you really think about the diversity of languages characteristics, you start to understand how daunting all these tasks are. Take a look at Mandarin Chinese, for instance. Our heuristic is suddenly not valid anymore. And for many of the tasks, there are plenty of edge cases.

<img src="./media/xkcd_language_nerd.png" width="300">

Bottom line is: language is hard! But that's what makes this field one of the most challenging but also most rewarding to work on.


Throughout these learning units we hope to give you some basic understanding on how to transform text into something useful for models, explain some challenges in this field, solve interesting problems and, hopefully, make you want to learn more about the topic afterwards! :)

The first part of this BLU goes through some fundamental concepts that will be helpful for all the practical tasks that come next in this specialization (and also in the future, if you ever need to work with text data!). We will start by introducing **regular expressions**, followed by three important concepts in data pre-processing (**tokenization**, **stopwords**, and **stemming**). Finally, we will learn about **n-grams** and **n-gram-models**.


## Finding patterns in text

To be able to perform any of the text processing tasks mentioned in an efficient way, we need to be able to parse and detect patterns in the text programatically. For the purpose of simplification, we'll only focus on English for the next couple of examples.

Let's say you have the following text:

In [2]:
text_example = """
Bommer Canyon is an open space preserve in southern Irvine, California featuring hiking and 
biking trails as well as private event areas. The canyon is part of the Irvine Ranch, which 
itself is a National Natural Landmark, the first California Natural Landmark,[1][2] and part 
of the City of Irvine Open Space Preserve.[3][4] The preserve is adjacent to the affluent 
Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16,000 acres of 
preserved open space.[5] Approximately 15 of these acres are preserved as a "Cattle Camp" 
named for the area's previous cattle operations and are now rented for private events such 
as campouts, company picnics, and family reunions.[6] The trails in Bommer Canyon feature 
groves of oak and sycamore trees as well as rough rock outcrops and are popular with area 
residents who use them for nature walks, hiking and mountain biking.
"""

How would you answer the question:

> What are all the words in the following text that start with the letter `a`?

You could obviously count them manually. But if, instead, you had thousands or millions of lines? That becomes impossible!

A second option, given your recently acquired skills in python, could be to write a function that does that for you:


In [3]:
def find_all_words_starting_with_a(text):
    
    # Assuming all words can be split by spaces - big assumption
    list_of_a_words = []
    words = text.split(" ")
    for w in words:
        if w.startswith("a"):
            list_of_a_words.append(w)
    
    return list_of_a_words

print("Words found that start with 'a':")
for w in find_all_words_starting_with_a(text_example):
    print("- " + w)

Words found that start with 'a':
- an
- and
- as
- as
- areas.
- a
- and
- adjacent
- affluent
- and
- and
- acres
- acres
- are
- as
- a
- area's
- and
- are
- and
- and
- as
- as
- and
- are
- area
- and


Now let's take a slightly more complex task: we want to find and remove all punctuation, replacing it with a space when needed.

In [4]:
def remove_punctuation(text):
    
    big_list_of_punctuation = [
        ".", ",", "?", "!", "-", "_", ":", ";",
        "\"", "'", "|", "(", ")", ")", "/", "\\",
        "[", "]",
    ]
    
    processed_text = ""
    for idx, letter in enumerate(text):
        # Get previous and next letters
        previous_letter = text[idx - 1] if idx > 1 else ""
        next_letter = text[idx + 1] if idx < len(text) - 2 else ""
        
        # If the letter is punctuation, and the previous and next letters are not spaces, add a space
        # Otherwise we'll have too many spaces
        if letter in big_list_of_punctuation:
            if previous_letter != " " and next_letter != " ":
                processed_text += " "
        else:
            processed_text += letter
        previous_letter = letter
    
    return processed_text

print(remove_punctuation(text_example))


Bommer Canyon is an open space preserve in southern Irvine California featuring hiking and 
biking trails as well as private event areas The canyon is part of the Irvine Ranch which 
itself is a National Natural Landmark the first California Natural Landmark  1  2 and part 
of the City of Irvine Open Space Preserve  3  4 The preserve is adjacent to the affluent 
Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16 000 acres of 
preserved open space  5 Approximately 15 of these acres are preserved as a Cattle Camp 
named for the area s previous cattle operations and are now rented for private events such 
as campouts company picnics and family reunions  6 The trails in Bommer Canyon feature 
groves of oak and sycamore trees as well as rough rock outcrops and are popular with area 
residents who use them for nature walks hiking and mountain biking 



Slightly more complicated, but seems to have worked! Now let's see what would happen for a different text:

In [5]:
different_text_example = """
Adenanthos venosus is an openly-branched shrub that      typically grows to a height of 1–2 m 
(3 ft 3 in – 6 ft 7 in) and forms a lignotuber. Its leaves are mostly arranged in clusters 
at the ends of branches, egg-shaped, sometimes with the narrower end towards the base, 
mostly 15–20 mm (0.59–0.79 in) long, 10 mm (0.39 in) wide and sessile. The leaves are 
mostly glabrous and have a pointed tip. The flowers are ”dull crimson” to ”pinkish purple” 
with a cream-coloured band in the centre and many glandular hairs on the outside. 
"""

print(remove_punctuation(different_text_example))


Adenanthos venosus is an openly branched shrub that      typically grows to a height of 1–2 m 
 3 ft 3 in – 6 ft 7 in and forms a lignotuber Its leaves are mostly arranged in clusters 
at the ends of branches egg shaped sometimes with the narrower end towards the base 
mostly 15–20 mm 0 59–0 79 in long 10 mm 0 39 in wide and sessile The leaves are 
mostly glabrous and have a pointed tip The flowers are ”dull crimson” to ”pinkish purple” 
with a cream coloured band in the centre and many glandular hairs on the outside 



Seems like this time our cleaned text is not so well formatted... In particular, we missed some `”` characters and spaces.

Of course, we could go back and re-write our function... But what we want to point out is that coding all these tasks from scratch is not only inefficient, but also quite boring.

When cleaning text, we may encounter endless different variations and extra conditions that will make it very hard to keep track of everything. Some examples apart from dealing with punctuation are:

- Replacing all numbers with a placeholder
- Remove all decimals from numbers
- Count all uppercase letters in a text
- Find all words that have less than 3 letters (or any other number)
- ...

This is where **regular expressions** come in.

### Regular Expressions (aka Regex)

Regular expressions are sequences of characters that allow us to define search patterns in a standardized way. They is one of the most fundamental concepts in computer science when working with text data.

The idea is simple: by defining a text pattern using regex rules, you can easily locate and replace it within any piece of text!

Most of the tasks that we defined before can be performed using regex. Let's see our first example: finding all words starting with 'a' in the text


In [6]:
def find_all_words_starting_with_a_regex(text):
    
    pattern = r"\ba\w+\b"
    return re.findall(pattern, text)

print("Words found that start with 'a':")
for w in find_all_words_starting_with_a_regex(text_example):
    print("- " + w)

Words found that start with 'a':
- an
- and
- as
- as
- areas
- and
- adjacent
- affluent
- and
- and
- acres
- acres
- are
- as
- area
- and
- are
- as
- and
- and
- as
- as
- and
- are
- area
- and


Much more compact, right? Now let's do the same for punctuation:

In [7]:
def remove_punctuation_regex(text):
    
    pattern_punkt = r"[\.,\?\!\-\_\:\;\"'\|\(\)/\\\[\]]"
    
    # Replaces all punctuation characters by spaces
    text_no_punkt = re.sub(pattern_punkt, " ", text)
    
    # Collapses multiple spaces
    text_no_punkt = re.sub(r"\s+", " ", text_no_punkt)

    return text_no_punkt
    

print(remove_punctuation_regex(text_example))

 Bommer Canyon is an open space preserve in southern Irvine California featuring hiking and biking trails as well as private event areas The canyon is part of the Irvine Ranch which itself is a National Natural Landmark the first California Natural Landmark 1 2 and part of the City of Irvine Open Space Preserve 3 4 The preserve is adjacent to the affluent Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16 000 acres of preserved open space 5 Approximately 15 of these acres are preserved as a Cattle Camp named for the area s previous cattle operations and are now rented for private events such as campouts company picnics and family reunions 6 The trails in Bommer Canyon feature groves of oak and sycamore trees as well as rough rock outcrops and are popular with area residents who use them for nature walks hiking and mountain biking 


Pretty cool right? 

Regex enables us to do a bunch of interesting things with text. But if you're still not convinced, try this [website](https://regex101.com/) out! You can play around with regex and validate that the patterns you wrote are performing as expected.

See some extra examples below:

In [8]:
# Find digits
pattern_digits = "[0-9]*"

# Find words smaller than 3 characters 
pattern_words_until_3 = "\b\w{1,3}\b"

# Find URLs in a text
pattern_urls = "https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"


  pattern_words_until_3 = "\b\w{1,3}\b"
  pattern_urls = "https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"


At this point you may be looking a bit like this

<img src="./media/regex-confusion.png" width="500">

But worry not! 

Most of us don't know regex by heart, and we need to take a look at cheatsheets from time to time, like the one shown below. Still, as you move forward in the NLP world, we hope you see how helpful regex can be.


#### Cheatsheet ([see more](https://cheatography.com/davechild/cheat-sheets/regular-expressions/))

`.` - matches any character, except newline.

`\d, \s \S` - match digit, match whitespace, not whitespace.

`\b, \B` - word, not word boundary.

`[xyz]` - matches x, y or z.

`[^xyz]` - matches anything that is not x, y or z.

`[x-z]` - matches a character between x and z.

`^xyz$` - `^` is the start of the string, `$` is the end of the string.

`\.` - use escaping to match special characters.

`\t`, `\n` - matches tab and newline.

`x*` - matches 0 or more symbols x.

`x+` - matches 1 or more symbols x.

`x?` - matches 0 or 1 symbol x.

`.?`, `*?`, `+?`, etc - represent non-greedy search. 

`x{5}` - matches exactly 5 symbols x.

`x{5,}` - matches 5 or more symbols x.

`x{5, 8}` - matches between 5 and 8 symbols x.

`xy|yz` - matches `xy` or `yz`.

#### Python `re` library

The python library that we are using - [re](https://docs.python.org/3/library/re.html) - has many other powerful methods. We'll deep dive into each of them in the following sections.

#### Finding one instance with `search()` 

Using `search()` we can take a certain pattern and look for it in a text. This function will return a `Match` object, from which we can obtain the first text portion that was matched by our pattern.

In [9]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

print("Looking for \"Madrid\":")
match = re.search("Madrid", text)
print(match)

print("\nLooking for \"Rome\":")
match = re.search("Rome", text)
print(match)

print("\nLooking for \"Lisbon\":")
match = re.search("Lisbon", text)
print(match)

Looking for "Madrid":
<re.Match object; span=(7, 13), match='Madrid'>

Looking for "Rome":
None

Looking for "Lisbon":
<re.Match object; span=(0, 6), match='Lisbon'>


So, it is already possible to observe some things about `re.search()`:

- When there is no match, `search()` returns `None`.

- The `Match` object has the index of the beginning and end of the match. Might be used via `match.start()` and `match.end()`.

- If there is more than one instance of the word in the text, only the first will be retrieved.

#### Finding all instances with `findall()`  or `finditer()`

If we want to return all the matches to our pattern in a given text we might use the function `findall()`. In this case, the matched portions of the text will be returned, instead of the `Match` object.

In [10]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

pattern = r"Lisbon"

for match in re.findall(pattern, text):
    print(match)

Lisbon
Lisbon
Lisbon


Notice that, one of the words was written as _Lisbona_ , but we still match the _Lisbon_ portion of that word. If we add the condition of having a white space after the letter *n* we will only get two matches.

In [11]:
pattern = r"Lisbon\s"

for match in re.findall(pattern, text):
    print(match)

Lisbon 
Lisbon 


If instead we really want the `Match` objects for some reason, `finditer()` should be used instead.

In [12]:
pattern = "Lisbon"

for match in re.finditer(pattern, text):
    print(match)

<re.Match object; span=(0, 6), match='Lisbon'>
<re.Match object; span=(14, 20), match='Lisbon'>
<re.Match object; span=(34, 40), match='Lisbon'>


#### Replacing all instances with `sub()`

If we want to replace the matches of our pattern in a given text with something else we need to use the function `sub()`. In this case, the matched portions of the text will be replaced, and the changed text will be returned.

For example if we wanted to remove the word `Lisbon` from a text we could do the following:

In [13]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

# \b indicates a word boundary so using it around a word 
# will only replace the text when it shows as a proper word, 
# between spaces or punctuation 
pattern = r"\bLisbon\b"

print(re.sub(pattern, "", text))


 Madrid  Toulose Oslo Lisbona


If instead we wanted to replace by another word we could just specify it:

In [14]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

# \b indicates a word boundary so using it around a word 
# will only replace the text when it shows as a proper word, 
# between spaces or punctuation 
pattern = r"\bLisbon\b"

print(re.sub(pattern, "Lisboa", text))


Lisboa Madrid Lisboa Toulose Oslo Lisbona



#### A primer on patterns

Now that you are familiar with the `re` functions, we'll use them to explore a bit better the possible different pattern that can be expressed with regex

---


#### Example 1
Looking at some of the previously shown codes at cheatsheet, let's see in some simple examples how that may help us!

In [15]:
text = "x xy xyy"

Remember what we've shown previously: `.` will match any character after x:

In [16]:
re.findall("x.", text)

['x ', 'xy', 'xy']

`*` will match 0 or more y symbols after xy:

In [17]:
re.findall("xy*", text)

['x', 'xy', 'xyy']

`+` will match 1 or more y symbols after x:

In [18]:
re.findall("xy+", text)

['xy', 'xyy']

`?` will match 0 or 1 y symbols after x:

In [19]:
re.findall("xy?", text)

['x', 'xy', 'xy']

`{i}` will match i y symbols after x:

In [20]:
re.findall("xy{2}", text)

['xyy']

---



#### Example 2

In [21]:
text="lotterer Jani Senna conway Kobayashi Lopez buemi Nakajima alonso"

If we want to match only the names that start with capital letters:

In [22]:
re.findall("[A-Z][a-z]+", text) # find substrings starting with a capital letter
                                # followed by 1 or more lowercase letters

['Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

If we want to match all the names that don't start with letters "B" and "L".

In [23]:
re.findall(r"\b[^bBlL\s][A-Za-z]+", text) # find substrings after a word boundary that...
                                          # do not begin with B or L or whitespace

['Jani', 'Senna', 'conway', 'Kobayashi', 'Nakajima', 'alonso']

You may be wondering what that hacky `r` is doing before the actual regex we are using. This has no connection with regex. It is just a way of telling python that it should interpret backslashes `\` literally (Notice how our regex has `\b` and `\s`). For instance:

In [24]:
print("With r:\n")
print(r"lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")
print("\n")
print("Without r:\n")
print("lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")

With r:

lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso


Without r:

lotterer 
 Jani 
 Senna conway Kobayashi Lopez buemi Nakajima alonso


In the first case, since we are using `r`, python takes `\n` literally. On the other hand, on the second case, python interprets it as the escaped symbol for newline.

Another important thing to know is that, since regex interprets several characters in a special way, you need to escape them if you want to match them perfectly. For that purpose you also use the `\`. Whatever comes after this character is considered escaped.

In [25]:
text=r"Text \ with + special [characters]."

In [26]:
print("Matches:\n")

for m in re.findall(r".+[ ]\ ", text): # If we don't escape the characters we mean
    print(m)                           # find, we won't match anything and could 
                                       # even have a broken regex

Matches:



In [27]:
print("Matches:\n")

for m in re.findall(r"[\.\+\[\]\\]", text): # If we escape the characters we mean to 
    print(m)                                 # find, we'll match them literally

Matches:

\
+
[
]
.


You'll notice in particular the `\\`. This is just a corner case, as the backlash is used to escape any character, it's also used to escape itself.  

<img src="./media/backslashes.png" width="700">



---

Imagine now we have some extra information after the names, and we receive a file with many lines. We still want only names starting with capital letters. So we run the previous regex and...

In [28]:
text="lotterer Rebellion\nJani Rebellion\nSenna Rebellion\nconway Toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima Toyota\nalonso Toyota"

In [29]:
re.findall("[A-Z][a-z]+", text)

['Rebellion',
 'Jani',
 'Rebellion',
 'Senna',
 'Rebellion',
 'Toyota',
 'Kobayashi',
 'Toyota',
 'Lopez',
 'Toyota',
 'Toyota',
 'Nakajima',
 'Toyota',
 'Toyota']

Well, actually we just want the first name! So let's try to add the symbol `^` to make sure the expression only captures the beginning of the sentence.

In [30]:
re.findall("^[A-Z][a-z]+", text)

[]

Hum.. we got a handful of nothing. Why is this happening? Well, the regex processes all the text as a single line, and the first name doesn't start with a capital letter. To make sure this is the case, let's change `lotterer` to `Lotterer`.

In [31]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text)

['Lotterer']

But we still only capture one line. Luckily, we have [`re.MULTILINE`](https://docs.python.org/3/library/re.html#re.MULTILINE), that allows us to process multiline strings easily.

In [32]:
re.findall("^[A-Z][a-z]+", text, re.MULTILINE)

['Lotterer', 'Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

And now we were able to get all the information we wanted! And what if we wanted the second name on each line? Well, in this case, that is the last word of the line, so we may use `$`.

In [33]:
re.findall("[A-Z][a-z]+$", text, re.MULTILINE)

['Rebellion', 'Rebellion', 'Toyota', 'Toyota', 'Toyota', 'Toyota']

What if we want all full lines ending with `rebellion`?

In [34]:
re.findall(".*rebellion$", text, flags=(re.MULTILINE|re.IGNORECASE))

['Lotterer Rebellion', 'Jani rebellion', 'Senna Rebellion']

You may have noticed that here we are also taking advantage of the flag `re.IGNORECASE`. This is a convenient flag to add if you want case-insensitive matches. Multiple regex flags can be strung together with pipes: `|`.

Regular expressions can get hard to read really fast, but even knowing the basics will be certainly helpful sometime in the future. To better understand how they work, nothing is better than practicing, and sites like [this](https://regex101.com/) are valuable visual tools to do so. The python library that we used has a lot of more powerful methods which might be useful to future tasks, so make sure to check them out if you're interested!

Here are some more reading suggestions about regex:
- https://towardsdatascience.com/regular-expressions-clearly-explained-with-examples-822d76b037b4
- https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

---

## Tokenization

Now that we've seen how to find patterns in text, let's explore how to partition it into smaller, meaningful tokens.

This task is called _tokenization_ and it's an essential step when dealing with text. In practice _tokenization_ is about splitting the strings of a corpus into substrings. This is important because we can transform a string into parts that are more suitable to be used by natural language processing tools. For instance, if we are working with the sentence:

_"The car went too fast on the second lap. This damaged the tires."_ ,

it will get easier if split it into substrings:

_["The", "car", "went", "too", "fast", "on", "the", "second", "lap", ".", "This", "damaged", "the", "tires", "."]_ .

<img src="./media/tokenizer.png" width="500">

Hopefully by now you've realized that this task is does more than just splitting space and requires a bit more thought. To simplify things, we'll use some libraries and methods that help us implement tokenization.

We will be using [NLTK](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) implementations. 

In [35]:
text = "The car went too fast on the second lap. This damaged the tires..."

In [36]:
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


Notice that the tokenizer is created by taking advantage of the regular expressions we learned earlier. This means that we can make different tokenizers according to what we want to split on. For instance, if we had used `[A-Z]\w+`, the tokenizer would only select the words that begin with capital letters.

In [37]:
tokenizer = RegexpTokenizer(r'[A-Z]\w+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'This']


Notice that there are already some pre-defined implementations we can use by taking advantage of `nltk.tokenize` module. These are:
- `BlanklineTokenizer` - Tokenize a string using blank lines as the delimiter.
- `WordPunctTokenizer` - Tokenize a string into alphabetic and non-alphabetic characters.
- `WhitespaceTokenizer`-  Tokenize a string using spaces, tabs, and newlines as delimiters.

In [38]:
from nltk.tokenize import BlanklineTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import WhitespaceTokenizer

In [39]:
BlanklineTokenizer().tokenize(text)

['The car went too fast on the second lap. This damaged the tires...']

In [40]:
WordPunctTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap',
 '.',
 'This',
 'damaged',
 'the',
 'tires',
 '...']

In [41]:
WhitespaceTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap.',
 'This',
 'damaged',
 'the',
 'tires...']

`WordPunctTokenizer()` is similar to the first one we defined (`RegexpTokenizer('\w+|\$[\d\.]+|\S+')`. This tokenizer is one of the most commonly used. So, when we talk about tokenization without specifying further details, this is by default the type of tokenization that we expect you to use (for example, in exercises).


**Final Note**: Even though we are using NLTK library during this BLU, some other libraries are commonly used as well, and are probably better. Here is a list of some to consider in your future challenges in NLP:

- [Spacy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/other-languages.html#python)

---

## Stemming

Stemming is another very important concept in natural language processing. This technique allows us to get the "root" of words by ignoring suffixes. This is important because, in certain tasks, we are more interested in the broader semantics of a word and not the specific variation of it, for instance, if it's a verb or an adjective.

So, let's see what this step gets us for the same example we have been using. To do that, we will be using the NLTK implementation of the [snowball stemmer](https://www.nltk.org/api/nltk.stem.snowball.html). Notice that there are other stemmers, some of them specific to certain tasks.

In [80]:
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text)

stemmer = SnowballStemmer("english")
stems = [stemmer.stem(word) for word in words]
print(stems)

['the', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'this', 'damag', 'the', 'tire', '...']


In [83]:
stemmer.stem('better')

'better'

We can see that _"damage"_ and _"tires"_ are transformed semi-words consisting only of the suffixes of the original words. Notice as well that all the words have been lowercased. Lowercasing the data is also a common step in text pre-processing.

Besides stemming there is also the process of **lemmatization**. Both processes share the goal of getting the root of the word, or more formally, reduce inflectional forms of a word to a common base form [\[7\]](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html), but they act differently. While stemming follows an heuristic approach that drops the suffix of words in order to get closer to the common base form, lemmatization returns the base or dictionary form of a word, known as _lemma_.

For example, if given the word _consultations_, stemming would return only *consult*, while lemmatization would take into account if the word was the verb or the noun, as well as if it's in the plural or singular form, returning *consultation*

As you may expect, lemmatization is much more expensive in computational terms and, for certain applications, stemming might be more than enough to obtain good results. We will be using only stemming throughout the NLP learning units.

## Stopwords 

**Stopwords** are common words that don't have much semantic meaning. Examples of stop words are pronouns or articles. Most times, it is better to remove this less meaningful words.

As an example of why removing stop words might be an important step, imagine a search engine going through a big range of documents. Words as "*the*", "*a*", "*at*", etc. will be present in so many documents that using them in the search will not help at all at finding the best documents to answer our query. So filtering them out will speed up the search and remove noise, making it beneficial to our goal.

Let's start by downloading the stopwords corpus provided by NLTK.

In [44]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/ines/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Now we can get a list of stop words in any language

In [73]:
from nltk.corpus import stopwords

stop_eng = set(stopwords.words('english'))

stop_eng

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [63]:
stop_pt = set(stopwords.words('portuguese'))

stop_pt

{'a',
 'ao',
 'aos',
 'aquela',
 'aquelas',
 'aquele',
 'aqueles',
 'aquilo',
 'as',
 'até',
 'com',
 'como',
 'da',
 'das',
 'de',
 'dela',
 'delas',
 'dele',
 'deles',
 'depois',
 'do',
 'dos',
 'e',
 'ela',
 'elas',
 'ele',
 'eles',
 'em',
 'entre',
 'era',
 'eram',
 'essa',
 'essas',
 'esse',
 'esses',
 'esta',
 'estamos',
 'estar',
 'estas',
 'estava',
 'estavam',
 'este',
 'esteja',
 'estejam',
 'estejamos',
 'estes',
 'esteve',
 'estive',
 'estivemos',
 'estiver',
 'estivera',
 'estiveram',
 'estiverem',
 'estivermos',
 'estivesse',
 'estivessem',
 'estivéramos',
 'estivéssemos',
 'estou',
 'está',
 'estávamos',
 'estão',
 'eu',
 'foi',
 'fomos',
 'for',
 'fora',
 'foram',
 'forem',
 'formos',
 'fosse',
 'fossem',
 'fui',
 'fôramos',
 'fôssemos',
 'haja',
 'hajam',
 'hajamos',
 'havemos',
 'haver',
 'hei',
 'houve',
 'houvemos',
 'houver',
 'houvera',
 'houveram',
 'houverei',
 'houverem',
 'houveremos',
 'houveria',
 'houveriam',
 'houvermos',
 'houverá',
 'houverão',
 'houverí

Let's remove all stop words from out tokens list. We need to first lowercase the words since stop words are all saved as lowercase.

In [75]:
[word for word in words if word.lower() not in stop_eng]

['car', 'went', 'fast', 'second', 'lap', '.', 'damaged', 'tires', '...']

---

## N-Grams

Creating _N-grams_ is one of the preprocessing strategies that has been longer used for NLP tasks. They consist of sequences of N consecutive tokens on a given sentence. Each element is usually a word, but we may define it as we wish for the task at hand. Depending on the value choosen for N, we may have unigrams, bigrams, trigrams, four-grams, etc.

For instance, for the sentence

`"The driver made a mistake"`,

we would have:

- unigrams: `The`, `driver`, `made`, `a`, `mistake`
- bigrams: `The driver`, `driver made`, `made a`, `a mistake`
- trigrams: `The driver made`, `driver made a`, `made a mistake`
- four-grams: `The driver made a`, `driver made a mistake`

We will create _N-grams_ by taking advantage of the [NLTK ngram](http://www.nltk.org/_modules/nltk/model/ngram.html) implementation. We will be using the tokenized list `words` created previously.

In [84]:
print(words)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


In [85]:
print(list(ngrams(words, 1)))

[('The',), ('car',), ('went',), ('too',), ('fast',), ('on',), ('the',), ('second',), ('lap',), ('.',), ('This',), ('damaged',), ('the',), ('tires',), ('...',)]


In [86]:
print(list(ngrams(words, 2)))

[('The', 'car'), ('car', 'went'), ('went', 'too'), ('too', 'fast'), ('fast', 'on'), ('on', 'the'), ('the', 'second'), ('second', 'lap'), ('lap', '.'), ('.', 'This'), ('This', 'damaged'), ('damaged', 'the'), ('the', 'tires'), ('tires', '...')]


In [47]:
print(list(ngrams(words, 3)))

[('The', 'car', 'went'), ('car', 'went', 'too'), ('went', 'too', 'fast'), ('too', 'fast', 'on'), ('fast', 'on', 'the'), ('on', 'the', 'second'), ('the', 'second', 'lap'), ('second', 'lap', '.'), ('lap', '.', 'This'), ('.', 'This', 'damaged'), ('This', 'damaged', 'the'), ('damaged', 'the', 'tires'), ('the', 'tires', '...')]


We may use N-grams for several reasons. For instance, we can look at them as features of our corpus, where the feature space is defined by all N-grams of our vocabulary. These features may be used, for instance, as input of a classification model.

These concepts will be explored in depth in the following notebooks. For now let's just enumerate some specific examples of why N
-grams might be useful:

- By comparing the frequency of each N-gram on two different texts, we can calculate the similarity between them
- By looking at the frequency of the 2-gram 'Very good' we may have an indicator the sentiment associated with the text. On the other hand, solely looking at the frequency of the unigrams 'Very' and 'good' might not convey the same meaning. 

### Wrapping it up

And you've reached the end of our first notebook, congratulation! You've learned the following concepts:

* regex
* tokenization
* stemming
* stop words
* n-grams


It may seem overwhelming for now, but you'll see everything will become more intuitive as you navigate trough your journey into the NLP world! So keep the motivation and see you in Part 2!

<img src="./media/info_everywhere.jpg" width="400">


___

### Reference

\[1\] - [RegExr](https://regexr.com/3lvai)

\[2\] - [Python Module of the Week](https://pymotw.com/2/re/)

\[3\] - [NLTK Book](https://www.nltk.org/book/)

\[4\] - [N-grams](https://en.wikipedia.org/wiki/N-gram#n-gram_models)

\[5\] - [Language Model](https://en.wikipedia.org/wiki/Language_model)

\[6\] - [Stanford CS124 Language Modeling slides](https://web.stanford.edu/class/cs124/lec/lm2021.pdf)

\[7\] - [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Extra Regex: [RegexOne](https://regexone.com/)