# BLU07 - Part 1 of 3 - Preprocessing for NLP

In [1]:
import re
import nltk

from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.util import ngrams

## 1. Natural language processing

This set of BLUs revolves around the topic of Natural Language Processing (NLP). As the name implies, this field is all about the processing and handling of natural language in a way that enables a computer to do useful things with it. There are plenty of interesting applications around it, namely:

- **Speech recognition**: the task of extracting words from a sample of audio. Additional features like prosody can be extracted.
- **Natural language generation**: the task of putting computational formulations into actual text. For example, automated generation of image labels, summarisation of text and data, creation of dialogue systems.
- **Natural language understanding**: the task of getting some meaning out of text data. For instance, recognizing entities in sentences, semantic roles, or classifying sentences according to their sentiment.

More formally, some of the main tasks and areas of NLP research are:

- **Part of speech tagging**: Determine the role of each word in a given sentence. Is it an adjective, verb, noun?
- **Word segmentation**: Break continuous text into words.
- **Parsing**: Define a tree that represents the grammatical structure of a sentence.
- **Machine translation**: Translate sentences from a source language to a target language automatically.
- **Named entity recognition**: Find parts of the text that correspond to certain entities, like the names of places, people, companies.
- **Question answering**: Given a question in human language, find the most appropriate answer.
- **Text to speech**: Transform written text into audible, human-like sounds that correspond to the given input.

Many of these tasks are out of the scope of these learning units, but we still think they're a motivating entry point into exploring this exciting topic. Nowadays, most of these tasks are accomplished with large language models or LLMs. In this course, we'll keep to the NLP basics, so that you have a solid foundation when delving into LLMs later on.

## 2. Preprocessing text 

Most fields that use text data require its transformation into an easier to use format. ML models that we used until now require a set of features in tabular form. How can we get there from a piece of text? Maybe you can think of counting words, analyze the diversity of verbs or adjectives, or even look at word combinations like adjectives plus nouns. All this and much more is done in basic text analysis. Before we get to the feature extraction though, we need to preprocess the text. This task usually includes the following steps:

- Splitting sentences into words
- Lowercasing
- Removing punctuation
- Removing most common words with little information value (stopwords)
- Remove suffixes or prefixes to get to the word base
- Filter by part of speech categories

At first, these text processing tasks may seem _easy_. For instance, separating the words in a sentence is as simple as looking at the spaces or punctuation between words. But when you really think about the diversity of languages, you start to understand how daunting all these tasks are. Take a look at Mandarin Chinese, for instance. Our heuristic is suddenly not valid anymore. And for many of the tasks, there are plenty of edge cases.

<img src="./media/xkcd_language_nerd.png" width="300">

Bottom line is: language is hard! But that's what makes this field one of the most challenging but also most rewarding to work in.

The first part of this BLU explains fundamental concepts that will be helpful for all the practical tasks that come next in this specialization (and also in the future, if you ever need to work with text data!). We will start by introducing **regular expressions**, followed by three important concepts in data preprocessing (**tokenization**, **stopwords**, and **stemming**). Finally, we will learn about **n-grams**. It helps if you still remember some theory from your language classes. ;)

## 3. Finding patterns in text

To process text efficiently and automatically, we need to be able to parse and detect patterns in the text programatically. For simplicity, we'll only focus on English in the examples.

Let's say you have the following text:

In [2]:
text_example = """
Bommer Canyon is an open space preserve in southern Irvine, California featuring hiking and 
biking trails as well as private event areas. The canyon is part of the Irvine Ranch, which 
itself is a National Natural Landmark, the first California Natural Landmark,[1][2] and part 
of the City of Irvine Open Space Preserve.[3][4] The preserve is adjacent to the affluent 
Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16,000 acres of 
preserved open space.[5] Approximately 15 of these acres are preserved as a "Cattle Camp" 
named for the area's previous cattle operations and are now rented for private events such 
as campouts, company picnics, and family reunions.[6] The trails in Bommer Canyon feature 
groves of oak and sycamore trees as well as rough rock outcrops and are popular with area 
residents who use them for nature walks, hiking and mountain biking.
"""

How would you answer the question:

> What are all the words in the following text that start with the letter `a`?

You could obviously count them manually. But what if you had thousands or millions of lines of text? That becomes impossible!

A second option, given your recently acquired Python skills, could be to write a function that does that for you:

In [3]:
def find_all_words_starting_with_a(text):
    
    # Assuming all words can be split by spaces - big assumption
    list_of_a_words = []
    words = text.split(" ")
    for w in words:
        if w.startswith("a"):
            list_of_a_words.append(w)
    
    return list_of_a_words

In [4]:
print("Words found that start with 'a':")
for w in find_all_words_starting_with_a(text_example):
    print("- " + w)

Words found that start with 'a':
- an
- and
- as
- as
- areas.
- a
- and
- adjacent
- affluent
- and
- and
- acres
- acres
- are
- as
- a
- area's
- and
- are
- and
- and
- as
- as
- and
- are
- area
- and


That worked quite well. Of course, we only found words starting with `a`, not `A`.

Now let's take a slightly more complex task: we want to find and remove all punctuation, replacing it with a space when needed.

In [5]:
def remove_punctuation(text):
    
    big_list_of_punctuation = [
        ".", ",", "?", "!", "-", "_", ":", ";",
        "\"", "'", "|", "(", ")", ")", "/", "\\",
        "[", "]",
    ]
    processed_text = ""
    for idx, letter in enumerate(text):
        previous_letter = text[idx - 1] if idx > 1 else ""
        next_letter = text[idx + 1] if idx < len(text) - 2 else ""
        
        # If the letter is punctuation, and the previous and next letters are not spaces, add a space
        if letter in big_list_of_punctuation:
            if previous_letter != " " and next_letter != " ":
                processed_text += " "
        else:
            processed_text += letter
    
    return processed_text

In [6]:
print(remove_punctuation(text_example))


Bommer Canyon is an open space preserve in southern Irvine California featuring hiking and 
biking trails as well as private event areas The canyon is part of the Irvine Ranch which 
itself is a National Natural Landmark the first California Natural Landmark  1  2 and part 
of the City of Irvine Open Space Preserve  3  4 The preserve is adjacent to the affluent 
Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16 000 acres of 
preserved open space  5 Approximately 15 of these acres are preserved as a Cattle Camp 
named for the area s previous cattle operations and are now rented for private events such 
as campouts company picnics and family reunions  6 The trails in Bommer Canyon feature 
groves of oak and sycamore trees as well as rough rock outcrops and are popular with area 
residents who use them for nature walks hiking and mountain biking 



Slightly more complicated, but seems to have worked! Now let's see apply the function on a different text:

In [7]:
different_text_example = """
Adenanthos venosus is an openly-branched shrub that typically grows to a height of 1–2 m 
(3 ft 3 in – 6 ft 7 in) and forms a lignotuber. Its leaves are mostly arranged in clusters 
at the ends of branches, egg-shaped, sometimes with the narrower end towards the base, 
mostly 15–20 mm (0.59–0.79 in) long, 10 mm (0.39 in) wide and sessile. The leaves are 
mostly glabrous and have a pointed tip. The flowers are ”dull crimson” to ”pinkish purple” 
with a cream-coloured band in the centre and many glandular hairs on the outside. 
"""

print(remove_punctuation(different_text_example))


Adenanthos venosus is an openly branched shrub that typically grows to a height of 1–2 m 
 3 ft 3 in – 6 ft 7 in and forms a lignotuber Its leaves are mostly arranged in clusters 
at the ends of branches egg shaped sometimes with the narrower end towards the base 
mostly 15–20 mm 0 59–0 79 in long 10 mm 0 39 in wide and sessile The leaves are 
mostly glabrous and have a pointed tip The flowers are ”dull crimson” to ”pinkish purple” 
with a cream coloured band in the centre and many glandular hairs on the outside 



Seems like this time it did not go so well... In particular, we missed some `”` characters which are different from the ones defined in our function.

Of course, we could go back and re-write our function... But there's a much better way: **regular expressions**.

Regular expressions help us to easily remove punctuation and execute many more typical text processing tasks like:
- Replace all numbers with a placeholder
- Remove all decimals from numbers
- Count all uppercase letters in a text
- Find all words that have less than 3 letters (or any other number)
- ...

### 3.1 Regular expressions (aka Regex)

Regular expressions are sequences of characters that allow us to define search patterns in a standardized way. They are one of the most fundamental concepts in computer science when working with text data.

The idea is simple: by defining a text pattern using regex rules, you can easily locate and replace it in any piece of text!

Most of the tasks that we defined before can be performed using regex. Let's see our first example: finding all words starting with `a` in the text. We will use the powerful [re](https://docs.python.org/3/library/re.html) library.

In [8]:
def find_all_words_starting_with_a_regex(text):
    
    pattern = r"\ba\w+\b"
    return re.findall(pattern, text)

In [9]:
print("Words found that start with 'a':")
for w in find_all_words_starting_with_a_regex(text_example):
    print("- " + w)

Words found that start with 'a':
- an
- and
- as
- as
- areas
- and
- adjacent
- affluent
- and
- and
- acres
- acres
- are
- as
- area
- and
- are
- as
- and
- and
- as
- as
- and
- are
- area
- and


Much more compact, right? Now let's do the same for punctuation:

In [10]:
def remove_punctuation_regex(text):
    
    pattern_punct = r"[\.,\?\!\-\_\:\;\"'\|\(\)/\\\[\]]"
    
    # Replaces all punctuation characters by spaces
    text_no_punct = re.sub(pattern_punct, " ", text)
    
    # Collapses multiple spaces
    text_no_punct = re.sub(r"\s+", " ", text_no_punct)

    return text_no_punct

In [11]:
print(remove_punctuation_regex(text_example))

 Bommer Canyon is an open space preserve in southern Irvine California featuring hiking and biking trails as well as private event areas The canyon is part of the Irvine Ranch which itself is a National Natural Landmark the first California Natural Landmark 1 2 and part of the City of Irvine Open Space Preserve 3 4 The preserve is adjacent to the affluent Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16 000 acres of preserved open space 5 Approximately 15 of these acres are preserved as a Cattle Camp named for the area s previous cattle operations and are now rented for private events such as campouts company picnics and family reunions 6 The trails in Bommer Canyon feature groves of oak and sycamore trees as well as rough rock outcrops and are popular with area residents who use them for nature walks hiking and mountain biking 


Pretty cool right? 

Regex enables us to do many interesting things with text. If you're still not convinced, try out this [website](https://regex101.com/)! You can play around with regex and validate that the patterns you wrote are performing as expected.

You might have noticed a bunch of backslashes in the regex pattern and an `r` at the beginning. The backslash `\` character is used  in both Python strings and regex expressions to give special meaning to otherwise ordinary characters. For example, `\n` means 'newline' in Python strings and `\s`  means 'whitespace' in regex expressions. Backslash can also be used to escape characters that would otherwise have a special meaning, like the backslash itself or the dot.

Python only allows approved escape sequences in strings. If we use backslash to create a regex escape sequence which Python does not know, we will get a warning. That is the reason why we put an `r` in front of our string, to make it into a so-called 'raw string'. You can put whatever you want in a raw string because Python will take it literally just like a sequence of characters and not try to interpret it into escape sequences. The backslashes only acquire special meaning when the string is interpreted by regex as a regex pattern.

In [12]:
# Find digits
pattern_digits = "[0-9]*"

# Find words smaller than 3 characters
pattern_words_until_3 = r"\b\w{1,3}\b"

# Find URLs in a text
pattern_urls = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"

At this point you may be looking a bit like this:

<img src="./media/regex-confusion.png" width="500">

But worry not! 

Most of us don't know regex patterns by heart, and we need to take a look at cheatsheets from time to time, like the one shown below.

#### 3.1.1 Regex cheatsheet ([see more](https://cheatography.com/davechild/cheat-sheets/regular-expressions/))
|   |   |   |
| :---    | --- | :---    | 
| `.`  matches any character, except newline. | | `x*` matches 0 or more x symbols. |
| `\d, \s, \S`  digit, whitespace, not whitespace. | | `x+` matches 1 or more x symbols. |
| `\b, \B`  word, not word boundary. | | `x?` matches 0 or 1 x symbol. |
| `[xyz]`  matches x, y or z. | | `.?`, `*?`, `+?` non-greedy search, matches as few characters as possible| 
| `[^xyz]`  matches anything that is not x, y or z. | | `x{5}` matches exactly 5 x symbols. |
| `[x-z]`  matches characters from x to z. | | `x{5, 8}` matches between 5 and 8 x symbols. |
| `^xyz$`  `^` is the start of the string, `$` is the end of the string. | | `xy\|yz` - matches `xy` or `yz`.|
| `\.`  use escaping to match special characters. | | `\t`, `\n` - tab and newline. |

#### 3.1.2 Finding one pattern match with `search()` 

Using `search()` we can take a pattern and look for it in a text. This function will return a `Match` object, from which we can obtain the first text portion that was matched by our pattern.

In [13]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

print("Looking for \"Madrid\":")
match = re.search("Madrid", text)
print(match)

print("\nLooking for \"Rome\":")
match = re.search("Rome", text)
print(match)

print("\nLooking for \"Lisbon\":")
match = re.search("Lisbon", text)
print(match)

Looking for "Madrid":
<re.Match object; span=(7, 13), match='Madrid'>

Looking for "Rome":
None

Looking for "Lisbon":
<re.Match object; span=(0, 6), match='Lisbon'>


So, it is already possible to observe some things about `re.search()`:

- When there is no match, `search()` returns `None`.

- The `Match` object has the index of the beginning and end of the match. They can be  used via `match.start()` and `match.end()`.

- If there is more than one instance of the word in the text, only the first one will be retrieved.

#### 3.1.3 Finding all pattern matches with `findall()`  or `finditer()`

If we want to return all the matches to our pattern in a given text we can use the function `findall()`. In this case, the matched portions of the text will be returned, instead of the `Match` object.

In [14]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

pattern = r"Lisbon"

for match in re.findall(pattern, text):
    print(match)

Lisbon
Lisbon
Lisbon


Notice that one of the words was written as _Lisbona_ , but we still match the _Lisbon_ portion of that word. If we add the condition of having a white space after the letter *n* we will only get two matches.

In [15]:
pattern = r"Lisbon\s"

for match in re.findall(pattern, text):
    print(match)

Lisbon 
Lisbon 


If instead we really want the `Match` objects for some reason, `finditer()` can help us on that. It returns an iterator with `Match` objects.

In [16]:
pattern = "Lisbon"

for match in re.finditer(pattern, text):
    print(match)

<re.Match object; span=(0, 6), match='Lisbon'>
<re.Match object; span=(14, 20), match='Lisbon'>
<re.Match object; span=(34, 40), match='Lisbon'>


#### 3.1.4 Replacing all pattern matches with `sub()`

If we want to replace the matches of our pattern in a given text with something else we need to use the function `sub()`. In this case, the matched portions of the text will be replaced, and the changed text will be returned.

For example, if we wanted to remove the word `Lisbon` from a text we could do the following:

In [17]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

# \b indicates a word boundary so using it around a word 
# will only replace the text when it shows as a proper word, 
# between spaces or punctuation 
pattern = r"\bLisbon\b"

print(re.sub(pattern, "", text))

 Madrid  Toulose Oslo Lisbona


If instead we wanted to replace it by another text we can specify it like so:

In [18]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

pattern = r"\bLisbon\b"

print(re.sub(pattern, "Lisboa", text))

Lisboa Madrid Lisboa Toulose Oslo Lisbona


#### 3.1.5 A primer on patterns

Now that you are familiar with the `re` functions, we'll use them to explore the different patterns that can be expressed with regex.

Let's start with simple patterns from the cheatsheet.

In [19]:
text = "x xy xyy"

`.` will match any character after x:

In [20]:
re.findall("x.", text)

['x ', 'xy', 'xy']

`*` will match 0 or more y symbols after x:

In [21]:
re.findall("xy*", text)

['x', 'xy', 'xyy']

`+` will match 1 or more y symbols after x:

In [22]:
re.findall("xy+", text)

['xy', 'xyy']

`?` will match 0 or 1 y symbols after x:

In [23]:
re.findall("xy?", text)

['x', 'xy', 'xy']

`{i}` will match i y symbols after x:

In [24]:
re.findall("xy{2}", text)

['xyy']

Another example:

In [25]:
text="lotterer Jani Senna conway Kobayashi Lopez buemi Nakajima alonso"

If we want to match only the names that start with capital letters:

In [26]:
re.findall("[A-Z][a-z]+", text) # find substrings starting with a capital letter
                                # followed by 1 or more lowercase letters

['Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

If we want to match all the names that don't start with letters "B" and "L".

In [27]:
re.findall(r"\b[^bBlL\s][A-Za-z]+", text) # find substrings after a word boundary that
                                          # do not begin with B or L or whitespace

['Jani', 'Senna', 'conway', 'Kobayashi', 'Nakajima', 'alonso']

We're using that hacky `r` for raw strings again to tell Python to interpret backslashes `\` literally (notice how our regex has `\b` and `\s`). For instance:

In [28]:
print("With r:\n")
print(r"lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")
print("\n")
print("Without r:\n")
print("lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")

With r:

lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso


Without r:

lotterer 
 Jani 
 Senna conway Kobayashi Lopez buemi Nakajima alonso


In the first case Python takes `\n` literally. In the second case, Python interprets it as the symbol for newline.

Another important thing to know is that, since regex interprets some characters in a special way, you need to escape them if you want to match them. For that purpose you also use the `\`. Whatever comes after this character is considered escaped.

In [29]:
text=r"Text \ with + special [characters]."

In [30]:
print("Matches:\n")

for m in re.findall(r".+[ ]\ ", text): # If we don't escape the characters we mean to
    print(m)                           # find, we won't match anything and could 
                                       # even have a broken regex

Matches:



In [31]:
print("Matches:\n")

for m in re.findall(r"[\.\+\[\]\\]", text): # If we escape the characters we mean to 
    print(m)                                 # find, we'll match them as intended

Matches:

\
+
[
]
.


You'll notice in particular the `\\`. As the backlash is used to escape any character, it's also used to escape itself.  

<img src="./media/backslashes.png" width="700">

Imagine now that we have some extra information after the names, and we receive a file with many lines. We still want only names starting with capital letters. So we run the previous regex and...

In [32]:
text="lotterer Rebellion\nJani Rebellion\nSenna Rebellion\nconway Toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima Toyota\nalonso Toyota"

In [33]:
re.findall("[A-Z][a-z]+", text)

['Rebellion',
 'Jani',
 'Rebellion',
 'Senna',
 'Rebellion',
 'Toyota',
 'Kobayashi',
 'Toyota',
 'Lopez',
 'Toyota',
 'Toyota',
 'Nakajima',
 'Toyota',
 'Toyota']

Well, actually we just want the first name! So let's try to add the symbol `^` to make sure the expression only captures the beginning of the sentence.

In [34]:
re.findall("^[A-Z][a-z]+", text)

[]

Hum.. we got a handful of nothing. Why is this happening? Well, the regex processes all the text as a single line, and the first name doesn't start with a capital letter. To prove this is the case, let's change `lotterer` to `Lotterer`.

In [35]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text)

['Lotterer']

But we still only capture one line. Luckily, we have [`re.MULTILINE`](https://docs.python.org/3/library/re.html#re.MULTILINE), that allows us to process multiline strings easily.

In [36]:
re.findall("^[A-Z][a-z]+", text, re.MULTILINE)

['Lotterer', 'Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

And now we were able to get all the information we wanted! And what if we wanted the second name on each line? Well, in this case, that is the last word of the line, so we can use `$`.

In [37]:
re.findall("[A-Z][a-z]+$", text, re.MULTILINE)

['Rebellion', 'Rebellion', 'Toyota', 'Toyota', 'Toyota', 'Toyota']

What if we want all full lines ending with `rebellion`?

In [38]:
re.findall(".*rebellion$", text, flags=(re.MULTILINE|re.IGNORECASE))

['Lotterer Rebellion', 'Jani rebellion', 'Senna Rebellion']

You may have noticed that here we are also taking advantage of the flag `re.IGNORECASE`. This is a convenient flag to add if you want case-insensitive matches. Multiple regex flags can be strung together with pipes: `|`.

Regular expressions can get hard to read really fast, but even knowing the basics will be certainly helpful sometime in the future. To better understand how they work, there's nothing like practicing, and sites like [this](https://regex101.com/) are valuable visual tools to do so. The Python library that we used has a lot more powerful methods, so make sure to check them out if you're interested!

Here are more reading suggestions about regex:
- https://towardsdatascience.com/regular-expressions-clearly-explained-with-examples-822d76b037b4
- https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

## 4. Tokenization

Now that we've seen how to find patterns in text, let's explore how to partition it into smaller, meaningful tokens.

This task is called **tokenizatio* and it's an essential step when dealing with text. In practice, tokenization is about splitting the strings of a corpus into substrings. This is important because we can transform a string into parts that are more suitable to be used by natural language processing tools. For instance, if we are working with the sentence:

_"The car went too fast on the second lap. This damaged the tires."_ ,

it will get easier if we split it into substrings:

_["The", "car", "went", "too", "fast", "on", "the", "second", "lap", ".", "This", "damaged", "the", "tires", "."]_ .

<img src="./media/tokenizer.png" width="500">

Hopefully by now you've realized that this task is more than just splitting the text on spaces and requires a bit more thought.

We will now use a specialized NLP library [NLTK](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) which implements different tokenization methods. 

In [39]:
text = "The car went too fast on the second lap. This damaged the tires..."

In [40]:
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


The [RegexpTokenizer](https://www.nltk.org/api/nltk.tokenize.regexp.html#nltk.tokenize.regexp.RegexpTokenizer) is using regular expressions to define the token creation. We can choose the regex to match either the tokens or the gaps between them. In this case we want tokens that are either words, amounts in dollars or non-whitespace characters like punctuation.

In the following example, the tokenizer selects the words that begin with capital letters.

In [41]:
tokenizer = RegexpTokenizer(r'[A-Z]\w+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'This']


The [nltk.tokenize.regexp](https://www.nltk.org/api/nltk.tokenize.regexp.html) module has several tokenizers with predefined regular expressions. These are:
- `BlanklineTokenizer` - Tokenize on blank lines as delimiters.
- `WordPunctTokenizer` - Tokenize into sequences of alphabetic and non-alphabetic characters.
- `WhitespaceTokenizer`- Tokenize on spaces, tabs, and newlines as delimiters.

Let's see how they tokenize the same sentence as above.

In [42]:
from nltk.tokenize import BlanklineTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import WhitespaceTokenizer

In [43]:
BlanklineTokenizer().tokenize(text)

['The car went too fast on the second lap. This damaged the tires...']

In [44]:
WordPunctTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap',
 '.',
 'This',
 'damaged',
 'the',
 'tires',
 '...']

In [45]:
WhitespaceTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap.',
 'This',
 'damaged',
 'the',
 'tires...']

`WordPunctTokenizer()` is similar to the first one we defined (`RegexpTokenizer('\w+|\$[\d\.]+|\S+')`. This tokenizer is one of the most commonly used. So, when we talk about tokenization without specifying further details, this is by default the type of tokenization that we expect you to use (for example, in exercises).

## 5. Stemming and lemmatization

Now that we have the tokens, we're going to stem them. **Stemming** allows us to get the root or base of the words. For example, 'fishing' and 'fisher' can both be reduced to the same stem 'fish'.  This is important because, in certain tasks, we are more interested in the broader semantics of a word and not the specific variation of it, like if it's a verb or an adjective.

We will be using the NLTK [snowball stemmer](https://www.nltk.org/api/nltk.stem.snowball.html). There are different stemming algorithms, some of them specialized for certain tasks.

In [46]:
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text)

stemmer = SnowballStemmer("english")
stems = [stemmer.stem(word) for word in words]
print(stems)

['the', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'this', 'damag', 'the', 'tire', '...']


As you can see, the stem doesn't have to be a word or correspond to the morphological root. Notice as well that all the words have been lowercased. Lowercasing the text is typically one of the first steps in text preprocessing.

Besides stemming there is also the process of **lemmatization**. Both processes share the goal of getting the root of the word, or more formally, reduce inflectional forms of a word to a common base form [\[7\]](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html), but they act differently. Stemming follows a heuristic approach that removes the word endings in order to get closer to the common base form. Lemmatization reduces different inflectional forms of a word to the base or dictionary form of a word known as _lemma_ and respects categories like nouns and verbs.

For example, given the word _consultations_, stemming would return only *consult*, while lemmatization would take into account that the word is a noun and return *consultation*.

As you may expect, lemmatization is much more expensive in computational terms and, for certain applications, stemming might be more than enough to obtain good results.

## 6. Stopwords 

**Stopwords** are common words that don't have much semantic meaning such as pronouns or articles. Most times, we don't want to include them in further analysis.

As an example of why removing stopwords might be an important step, imagine a search engine going through a large set of documents. Words as "*the*", "*a*", "*at*", etc. will be present in so many documents that using them in the search will not help at all to find the best documents to answer the query. So filtering them out will speed up the search and remove noise.

Let's start by downloading the stopwords corpus provided by NLTK.

In [47]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/maria/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Now we can get a list of stopwords in any language.

In [48]:
from nltk.corpus import stopwords

stop_eng = set(stopwords.words('english'))

stop_eng

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [49]:
stop_pt = set(stopwords.words('portuguese'))

stop_pt

{'a',
 'ao',
 'aos',
 'aquela',
 'aquelas',
 'aquele',
 'aqueles',
 'aquilo',
 'as',
 'até',
 'com',
 'como',
 'da',
 'das',
 'de',
 'dela',
 'delas',
 'dele',
 'deles',
 'depois',
 'do',
 'dos',
 'e',
 'ela',
 'elas',
 'ele',
 'eles',
 'em',
 'entre',
 'era',
 'eram',
 'essa',
 'essas',
 'esse',
 'esses',
 'esta',
 'estamos',
 'estar',
 'estas',
 'estava',
 'estavam',
 'este',
 'esteja',
 'estejam',
 'estejamos',
 'estes',
 'esteve',
 'estive',
 'estivemos',
 'estiver',
 'estivera',
 'estiveram',
 'estiverem',
 'estivermos',
 'estivesse',
 'estivessem',
 'estivéramos',
 'estivéssemos',
 'estou',
 'está',
 'estávamos',
 'estão',
 'eu',
 'foi',
 'fomos',
 'for',
 'fora',
 'foram',
 'forem',
 'formos',
 'fosse',
 'fossem',
 'fui',
 'fôramos',
 'fôssemos',
 'haja',
 'hajam',
 'hajamos',
 'havemos',
 'haver',
 'hei',
 'houve',
 'houvemos',
 'houver',
 'houvera',
 'houveram',
 'houverei',
 'houverem',
 'houveremos',
 'houveria',
 'houveriam',
 'houvermos',
 'houverá',
 'houverão',
 'houverí

Let's remove all stopwords from our tokens list. We first need to lowercase the words since stopwords are all saved in lowercase.

In [50]:
[word for word in words if word.lower() not in stop_eng]

['car', 'went', 'fast', 'second', 'lap', '.', 'damaged', 'tires', '...']

## 7. N-Grams

Creating **N-gram** is one of the oldest NLP strategies. N-gram is a sequence of N consecutive tokens. The elements can be words, but also punctuation, depending on how we define the tokens. Depending on the value choosen for N, we can have unigrams, bigrams, trigrams, four-grams, etc.

For instance, for the text

`"The driver made a mistake"`,

we have:

- unigrams: `The`, `driver`, `made`, `a`, `mistake`
- bigrams: `The driver`, `driver made`, `made a`, `a mistake`
- trigrams: `The driver made`, `driver made a`, `made a mistake`
- four-grams: `The driver made a`, `driver made a mistake`

We will create N-grams by taking advantage of the [NLTK N-gram](https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams) implementation. We will be using the tokenized list `words` created previously.

In [51]:
print(words)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


In [52]:
print(list(ngrams(words, 1)))

[('The',), ('car',), ('went',), ('too',), ('fast',), ('on',), ('the',), ('second',), ('lap',), ('.',), ('This',), ('damaged',), ('the',), ('tires',), ('...',)]


In [53]:
print(list(ngrams(words, 2)))

[('The', 'car'), ('car', 'went'), ('went', 'too'), ('too', 'fast'), ('fast', 'on'), ('on', 'the'), ('the', 'second'), ('second', 'lap'), ('lap', '.'), ('.', 'This'), ('This', 'damaged'), ('damaged', 'the'), ('the', 'tires'), ('tires', '...')]


In [54]:
print(list(ngrams(words, 3)))

[('The', 'car', 'went'), ('car', 'went', 'too'), ('went', 'too', 'fast'), ('too', 'fast', 'on'), ('fast', 'on', 'the'), ('on', 'the', 'second'), ('the', 'second', 'lap'), ('second', 'lap', '.'), ('lap', '.', 'This'), ('.', 'This', 'damaged'), ('This', 'damaged', 'the'), ('damaged', 'the', 'tires'), ('the', 'tires', '...')]


We can use N-grams in several ways, for instance as features for a text classification model as we'll see in the following notebooks.

For now let's just look at situations where N-grams can be useful:

- By comparing the frequency of each N-gram in two different texts, we can calculate the similarity between them.
- By looking at the frequency of the 2-gram 'Very good' we may have an indicator of the sentiment associated with the text. On the other hand, solely looking at the frequency of the unigrams 'Very' and 'good' might not convey the same meaning. 

## 8. Wrap up and further reading

And you've reached the end of our first notebook, congratulation! You've learned the following concepts:

* regex
* tokenization
* stemming
* stopwords
* N-grams

It may seem overwhelming for now, but you'll see everything will become more intuitive as you navigate your journey in the NLP world! So keep the motivation and see you in Part 2 where we'll use the preprocessed tokens in a model!

<img src="./media/info_everywhere.jpg" width="400">

Useful resources:

- [RegExr](https://regexr.com/3lvai)
- [RegexOne](https://regexone.com/)
- [Python Module of the Week - re](https://pymotw.com/2/re/)
- [NLTK Book](https://www.nltk.org/book/)
- [Language Model](https://en.wikipedia.org/wiki/Language_model)
- [Stanford CS124 Language Modeling slides](https://web.stanford.edu/class/cs124/lec/lm2021.pdf)
- [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Other commonly used NLP libraries:
- [Spacy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/other-languages.html#python)