In [1]:
import re
import nltk

from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.util import ngrams

# Natural Language Processing

### Introduction

As you may have noticed, this set of BLUs will revolve around the topic of Natural Language Processing (NLP). As the name implies, this field is all about the processing and handling of language in such a way that a computer may be able to do useful things with it. There are plenty of tasks and problems around it, namely:

- **Speech recognition**: the task of, given a sample of audio, extract the words that are being spoken or even prosody features, for example.
- **Natural language generation**: the task of putting computational formulations into actual text, for example, automated generation of labels to images, summarisation of texts and data, creation of dialogue systems, etc.
- **Natural language understanding**: the task of getting some meaning out of the data, for instance, recognizing entities in sentences, semantic roles, or even classify sentences according to their sentiment, etc., or transforming it into something machines can work on (numbers).

More formally, some of the main tasks and areas of research of NLP are:

- **Part of Speech tagging**: Determine the role of each word in a given sentence, for instance, if it is an adjective, verb, noun, etc.

- **Word Segmentation**: Break continuous text into words.

- **Parsing**: Define a tree that represents the grammatical structure of a sentence.

- **Machine Translation**: Translate sentences from a source language to a target language automatically.

- **Named entity recognition**: Find parts of the text that correspond to certain entities, like names of places, people, companies, etc.

- **Question answering**: Given a question in human language, find the most appropriate answer.

- **Text to speech**: As the name implies, transform written text into audible, human-like sounds that correspond to the given input.

Many of these tasks are out of the scope of these learning units, but we think that it is important to at least acknowledge that they exist in the realm of NLP. 


### Processing text 

Most areas that use text data require in one way or the other that it is transformed into more useful input. Some of these transformations may be, for example:

- Splitting words in a sentence
- Removing punctuation
- Removing common words
- Extracting suffixes or prefixes

When you first hear about these text processing tasks you are learning, for example separating the words in a sentence, it may seem _easy enough_. After all, words are separated by spaces or maybe some punctuation. But when you really think about the diversity that exists in terms of languages you start to understand how daunting all these tasks are. Take a look at Mandarin Chinese, for instance. Our heuristic is suddenly not valid anymore. And for many of the tasks, there are plenty of corner cases.

<img src="./media/xkcd_language_nerd.png" width="300">

Bottom line is, language is hard, but that's what makes this field one of the most challenging but also more rewarding to work on.


Throughout these learning units we hope to give you some basic understanding on how to transform text into something useful for us, explain what are some of the challenges in this field, asolve some interesting problems and hopefully make you want to learn more about the topic afterwards!

The first part of this BLU goes through some of the fundamental concepts that will be helpful for all the practical tasks that you will need during this month, but also in the future, if you ever need to work with text data. We will start by introducing **regular expressions**, followed by three important concepts in data pre-processing (**tokenization**, **stopwords**, and **stemming**). Finally, we will learn about **n-grams** and **n-gram-models**.


## Finding patterns in text

To be able to perform any of the text processing tasks mentioned in an efficient way, we need to be able to parse and detect patterns in the text programatically. For the purpose of simplification, we'll only focus on English for the next couple of examples.

Let's say you have the following text:

In [2]:
text_example = """
Bommer Canyon is an open space preserve in southern Irvine, California featuring hiking and 
biking trails as well as private event areas. The canyon is part of the Irvine Ranch, which 
itself is a National Natural Landmark, the first California Natural Landmark,[1][2] and part 
of the City of Irvine Open Space Preserve.[3][4] The preserve is adjacent to the affluent 
Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16,000 acres of 
preserved open space.[5] Approximately 15 of these acres are preserved as a "Cattle Camp" 
named for the area's previous cattle operations and are now rented for private events such 
as campouts, company picnics, and family reunions.[6] The trails in Bommer Canyon feature 
groves of oak and sycamore trees as well as rough rock outcrops and are popular with area 
residents who use them for nature walks, hiking and mountain biking.
"""

How would you answer the question:

> What are all the words in the following text that start with the letter `a`?

You could obviously count them manually, but if you imagine that you can have thousands or millions of lines, that becomes impossible. 

A second option, given your recently acquired skills in python, could be to write a function that does this for you:


In [3]:
def find_all_words_starting_with_a(text):
    
    # Assuming all words can be split by spaces - big assumption
    list_of_a_words = []
    words = text.split(" ")
    for w in words:
        if w.startswith("a"):
            list_of_a_words.append(w)
    
    return list_of_a_words

print("Words found that start with 'a':")
for w in find_all_words_starting_with_a(text_example):
    print("- " + w)

Words found that start with 'a':
- an
- and
- as
- as
- areas.
- a
- and
- adjacent
- affluent
- and
- and
- acres
- acres
- are
- as
- a
- area's
- and
- are
- and
- and
- as
- as
- and
- are
- area
- and


Now let's take a slightly more complex task: we want to find and remove all punctuation, replacing it with a space when needed.

In [4]:
def remove_punctuation(text):
    
    big_list_of_punctuation = [
        ".", ",", "?", "!", "-", "_", ":", ";",
        "\"", "'", "|", "(", ")", ")", "/", "\\",
        "[", "]",
    ]
    
    processed_text = ""
    for idx, letter in enumerate(text):
        previous_letter = text[idx - 1] if idx > 1 else ""
        next_letter = text[idx + 1] if idx < len(text) - 2 else ""
        if letter in big_list_of_punctuation:
            if previous_letter != " " and next_letter != " ":
                processed_text += " "
        else:
            processed_text += letter
        previous_letter = letter
    
    return processed_text

print(remove_punctuation(text_example))


Bommer Canyon is an open space preserve in southern Irvine California featuring hiking and 
biking trails as well as private event areas The canyon is part of the Irvine Ranch which 
itself is a National Natural Landmark the first California Natural Landmark  1  2 and part 
of the City of Irvine Open Space Preserve  3  4 The preserve is adjacent to the affluent 
Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16 000 acres of 
preserved open space  5 Approximately 15 of these acres are preserved as a Cattle Camp 
named for the area s previous cattle operations and are now rented for private events such 
as campouts company picnics and family reunions  6 The trails in Bommer Canyon feature 
groves of oak and sycamore trees as well as rough rock outcrops and are popular with area 
residents who use them for nature walks hiking and mountain biking 



Slightly more complicated, but seems to have worked! Now let's see what would happen for a different text:

In [5]:
different_text_example = """
Adenanthos venosus is an openly-branched shrub that typically grows to a height of 1–2 m 
(3 ft 3 in – 6 ft 7 in) and forms a lignotuber. Its leaves are mostly arranged in clusters 
at the ends of branches, egg-shaped, sometimes with the narrower end towards the base, 
mostly 15–20 mm (0.59–0.79 in) long, 10 mm (0.39 in) wide and sessile. The leaves are 
mostly glabrous and have a pointed tip. The flowers are ”dull crimson” to ”pinkish purple” 
with a cream-coloured band in the centre and many glandular hairs on the outside. 
"""

print(remove_punctuation(different_text_example))


Adenanthos venosus is an openly branched shrub that typically grows to a height of 1–2 m 
 3 ft 3 in – 6 ft 7 in and forms a lignotuber Its leaves are mostly arranged in clusters 
at the ends of branches egg shaped sometimes with the narrower end towards the base 
mostly 15–20 mm 0 59–0 79 in long 10 mm 0 39 in wide and sessile The leaves are 
mostly glabrous and have a pointed tip The flowers are ”dull crimson” to ”pinkish purple” 
with a cream coloured band in the centre and many glandular hairs on the outside 



Seems like this time we missed a bit, in particular the `”` characters. 

Of course we could go back and re-write our function but what we want to point out is that coding all these tasks from scratch is not only inefficient, but also quite boring. Just for quoting for example, there are several potential characters that can be used, as can be seen [here](https://op.europa.eu/en/web/eu-vocabularies/formex/physical-specifications/character-encoding/quotation-marks).

And we may have endless different variations and extra conditions that will make everything hard to keep track. Some examples just off the top of my head:

- Replacing all numbers with a placeholder
- Remove all decimals from numbers
- Count all uppercase letters in a text
- Find all words that have less than 3 letters (or any other number)
- ...

This is where **regular expressions** come in.

## Regular Expressions (aka Regex)

Regular expressions are sequences of characters that allow us to define search patterns in a standardized way. It goes by several rules and is one of the most fundamental concepts in computer science regarding working with text data.

Most of the tasks that we defined before can be performed with it. Let's take our first example, finding all words starting with 'a' in the text:


In [6]:
def find_all_words_starting_with_a_regex(text):
    
    pattern = r"\ba\w+\b"
    return re.findall(pattern, text)

print("Words found that start with 'a':")
for w in find_all_words_starting_with_a_regex(text_example):
    print("- " + w)

Words found that start with 'a':
- an
- and
- as
- as
- areas
- and
- adjacent
- affluent
- and
- and
- acres
- acres
- are
- as
- area
- and
- are
- as
- and
- and
- as
- as
- and
- are
- area
- and


Much more compact, right? Now let's do the same for punctuation:

In [7]:
def remove_punctuation_regex(text):
    
    pattern_punkt = r"[\.,\?\!\-\_\:\;\"'\|\(\)/\\\[\]]"
    
    # Replaces all punctuation characters by spaces
    text_no_punkt = re.sub(pattern_punkt, " ", text)
    
    # Collapses multiple spaces
    text_no_punkt = re.sub(r"\s+", " ", text_no_punkt)

    return text_no_punkt
    

print(remove_punctuation_regex(text_example))

 Bommer Canyon is an open space preserve in southern Irvine California featuring hiking and biking trails as well as private event areas The canyon is part of the Irvine Ranch which itself is a National Natural Landmark the first California Natural Landmark 1 2 and part of the City of Irvine Open Space Preserve 3 4 The preserve is adjacent to the affluent Irvine villages of Shady Canyon and Turtle Ridge and features roughly 16 000 acres of preserved open space 5 Approximately 15 of these acres are preserved as a Cattle Camp named for the area s previous cattle operations and are now rented for private events such as campouts company picnics and family reunions 6 The trails in Bommer Canyon feature groves of oak and sycamore trees as well as rough rock outcrops and are popular with area residents who use them for nature walks hiking and mountain biking 


Pretty cool right? 

Regex enables us to do a bunch of interesting things in text, and much more complex things. But don't just trust our word for it. Actually try these out. You can use [this website](https://regexr.com/3lvai) to play around with regex and validate that the patterns you wrote are performing as expected. See some extra examples below:

In [8]:
# Find digits
pattern_digits = "[0-9]*"

# Find words smaller than 3 characters 
pattern_words_until_3 = "\b\w{1,3}\b"

# Find URLs in a text
pattern_urls = "https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"


Now at this point you may be looking a bit like this

<img src="./media/regex-confusion.png" width="500">

But worry not! 

Most of us don't know regex by heart, and we need to take a look at cheatsheets from time to time, like the one shown below. Still, as you move forward in the NLP world, we hope you see how helpful regex can be.


#### Cheatsheet [\[1\]](https://regexr.com/3lvai)

`.` - matches any character, except newline.

`\d, \s \S` - match digit, match whitespace, not whitespace.

`\b, \B` - word, not word boundary.

`[xyz]` - matches x, y or z.

`[^xyz]` - matches anything that is not x, y or z.

`[x-z]` - matches a character between x and z.

`^xyz$` - `^` is the start of the string, `$` is the end of the string.

`\.` - use escaping to match special characters.

`\t`, `\n` - matches tab and newline.

`x*` - matches 0 or more symbols x.

`x+` - matches 1 or more symbols x.

`x?` - matches 0 or 1 symbol x.

`.?`, `*?`, `+?`, etc - represent non-greedy search. 

`x{5}` - matches exactly 5 symbols x.

`x{5,}` - matches 5 or more symbols x.

`x{5, 8}` - matches between 5 and 8 symbols x.

`xy|yz` - matches `xy` or `yz`.

## Python `re` library

The python library that we are using - [re](https://docs.python.org/3/library/re.html) - has many powerful methods too. We'll deep dive into each of them in the following sections.

### Finding one instance with `search()` 

Using `search()` we can take a certain pattern and look for it in a text. This function will return a `Match` object, from which we can obtain the first text portion that was matched by our pattern.

In [9]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

print("Looking for \"Madrid\":")
match = re.search("Madrid", text)
print(match)

print("\nLooking for \"Rome\":")
match = re.search("Rome", text)
print(match)

print("\nLooking for \"Lisbon\":")
match = re.search("Lisbon", text)
print(match)

Looking for "Madrid":
<re.Match object; span=(7, 13), match='Madrid'>

Looking for "Rome":
None

Looking for "Lisbon":
<re.Match object; span=(0, 6), match='Lisbon'>


So, it is already possible to observe some things about `re.search()`:

- When there is no match, `search()` returns `None`.

- The `Match` object has the index of the beginning and end of the match. Might be used via `match.start()` and `match.end()`.

- If there is more than one instance of the word in the text, only the first will be retrieved.

### Finding all instances with `findall()`  or `finditer()`

If we want to return all the matches to our pattern in a given text we might use the function `findall()`. In this case, the matched portions of the text will be returned, instead of the `Match` object.

In [10]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

pattern = "Lisbon"

for match in re.findall(pattern, text):
    print(match)

Lisbon
Lisbon
Lisbon


Notice that, one of the words was written as _Lisbona_ , but we still match the _Lisbon_ portion of that word. If we add the condition of having a white space after the letter *n* we will only get two matches.

In [11]:
pattern = "Lisbon\s"

for match in re.findall(pattern, text):
    print(match)

Lisbon 
Lisbon 


If instead we really want the `Match` objects for some reason, `finditer()` should be used instead.

In [12]:
pattern = "Lisbon"

for match in re.finditer(pattern, text):
    print(match)

<re.Match object; span=(0, 6), match='Lisbon'>
<re.Match object; span=(14, 20), match='Lisbon'>
<re.Match object; span=(34, 40), match='Lisbon'>


### Replacing all instances with `sub()`  or `

If we want to replace the matches of our pattern in a given text with something else we need to use the function `sub()`. In this case, the matched portions of the text will be replaced, and the changed text will be returned.

For example if we wanted to remove the word `Lisbon` from a text we could do the following:

In [13]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

# \b indicates a word boundary so using it around a word 
# will only replace the text when it shows as a proper word, 
# between spaces or punctuation 
pattern = r"\bLisbon\b"

print(re.sub(pattern, "", text))


 Madrid  Toulose Oslo Lisbona


If instead we wanted to replace by another word we could just specify it:

In [14]:
text = "Lisbon Madrid Lisbon Toulose Oslo Lisbona"

# \b indicates a word boundary so using it around a word 
# will only replace the text when it shows as a proper word, 
# between spaces or punctuation 
pattern = r"\bLisbon\b"

print(re.sub(pattern, "Lisboa", text))


Lisboa Madrid Lisboa Toulose Oslo Lisbona



### A primer on patterns

Now that you are familiar with the `re` functions, we'll use them to explore a bit better the paterns possible with regex

---


Looking at some of the previously shown codes at cheatsheet, let's see in some simple examples how that may help us!

In [15]:
text = "x xy xyy"

Remembering what we've shown previously, `.` will match any character after x:

In [16]:
re.findall("x.", text)

['x ', 'xy', 'xy']

`*` will match 0 or more y symbols after xy:

In [17]:
re.findall("xy*", text)

['x', 'xy', 'xyy']

`+` will match 1 or more y symbols after x:

In [18]:
re.findall("xy+", text)

['xy', 'xyy']

`?` will match 0 or 1 y symbols after x:

In [19]:
re.findall("xy?", text)

['x', 'xy', 'xy']

`{i}` will match i y symbols after x:

In [20]:
re.findall("xy{2}", text)

['xyy']

---

In [21]:
text="lotterer Jani Senna conway Kobayashi Lopez buemi Nakajima alonso"

If we want to match only the names that start with capital letters:

In [22]:
re.findall("[A-Z][a-z]+", text) # find substrings starting with a capital letter
                                # followed by 1 or more lowercase letters

['Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

If we want to match all the names that don't start with letters "B" and "L".

In [23]:
re.findall(r"\b[^bBlL\s][A-Za-z]+", text) # find substrings after a word boundary that...
                                          # do not begin with B or L or whitespace

['Jani', 'Senna', 'conway', 'Kobayashi', 'Nakajima', 'alonso']

You may be wondering what that hacky `r` is doing before the actual regex we are using. This has no connection with regex. It is just a way of telling python that it should interpret backslashes `\` literally (Notice how our regex has `\b` and `\s`). For instance:

In [24]:
print("With r:\n")
print(r"lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")
print("\n")
print("Without r:\n")
print("lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso")

With r:

lotterer \n Jani \n Senna conway Kobayashi Lopez buemi Nakajima alonso


Without r:

lotterer 
 Jani 
 Senna conway Kobayashi Lopez buemi Nakajima alonso


In the first case, since we are using `r` the model takes `\n` literally and in the second case, python interprets it as the escaped symbol for newline. Another important thing to know is that because regex interprets several characters in a special way, if you want to match them literally, you need to escape them. For that purpose you also use the `\`. Whatever comes after this character is considered escaped.

In [25]:
text="Text \ with + special [characters]."

In [26]:
print("Matches:\n")

for m in re.findall(r".+[ ]\ ", text): # If we don't escape the characters we mean
    print(m)                           # find, we won't match anything and could 
                                       # even have a broken regex

Matches:



In [27]:
print("Matches:\n")

for m in re.findall(r"[\.\+\[\]\\]", text): # If we escape the characters we mean to 
    print(m)                                 # find, we'll match them literally

Matches:

\
+
[
]
.


You'll notice in particular the `\\`. This is just a corner case, as the backlash is used to escape any character, it's also used to escape itself.  

<img src="./media/backslashes.png" width="700">



---

Imagine now we have some extra information after the names, and that we receive a file with many lines. We still want only names starting with capital letters. So we run the previous regex and...

In [28]:
text="lotterer Rebellion\nJani Rebellion\nSenna Rebellion\nconway Toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima Toyota\nalonso Toyota"

In [29]:
re.findall("[A-Z][a-z]+", text)

['Rebellion',
 'Jani',
 'Rebellion',
 'Senna',
 'Rebellion',
 'Toyota',
 'Kobayashi',
 'Toyota',
 'Lopez',
 'Toyota',
 'Toyota',
 'Nakajima',
 'Toyota',
 'Toyota']

Well, we don't want those extra names in there. So let's try to add the symbol `^` to make sure the expression only captures the beginning part of the sentence.

In [30]:
re.findall("^[A-Z][a-z]+", text)

[]

Hum.. we got a handful of nothing. Why is this happening? Well, the regex processes all the text as a single line, and the first name doesn't start with a capital letter. To make sure this is the case, let's change `lotterer` to `Lotterer`.

In [31]:
text="Lotterer Rebellion\nJani rebellion\nSenna Rebellion\nconway toyota\nKobayashi Toyota\nLopez Toyota\nbuemi Toyota\nNakajima toyota\nalonso Toyota"
re.findall("^[A-Z][a-z]+", text)

['Lotterer']

But we still only capture one line. Luckily, we have [`re.MULTILINE`](https://docs.python.org/3/library/re.html#re.MULTILINE), that allows us to process multiline strings easily.

In [32]:
re.findall("^[A-Z][a-z]+", text, re.MULTILINE)

['Lotterer', 'Jani', 'Senna', 'Kobayashi', 'Lopez', 'Nakajima']

And now we were able to get all the information we wanted! And what if we wanted the second part of each line? Well, in this case, that is the last word of the line, so we may use `$`.

In [33]:
re.findall("[A-Z][a-z]+$", text, re.MULTILINE)

['Rebellion', 'Rebellion', 'Toyota', 'Toyota', 'Toyota', 'Toyota']

What if we want all full lines ending with `rebellion`?

In [34]:
re.findall(".*rebellion$", text, flags=(re.MULTILINE|re.IGNORECASE))

['Lotterer Rebellion', 'Jani rebellion', 'Senna Rebellion']

You may notice that here we are also taking advantage of the flag `re.IGNORECASE`. This is a convenient flag to add if you want case-insensitive matches. Multiple regex flags can be strung together with pipes: `|`.

Regular expressions can get hard to read really fast, but even knowing the basics will be certainly helpful sometime in the future. To better understand how they work, nothing is better than practicing, and sites like [this](https://regexr.com/3lvai) and [this](https://regex101.com/) are valuable visual tools to do so. The python library that we used has a lot of more powerful methods too, which might be useful to future tasks.

More suggestions for you to read about regex are:
- https://towardsdatascience.com/regular-expressions-clearly-explained-with-examples-822d76b037b4
- https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

---

## Tokenizer

One important step when dealing with text data is to _tokenize_ the data. In practice what this means is splitting the strings of a corpus into substrings. This is important because it transforms a string into parts that are more suitable to be used by the tools that exist in natural language processing. For instance, if we are working with the sentence:

_"The car went too fast on the second lap. This damaged the tires."_ ,

would be better approached as a list,

_["The", "car", "went", "too", "fast", "on", "the", "second", "lap", ".", "This", "damaged", "the", "tires", "."]_ .

<img src="./media/tokenizer.png" width="500">

Hopefully by now you've realized that this task is slightly more than just splitting spaces, and requires a bit more thought. To simplify things, there are already libraries and methods that help us implement tokenization in different ways.

We will be using [NLTK](https://www.nltk.org/_modules/nltk/tokenize/regexp.html) implementations. 

In [35]:
text = "The car went too fast on the second lap. This damaged the tires..."

In [36]:
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


Notice that the tokenizer is created by taking advantage of the regular expressions we learned earlier. This means that we can make different tokenizers according to what we want to split on. For instance, if we had used `[A-Z]\w+`, the tokenizer would only select the words that begin with capital letters.

In [37]:
tokenizer = RegexpTokenizer('[A-Z]\w+')
tokens = tokenizer.tokenize(text)
print(tokens)

['The', 'This']


Notice that there are already some pre-defined implementations we can use by taking advantage of `RegexpTokenizer`. These are:
- `BlanklineTokenizer` - Tokenize a string using blank lines as the delimiter.
- `WordPunctTokenizer` - Tokenize a string into alphabetic and non-alphabetic characters.
- `WhitespaceTokenizer`-  Tokenize a string using spaces, tabs, and newlines as delimiters.

In [38]:
from nltk.tokenize import BlanklineTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import WhitespaceTokenizer

In [39]:
BlanklineTokenizer().tokenize(text)

['The car went too fast on the second lap. This damaged the tires...']

In [40]:
WordPunctTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap',
 '.',
 'This',
 'damaged',
 'the',
 'tires',
 '...']

In [41]:
WhitespaceTokenizer().tokenize(text)

['The',
 'car',
 'went',
 'too',
 'fast',
 'on',
 'the',
 'second',
 'lap.',
 'This',
 'damaged',
 'the',
 'tires...']

Notice that the `WordPunctTokenizer()` is similar to the first one we defined (`RegexpTokenizer('\w+|\$[\d\.]+|\S+')`. This tokenizer is one of the most commonly used. So, when we talk about tokenization without specifying further details, it is by default the type of tokenization that we expect you to use (for example, in exercises).


**Final Note**: Even though we are using NLTK library during this BLU, some other libraries are commonly used as well, and are probably better. Here is a list of some to consider in your future challenges in NLP:

- [Spacy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/other-languages.html#python)

---

## Stemming

Stemming allows us to get the "root" of words. This is important because in certain tasks we are more interested in a broader representation of a given word and not the specific variation of it, like its plural, for instance. Before using the stemmer it is necessary to download some tools required by `nltk`, regarding the language we want to use. We will be working with the English language, using the NLTK Downloader, the same way we would import `nltk`.

In [42]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/catarinasilva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

So, let's see what this step gets us for the same example we have been using. To do that, we will be using the NLTK implementation of the [snowball stemmer](https://www.nltk.org/api/nltk.stem.html#nltk.stem.snowball.SnowballStemmer). Notice that there are other stemmers, some of them specific to certain tasks.

In [43]:
tokenizer = WordPunctTokenizer()
words = tokenizer.tokenize(text)
stemmer = SnowballStemmer("english", ignore_stopwords=True)
stems = [list(map(stemmer.stem, words))]
print(stems)

[['the', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'this', 'damag', 'the', 'tire', '...']]


We can see that _"damage"_ and _"tires"_ are transformed into simpler forms of the respective words. Notice as well that all the words have been lowercased. Lowercasing the data is also a common step in text pre-processing.

### Stopwords 

One thing that you may have noticed was the concept of "stopwords" being used. **Stopwords** are common words in a given corpus or language that, due to being so common, lose interest for most natural language processing applications. 

For instance, imagine a search engine, looking through a whole range of documents. Words as "*the*", "*a*", "*at*", etc. will be present in so many documents that using them in the search will not reduce the number of possible files that could be relevant to our query. So filtering them out is beneficial to our goal.

In the specific case of the stemmer function that we are using, defining `ignore_stopwords` as `True` will prevent the stemming of stopwords.

In the next part of this BLU you will read about stopwords again, as they are important for the task you will be doing there.

Besides stemming there is also the process of **lemmatization**. Both processes share the goal of getting the root of the word, or more formally, reduce inflectional forms of a word to a common base form [\[7\]](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html), but they act differently. Whereas stemming follows a heuristic approach that drops the suffix of words in order to get closer to the common base form, lemmatization uses a dictionary and morphological analysis of words to return the base form of words, known as _lemma_.

Using the example in the cited reference, if shown the word _saw_, stemming would tend to return only *s*, while lemmatization would take into account if the word was the verb or the noun, and correspondingly, return _see_ or _saw_  as the base form of the word.

As you may expect, lemmatization is much more expensive in computational terms and, for certain applications, stemming might be more than enough to obtain good results. We will be using only stemming throughout the NLP learning units.

---

## N-Grams

_n-grams_ correspond to sequences of n consecutive elements from a given sentence. Commonly each element is a word, or "token," but we may define it as we wish for the task at hand. Usually, we refer to unigrams, bigrams, trigrams, four-grams, etc. according to the length of the sequence of elements.

For instance, for the sentence

`"The driver made a mistake"`,

we would have:

- unigrams: `The`, `driver`, `made`, `a`, `mistake`
- bigrams: `The driver`, `driver made`, `made a`, `a mistake`
- trigrams: `The driver made`, `driver made a`, `made a mistake`
- four-grams: `The driver made a`, `driver made a mistake`

We will create _n-grams_ but taking advantage of the [NLTK ngram](http://www.nltk.org/_modules/nltk/model/ngram.html) implementation. We will be using the tokenized list `words` created previously.

In [44]:
print(words)

['The', 'car', 'went', 'too', 'fast', 'on', 'the', 'second', 'lap', '.', 'This', 'damaged', 'the', 'tires', '...']


In [45]:
print(list(ngrams(words, 1)))

[('The',), ('car',), ('went',), ('too',), ('fast',), ('on',), ('the',), ('second',), ('lap',), ('.',), ('This',), ('damaged',), ('the',), ('tires',), ('...',)]


In [46]:
print(list(ngrams(words, 2)))

[('The', 'car'), ('car', 'went'), ('went', 'too'), ('too', 'fast'), ('fast', 'on'), ('on', 'the'), ('the', 'second'), ('second', 'lap'), ('lap', '.'), ('.', 'This'), ('This', 'damaged'), ('damaged', 'the'), ('the', 'tires'), ('tires', '...')]


In [47]:
print(list(ngrams(words, 3)))

[('The', 'car', 'went'), ('car', 'went', 'too'), ('went', 'too', 'fast'), ('too', 'fast', 'on'), ('fast', 'on', 'the'), ('on', 'the', 'second'), ('the', 'second', 'lap'), ('second', 'lap', '.'), ('lap', '.', 'This'), ('.', 'This', 'damaged'), ('This', 'damaged', 'the'), ('damaged', 'the', 'tires'), ('the', 'tires', '...')]


And by looking at the output, it's possible to observe that we are getting what we expected.

N-grams may be used for several things, like extra features in natural language processing classification tasks. Imagine counts of "very good" vs. "very" and "good" individually when doing sentiment analysis, or the difference in the counts of n-grams present in a reference and our hypothesis as a way of calculating similarity between generated texts, and so on. 

### Wrapping it up

And you've reached the end of our first notebook, congratulation! Throughout it you've learned the following concepts:

* regex
* tokenization
* stemming
* n-grams


It may seem overwhelming, but this is the first step on your journey into the NLP world, so keep at it and see you in Part 2!

<img src="./media/info_everywhere.jpg" width="400">


___

### Reference

\[1\] - [RegExr](https://regexr.com/3lvai)

\[2\] - [Python Module of the Week](https://pymotw.com/2/re/)

\[3\] - [NLTK Book](https://www.nltk.org/book/)

\[4\] - [N-grams](https://en.wikipedia.org/wiki/N-gram#n-gram_models)

\[5\] - [Language Model](https://en.wikipedia.org/wiki/Language_model)

\[6\] - [Stanford CS124 Language Modeling slides](https://web.stanford.edu/class/cs124/lec/lm2021.pdf)

\[7\] - [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Extra Regex: [RegexOne](https://regexone.com/)