# Chapter 4 : Text Preprocessing, Stemming and Lemmatization

Unlike tokenization, which reduces a document into individual words, s**temming and lemmatization are attempts to reduce these words further to their lexical roots**

## Text preprocessing

### **Removing HTML**

Fortunately, in Python, there is a package called `BeautifulSoup` that allows us to remove all HTML in a few lines:

In [14]:
from bs4 import BeautifulSoup
import nltk

input_text = "<b> This text is in bold</br>, <i> This text is in italics </i>"
output_text = BeautifulSoup(input_text, "html.parser").get_text()
print('Input: ' + input_text)
print('Output: ' + output_text)

Input: <b> This text is in bold</br>, <i> This text is in italics </i>
Output:  This text is in bold,  This text is in italics 


### **Converting text into lowercase**

This can be done very easily within Python using the following lines of code:

In [15]:
input_text = ['Cat','cat','CAT']
output_text = [x.lower() for x in input_text]
print('Input: ' + str(input_text))
print('Output: ' + str(output_text))

Input: ['Cat', 'cat', 'CAT']
Output: ['cat', 'cat', 'cat']


### **Removing punctuation**

Sometimes, depending on the type of model being constructed, we may wish to remove punctuation from our input text. This is particularly useful in models where we are aggregating word counts, such as in a bag-of-words representation. The presence of a full stop or a comma within the sentence doesn't add any useful information about the semantic content of the sentence. However, more complicated models that take into account the position of punctuation within the sentence may actually use the position of the punctuation to infer a different meaning. A classic example is as follows:

*The panda eats shoots and leaves*

*The panda eats, shoots, and leaves*

Here, the addition of a comma transforms the sentence describing a panda's eating habits into a sentence describing an armed robbery of a restaurant by a panda!

We can do this in Python by using the `re` library, to match any punctuation using a regular expression, and the `sub()` method, to replace any matched punctuation with an empty character

In [16]:
import re

input_text = "This ,sentence.'' contains-£ no:: punctuation?"
output_text = re.sub(r'[^\w\s]', '', input_text)
print('Input: ' + input_text)
print('Output: ' + output_text)

Input: This ,sentence.'' contains-£ no:: punctuation?
Output: This sentence contains no punctuation


There may be instances where we may not wish to directly remove punctuation. A good example would be the use of the ampersand (&), which in almost every instance is used interchangeably with the word "and". Therefore, rather than completely removing the ampersand, we may instead opt to replace it directly with the word "and". We can easily implement this in Python using the `.replace()` function:

In [17]:
input_text = "Cats & dogs"
output_text = input_text.replace("&", "and")
print('Input: ' + input_text)
print('Output: ' + output_text)

Input: Cats & dogs
Output: Cats and dogs


It is also worth considering specific circumstances where punctuation may be essential for the representation of a sentence. One crucial example is email addresses. Removing the @ from email addresses doesn't make the address any more readable:

name@gmail.com

Removing the punctuation returns this:

namegmailcom

So, in instances like this, it may be preferable to remove the whole item altogether, according to the requirements and purpose of your NLP model.

### **Replacing numbers**

Similarly, with numbers, we also want to standardize our outputs. Numbers can be written as digits $(9, 8, 7)$ or as actual words *(nine, eight, seven)*. It may be worth transforming these all into a single, standardized representation so that $1$ and *one* are not treated as separate entities. We can do this in Python using the following methodology:

In [19]:
from inflect import engine

def to_digit(digit):
    i = engine()
    if digit.isdigit():
        output = i.number_to_words(digit)
    else:
        output = digit
    return output

input_text = ["1","two","3"]
output_text = [to_digit(x) for x in input_text]
print('Input: ' + str(input_text))
print('Output: ' + str(output_text))

ValueError: `Field` default cannot be set in `Annotated` for 'num_Annotated[str, FieldInfo(min_length=1, extra={})]'

This shows that we have successfully converted our digits into text.

However, in a similar fashion to processing email addresses, processing phone numbers may not require the same representation as regular numbers. This is illustrated in the following example:

In [20]:
input_text = ["0800118118"]
output_text = [to_digit(x) for x in input_text]

print('Input: ' + str(input_text))
print('Output: ' + str(output_text))

NameError: name 'to_digit' is not defined

## Stemming and Lemmatization

Stemming and lemmatization is the process by which we arrive at these root words. **Stemming** is an algorithmic process in which the ends of words are cut off to arrive at a common root, whereas lemmatization uses a true vocabulary and structural analysis of the word itself to arrive at the true roots, or **lemmas**, of the word.

### **Stemming**

Stemming is the algorithmic process by which we trim the ends off words in order to arrive at their lexical roots, or **stems**. To do this, we can use different **stemmers** that each follow a particular algorithm in order to return the stem of a word. In English, one of the most common stemmers is the **Porter Stemmer**.

The **Porter Stemmer** is an algorithm with a large number of logical rules that can be used to return the stem of a word. We will first show how to implement a Porter Stemmer in Python using NLTK before moving on and discussing the algorithm in more detail:

In [None]:
from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# 1 First we create an instance of the Porter Stemmer
porter = PorterStemmer()
# 2 We then simply call this instance of the stemmer
# on individual words and print the results
word_list = ["see","saw","cat", "cats", "stem", "stemming",
            "lemma","lemmatization","known","knowing","time",
            "timing","football", "footballers"]
for word in word_list:
    print(word + ' -> ' + porter.stem(word))

# 3 We can also apply stemming to an entier sentence
# first by tokenizing the sentence and then by stemming each term
# individually
def sentenceStemmer(sentence):
    tokens = word_tokenize(sentence)
    stems = [porter.stem(word) for word in tokens]
    return "".join(stems)

sentenceStemmer('The cats and dogs are running')

see -> see
saw -> saw
cat -> cat
cats -> cat
stem -> stem
stemming -> stem
lemma -> lemma
lemmatization -> lemmat
known -> known
knowing -> know
time -> time
timing -> time
football -> footbal
footballers -> footbal


'thecatanddogarerun'

### **Lemmatization**

Lemmatization differs from stemming in that it reduces words to their **lemma** instead of their stem. While the stem of a word is processed and reduced to a string, **a word's lemma is its true lexical root**. So, while the stem of the word *ran* will just be *ran*, its lemma is the true lexical root of the word, which would be *run*. We will look at using the WordNet Lemmatizer within NLTK.

In [None]:
# We will first create an instance of our lemmatizer and call it on a
# selection of words
wordnet_lemmatizer = WordNetLemmatizer()
print(wordnet_lemmatizer.lemmatize('horses'))
print(wordnet_lemmatizer.lemmatize('wolves'))
print(wordnet_lemmatizer.lemmatize('mice'))
print(wordnet_lemmatizer.lemmatize('cacti'))

horse
wolf
mouse
cactus


Here, we can already begin to see the advantages of using lemmatization over stemming. Since the WordNet Lemmatizer is built on a database of all the words in the English language, it knows that mice is the plural version of mouse

In [None]:
print(wordnet_lemmatizer.lemmatize('madeupwords'))
print(porter.stem('madeupwords'))

madeupwords
madeupword


Here, we can see that, in this instance, our stemmer is able to generalize better to previously unseen words. Therefore, using a lemmatizer may be a problem if we're lemmatizing sources where language doesn't necessarily match up with real English language, such as social media sites where people may frequently abbreviate language.

If we call our lemmatizer on two verbs, we will see that this doesn't reduce them to their expected common lemma:

In [None]:
print(wordnet_lemmatizer.lemmatize('run'))
print(wordnet_lemmatizer.lemmatize('ran'))

run
ran


This is because our lemmatizer relies on the context of words to be able to return the lemmas. Recall from our POS analysis that we can easily return the context of a word in a sentence and determine whether a given word is a noun, verb, or adjective. For now, let's manually specify that our words are verbs. We can see that this now correctly returns the lemma:

In [None]:
print(wordnet_lemmatizer.lemmatize('ran', pos='v'))
print(wordnet_lemmatizer.lemmatize('run', pos='v'))

run
run


In [None]:
sentence = 'The cats and dogs are running'

def return_word_pos_tuples(sentence):
    return nltk.pos_tag(nltk.word_tokenize(sentence))

return_word_pos_tuples(sentence)

[('The', 'DT'),
 ('cats', 'NNS'),
 ('and', 'CC'),
 ('dogs', 'NNS'),
 ('are', 'VBP'),
 ('running', 'VBG')]

his means that in order to return the correct lemmatization of any given sentence, we must first perform POS tagging to obtain the context of the words in the sentence, then pass this through the lemmatizer to obtain the lemmas of each of the words in the sentence. We first create a function that will return our POS tagging for each word in the sentence:

Now that we have seen both lemmatization and stemming in action, the question still remains as to under which circumstances we should use both of these techniques. We saw that both techniques attempt to reduce each word to its root. In stemming, this may just be a reduced form of the target room, whereas in lemmatization, it reduces to a true English language word root.

Because lemmatization requires cross-referencing the target word within the WordNet corpus, as well as performing part-of-speech analysis to determine the form of the lemma, this may take a significant amount of processing time if a large number of words have to be lemmatized. This is in contrast to stemming, which uses a detailed but relatively fast algorithm to stem words. Ultimately, as with many problems in computing, it is a question of trading off speed versus detail. When choosing which of these methods to incorporate in our deep learning pipeline, the trade-off may be between speed and accuracy. If time is of the essence, then stemming may be the way to go. On the other hand, if you need your model to be as detailed and as accurate as possible, then lemmatization will likely result in the superior model.