# **Tutorial:** Text Preprocessing Techniques
---

Text data seldom come clean. Human communicates using text, and it often contains noise in various forms like emoticons, punctuation, etc.

Text preprocessing is an important part of the NLP pipeline. In this tutorial, you will learn about various text preprocessing techniques, such as lowercasing, noise removal, tokenization, stemming, lemmatization, and text normalization.

At the end of this tutorial, you will be able to:
* explain various text preprocessing techniques in NLP
* clean simple texts using basic `python` and several text libraries such as `nltk`, `spacy` and `regex`

## Lowercasing
---

You will use the same text for all preprocessing shown below. The text is as follows:

In [None]:
text = '''Singapore Polytechnic has many courses to help you to upskill and
learn. Isn't it great ❤? #sp_nlp #ms9007 #practical-nlp'''

text

"Singapore Polytechnic has many courses to help you to upskill and\nlearn. Isn't it great ❤? #sp_nlp #ms9007 #practical-nlp"

To lowercase a text, use `text.lower()`.

In [None]:
text.lower()

"singapore polytechnic has many courses to help you to upskill and\nlearn. isn't it great ❤? #sp_nlp #ms9007 #practical-nlp"

## Removing punctuations
---

In [None]:
# import the necessary libraries
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
def remove_punctuation(text):
    no_punctuation = "".join([i for i in text if i not in string.punctuation])
    return no_punctuation

remove_punctuation(text)

'Singapore Polytechnic has many courses to help you to upskill and\nlearn Isnt it great ❤ spnlp ms9007 practicalnlp'

## Tokenization
---

Tokenization is the process of breaking up a text into smaller parts (called *tokens*). Tokens can be sentences, words, characters, or subwords. The most common tokenization are **sentence tokenization** (a.k.a. sentence segmentation) and **word tokenization**.


### Sentence tokenization
---

In [None]:
# import the necessary libraries
import nltk
nltk.download('punkt')

from nltk import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
sent_tokenize(text)

['Singapore Polytechnic has many courses to help you to upskill and\nlearn.',
 "Isn't it great ❤?",
 '#sp_nlp #ms9007 #practical-nlp']

### Word tokenization
---

There are many word tokenizers available in Python. In this course, we will use the `word_tokenize(text)` method from `nltk` package, unless otherwise mentioned.

In [None]:
# using nltk word_tokenize (recommended in this course)
from nltk import word_tokenize
word_tokenize(text)

['Singapore',
 'Polytechnic',
 'has',
 'many',
 'courses',
 'to',
 'help',
 'you',
 'to',
 'upskill',
 'and',
 'learn',
 '.',
 'Is',
 "n't",
 'it',
 'great',
 '❤',
 '?',
 '#',
 'sp_nlp',
 '#',
 'ms9007',
 '#',
 'practical-nlp']

Other than the standard `nltk.word_tokenize(text)` method, there exists many other word tokenizers available in various other Python libraries, such as `nltk`, `textblob`, `spacy`, `gensim`, etc.

Run the cells below and compare the outputs against the standard `nltk.word_tokenize(text)` output above.

In [None]:
# manual whitespace tokenization
text.split()

['Singapore',
 'Polytechnic',
 'has',
 'many',
 'courses',
 'to',
 'help',
 'you',
 'to',
 'upskill',
 'and',
 'learn.',
 "Isn't",
 'it',
 'great',
 '❤?',
 '#sp_nlp',
 '#ms9007',
 '#practical-nlp']

In [None]:
text

"Singapore Polytechnic has many courses to help you to upskill and\nlearn. Isn't it great ❤? #sp_nlp #ms9007 #practical-nlp"

In [None]:
# using nltk punctuation-based tokenizer
# Note that this tokenizer is a simpler, regular-expression based tokenizer which splits on white space and punctuation.
from nltk import wordpunct_tokenize
wordpunct_tokenize(text)

['Singapore',
 'Polytechnic',
 'has',
 'many',
 'courses',
 'to',
 'help',
 'you',
 'to',
 'upskill',
 'and',
 'learn',
 '.',
 'Isn',
 "'",
 't',
 'it',
 'great',
 '❤?',
 '#',
 'sp_nlp',
 '#',
 'ms9007',
 '#',
 'practical',
 '-',
 'nlp']

In [None]:
# using nltk treebank word tokenizer
from nltk import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

['Singapore',
 'Polytechnic',
 'has',
 'many',
 'courses',
 'to',
 'help',
 'you',
 'to',
 'upskill',
 'and',
 'learn.',
 'Is',
 "n't",
 'it',
 'great',
 '❤',
 '?',
 '#',
 'sp_nlp',
 '#',
 'ms9007',
 '#',
 'practical-nlp']

<font color="blue"> Notice that the abbreviation `Isn't` is not split by at the punctuation `'` when using TreebankWordTokenizer, unlike the previous two tokenizer.

In [None]:
# using nltk tweet tokenizer
from nltk import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize(text)

['Singapore',
 'Polytechnic',
 'has',
 'many',
 'courses',
 'to',
 'help',
 'you',
 'to',
 'upskill',
 'and',
 'learn',
 '.',
 "Isn't",
 'it',
 'great',
 '❤',
 '?',
 '#sp_nlp',
 '#ms9007',
 '#practical-nlp']

---
**Textblob:** Documentation [here](https://textblob.readthedocs.io/en/dev/quickstart.html).

In [None]:
# using textblob
from textblob import TextBlob
blob_object = TextBlob(text)

# word tokenization
blob_object.words

WordList(['Singapore', 'Polytechnic', 'has', 'many', 'courses', 'to', 'help', 'you', 'to', 'upskill', 'and', 'learn', 'Is', "n't", 'it', 'great', '❤', 'sp_nlp', 'ms9007', 'practical-nlp'])

Notice that the TextBlob tokenizer removes the punctuations. In addition, it has rules for English contractions.

---
**Spacy**: Documentation [here](https://spacy.io/usage/linguistic-features#tokenization).

In [None]:
# using spacy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# word tokenization
for token in doc:
    print(token, token.pos_)

Singapore PROPN
Polytechnic PROPN
has VERB
many ADJ
courses NOUN
to PART
help VERB
you PRON
to PART
upskill VERB
and CCONJ

 SPACE
learn VERB
. PUNCT
Is AUX
n't PART
it PRON
great ADJ
❤ PUNCT
? PUNCT
# SYM
sp_nlp NOUN
# SYM
ms9007 VERB
# DET
practical ADJ
- PUNCT
nlp NOUN


---
**Gensim**

In [None]:
# using gensim
from gensim.utils import tokenize
list(tokenize(text))

['Singapore',
 'Polytechnic',
 'has',
 'many',
 'courses',
 'to',
 'help',
 'you',
 'to',
 'upskill',
 'and',
 'learn',
 'Isn',
 't',
 'it',
 'great',
 'sp_nlp',
 'ms',
 'practical',
 'nlp']

Congratulations! You have seen word tokenization by various Python libraries. Each library has its own advantages and disadvantages. None of them is perfect, and you may want to choose your preferred one, depending on the application at hand.

## Removing stopwords
---

First, you need to understand what *stopwords* are.

Execute the following cells to print out the common English stopwords in `nltk`.

In [None]:
# import the necessary libraries
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# show stopwords list in nltk
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Next, you will use the same text as above, and you will remove the punctuations and all the stopwords.

In [None]:
text = '''Singapore Polytechnic has many courses to help you to upskill and \
learn. Isn't it great ❤? #sp_nlp #ms9007 #practical-nlp'''

def remove_punctuation(text):
    no_punctuation = "".join([i for i in text if i not in string.punctuation])
    return no_punctuation

text_no_punct = remove_punctuation(text)

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

tokens = word_tokenize(text)
tokens_no_punct = word_tokenize(text_no_punct)
tokens_no_punct_stopwords = [w for w in tokens_no_punct \
                            if not w.lower() in stop_words]

print(f'Original text: {tokens}')
print(f'Remove punctuation: {tokens_no_punct}')
print(f'Remove punctuation and stopwords: {tokens_no_punct_stopwords}')

Original text: ['Singapore', 'Polytechnic', 'has', 'many', 'courses', 'to', 'help', 'you', 'to', 'upskill', 'and', 'learn', '.', 'Is', "n't", 'it', 'great', '❤', '?', '#', 'sp_nlp', '#', 'ms9007', '#', 'practical-nlp']
Remove punctuation: ['Singapore', 'Polytechnic', 'has', 'many', 'courses', 'to', 'help', 'you', 'to', 'upskill', 'and', 'learn', 'Isnt', 'it', 'great', '❤', 'spnlp', 'ms9007', 'practicalnlp']
Remove punctuation and stopwords: ['Singapore', 'Polytechnic', 'many', 'courses', 'help', 'upskill', 'learn', 'Isnt', 'great', '❤', 'spnlp', 'ms9007', 'practicalnlp']


## Stemming / Lemmatization
---
**Stemming** is a process of reducing words to its root form even if the root has no dictionary meaning.

**Lemmatization** is a process of reducing words into their **lemma** or **dictionary word**. It takes into account the meaning of the word in the sentence.

In [None]:
# import necessary libraries
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

# instantiate stemmers
ps = PorterStemmer()
ls = LancasterStemmer()
wnl = WordNetLemmatizer()

sentence = '''Such an analysis can reveal features that are not easily visible
from the variations in the individual genes and can lead to a picture of
expression that is more biologically transparent and accessible to
interpretation'''

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
tokenized_sentence = nltk.tokenize.word_tokenize(sentence)

stemmed_ps = [ps.stem(word) for word in tokenized_sentence]
stemmed_ls = [ls.stem(word) for word in tokenized_sentence]
lemmatized = [wnl.lemmatize(word) for word in tokenized_sentence]

print(f'Tokenized sentence: {tokenized_sentence}')
print(f'Stemmed sentence (porter): {stemmed_ps}')
print(f'Stemmed sentence (lancester): {stemmed_ls}')
print(f'Lemmatized sentence: {lemmatized}')

Tokenized sentence: ['Such', 'an', 'analysis', 'can', 'reveal', 'features', 'that', 'are', 'not', 'easily', 'visible', 'from', 'the', 'variations', 'in', 'the', 'individual', 'genes', 'and', 'can', 'lead', 'to', 'a', 'picture', 'of', 'expression', 'that', 'is', 'more', 'biologically', 'transparent', 'and', 'accessible', 'to', 'interpretation']
Stemmed sentence (porter): ['such', 'an', 'analysi', 'can', 'reveal', 'featur', 'that', 'are', 'not', 'easili', 'visibl', 'from', 'the', 'variat', 'in', 'the', 'individu', 'gene', 'and', 'can', 'lead', 'to', 'a', 'pictur', 'of', 'express', 'that', 'is', 'more', 'biolog', 'transpar', 'and', 'access', 'to', 'interpret']
Stemmed sentence (lancester): ['such', 'an', 'analys', 'can', 'rev', 'feat', 'that', 'ar', 'not', 'easy', 'vis', 'from', 'the', 'vary', 'in', 'the', 'individ', 'gen', 'and', 'can', 'lead', 'to', 'a', 'pict', 'of', 'express', 'that', 'is', 'mor', 'biolog', 'transp', 'and', 'access', 'to', 'interpret']
Lemmatized sentence: ['Such', 'a

In [None]:
# display results in the form of table, for comparison
import pandas as pd

pd.DataFrame(
    {'Tokenized': tokenized_sentence,
     'Porter': stemmed_ps,
     'Lancester': stemmed_ls,
     'Lemmatized': lemmatized
     })

Unnamed: 0,Tokenized,Porter,Lancester,Lemmatized
0,Such,such,such,Such
1,an,an,an,an
2,analysis,analysi,analys,analysis
3,can,can,can,can
4,reveal,reveal,rev,reveal
5,features,featur,feat,feature
6,that,that,that,that
7,are,are,ar,are
8,not,not,not,not
9,easily,easili,easy,easily


Notice the imperfections of the output above too. You would expect "is" and "are" to be lemmatized to "be", but the output above still shows "is" and "are". This is because `WordNetLemmatizer()` requires the `pos` argument (`pos` = part of speech) to be accurate. The default `pos` has been set to `"n"`, which means noun. Refer to the [WordNet documentation](https://www.nltk.org/_modules/nltk/stem/wordnet.html) here.

In [None]:
# notice that the output is incorrect
wnl.lemmatize("are")

'are'

In [None]:
# notice now that the output is correct
# pos = part of speech: v = verb, a = adjective, r = adverb
wnl.lemmatize("are", pos="v")

'be'

In [None]:
wnl.lemmatize("better")

'better'

In [None]:
wnl.lemmatize("better", pos="a")

'good'

In [None]:
wnl.lemmatize("better", pos="r")

'well'

In [None]:
# by default, pos="n"
wnl.lemmatize("meeting")

'meeting'

In [None]:
wnl.lemmatize("meeting", pos="v")

'meet'

## <font color="blue">**Conclusion**</font>
---
Congratulations! You have learnt how to use `nltk` for various text preprocessing tasks that you are likely to encounter in your projects.

Now, you will further your skill by looking at a case study: how to preprocess social media responses in the form of tweets.

Please proceed to the next tutorial.