<a href="https://colab.research.google.com/github/LxYuan0420/aws-machine-learning-university-accelerated-nlp/blob/master/colab_notebooks/MLA_NLP_Lecture1_Text_Process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Text Processing**

In this notebook, we go over some simple techniques to clean and prepare text data for modeling with machine learning.

1. Simple text cleaning processes
2. Lexicon-based text processing
    - Stop words removal
    - Stemming
    - Lemmatization

    
**1. Simple text cleaning processes**


In this section, we will do some general purpose text cleaning. The following methods for cleaning can be extended depending on the application.

In [1]:
text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "
print(text)

   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


In [2]:
# Lowercase
print(text.lower())

   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


In [3]:
# remove leading/trailing whitespace 
print(text.strip())

This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .


In [4]:
# remove html tags/markups
import re

text = re.compile("<.*?>").sub("", text)
print(text)

   This is a message to be cleaned. It may involve some things like: , ?, :, ''  adjacent spaces and tabs     .  


In [5]:
# replace punctuation with space.
# punct can actually be useful sometimes.
import re, string

text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
print(text)

   This is a message to be cleaned  It may involve some things like              adjacent spaces and tabs        


In [6]:
print(re.escape(string.punctuation))

\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^_\`\{\|\}\~


In [7]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


**2. Lexicon-based text processing**

We saw some general purpose text pre-processing methods in the previous section. Lexicon based methods are usually applied after the common text processing methods. They are used to normalize sentences in our dataset. By normalization, here, we mean putting words into a similar format that will also enhace similarities (if any) between sentences.

We need to install some packages for this example. Run the following cell.

In [8]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [10]:
# remove stop words
import nltk
from nltk.tokenize import word_tokenize

filtered_sentence = []

stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

# tokenize each sentence
words = word_tokenize(text)
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
    
text = " ".join(filtered_sentence)

In [11]:
print(text)

This message be cleaned It may involve some things like adjacent spaces tabs


**Stemming**

Stemming is a rule-based system to convert words into their root form. It removes suffixes from words. This helps us enhace similarities (if any) between sentences.

Example:

"jumping", "jumped" -> "jump"

"cars" -> "car"

In [15]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "

tokenized_words = word_tokenize(text)

snow = SnowballStemmer('english')
stem_words = [snow.stem(w) for w in tokenized_words]

for w, s in zip(tokenized_words, stem_words):
    print(f"{w:<10} -> {s}")

This       -> this
is         -> is
a          -> a
message    -> messag
to         -> to
be         -> be
cleaned    -> clean
.          -> .
It         -> it
may        -> may
involve    -> involv
some       -> some
things     -> thing
like       -> like
:          -> :
<          -> <
br         -> br
>          -> >
,          -> ,
?          -> ?
,          -> ,
:          -> :
,          -> ,
''         -> ''
adjacent   -> adjac
spaces     -> space
and        -> and
tabs       -> tab
.          -> .


You can see above that stemming operation is NOT perfect. We have mistakes such as "messag", "involv", "adjac". It is a rule based method that sometimes mistakely remove suffixes from words. Nevertheless, it runs fast.

**Lemmatization**

If we are not satisfied with the result of stemming, we can use the Lemmatization instead. It usually requires more work, but gives better results. As mentioned in the class, lemmatization needs to know the correct word position tags such as "noun", "verb", "adjective", etc. and we will use another NLTK function to feed this information to the lemmatizer.

In [17]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# initialie the lemmatizer
wl = WordNetLemmatizer()

# This is a helper function to map NLTK position tags
# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
def get_wordnet_pos(tag):
    if tag.startswith("J"):
        return wordnet.ADJ
    elif tag.startswith("V"):
        return wordnet.VERB
    elif tag.startswith("N"):
        return wordnet.NOUN
    else:
        return wordnet.ADV

text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "

lemmatized_sentence = []
words = word_tokenize(text)
word_pos_tags = nltk.pos_tag(words)
for idx, tag in enumerate(word_pos_tags):
    lemmatized_sentence.append(wl.lemmatize(tag[0], get_wordnet_pos(tag[1])))

lemmatized_text = " ".join(lemmatized_sentence)
print(lemmatized_text)

This be a message to be clean . It may involve some thing like : < br > , ? , : , '' adjacent space and tab .
