# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 1</a>

## Text Processing

In this notebook, we go over some simple techniques to clean and prepare text data for modeling with machine learning.

1. <a href="#1">Simple text cleaning processes</a>
2. <a href="#2">Lexicon-based text processing</a>
    * Stop words removal 
    * Stemming   
    * Lemmatization

In [1]:
# Upgrade dependencies
!pip install -r ../../requirements.txt

Collecting torch==1.8.1
  Using cached torch-1.8.1-cp36-cp36m-manylinux1_x86_64.whl (804.1 MB)
Collecting torchtext==0.9.1
  Using cached torchtext-0.9.1-cp36-cp36m-manylinux1_x86_64.whl (7.1 MB)
Collecting scikit-learn==0.24.1
  Using cached scikit_learn-0.24.1-cp36-cp36m-manylinux2010_x86_64.whl (22.2 MB)
Collecting trax==1.3.7
  Using cached trax-1.3.7-py2.py3-none-any.whl (521 kB)
Collecting transformers==4.5.1
  Using cached transformers-4.5.1-py3-none-any.whl (2.1 MB)
Collecting jax
  Using cached jax-0.2.17-py3-none-any.whl
Collecting funcsigs
  Using cached funcsigs-1.0.2-py2.py3-none-any.whl (17 kB)
Collecting gym
  Using cached gym-0.21.0-py3-none-any.whl
Collecting t5
  Using cached t5-0.9.3-py3-none-any.whl (153 kB)
Collecting jaxlib
  Using cached jaxlib-0.1.69-cp36-none-manylinux2010_x86_64.whl (46.5 MB)
[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/six-

In [2]:
import nltk
import re
import string

from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


## 1. <a name="1">Simple text cleaning processes</a>
(<a href="#0">Go to top</a>)

In this section, we will do some general purpose text cleaning. The following methods for cleaning can be extended depending on the application.

In [3]:
original_text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "
print(original_text)

   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


Let's first lowercase our text. 

In [4]:
text = original_text.lower()
print(text)

   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


We can get rid of leading/trailing whitespace with the following:

In [5]:
text = text.strip()
print(text)

this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .


Remove HTML tags/markups:

In [6]:
text = re.compile("<.*?>").sub("", text)
print(text)

this is a message to be cleaned. it may involve some things like: , ?, :, ''  adjacent spaces and tabs     .


Replace punctuation with space. Be careful with this one, depending on the application, punctuations can actually be useful. For example positive vs negative meanining of a sentence.

In [7]:
text = re.compile("[%s]" % re.escape(string.punctuation)).sub(" ", text)
print(text)

this is a message to be cleaned  it may involve some things like              adjacent spaces and tabs      


Remove extra space and tabs

In [8]:
text = re.sub("\s+", " ", text)
print(text)

this is a message to be cleaned it may involve some things like adjacent spaces and tabs 


## 2. <a name="2">Lexicon-based text processing</a>
(<a href="#0">Go to top</a>)

We saw some general purpose text pre-processing methods in the previous section. Lexicon based methods are usually applied after the common text processing methods. They are used to normalize sentences in our dataset. By normalization, here, we mean putting words into a similar format that will also enhace similarities (if any) between sentences.

We need to download some packages for this example. Run the following cell.

In [9]:
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

#### Stop word removal
There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: "a", "an", "the", "this", "that", "is", "it", "to", "and" in this example.

In [10]:
# We will use a tokenizer from the NLTK library
filtered_sentence = []

# Stop word lists can be adjusted for your problem
stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

# Tokenize the sentence
words = word_tokenize(text)
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
text = " ".join(filtered_sentence)

In [11]:
print(text)

message be cleaned may involve some things like adjacent spaces tabs


#### Stemming
Stemming is a rule-based system to convert words into their root form. It removes suffixes from words. This helps us enhace similarities (if any) between sentences. 

Example:

"jumping", "jumped" -> "jump"

"cars" -> "car"

In [12]:
# We will use a tokenizer and stemmer from the NLTK library
# Initialize the stemmer
snow = SnowballStemmer("english")

stemmed_sentence = []
# Tokenize the sentence
words = word_tokenize(text)
for w in words:
    # Stem the word/token
    stemmed_sentence.append(snow.stem(w))
stemmed_text = " ".join(stemmed_sentence)

In [13]:
print(stemmed_text)

messag be clean may involv some thing like adjac space tab


You can see above that stemming operation is NOT perfect. We have mistakes such as "messag", "involv", "adjac". It is a rule based method that sometimes mistakely remove suffixes from words. Nevertheless, it runs fast.

#### Lemmatization
If we are not satisfied with the result of stemming, we can use the Lemmatization instead. It usually requires more work, but gives better results. As mentioned in the class, lemmatization needs to know the correct word position tags such as "noun", "verb", "adjective", etc. and we will use another NLTK function to feed this information to the lemmatizer.

In [14]:
# Initialize the lemmatizer
wl = WordNetLemmatizer()

# This is a helper function to map NTLK position tags
# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
def get_wordnet_pos(tag):
    if tag.startswith("J"):
        return wordnet.ADJ
    elif tag.startswith("V"):
        return wordnet.VERB
    elif tag.startswith("N"):
        return wordnet.NOUN
    elif tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN


lemmatized_sentence = []
# Tokenize the sentence
words = word_tokenize(text)
# Get position tags
word_pos_tags = nltk.pos_tag(words)
# Map the position tag and lemmatize the word/token
for idx, tag in enumerate(word_pos_tags):
    lemmatized_sentence.append(wl.lemmatize(tag[0], get_wordnet_pos(tag[1])))

lemmatized_text = " ".join(lemmatized_sentence)

In [15]:
print(lemmatized_text)

message be clean may involve some thing like adjacent space tabs


This looks better than the stemming result.

Let's compare with the original text:

In [16]:
original_text

"   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "