# **Text Cleaning Basics**

When the text corpus is given to us, it may have following issues:
*   HTML tags
*   Upper / Lower Case inconsistency
*   Punctuations
*   Stop words
*  Words not in their root form
*   And so on.. .
Before using the data for predictions, we need to clean it.

1) HTML Tags removal:

While scraping the text data from a website, you may get HTML tags included, so it is recommended that we remove them.



```
#Example:

"The market extended gains for the seventh consecutive session, climbing 1 percent to end at <b> record </b> closing high on May 31. Reliance Industries <h2> continued to be a leader </h2> in the rally, followed by <br> private banks & financials and FMCG stocks."

To clean the above text, let us remove the words which are present in between the angle brackets ‘<’ , ‘>’. We need to write regex (regular expressions) for it.


```



In [1]:
import re
text_data = '''The market extended gains for the seventh consecutive session, climbing 1 percent to end at <b> record </b> closing high on May 31. Reliance Industries <h2> continued to be a leader </h2> in the rally, followed by <br> private banks & financials and FMCG stocks.'''
html_pattern = re.compile('<.*?>')
text_data = re.sub(html_pattern, '', text_data)
text_data

'The market extended gains for the seventh consecutive session, climbing 1 percent to end at  record  closing high on May 31. Reliance Industries  continued to be a leader  in the rally, followed by  private banks & financials and FMCG stocks.'

Now you can notice that html tags have been replaced with empty strings.

**2) Upper and lower case inconsistency:**

Let us remove this inconsistency and convert everything into lower case.



In [2]:
text_data = text_data.lower()
text_data

'the market extended gains for the seventh consecutive session, climbing 1 percent to end at  record  closing high on may 31. reliance industries  continued to be a leader  in the rally, followed by  private banks & financials and fmcg stocks.'

**3) Remove Punctuations:**

Punctuations in the text do not make much sense hence we can remove them.

Example: % ^ & * , ) } etc

In [3]:
text_data = re.sub(r'[^\w\s]', '', text_data)
text_data

'the market extended gains for the seventh consecutive session climbing 1 percent to end at  record  closing high on may 31 reliance industries  continued to be a leader  in the rally followed by  private banks  financials and fmcg stocks'

**4) Remove words having length less than or equal to 2:**

Words that provide meaningful information often have word length more than 2.

In [4]:
text_data = ' '.join(word for word in text_data.split() if len(word)>2)
text_data

'the market extended gains for the seventh consecutive session climbing percent end record closing high may reliance industries continued leader the rally followed private banks financials and fmcg stocks'

**Tokenization**

Process of splitting the text, phrases, sentences into smaller units is called Tokenization.

***Example:***

Splitting of a text into sentences (Sentence is considered as token)
Splitting of a sentence into words (Word is considered as token)
We can import different types of tokenizers from the nltk library accordingly.


**1) Sentence Tokenizer:**


Text data will be splitted into sentences

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [7]:
from nltk.tokenize import sent_tokenize
text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.'
sent_tokenize(text_data)

['The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31.',
 'Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.']

**2) Word Tokenizer:**


Text data will be splitted into words

In [18]:
from nltk.tokenize import word_tokenize
text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.'
tokenized_text= word_tokenize(text_data)
print(tokenized_text)

['The', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session', ',', 'climbing', '1', 'percent', 'to', 'end', 'at', 'record', 'closing', 'high', 'on', 'May', '31', '.', 'Reliance', 'Industries', 'continued', 'to', 'be', 'a', 'leader', 'in', 'the', 'rally', ',', 'followed', 'by', 'private', 'banks', '&', 'financials', 'and', 'FMCG', 'stocks', '.']


**3) WhitespaceTokenizer:**

Based on white space, words are splitted. In Previous example, “,”, “.” are not part of the word, as they have their own usage and meaning, hence they are splitted separately and considered as separate tokens by their own. But in Whitespace tokenizer, characters which are occurring together will remain together.

In [10]:
from nltk.tokenize import WhitespaceTokenizer
text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.'
print(WhitespaceTokenizer().tokenize(text_data))

['The', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session,', 'climbing', '1', 'percent', 'to', 'end', 'at', 'record', 'closing', 'high', 'on', 'May', '31.', 'Reliance', 'Industries', 'continued', 'to', 'be', 'a', 'leader', 'in', 'the', 'rally,', 'followed', 'by', 'private', 'banks', '&', 'financials', 'and', 'FMCG', 'stocks.']


**Stop word Removal**

There are words in our sentences which do not provide any relevant information and hence they can be removed from the text.

Example: and, of, at, it, the etc.

There are multiple NLP libraries which operate on text and provide functionality to remove stop words. Some of the famous libraries that provide support for Stop word removal:
*   NLTK
*   Spacy
*   Gensim

If you do not have this library in your system, you can install it via below command:

***pip install nltk***





In [14]:
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [16]:
print(len(stopwords))

179


In [22]:
text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.'
text_data = re.sub(r'[^\w\s]', '', text_data) #punctuation removal

from nltk.tokenize import word_tokenize
tokenized_text = word_tokenize(text_data)
tokenized_text = word_tokenize(text_data)
print(tokenized_text)



['The', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session', 'climbing', '1', 'percent', 'to', 'end', 'at', 'record', 'closing', 'high', 'on', 'May', '31', 'Reliance', 'Industries', 'continued', 'to', 'be', 'a', 'leader', 'in', 'the', 'rally', 'followed', 'by', 'private', 'banks', 'financials', 'and', 'FMCG', 'stocks']


In [23]:
removed_stop_words_list = [word for word in tokenized_text if word not in stopwords]
print(removed_stop_words_list)

['The', 'market', 'extended', 'gains', 'seventh', 'consecutive', 'session', 'climbing', '1', 'percent', 'end', 'record', 'closing', 'high', 'May', '31', 'Reliance', 'Industries', 'continued', 'leader', 'rally', 'followed', 'private', 'banks', 'financials', 'FMCG', 'stocks']


*It is clearly visible that 'the', 'for', 'and' have been removed from the text*


**Important Point:** Let’s say there is some word which does not make sense in your domain and you want to remove it too. There is a way by which you can enhance your stop words list by adding this word into Stop words list and later you can apply the same step for removal.

**Example:** ‘fmcg’ is a more common word in your domain so you want to remove it.

In [24]:
stopwords.append('fmcg')
len(stopwords)


180

In [25]:
removed_stop_words_list = [word for word in tokenized_text if word not in stopwords]
print(removed_stop_words_list)


['The', 'market', 'extended', 'gains', 'seventh', 'consecutive', 'session', 'climbing', '1', 'percent', 'end', 'record', 'closing', 'high', 'May', '31', 'Reliance', 'Industries', 'continued', 'leader', 'rally', 'followed', 'private', 'banks', 'financials', 'FMCG', 'stocks']


*At this point, we have cleaned data, however, there are some words which are not in their root form. And this problem can affect the model’s accuracy. Hence it is recommended to convert the words to their base forms. We are going to learn these techniques going forward*

## **Stemming & Lemmatization**
When we work on some text document, removal of punctuation and stop words are just not enough, there is still something more which needs our attention.

The words that we use in sentences can take any form. Words can be used in present tense, or past or may be in future tense, accordingly the word will change.

For e.g.

    - The word ‘Go’ is ‘Go’ / ‘Goes’ in present tense and ‘Went’ in past tense

    - The word ‘See’ is ‘See’ / ‘Sees’ in present tense, whereas it is ‘Saw’ in past tense



These inconsistencies in data can affect the model training and predictions, hence, we need to make sure that the words exist in their root forms.



To handle this, there are two methods:


Stemming & Lemmatization
When we work on some text document, removal of punctuation and stop words are just not enough, there is still something more which needs our attention.



The words that we use in sentences can take any form. Words can be used in present tense, or past or may be in future tense, accordingly the word will change.



For e.g.

    - The word ‘Go’ is ‘Go’ / ‘Goes’ in present tense and ‘Went’ in past tense

    - The word ‘See’ is ‘See’ / ‘Sees’ in present tense, whereas it is ‘Saw’ in past tense



These inconsistencies in data can affect the model training and predictions, hence, we need to make sure that the words exist in their root forms.



To handle this, there are two methods:



**1) Stemming:**


Stemming is the process of converting/reducing the inflected words to their root form. In this method, the suffixes are removed from the inflected word so that it becomes the root.

For eg. From the word “Going”, “ing” suffix will get removed and the inflected word “Going” will become “Go” which is the root form.

Few more examples:
*   Developing -> Develop
*   Developed -> Develop
*   Development -> Develop
*   Develops -> Develop

All these inflected words take their root form when their suffixes are removed. Internally the stemming process uses some rules for trimming the suffix part.

 We can implement stemming in Python using famous library called as “nltk”

If you don't have nltk installed in your machine, you can simply type: ***pip install nltk***

In [26]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
print(porter.stem("developing"))
print(porter.stem("develops"))
print(porter.stem("development"))
print(porter.stem("developed"))

develop
develop
develop
develop


However, there are some words which do not get properly handled by the “Stemming” process.

**For e.g.** “went”, “flew”, “saw” these words can’t be converted properly to their base forms if Stemming is applied.

In [27]:
print(porter.stem("went"))
print(porter.stem("flew"))
print(porter.stem("saw"))

went
flew
saw


**Surprisingly, there is no change in the output**
because the Stemming process is not smart enough.

It **just knows how to trim the suffix part**, but it does not know how to change the form of the word.

To solve this issue, there should be some algorithm which understands the linguistic meaning of the sentence and converts each word to its base form accordingly.

Fortunately we have **Lemmatization** for this work.

**Pros and Cons of Stemmer:**

***Pros:***

  Computationally Fast: As it simply trims the suffix without worrying about the context of word


***Cons:***

  It is not useful enough if you are concerned about the valid words. Stemmer can give you some words which do not have any meaning.

 **"Goes” -> “goe”**

**2) Lemmatization:**


It is where the words are converted to their root forms by understanding the context of the word in the sentence.

In Stemming, the root word which we get after conversion is called a **stem**.

Whereas, it is called a **lemma** in Lemmatization.

**Pros:**
  The root word which we get after conversion holds some meaning and the word belongs to the Dictionary.
**Cons:**
  It is computationally expensive.

*NLTK provides a class called **WordNet** for this purpose.*

In [28]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [30]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print(wordnet_lemmatizer.lemmatize("going"))
print(wordnet_lemmatizer.lemmatize("goes")) # Lemmatizer is able to convert it to "go"
print(wordnet_lemmatizer.lemmatize("went"))
print(porter.stem("goes")) # Stemming is unable to normalize the word "goes" properly

[nltk_data] Downloading package wordnet to /root/nltk_data...


going
go
went
goe


**But you might be wondering that Lemmatizer is unable to normalize the words “going” and “went” into their root forms.**

It is because we have not passed the context to it.

Part of speech “pos” is the parameter which we need to specify. By default it is NOUN.

If a word is a verb which we want to normalize, then we need to specify with the value as “v”

In [31]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print(wordnet_lemmatizer.lemmatize("going", pos="v"))
print(wordnet_lemmatizer.lemmatize("goes", pos="v"))
print(wordnet_lemmatizer.lemmatize("went", pos="v"))
print(wordnet_lemmatizer.lemmatize("go", pos = "v"))
print(wordnet_lemmatizer.lemmatize("studies", pos = "v"))
print(wordnet_lemmatizer.lemmatize("studying", pos = "v"))
print(wordnet_lemmatizer.lemmatize("studied", pos = "v"))
print(wordnet_lemmatizer.lemmatize("dogs")) # by default, it is noun
print(wordnet_lemmatizer.lemmatize("dogs", pos="n"))

go
go
go
go
study
study
study
dog
dog


**So far we have looked into many text cleaning and normalization techniques**

**Thank you for your active participation !!!**

⭐⭐⭐⭐⭐