<a href="https://colab.research.google.com/github/Raj-dot-GitHub/NLP-Notes/blob/main/Text%20Normalization/NLP_Normalization_of_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Text Normalization**



### **What is Text Normalization?**
> **Text Normalization** is a process of reducing a word to its root form.

### We have two ways of Text Normalization:
1. Stemming.
2. Lemmatization.

### **What are Stemming and Lemmatization?**
> **Stemming** and **Lemmatization** is simply normalization of words, which means reducing a word to its root form.

### **What is Stemming?**
> 

*   Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into account a list of common prefixes or suffixes that could be found in that word
*   It is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word



### **What is Lemmatization?**
> Lemmatization, on the other hand, is an organized & step-by-step procedure of obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

### **Benefits of Text Normalization:**
> Let’s consider the following two sentences:

* He was driving
* He went for a drive

We can easily state that both the sentences are conveying the same meaning, that is, driving activity in the past. A machine will treat both sentences differently. Thus, to make the text understandable for the machine, we need to perform stemming or lemmatization.

Another benefit of text normalization is that it reduces the number of unique words in the text data. This helps in bringing down the training time of the machine learning model (and don’t we all want that?).

### **Stemming vs Lemmatization. Which one should I prefer?**
> **Stemming** algorithm works by cutting the suffix or prefix from the word. **Lemmatization** is a more powerful operation as it takes into consideration the morphological analysis of the word.

Lemmatization returns the **lemma**, which is the root word of all its inflection forms.

We can say that stemming is a quick and dirty method of chopping off words to its root form while on the other hand, lemmatization is an intelligent operation that uses dictionaries which are created by in-depth linguistic knowledge. **Hence, Lemmatization helps in forming better features.**

## **Text Normalization using NLTK**

### **Stemming**

For stemming we use PorterStemmer() method in NLTK.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [2]:
stop_words = set(stopwords.words("english"))
print(len(stop_words))

179


In [3]:
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

# Performing word tokenization.
word_tokens = word_tokenize(text)      

In [4]:
# Stopwords removal.
filtered_sentence = []

for word in word_tokens:
  if word not in stop_words:
    filtered_sentence.append(word)

print(filtered_sentence)

['He', 'determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'rights', 'become', 'much', 'less', 'valuable', ',', 'indeed', 'vaguest', 'idea', 'wood', 'river', 'question', '.']


In [6]:
# Appling Stemming.
stem_words = []
ps = PorterStemmer()
for word in filtered_sentence:
  root_word = ps.stem(word)
  stem_words.append(root_word)

print(filtered_sentence)
print(stem_words)

['He', 'determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'rights', 'become', 'much', 'less', 'valuable', ',', 'indeed', 'vaguest', 'idea', 'wood', 'river', 'question', '.']
['He', 'determin', 'drop', 'litig', 'monastri', ',', 'relinguish', 'claim', 'wood-cut', 'fisheri', 'rihgt', '.', 'He', 'readi', 'becuas', 'right', 'becom', 'much', 'less', 'valuabl', ',', 'inde', 'vaguest', 'idea', 'wood', 'river', 'question', '.']


### **Lemmatization.**
>For Lemmatization we will use WordNetLemmatizer method from NLTK.

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

In [8]:
text2 = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(text2)

In [9]:
# Stopwords removal.

filtered_sentence = []

for word in word_tokens:
  if word not in stop_words:
    filtered_sentence.append(word)

print(filtered_sentence)

['He', 'determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'rights', 'become', 'much', 'less', 'valuable', ',', 'indeed', 'vaguest', 'idea', 'wood', 'river', 'question', '.']


In [13]:
lemma_words = []

wordnet_lemmatizer = WordNetLemmatizer()
for word in filtered_sentence:
  word1 = wordnet_lemmatizer.lemmatize(word, pos = "n")
  word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
  word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
  lemma_words.append(word3)

print(filtered_sentence)
print(lemma_words)


['He', 'determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'rights', 'become', 'much', 'less', 'valuable', ',', 'indeed', 'vaguest', 'idea', 'wood', 'river', 'question', '.']
['He', 'determine', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claim', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'right', 'become', 'much', 'le', 'valuable', ',', 'indeed', 'vague', 'idea', 'wood', 'river', 'question', '.']


### **Note:-** 
Lemmatization is done on the basis of part-of-speech tagging (POS tagging).

Here, **v** stands for **verb**, **a** stands for **adjective** and **n** stands for **noun**. The lemmatizer only lemmatizes those words which match the pos parameter of the lemmatize method.

## **Text Normalization using Spacy.**

Note:- There is no module for Stemming in Spacy. we can only perform lemmatization.

### **Lemmatization.**

In [14]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [15]:
doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were.""")


In [16]:
# Lemmatization.
lemma_words = []
for tokens in doc:
  lemma_words.append(tokens.lemma_)

print(lemma_words)

['-PRON-', 'determine', 'to', 'drop', '-PRON-', 'litigation', 'with', 'the', 'monastry', ',', 'and', 'relinguish', '-PRON-', 'claim', 'to', 'the', 'wood', '-', 'cut', 'and', '\n', 'fishery', 'rihgts', 'at', 'once', '.', '-PRON-', 'be', 'the', 'more', 'ready', 'to', 'do', 'this', 'becuase', 'the', 'right', 'have', 'become', 'much', 'less', 'valuable', ',', 'and', '-PRON-', 'have', '\n', 'indeed', 'the', 'vague', 'idea', 'where', 'the', 'wood', 'and', 'river', 'in', 'question', 'be', '.']


### **Note:-**
Here -PRON- is the notation for pronoun which could easily be removed using regular expressions. The benefit of spaCy is that we do not have to pass any pos parameter to perform lemmatization.


## **Text Normalization using TextBlob.**

### Note:- 
Similar to Spacy, TextBlob has no module for stemming.

### **Lemmatization**

In [17]:
from textblob import Word

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

lem = []
for i in text.split():
  word1 = Word(i).lemmatize("n")
  word2 = Word(word1).lemmatize("v")
  word3 = Word(word2).lemmatize("a")
  lem.append(Word(word3).lemmatize())
print(lem)

['He', 'determine', 'to', 'drop', 'his', 'litigation', 'with', 'the', 'monastry,', 'and', 'relinguish', 'his', 'claim', 'to', 'the', 'wood-cuting', 'and', 'fishery', 'rihgts', 'at', 'once.', 'He', 'wa', 'the', 'more', 'ready', 'to', 'do', 'this', 'becuase', 'the', 'right', 'have', 'become', 'much', 'le', 'valuable,', 'and', 'he', 'have', 'indeed', 'the', 'vague', 'idea', 'where', 'the', 'wood', 'and', 'river', 'in', 'question', 'were.']


> TextBlob also uses POS tagging to perform lemmatization

### That's It !!