## `Text Normalization`:
                 "Text normalization is the process of transforming a text into a canonical(standard) form"
    
- Text normalization is important for noisy texts such as social media comments, text messages and comments to blog posts where abbreviations, misspellings and use of out-of-vocabulary words (oov) are prevalent.

- This paper (https://sentic.net/microtext-normalization.pdf) showed that by using a text normalization strategy for Tweets, they were able to improve sentiment classification accuracy by ~4%.
    #### Example:
        wolves -> wolf
        plays -> play
        goooood -> good
- Below are techniques available for `text normalization` i.e:
        1. Stemming
        2. Lemmatization
        3. Stop Words Removal
        4. N-gram generation
        5. Regular Expressions
________________________________________________________

### 1. Stemming

- A process of removing and replacing suffixes to get to the root form of the word, which is called the **stem**.
- Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form.
- So the words “trouble”, “troubled” and “troubles” might actually be converted to `troubl` instead of trouble because the ends were just chopped off.
- `PorterStemmer` Algorithm is one of most popular for stemming

#### Example

In [16]:
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer() 
sentence = "He had trouble while executing spark programs"
tokens = nltk.word_tokenize(sentence)
for token in tokens:
    print("Word : {} ------- StemWord : {}".format(token, stemmer.stem(token)))

Word : He ------- StemWord : He
Word : had ------- StemWord : had
Word : trouble ------- StemWord : troubl
Word : while ------- StemWord : while
Word : executing ------- StemWord : execut
Word : spark ------- StemWord : spark
Word : programs ------- StemWord : program


### 2. Lemmatization

- Usually refers to doing things properly with use of a vocabulary and morphological analysis and returns the base or meaning full form of a word which is known as **Lemma**
- The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”.

#### Example

In [15]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
sentence = "He had trouble while executing spark programs"
tokens = nltk.word_tokenize(sentence)
for token in tokens:
    print("Word : {} ------- StemWord : {}".format(token, lemmatizer.lemmatize(token)))

Word : He ------- StemWord : He
Word : had ------- StemWord : had
Word : trouble ------- StemWord : trouble
Word : while ------- StemWord : while
Word : executing ------- StemWord : executing
Word : spark ------- StemWord : spark
Word : programs ------- StemWord : program


- The only major thing to note is that lemmatize takes a **part of speech** parameter, **pos.** If not supplied, the default is **noun.** This means that an attempt will be made to find the closest noun, which can create trouble for you. Keep this in mind if you use lemmatizing!

In [18]:
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))

good
best


### Takeaway: We need to try stemming and lemmatization and choose best for our task

### 3. Stop Words Removal

- Stop words are a set of commonly used words in a language. 
- Examples of stop words in English are “a”, “the”, “is”, “are” and etc. 
- The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead.
- Stop word lists can come from pre-established sets or you can create a custom one for your domain.

#### Print list of stop words available in NLTK

In [25]:
import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

#### Example:

In [29]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
example_sent = "This is a difficult sentence for the demo purpose"
stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(example_sent) 
filtered_sentence = ' '.join([w for w in word_tokens if not w in stop_words])
filtered_sentence

'This difficult sentence demo purpose'

**"Why is removing stop words not always a good idea"**
- It may sometimes remove the meaningful words as well like in nltk stopwords list weren't, won't, wouldn't etc. If we remove these stop words from the sentence then it will totally change the meaning of the sentence.
#### Reference Links:
- https://medium.com/@limavallantin/why-is-removing-stop-words-not-always-a-good-idea-c8d35bd77214
- https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52

### 4. N-Gram generation:

In [38]:
import nltk
from nltk.util import ngrams
 
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]
 
data = 'A class is a blueprint for the object.'
print("2-gram: ", extract_ngrams(data, 2))

2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object', 'object .']


### 5. Regular Expressions (https://regex101.com/)

In [32]:
import re
pattern = r'\s\s+'
text = "Mastercard is   a very     goood company         "
re.sub(pattern, ' ', text)

'Mastercard is a very goood company '

In [33]:
# https://www.dataquest.io/blog/regex-cheatsheet/