## Text Pre-Processing:

1. Tokenizing words.
2. Normalizing word formats.
3. Segmenting sentences.

### Tokenizing words:

Problem with space based Tokenization is that it blindly removes punctuations:
1. m.p.h, Ph.D, AT&T, cap'n
2. prices ($45.55)
3. Dates (01/02/06)
4. URL (http://www.google.com)
5. hashtags (#nlp)
6. email addresses (abc@gmail.com)
7. Clitic: a word that does not stand on its own.
    1. 'are' in we're.
    2. French 'je' in j'ai.
    3. 'le' in l'honneur.
8. When should multi word expressions (MWE) be words?
    1. New York
    2. Rock 'n' roll
9. Many languages like Chinese, Japanese, Thai don't use spaces to seperate words.

How do we decide where the token boundaries should be?

## Method 1: Byte Pair Encoding (BPE)

BPE is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data.

BPE brings the perfect balance between character- and word-level hybrid representations which makes it capable of managing large.

### Steps:

1. Let vocabulary be the set of all individual characters. = ${a,b, c, d} $
2. Repeat:
    1. Choose the two characters that occur most frequently adjacent in the training corpus (say 'a', 'b')
    2. Add a new merged symbol 'ab' to the vocabulary.
    3. Replace every adjacent 'a' 'b' in the corpus with 'ab'.
3. Until k merges have been done.

### Advantages:

1. Includes most frequently occuring words and sub-words.
2. It picks morphemes. A morpheme is the smallest meaning-bearing unit of a language.
3. unlikeliest has 3 morphemes un-, likely and -est.
4. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens.
5. It does not really require one to understand the language to tokenize it.

In [3]:
from transformers import BertTokenizer

def bert_tokenizer(word):
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    print(tokenizer.tokenize(word))

corpus = "low low low low lowest lowest lowest newer newer new newer wider wide new lower"
bert_tokenizer(corpus)

['low', 'low', 'low', 'low', 'lowest', 'lowest', 'lowest', 'newer', 'newer', 'new', 'newer', 'wider', 'wide', 'new', 'lower']


In [2]:
corpus = "New GPU good deal, even more good"
bert_tokenizer(corpus)

['new', 'gp', '##u', 'good', 'deal', ',', 'even', 'more', 'good']


In [4]:
from transformers import XLNetTokenizer

def xltokenizer(word):
    tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
    print(tokenizer.tokenize(word))

corpus = "New GPU good deal, even more good"
xltokenizer(corpus)

['▁New', '▁G', 'PU', '▁good', '▁deal', ',', '▁even', '▁more', '▁good']


In [5]:
corpus = "low low low low lowest lowest lowest newer newer new newer wider wide new lower"
xltokenizer(corpus)

['▁low', '▁low', '▁low', '▁low', '▁lowest', '▁lowest', '▁lowest', '▁newer', '▁newer', '▁new', '▁newer', '▁wider', '▁wide', '▁new', '▁lower']


## Lemmatization and Stemming - Word Normalization

Lemmatization is finding out the shared roots of several words in the corpus.

Advantage is that it reduces alot of task dependent noise in the data. Disadvantage is that for specific NLP tasks such as machine translation, we need to preserve those original words rather than using lemmatized forms.

am, are, is --> be

car, cars, car's, cars' --> car

He is reading detective stories --> He be read detective story.

Disadvantage:
1. Contexts are lost by this normalization technique.
2. Language Dependent.

In [7]:
import spacy
import re
import pandas as pd
import numpy as np
from collections import Counter

In [8]:
# Lemmatization

def lemmatize(corpus):
    nlp = spacy.load("en_core_web_sm")
    lemmatized_list = nlp(corpus)
    
    # Token, lemmatized format, Which parts of speech, is it a stop word or not
    print(f'TEXTt\t\t|\tLEMMA\t|\tPOS\t|\tIS_STOP_WORD')
    for token in lemmatized_list:
        print(f'{token.text}\t\t|\t{token.lemma_}\t|\t{token.pos_}\t|\t{token.is_stop}')

corpus = '''Reading books with people while eating lunch.'''
lemmatize(corpus)

TEXTt		|	LEMMA	|	POS	|	IS_STOP_WORD
Reading		|	read	|	VERB	|	False
books		|	book	|	NOUN	|	False
with		|	with	|	ADP	|	True
people		|	people	|	NOUN	|	False
while		|	while	|	SCONJ	|	True
eating		|	eat	|	VERB	|	False
lunch		|	lunch	|	NOUN	|	False
.		|	.	|	PUNCT	|	False


In [10]:
# Reviews dataset
df = pd.read_csv('reviews.csv')
df.head()

Unnamed: 0,rating,review
0,negative,terrible place to work for i just heard a stor...
1,negative,"hours , minutes total time for an extremely s..."
2,negative,my less than stellar review is for service . w...
3,negative,i m granting one star because there s no way t...
4,negative,the food here is mediocre at best . i went aft...


In [11]:
reviews = np.asarray(df['review'])
reviews[1]

' hours , minutes total time for an extremely simple physical . stay away unless you have hours to waste ! ! ! '

In [12]:
corpus = reviews[1]
lemmatize(corpus)

TEXTt		|	LEMMA	|	POS	|	IS_STOP_WORD
 		|	 	|	SPACE	|	False
hours		|	hour	|	NOUN	|	False
,		|	,	|	PUNCT	|	False
minutes		|	minute	|	NOUN	|	False
total		|	total	|	ADJ	|	False
time		|	time	|	NOUN	|	False
for		|	for	|	ADP	|	True
an		|	an	|	DET	|	True
extremely		|	extremely	|	ADV	|	False
simple		|	simple	|	ADJ	|	False
physical		|	physical	|	NOUN	|	False
.		|	.	|	PUNCT	|	False
stay		|	stay	|	VERB	|	False
away		|	away	|	ADV	|	False
unless		|	unless	|	SCONJ	|	True
you		|	-PRON-	|	PRON	|	True
have		|	have	|	AUX	|	True
hours		|	hour	|	NOUN	|	False
to		|	to	|	PART	|	True
waste		|	waste	|	VERB	|	False
!		|	!	|	PUNCT	|	False
!		|	!	|	PUNCT	|	False
!		|	!	|	PUNCT	|	False


### Stemming - word Normalization

A stemming algorithm is a computational procedure which reduces all words with the same root… to a common form, usually by stripping each word of its derivational and inflectional suffixes. The stem obtained may or may not be linguistically correct/meaningful.

The `Spacy library` does not have stemming operation. It relies on lemmatization.
But NLTK has stemming. Two types:
1. Porter Stemmer
2. Snowball Stemmer

Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. 

    1. It supports multiple languages, including English, Russian, Danish, French, Finnish, German, Italian, Hungarian, Portuguese, Norwegian, Swedish, and Spanish. 

Examples
1. ATIONAL --> ATE (relational --> relate)
2. motoring --> motor
3. grasses --> grass

In [13]:
import nltk
from nltk.stem.porter import *
from nltk.stem.snowball import SnowballStemmer

stemmer_1 = PorterStemmer()
stemmer_2 = SnowballStemmer(language='english')

tokens = ['compute', 'computer', 'computed', 'computing']
for token in tokens:
    print('1\t' + token + ' --> ' + stemmer_1.stem(token))
    print('2\t' + token + ' --> ' + stemmer_2.stem(token))
    print('---------------------------')

1	compute --> comput
2	compute --> comput
---------------------------
1	computer --> comput
2	computer --> comput
---------------------------
1	computed --> comput
2	computed --> comput
---------------------------
1	computing --> comput
2	computing --> comput
---------------------------


### Extra Feature of Snowball Stemming
The stemming of stopwords like ‘being’ to ‘be’ is useless because they don’t have any shared meaning, although there could be a grammatical connection between these two words. Snowball provides another parameter called `ignore_stopwords`, which is set to false by default. 

If it is set to true, then snowball will not perform the stemming of stopwords.



In [14]:
tokens = ['is', 'being', 'be', 'was', 'founding', 'mice', 'runs', 'running', 'ran']
stemmer_1 = PorterStemmer()
stemmer_2 = SnowballStemmer(language='english')
stemmer_3 = SnowballStemmer(language='english', ignore_stopwords=True)

for token in tokens:
    print('Stemmer 1: ' + token + ' --> ' + stemmer_1.stem(token))
    print('Stemmer 2: ' + token + ' --> ' + stemmer_2.stem(token))
    print('Stemmer 3: ' + token + ' --> ' + stemmer_3.stem(token))
    print('---------------------------')

Stemmer 1: is --> is
Stemmer 2: is --> is
Stemmer 3: is --> is
---------------------------
Stemmer 1: being --> be
Stemmer 2: being --> be
Stemmer 3: being --> being
---------------------------
Stemmer 1: be --> be
Stemmer 2: be --> be
Stemmer 3: be --> be
---------------------------
Stemmer 1: was --> wa
Stemmer 2: was --> was
Stemmer 3: was --> was
---------------------------
Stemmer 1: founding --> found
Stemmer 2: founding --> found
Stemmer 3: founding --> found
---------------------------
Stemmer 1: mice --> mice
Stemmer 2: mice --> mice
Stemmer 3: mice --> mice
---------------------------
Stemmer 1: runs --> run
Stemmer 2: runs --> run
Stemmer 3: runs --> run
---------------------------
Stemmer 1: running --> run
Stemmer 2: running --> run
Stemmer 3: running --> run
---------------------------
Stemmer 1: ran --> ran
Stemmer 2: ran --> ran
Stemmer 3: ran --> ran
---------------------------


## Case Folding - Normalizing Step

Putting tokens/ words into a standard format.

1. U.S or US
2. uhhuh or uh-huh
3. Fed or fed
4. am, is, be, are


### Text Pre-Processing Step - Spelling Correction 

Minimum Edit Distance

If each operation has a cost of 1

    * Distance between these is 5.
If substituion cost is 2 (Levenshtein)

    * Distance between them is 8

In [29]:
import editdistance
from english_words import english_words_set
corpus = ['banana', 'bahana']
editdistance.eval('banana', 'bahama')

2

In [32]:
def spelling_correction(word, vocabulary):
    '''
    Spelling correction is done by checking the minimum edit distance(med)
    '''
    med = [editdistance.eval(word, vocabulary[i]) for i in range(len(vocabulary))]
    return vocabulary[np.argmin(med)]

def correct_the_spell(word = 'banama'):
    # Default Vocabulary from spacy
    # nlp = spacy.load('en_core_web_sm')
    # vocabulary = list(nlp.vocab.strings)
    
    # English Vocabulary
    vocabulary = list(english_words_set)

    print(f'Original Word: {word}')
    print(f'The correct spelling: {spelling_correction(word, vocabulary)}')


In [33]:
correct_the_spell(word = 'banama')

Original Word: banama
The correct spelling: panama


In [34]:
correct_the_spell(word = 'bahana')

Original Word: bahana
The correct spelling: banana


In [36]:
correct_the_spell(word = 'ghosr')

Original Word: ghosr
The correct spelling: ghost


In [37]:
correct_the_spell(word = 'acceptible')

Original Word: acceptible
The correct spelling: accessible


In [43]:
correct_the_spell(word = 'hight')

Original Word: hight
The correct spelling: eight
