# Tokenization, Tagging, Chunking - Normalization

In [1]:
import nltk

Normalization is act of cleaning text data to make it more uniform

**Removing punctuation.**

In [2]:
md = nltk.corpus.gutenberg.words("melville-moby_dick.txt")

In [3]:
md_8 = md[:8]
md_8

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']

In [4]:
# To just see ONLY words use .isalpha() 
for word in md_8:
    if word.isalpha():
        print (word)

Moby
Dick
by
Herman
Melville


If word is composed of letter it would be true for `.isalpha()` and would be printed. For period or dates it will be False for `.isalpha()` and won't be printed

Making everything lower case. so that Cat and cat does NOT show up as seperate tokens during frequency distribution.

In [5]:
for word in md_8:
    print (word.lower())

[
moby
dick
by
herman
melville
1851
]


In [6]:
# both steps in 1 shot
norm = [word.lower() for word in md_8 if word.isalpha()]

In [7]:
norm

['moby', 'dick', 'by', 'herman', 'melville']

## Stemmers

**Stemmers help further normalize text** when we run into words that might be **plural,
for example. cat vs cats / city vs cities** 

There are many different kinds of stemmers so you have to pick the one that works best for your use case.

In [8]:
# Type 1 stemmer. Porter stemmer
porter = nltk.PorterStemmer()

In [9]:
my_list = ["cat","cats","lie","lying","run","running","city","cities","month","monthly","woman","women"]

In [10]:
for word in my_list:
    print (porter.stem(word))

cat
cat
lie
lie
run
run
citi
citi
month
monthli
woman
women


**Plural of cat, lie, run was normalized properly. But PROBLEM with plurals of city, month and woman**

In [11]:
# Type 2 stemmer. Lancaster stemmer
lancaster = nltk.LancasterStemmer()

In [12]:
for word in my_list:
    print (lancaster.stem(word))

cat
cat
lie
lying
run
run
city
city
mon
month
wom
wom


**This time error with plurals of lie, month and woman. But it normalized the plural of city.**

We can try to solve the normalization problem with **Lemmatization.** word net `resource` based normalization provided by nltk is called Lemmatization. This is computation heavy.

In [13]:
wnlem = nltk.WordNetLemmatizer()

In [14]:
for word in my_list:
    print (wnlem.lemmatize(word))

cat
cat
lie
lying
run
running
city
city
month
monthly
woman
woman


**Normalized plural of woman, but could NOT normalize many of the plurals that were taken care by stemmers**

Extra NOTES:

Tokenizing is the act of splitting a document of text into tokens. Tokens can be sentences (nltk.sent_tokenize(text)) or words and punctuation (nltk.word_tokenize(text)).

Splitting only splits on a selected delimiter. Splitting and tokenizing are not the same and will provide different results. When working with text, you should use tokenization over splitting.

Consider this example:

text = "Yesterday I bought an apple, a banana, and a lemon. Today I will buy carrots and potatoes."
 
`nltk.sent_tokenize(text)`
['Yesterday I bought an apple, a banana, and a lemon.', 'Today I will buy carrots and potatoes.']
 
`nltk.word_tokenize(text)`
['Yesterday', 'I', 'bought', 'an', 'apple', ',', 'a', 'banana', ',', 'and', 'a', 'lemon.', 'Today', 'I', 'will', 'buy', 'carrots', 'and', 'potatoes', '.']
 
`text.split()`
['Yesterday', 'I', 'bought', 'an', 'apple,', 'a', 'banana,', 'and', 'a', 'lemon.', 'Today', 'I', 'will', 'buy', 'carrots', 'and', 'potatoes.']

You can see each provided different results. In the case of using "split" we see "apple," (with the comma attached) which is something we would not normally want.