# 1) Basic Natural Language Processing.
![1](1.png)
![2](2.png)
![3](3.png)

# 2) Basic NLP tasks with NLTK.
![4](4.png)

## - Counting vocabulary of words
`Corpus = Collection`

In [1]:
import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [8]:
# let's see the text 1
text1

<Text: Moby Dick by Herman Melville 1851>

In [9]:
# lets discover the catalog of sentences.
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [10]:
# look at sentence_1 of catalog of sentences.
sent1

['Call', 'me', 'Ishmael', '.']

In [23]:
# explore text 7
print(text7,'\n')
# look at sentence_7 which is one sentence from text 7.
print(sent7,'\n')
# lets get the lenghth of this sent_7.
print(len(sent7),'\n')
# the length of the entire text_7.
print(len(text7),'\n')
# So, what is the unique num of words
print(len(set(text7)),'\n')
# getting the first 10 unique words.
print(list(set(text7))[:10])

<Text: Wall Street Journal> 

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] 

18 

100676 

12408 

['craze', 'Somerset', 'Spiegel', 'till', 'uncharted', 'PORTING', 'deterioration', 'put', 'Uncertainty', 'Hutchinson']


## - Frequency of words

In [24]:
# Now, if you want to find out the freq. of words.
dist = FreqDist(text7)
dist

FreqDist({',': 4885, 'the': 4045, '.': 3828, 'of': 2319, 'to': 2164, 'a': 1878, 'in': 1572, 'and': 1511, '*-1': 1123, '0': 1099, ...})

In [26]:
# the length of dist is equal to the num of unique words in text.
len(dist)

12408

In [38]:
# To get a list of words in dist (Vocabs):
vocab1 = dist.keys()
# put it into list because 'dict_keys' object is not subscriptable.
print(list(vocab1)[5:20])
# if i want to find how many times a particular word occurs.
print(dist['four'])

['old', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.', 'Mr.', 'is', 'chairman']
20


In [39]:
# If you want to know how many times a particular word occurs ,
# and also have a condition on its length + num of occurrences.
# Restriction on length to avoid words like `the` or `,` 
# thus get real frequent words.

freqwords = [w for w in vocab1 if len(w)>5 and dist[w]>100] 
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

## - Normalization and stemming.
Means different forms of the same 'WORD', and we want them to be treated as one word instead of multiple words.

In [47]:
input1 = "List listed lists listing listings"
# To do so,First, we need to lower() all of these versions.
words1 = input1.lower().split()
print(words1)
# Then , bring up porterstemmer() to do this stemming task.
# stem: means `Originate` in other word get the origin of the word.
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'listed', 'lists', 'listing', 'listings']


['list', 'list', 'list', 'list', 'list']

## - Lemmatization.
Slight variant of stemming. Which means the word that comes out be actually meaningful.

In [54]:
# set the `Univ ersal Decleration of Human Rights`
udhr = nltk.corpus.udhr.words('English-Latin1')
print(udhr[:20],'\n')
# lets apply stemming onto these bunch of words.
print([porter.stem(t) for t in udhr[:20]])

['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'rights', 'of'] 

['univers', 'declar', 'of', 'human', 'right', 'preambl', 'wherea', 'recognit', 'of', 'the', 'inher', 'digniti', 'and', 'of', 'the', 'equal', 'and', 'inalien', 'right', 'of']


- As you can see above, stemming did a good job on originate some words like `rights`etc. But, you might also have noticed that `univers`&`declar` aren't a valid word.

**So, what solves this issue is Lemmatization, which means stemming, but resulting stems that are all valid words.**

In [66]:
# set the word net Lemmatizer 
WNlemma = nltk.WordNetLemmatizer()
print([WNlemma.lemmatize(t) for t in udhr[:20]])

['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'right', 'of']


![5](5.png)
- you can see that all the stems{origins} are meaningful, but `Rights` is still like it was because it's capitalized. So , we need to `lower` all of them to make accurate lemmatization.

In [72]:
# lower & lemmatize :
lowered_list = [w.lower() for w in list(udhr[:20])]
print([WNlemma.lemmatize(t) for t in lowered_list])

# Now, `Rights` is gone.

['universal', 'declaration', 'of', 'human', 'right', 'preamble', 'whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'right', 'of']


## - Tokenization.
Recall splitting a setence into words / Tokens.

In [73]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split()

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

**You can see that just splitting on space didn't provide quite good job where it kept the dot with the word `bed.`, which is not appropriate at all.**

- **The solution here : is Applying NLTK tokenizer**

In [75]:
print(nltk.word_tokenize(text11))

['Children', 'should', "n't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed', '.']


![6](6.png)
**You will notice that shouldn't became `should` and this "`n't`" that stands for "not", and that is important in quite a few NLP task because you want to know negation = {contradiction or denial of S.M} here.**

- Its important to know that there are some unique words like n apostrophe t that should also be separated and so on. `But there is even more fundamental question of, what is a sentence and how do you know sentence boundaries?`.

In [76]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
print(text12)

This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!


So already, you know that a sentence can end with a full stop or a question mark or an exclamation mark and so on. But, not all full stops and sentences:

- So for example, U dot S dot, that stands for US is just one word, has two full stops, but neither of them end the sentence. The same thing with $2.99.  That full stop is an indicator of a number but not end of a sentence.

- The solution is :We could use NLTK's inbuilt sentence splitter here.

In [78]:
sentences = nltk.sent_tokenize(text12)
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

# 3) Advanced NLP tasks with NLTK.

![7](7.png)

## -  Part_of_speech (POS) Tagging.

![8](8.png)

In [99]:
import nltk
# let's take a look the different `Tags` in English.
print(nltk.help.upenn_tagset('MD'),'\n')
print(nltk.help.upenn_tagset('VB'),'\n')
print(nltk.help.upenn_tagset('DT'))

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
None 

VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
None 

DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
None


In [89]:
# Now, lets see how useful it can be in action.
text11 = "Children shouldn't drink a sugary drink before bed."
print(text11,'\n')
# 1st: tokenize
tokens = nltk.word_tokenize(text11)
print(tokens)
# 2nd: run the post tagger.
nltk.pos_tag(tokens)

Children shouldn't drink a sugary drink before bed. 

['Children', 'should', "n't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed', '.']


[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

In [90]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

![9](9.png)
**The debate here, is that you don't accurately know either it means visiting aunts as a verb you do or an ultimate noun, and here where the devil lies and the the ambiguity takes place.**

**The is no room for variations or getting all options available in NLTK.pos_tag() , Thus it gives the most common form of it which is `VBG` stands for Gurand-verb.**

Note : nuisance =  inconvenience or annoyance

## - Parsing Sentence Structure. 
![10](10.png)

In [98]:
text15 = 'Alice loves Bob'
text15_tokens = nltk.word_tokenize(text15)
print(text15_tokens,'\n')
# make the grammer structure.
grammer = nltk.CFG.fromstring("""
                              S -> NP VP
                              VP -> V NP
                              NP -> 'Alice' | 'Bob'
                              V -> 'loves'
                              """)
parser = nltk.ChartParser(grammer)
trees = parser.parse_all(text15_tokens)
print(trees,'\n')

for tree in trees:
    print(tree)   

['Alice', 'loves', 'Bob'] 

[Tree('S', [Tree('NP', ['Alice']), Tree('VP', [Tree('V', ['loves']), Tree('NP', ['Bob'])])])] 

(S (NP Alice) (VP (V loves) (NP Bob)))


![11](11.png)

In [111]:
text16 = "I saw the man with a telescope"
text16_tokens = nltk.word_tokenize(text16)

grammer1 = nltk.CFG.fromstring("""
                            S -> NP VP
                            PP -> P NP
                            NP -> Det N | Det N PP | 'I'
                            VP -> V NP | VP PP
                            Det -> 'a' | 'the'
                            N -> 'man' | 'telescope'
                            V -> 'saw'
                            P -> 'with'
                              """)
parser = nltk.ChartParser(grammer1)
trees = parser.parse_all(text16_tokens)
print(trees,'\n')

for tree in trees:
    print(tree)   

[Tree('S', [Tree('NP', ['I']), Tree('VP', [Tree('VP', [Tree('V', ['saw']), Tree('NP', [Tree('Det', ['the']), Tree('N', ['man'])])]), Tree('PP', [Tree('P', ['with']), Tree('NP', [Tree('Det', ['a']), Tree('N', ['telescope'])])])])]), Tree('S', [Tree('NP', ['I']), Tree('VP', [Tree('V', ['saw']), Tree('NP', [Tree('Det', ['the']), Tree('N', ['man']), Tree('PP', [Tree('P', ['with']), Tree('NP', [Tree('Det', ['a']), Tree('N', ['telescope'])])])])])])] 

(S
  (NP I)
  (VP
    (VP (V saw) (NP (Det the) (N man)))
    (PP (P with) (NP (Det a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (Det the) (N man) (PP (P with) (NP (Det a) (N telescope))))))


-Now, we gave examples of simple grammars, and said we'll create a context for grammar out of it, but we can not do that every time. In fact, generating the grammar and generating grammar rules itself is a learning task that you could learn, and you need a lot of training data for that:
- And a lot of manual effort and hours have gone into creating what is known as a `tree bank`, basically, a big collection of parse trees.

In [115]:
from nltk.corpus import treebank
print(sent7,'\n')

text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] 

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


## POS tagging and parsing ambiguity.
![12](12.png)

In [124]:
text18 = "The old man the boat"  # man[verb] = support, boost , use ,occupy
print(text18,'\n')

text18_tokens = nltk.word_tokenize(text18)
print(text18_tokens,'\n')

# apply post_tag: 
nltk.pos_tag(text18_tokens)

The old man the boat 

['The', 'old', 'man', 'the', 'boat'] 



[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [128]:
text19 = 'Colorless green ideas sleep furiously'  
print(text19,'\n')

text19_tokens = nltk.word_tokenize(text19)
print(text19_tokens,'\n')

# apply post_tag: 
nltk.pos_tag(text19_tokens)

Colorless green ideas sleep furiously 

['Colorless', 'green', 'ideas', 'sleep', 'furiously'] 



[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

![13](13.png)
- **Next module, we're going to go into more detail about how do you train these models, how do you build a supervised method, and supervised technique, for these.**