# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



### Not for Grading

## Learning Objective

At the end of the experiment, you will be able to

*   Use NLTK package

## Background


### What is NLTK? 


The Natural Language Toolkit (NLTK) is a package in python that provides libraries for different text processing techniques, such as classification, tokenization, stemming and tagging.

**NLTK corpus**


**Punkt:** This tokenizer divides a text into a list of sentences to build a model for abbreviation words, collocations and words with sentences. 



**Wordnet**: WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets).



**averaged_perceptron_tagger**: It is used for tagging words with their parts of speech (POS)

**tagset:** The tagset consists of the following tags:

VB - verbs (all tenses and modes)

NN - nouns (common and proper)

PRON - pronouns

ADJ - adjectives

ADV - adverbs

ADP - adpositions (prepositions and postpositions)

CONJ - conjunctions

DET - determiners

NUM - cardinal numbers

PRT - particles or other function words

IN -  preposition/subordinating conjunction

NNS - noun plural ‘desks’

JJ  - adjective ‘big’

VBP - verb, sing. present, non-3d take

DT - determiner



In [None]:
! wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/shakespeare.txt


### Read the file for pre-processing

Read Shakespeare text file which is extracted in a webscraping notebook. 

In [None]:
# Function to open file and 'r' subsequently reads a file
f = open("shakespeare.txt", "r")   

# Reading the file
text = f.read()
f.close()

# Length of the text
len(text)

87674

In [None]:
print(text)




Comedy of Errors: Entire Play
 





The Comedy of Errors

Shakespeare homepage 
    | Comedy of Errors 
    | Entire play

ACT I
SCENE I. A hall in DUKE SOLINUS'S palace.

Enter DUKE SOLINUS, AEGEON, Gaoler, Officers, and other Attendants

AEGEON

Proceed, Solinus, to procure my fall
And by the doom of death end woes and all.

DUKE SOLINUS

Merchant of Syracuse, plead no more;
I am not partial to infringe our laws:
The enmity and discord which of late
Sprung from the rancorous outrage of your duke
To merchants, our well-dealing countrymen,
Who wanting guilders to redeem their lives
Have seal'd his rigorous statutes with their bloods,
Excludes all pity from our threatening looks.
For, since the mortal and intestine jars
'Twixt thy seditious countrymen and us,
It hath in solemn synods been decreed
Both by the Syracusians and ourselves,
To admit no traffic to our adverse towns Nay, more,
If any born at Ephesus be seen
At any Syracusian marts and fairs;
Again: if any Syracusian born
Co

### Normalize Text

1. Convert upper case letters to lower case
2. Remove newline characters '\n' from the given text



In [None]:
text = text.lower()     # Returns a string converted to lowercase
print(len(text))
# Removes new line characters
text = text.replace("\n", " ")
print(text)

87674
   comedy of errors: entire play        the comedy of errors  shakespeare homepage      | comedy of errors      | entire play  act i scene i. a hall in duke solinus's palace.  enter duke solinus, aegeon, gaoler, officers, and other attendants  aegeon  proceed, solinus, to procure my fall and by the doom of death end woes and all.  duke solinus  merchant of syracuse, plead no more; i am not partial to infringe our laws: the enmity and discord which of late sprung from the rancorous outrage of your duke to merchants, our well-dealing countrymen, who wanting guilders to redeem their lives have seal'd his rigorous statutes with their bloods, excludes all pity from our threatening looks. for, since the mortal and intestine jars 'twixt thy seditious countrymen and us, it hath in solemn synods been decreed both by the syracusians and ourselves, to admit no traffic to our adverse towns nay, more, if any born at ephesus be seen at any syracusian marts and fairs; again: if any syracusian b

### Tokenization

Tokenization is a way to split the text into tokens. These tokens could be paragraphs, sentences, or individual words. These tokens are useful for finding such patterns and considered as a base step for stemming and Lemmatization

NLTK comes with many corpora, downloading a pre-trained Punkt tokenizer to perform tokenization, and a wordnet to perform lemmatization

In [None]:
# Importing nltk package
import nltk

# Downloading punkt from NLTK to perform sentence tokenization
nltk.download("punkt")

# Downloading wordnet from NLTK to perform Lemmatization
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Sentence Level Tokenization

Sentence tokenization is the process of splitting text into individual sentences.

*For example :*

**Input :** `Sun rises in the east. Sun sets in the west.`

**Output:** `['Sun rises in the east.', 'Sun sets in the west.']`


In [None]:
from nltk.tokenize import sent_tokenize

# sent_tokenize() is to split a document or paragraph into sentences
sen_token = sent_tokenize(text) 

# length of the sentances
len(sen_token) 

1000

In [None]:
# Printing 10th sentence
print(sen_token[10])

there had she not been long, but she became a joyful mother of two goodly sons; and, which was strange, the one so like the other, as could not be distinguish'd but by names.


Average sentence length  gives the maximum words in a sentence

In [None]:
"""
# This code is given as many of the participants requested 'for' loop instead list comprehension
word_count = []

for i, sent in enumerate(sen_token):
    word_count.append(len(sent.split()))
    
print(word_count)
"""

"\n# This code is given as many of the participants requested 'for' loop instead list comprehension\nword_count = []\n\nfor i, sent in enumerate(sen_token):\n    word_count.append(len(sent.split()))\n    \nprint(word_count)\n"

In [None]:
word_count = [len(sent.split()) for i, sent in enumerate(sen_token)]
print("Number of words in each sentence:", word_count)

Avg_Sent = sum(word_count)//len(word_count)
print("Average number of words used in each sentence:", Avg_Sent)

Number of words in each sentence: [28, 26, 58, 83, 21, 18, 24, 41, 23, 74, 34, 36, 19, 1, 5, 94, 103, 52, 7, 19, 17, 54, 4, 35, 50, 24, 28, 61, 37, 25, 16, 31, 30, 41, 6, 5, 13, 3, 2, 27, 36, 9, 21, 40, 3, 20, 26, 19, 21, 27, 20, 10, 20, 45, 12, 2, 7, 6, 80, 23, 18, 9, 20, 16, 38, 16, 20, 10, 6, 7, 20, 30, 38, 28, 15, 6, 5, 31, 14, 6, 7, 6, 10, 20, 47, 17, 11, 22, 6, 16, 33, 9, 9, 21, 9, 8, 78, 8, 8, 10, 9, 9, 9, 3, 14, 66, 10, 9, 13, 17, 7, 4, 20, 11, 25, 9, 10, 10, 4, 13, 24, 10, 6, 4, 2, 9, 8, 11, 9, 3, 16, 25, 10, 11, 7, 10, 20, 4, 4, 22, 22, 9, 17, 11, 9, 3, 21, 22, 8, 29, 3, 4, 8, 17, 24, 38, 16, 8, 3, 3, 39, 15, 4, 7, 5, 9, 4, 4, 9, 6, 11, 6, 6, 12, 23, 28, 17, 6, 13, 4, 6, 10, 13, 33, 18, 25, 7, 40, 8, 7, 10, 8, 15, 18, 25, 5, 8, 14, 14, 6, 13, 9, 4, 9, 14, 5, 14, 17, 13, 7, 19, 6, 18, 12, 19, 19, 27, 14, 17, 13, 18, 6, 9, 9, 6, 10, 6, 5, 26, 17, 16, 16, 21, 12, 4, 26, 50, 17, 19, 8, 37, 29, 42, 10, 39, 15, 9, 38, 3, 7, 9, 9, 5, 5, 28, 10, 9, 5, 7, 19, 12, 18, 23, 18, 63, 22, 1

### Word Level Tokenization

Word tokenization is the process of splitting text into words.
 
*For example*

**Input:** ```"Sun rises in the east. Sun sets in the west."```

**Output:** ```["Sun" ,"rises" ,"in", "the", "east", ".", "Sun", "sets", "in", "the", "west",  "."]```





In [None]:
from nltk.tokenize import word_tokenize

# word_tokenize() method is to split a sentence into tokens or words
wtokens = word_tokenize(text)
print(wtokens)

['comedy', 'of', 'errors', ':', 'entire', 'play', 'the', 'comedy', 'of', 'errors', 'shakespeare', 'homepage', '|', 'comedy', 'of', 'errors', '|', 'entire', 'play', 'act', 'i', 'scene', 'i.', 'a', 'hall', 'in', 'duke', 'solinus', "'s", 'palace', '.', 'enter', 'duke', 'solinus', ',', 'aegeon', ',', 'gaoler', ',', 'officers', ',', 'and', 'other', 'attendants', 'aegeon', 'proceed', ',', 'solinus', ',', 'to', 'procure', 'my', 'fall', 'and', 'by', 'the', 'doom', 'of', 'death', 'end', 'woes', 'and', 'all', '.', 'duke', 'solinus', 'merchant', 'of', 'syracuse', ',', 'plead', 'no', 'more', ';', 'i', 'am', 'not', 'partial', 'to', 'infringe', 'our', 'laws', ':', 'the', 'enmity', 'and', 'discord', 'which', 'of', 'late', 'sprung', 'from', 'the', 'rancorous', 'outrage', 'of', 'your', 'duke', 'to', 'merchants', ',', 'our', 'well-dealing', 'countrymen', ',', 'who', 'wanting', 'guilders', 'to', 'redeem', 'their', 'lives', 'have', 'seal', "'d", 'his', 'rigorous', 'statutes', 'with', 'their', 'bloods', ',

### Removing Punctuations

To remove punctuations from the text, apply translation function

maketrans() takes 3 arguments and returns a translation table usable for str.translate()


*For example:*

**Input:** ```["Sun" ,"rises" ,"in", "the", "east", ".", "Sun", "sets", "in", "the", "west",  "."]```

**Output:** ```["Sun" ,"rises" ,"in", "the", "east", "Sun", "sets", "in", "the", "west"]```

Punctuations are removed in the above example.

In [None]:
words = []

for token in wtokens:
    if token.isalpha():     # Returns the string is a alphabetic
         words.append(token)

print(len(words))

15970


In [None]:
words

['comedy',
 'of',
 'errors',
 'entire',
 'play',
 'the',
 'comedy',
 'of',
 'errors',
 'shakespeare',
 'homepage',
 'comedy',
 'of',
 'errors',
 'entire',
 'play',
 'act',
 'i',
 'scene',
 'a',
 'hall',
 'in',
 'duke',
 'solinus',
 'palace',
 'enter',
 'duke',
 'solinus',
 'aegeon',
 'gaoler',
 'officers',
 'and',
 'other',
 'attendants',
 'aegeon',
 'proceed',
 'solinus',
 'to',
 'procure',
 'my',
 'fall',
 'and',
 'by',
 'the',
 'doom',
 'of',
 'death',
 'end',
 'woes',
 'and',
 'all',
 'duke',
 'solinus',
 'merchant',
 'of',
 'syracuse',
 'plead',
 'no',
 'more',
 'i',
 'am',
 'not',
 'partial',
 'to',
 'infringe',
 'our',
 'laws',
 'the',
 'enmity',
 'and',
 'discord',
 'which',
 'of',
 'late',
 'sprung',
 'from',
 'the',
 'rancorous',
 'outrage',
 'of',
 'your',
 'duke',
 'to',
 'merchants',
 'our',
 'countrymen',
 'who',
 'wanting',
 'guilders',
 'to',
 'redeem',
 'their',
 'lives',
 'have',
 'seal',
 'his',
 'rigorous',
 'statutes',
 'with',
 'their',
 'bloods',
 'excludes',
 'a

### Stemming

Stemming is the process of reducing a word to its root stem. 

For example:

**Input:**  ```['Sun', 'rises', 'in', 'the', 'east', '.', 'Sun', 'sets', 'in', 'the', 'west', '.']```

**Output:** ```['sun', 'rise', 'in', 'the', 'east', '.', 'sun', 'set', 'in', 'the', 'west', '.']```

In [None]:
# Create an object for PorterStemmer
porter = nltk.PorterStemmer()    

stem = [porter.stem(i) for i in words]
print(stem)

['comedi', 'of', 'error', 'entir', 'play', 'the', 'comedi', 'of', 'error', 'shakespear', 'homepag', 'comedi', 'of', 'error', 'entir', 'play', 'act', 'i', 'scene', 'a', 'hall', 'in', 'duke', 'solinu', 'palac', 'enter', 'duke', 'solinu', 'aegeon', 'gaoler', 'offic', 'and', 'other', 'attend', 'aegeon', 'proceed', 'solinu', 'to', 'procur', 'my', 'fall', 'and', 'by', 'the', 'doom', 'of', 'death', 'end', 'woe', 'and', 'all', 'duke', 'solinu', 'merchant', 'of', 'syracus', 'plead', 'no', 'more', 'i', 'am', 'not', 'partial', 'to', 'infring', 'our', 'law', 'the', 'enmiti', 'and', 'discord', 'which', 'of', 'late', 'sprung', 'from', 'the', 'rancor', 'outrag', 'of', 'your', 'duke', 'to', 'merchant', 'our', 'countrymen', 'who', 'want', 'guilder', 'to', 'redeem', 'their', 'live', 'have', 'seal', 'hi', 'rigor', 'statut', 'with', 'their', 'blood', 'exclud', 'all', 'piti', 'from', 'our', 'threaten', 'look', 'for', 'sinc', 'the', 'mortal', 'and', 'intestin', 'jar', 'thi', 'sediti', 'countrymen', 'and', '

### Lemmatization

Lemmatization is similar to stemming but it brings context to the words.

For example:

* rocks : rock
* corpora : corpus
* better : good

In [None]:
# Create an object for WordNetLemmatizer
lemma = nltk.WordNetLemmatizer()

lemmatizer = [lemma.lemmatize(i) for i in words]
print(lemmatizer[0:20])

['comedy', 'of', 'error', 'entire', 'play', 'the', 'comedy', 'of', 'error', 'shakespeare', 'homepage', 'comedy', 'of', 'error', 'entire', 'play', 'act', 'i', 'scene', 'a']


### Removing Stopwords

Download all the stopwords from the NLTK package using nltk.download('stopwords') and then remove the unwanted words from the given list of words

few stopwords from the NLTK package are  “the”, “a”, “an”, “in”, "at", "of".

**Example:**

**Input:** `"Sun rises in the east Sun sets in the west"`

**Output:** `["Sun", "rises", "east", "Sun", "sets", "west"]`

In [None]:
# Download all stopwords from NLTK
nltk.download("stopwords")

# Import stopwords
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english")) 
print(len(stop_words))
print(stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
179
{'each', 't', 'once', 'are', 'of', 'more', 'having', 'not', 'your', "isn't", 'himself', 'did', 'for', 'him', 'been', 'them', "that'll", 'because', 'hadn', 'can', "hasn't", 'such', 'shan', 're', 'who', 'just', 'my', 'hers', 'where', "haven't", "she's", 'over', 'yourselves', 'her', 'between', 'd', 'other', 'their', 'will', 'when', 'any', 'our', 'very', "shan't", 'be', 'll', 'am', 'wasn', 'yourself', 'some', 'no', 'i', 'here', 'wouldn', 'before', 'haven', 'until', 'themselves', "it's", 'during', 'these', 'as', 'all', 'ma', 'under', 'they', 'isn', 'a', 'was', 'the', 'nor', 'above', 'y', 'doing', 'which', 'an', 'from', 'out', 'ours', "couldn't", "wouldn't", "didn't", 'up', "should've", 'own', 'after', 'you', 'had', 'and', 'herself', "weren't", 'or', "don't", 'his', 'mustn', 'both', 'myself', 'itself', 'o', 'doesn', 'why', 'its', 'ain', 'mightn', 'too', 'he', 'about', 'n

In [None]:
# Iterate over all the words and append the words which are not there in the stopwords

pre_processed = []
for i, word in enumerate(words):
  if word not in stop_words:
    pre_processed.append(word)

print(pre_processed)
print("Number of words after removing stopwords:", len(pre_processed))

['comedy', 'errors', 'entire', 'play', 'comedy', 'errors', 'shakespeare', 'homepage', 'comedy', 'errors', 'entire', 'play', 'act', 'scene', 'hall', 'duke', 'solinus', 'palace', 'enter', 'duke', 'solinus', 'aegeon', 'gaoler', 'officers', 'attendants', 'aegeon', 'proceed', 'solinus', 'procure', 'fall', 'doom', 'death', 'end', 'woes', 'duke', 'solinus', 'merchant', 'syracuse', 'plead', 'partial', 'infringe', 'laws', 'enmity', 'discord', 'late', 'sprung', 'rancorous', 'outrage', 'duke', 'merchants', 'countrymen', 'wanting', 'guilders', 'redeem', 'lives', 'seal', 'rigorous', 'statutes', 'bloods', 'excludes', 'pity', 'threatening', 'looks', 'since', 'mortal', 'intestine', 'jars', 'thy', 'seditious', 'countrymen', 'us', 'hath', 'solemn', 'synods', 'decreed', 'syracusians', 'admit', 'traffic', 'adverse', 'towns', 'nay', 'born', 'ephesus', 'seen', 'syracusian', 'marts', 'fairs', 'syracusian', 'born', 'come', 'bay', 'ephesus', 'dies', 'goods', 'confiscate', 'duke', 'dispose', 'unless', 'thousand

### Parts of Speech:


Given any sentence, you can classify each word as a noun, verb, conjunction, or any other class of words. When there are hundreds of thousands of sentences, even millions, this is obviously a large and tedious task. But it's not one that can't be solved computationally. 




In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

To know what is DT, JJ, or any other tags, use below code to verify


In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset('NN')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


Hint: Refer to the following link for[ parts of speech](https://www.nltk.org/book/ch05.html)

In [None]:
# To find the list of noun words from the preprocessed words

pos = nltk.pos_tag(pre_processed)
print(pos)
num_nouns = [i for i in pos if i[1] == 'NN']
print(len(num_nouns))


[('comedy', 'NN'), ('errors', 'NNS'), ('entire', 'JJ'), ('play', 'NN'), ('comedy', 'NN'), ('errors', 'NNS'), ('shakespeare', 'VBP'), ('homepage', 'JJ'), ('comedy', 'NN'), ('errors', 'NNS'), ('entire', 'JJ'), ('play', 'NN'), ('act', 'NN'), ('scene', 'NN'), ('hall', 'NN'), ('duke', 'VBZ'), ('solinus', 'JJ'), ('palace', 'NN'), ('enter', 'NN'), ('duke', 'JJ'), ('solinus', 'NN'), ('aegeon', 'NN'), ('gaoler', 'NN'), ('officers', 'NNS'), ('attendants', 'NNS'), ('aegeon', 'VBP'), ('proceed', 'VB'), ('solinus', 'JJ'), ('procure', 'NN'), ('fall', 'NN'), ('doom', 'NN'), ('death', 'NN'), ('end', 'NN'), ('woes', 'VBZ'), ('duke', 'JJ'), ('solinus', 'NN'), ('merchant', 'NN'), ('syracuse', 'NN'), ('plead', 'VBP'), ('partial', 'JJ'), ('infringe', 'NN'), ('laws', 'NNS'), ('enmity', 'NN'), ('discord', 'NN'), ('late', 'RB'), ('sprung', 'VBD'), ('rancorous', 'JJ'), ('outrage', 'NN'), ('duke', 'NN'), ('merchants', 'NNS'), ('countrymen', 'NNS'), ('wanting', 'VBG'), ('guilders', 'NNS'), ('redeem', 'VBP'), ('l

In [None]:
help(nltk.pos_tag)

Help on function pos_tag in module nltk.tag:

pos_tag(tokens, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
    
        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
        [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
        ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
    
    NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.
    
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :param tagset: the tagset to be u