# **Assignment 1 on Natural Language Processing**

### Date : 4th Sept, 2020

#### Instructor : Prof. Sudeshna Sarkar

#### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

 # NLTK Library

The [NLTK](https://www.nltk.org/) Python framework is generally used as an education and research tool. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of these packages which will be discussed in this tutorial.

**Installing Nltk** <br>
Nltk can be installed using PIP or Conda package managers.For detailed installation instructions follow this [link](https://www.nltk.org/install.html).

To ensure we are all on the same page, the coding environment will be in **python3**. We suggest downloading Anaconda3 and creating a separate environment to do this assignment. 
The link to anaconda3 for Windows and Linux is available here https://docs.anaconda.com/anaconda/install/. 
The steps to install NLTK is available on the link: 
```bash
sudo pip3 install nltk 
python3 
nltk.download()
```

**Note for Question and answers:**

Write your answers to the point in the text box below labelled as **Answer here**.

# Tokenizing words and Sentences using Nltk

**Tokenization** is the process by which big quantity of text is divided into smaller parts called tokens. <br>It is crucial to understand the pattern in the text in order to perform various NLP tasks.These tokens are very useful for finding such patterns.<br>

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1. word tokenize
2. sentence tokenize

In [1]:
# Importing modules
import nltk
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to /home/debjoy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package inaugural to /home/debjoy/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


In [2]:
# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
# print(corpus)

### **TASK**:

For the given corpus, 
1. Print the number of sentences and tokens. 
2. Print the average number of tokens per sentence.
3. Print the number of unique tokens
4. Print the number of tokens after stopword removal using the stopwords from nltk.


In [3]:
# TODO
from nltk.corpus import stopwords

sentences = sent_tokenize(corpus)
tokens = word_tokenize(corpus)
sw = stopwords.words('english')
tokens_without_sws = [t for t in tokens if t not in sw]

print("Number of sentences: {}".format(len(sentences)))
print("Number of tokens: {}".format(len(tokens)))
print("Average number of tokens per sentence: {:0.2f}".format(len(tokens)/len(sentences)))
print("Number of unique tokens: {}".format(len(set(tokens))))
print("Number of tokens after stopword removal: {}".format(len(tokens_without_sws)))
print("Number of unique tokens after stopword removal: {}".format(len(set(tokens_without_sws))))

Number of sentences: 23
Number of tokens: 1537
Average number of tokens per sentence: 66.83
Number of unique tokens: 626
Number of tokens after stopword removal: 800
Number of unique tokens after stopword removal: 543


# Stemming and Lemmatization with NLTK

**What is Stemming?** <br>
Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.<br>
Hence Stemming is a way to find the root word from any variations of respective word

There are many stemmers provided by Nltk like **PorterStemmer**, **SnowballStemmer**, **LancasterStemmer**.<br>

We will try and see differences between Porterstemmer and Snowballstemmer

In [4]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Note that SnowballStemmer has language as parameter.

words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]

# TODO
# create an instance of both the stemmers and perform stemming on above words
ps = PorterStemmer()
ss = SnowballStemmer(language = "english")
ps_words = [ps.stem(w) for w in words]
ss_words = [ss.stem(w) for w in words]
print("PorterStemmed: {}".format(ps_words))
print("SnowballStemmed: {}".format(ss_words))

# TODO
# Complete the function which takes a sentence/corpus and gets its stemmed version.
from nltk.tokenize.treebank import TreebankWordDetokenizer
def stemSentence(sentence=None):
    # Generate tokens and stem them (Used PorterStemmer)
    tokens = word_tokenize(sentence)
    ps_words = [ps.stem(w) for w in tokens]
    
    # Detokenize the stemmed tokens
    stemmed_sentence = TreebankWordDetokenizer().detokenize(ps_words)
    return stemmed_sentence

print("\nStemmed Corpus(Partially Displayed) - ")
print(stemSentence(corpus)[:1000] + " ...")

PorterStemmed: ['grow', 'leav', 'fairli', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']
SnowballStemmed: ['grow', 'leav', 'fair', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']

Stemmed Corpus(Partially Displayed) - 
fellow-citizen of the senat and of the hous of repres: among the vicissitud incid to life no event could have fill me with greater anxieti than that of which the notif wa transmit by your order, and receiv on the 14th day of the present month . On the one hand, I wa summon by my countri, whose voic I can never hear but with vener and love, from a retreat which I had chosen with the fondest predilect, and, in my flatter hope, with an immut decis, as the asylum of my declin year--a retreat which wa render everi day more necessari as well as more dear to me by the addit of habit to inclin, and of frequent interrupt in my health to the gradual wast commit on it by time . On the other hand, the magnitud and difficulti

**What is Lemmatization?** <br>
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.<br>

*The NLTK Lemmatization method is based on WorldNet's built-in morph function.*

In [5]:
#imports
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.

words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]

#TODO
# Create an instance of the Lemmatizer and perform Lemmatization on above words
# You can also give Parts-of-speech(pos) to the Lemmatizer for example "v" (verb). Check the differences in the outputs.

lemmatizer = WordNetLemmatizer() 
lemmatized_words = [lemmatizer.lemmatize(w) for w in words]
n_lemmatized_words = [lemmatizer.lemmatize(w, pos="n") for w in words]
v_lemmatized_words = [lemmatizer.lemmatize(w, pos="v") for w in words]

print("Lemmatized Words: {}".format(lemmatized_words))
print("Lemmatized Words with POS=Noun: {}".format(n_lemmatized_words))
print("Lemmatized Words with POS=Verb: {}".format(v_lemmatized_words))

# The word leaves can be lemmatized to either leaf/leave depending upon the POS tag(Noun/Verb)
ln = lemmatizer.lemmatize("leaves", pos="n")
lv = lemmatizer.lemmatize("leaves", pos="v")
print("\nExample:\nDifference in lemmatization with POS for 'leaves': {}(with POS=Noun), {}(with POS=Verb)".format(ln, lv))

#TODO
# Complete the function which takes a sentence/corpus and gets its lemmatized version.
def lemmatizeSentence(sentence=None):
    # Generate tokens and lemmatize them
    tokens = word_tokenize(sentence)
    lm_words = [lemmatizer.lemmatize(w) for w in tokens]
    
    # Detokenize the lemmatize tokens
    lemmatized_sentence = TreebankWordDetokenizer().detokenize(lm_words)
    return lemmatized_sentence

print("\nLemmatized Corpus(Partially Displayed) - ")
print(lemmatizeSentence(corpus)[:1000] + " ...")

[nltk_data] Downloading package wordnet to /home/debjoy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatized Words: ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa', 'relational', 'ha']
Lemmatized Words with POS=Noun: ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa', 'relational', 'ha']
Lemmatized Words with POS=Verb: ['grow', 'leave', 'fairly', 'cat', 'trouble', 'run', 'friendships', 'easily', 'be', 'relational', 'have']

Example:
Difference in lemmatization with POS for 'leaves': leaf(with POS=Noun), leave(with POS=Verb)

Lemmatized Corpus(Partially Displayed) - 
Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitude incident to life no event could have filled me with greater anxiety than that of which the notification wa transmitted by your order, and received on the 14th day of the present month . On the one hand, I wa summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in

**Question:** Give example of two words which have same stem but different lemma? Show the stem and lemma of both words in the code below 



**Answer here:** Two suitable words can be "leaving" and "leaves".

In [6]:
#TODO
# Write code to print the stem and lemma of both your words
w1 = "leaving"
w2 = "leaves"

lmw1 = lemmatizer.lemmatize(w1)
lmw2 = lemmatizer.lemmatize(w2, pos="a")
psw1 = ps.stem(w1)
psw2 = ps.stem(w2)

print("Original words:   {}, {}".format(w1, w2))
print("Lemmatized words: {}, {}".format(lmw1, lmw2))
print("Stemmed words:    {}, {}".format(psw1, psw2))

Original words:   leaving, leaves
Lemmatized words: leaving, leaves
Stemmed words:    leav, leav


**Question:** Write a comparison between stemming and lemmatization?

**Answer here:** Stemming process only consists of rule-based algorithms for removal of prefix/suffux of a word and it doesn't take into account the word morphology or the tense of the word in the text. Lemmatization makes use of morphological analysis and a word vocabulary for removing suffix/prefix of words and tries to conserve the meaning of the word. (Similar to how "leaving" and "leaves" both were stemmed to "leav", which is a shorter version but not an actual word, while lemmatization retains the original words to conserve the sematic correctness.)