# **Assignment 1 on Natural Language Processing**

### Date : 4th Sept, 2020

#### Instructor : Prof. Sudeshna Sarkar

#### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

 # NLTK Library

The [NLTK](https://www.nltk.org/) Python framework is generally used as an education and research tool. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of these packages which will be discussed in this tutorial.

**Installing Nltk** <br>
Nltk can be installed using PIP or Conda package managers.For detailed installation instructions follow this [link](https://www.nltk.org/install.html).

To ensure we are all on the same page, the coding environment will be in **python3**. We suggest downloading Anaconda3 and creating a separate environment to do this assignment. 
The link to anaconda3 for Windows and Linux is available here https://docs.anaconda.com/anaconda/install/. 
The steps to install NLTK is available on the link: 
```bash
sudo pip3 install nltk 
python3 
nltk.download()
```

**Note for Question and answers:**

Write your answers to the point in the text box below labelled as **Answer here**.

# Tokenizing words and Sentences using Nltk

**Tokenization** is the process by which big quantity of text is divided into smaller parts called tokens. <br>It is crucial to understand the pattern in the text in order to perform various NLP tasks.These tokens are very useful for finding such patterns.<br>

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1. word tokenize
2. sentence tokenize

In [None]:
# Importing modules
import nltk
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.


In [None]:
# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
# print(corpus)

### **TASK**:

For the given corpus, 
1. Print the number of sentences and tokens. 
2. Print the average number of tokens per sentence.
3. Print the number of unique tokens
4. Print the number of tokens after stopword removal using the stopwords from nltk.


In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# TODO

sentences = sent_tokenize(corpus)
words = word_tokenize(corpus)
print("Number of sentences: {}\nNumber of tokens: {}".format(len(sentences), len(words)))

count = 0
for sentence in sentences:
    count += len(word_tokenize(sentence))
print("Average number of tokens per sentence: {:.4f}".format(count / len(sentences)))

count = 0
wordSet = set()
for word in words:
    if word not in wordSet:
        count += 1
        wordSet.add(word)
print("Number of unique tokens: {}".format(count))

stop_words = set(stopwords.words("english"))
print("Number of tokens after removing stopwords: {}".format(len([word for word in words if word not in stop_words])))

Number of sentences: 23
Number of tokens: 1537
Average number of tokens per sentence: 66.8261
Number of unique tokens: 626
Number of tokens after removing stopwords: 800


# Stemming and Lemmatization with NLTK

**What is Stemming?** <br>
Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.<br>
Hence Stemming is a way to find the root word from any variations of respective word

There are many stemmers provided by Nltk like **PorterStemmer**, **SnowballStemmer**, **LancasterStemmer**.<br>

We will try and see differences between Porterstemmer and Snowballstemmer

In [None]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Note that SnowballStemmer has language as parameter.

words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]

# TODO
# create an instance of both the stemmers and perform stemming on above words

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print("PorterStemmer stemmed words: {}".format(stemmed_words))

stemmer = SnowballStemmer(language = 'english')
stemmed_words = [stemmer.stem(word) for word in words]
print("SnowballStemmmer stemmed words: {}".format(stemmed_words))

# TODO
# Complete the function which takes a sentence/corpus and gets its stemmed version.

def stemSentence(sentence=None):
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in word_tokenize(sentence)]
    stemmed_sent = ""
    for word in words:
        stemmed_sent += (word + " ")
    return stemmed_sent

print("\n" + stemSentence("The quick brown fox jumps over the lazy dog")) # Test sentences here

PorterStemmer stemmed words: ['grow', 'leav', 'fairli', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']
SnowballStemmmer stemmed words: ['grow', 'leav', 'fair', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat']

the quick brown fox jump over the lazi dog 


**What is Lemmatization?** <br>
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.<br>

*The NLTK Lemmatization method is based on WorldNet's built-in morph function.*

In [None]:
#imports
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]

#TODO
# Create an instance of the Lemmatizer and perform Lemmatization on above words
# You can also give Parts-of-speech(pos) to the Lemmatizer for example "v" (verb). Check the differences in the outputs.

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos = 'n') for word in words]
print("Lemmatized words with pos = 'n': {}".format(lemmatized_words))

lemmatized_words = [lemmatizer.lemmatize(word, pos = 'v') for word in words]
print("Lemmatized words with pos = 'v': {}".format(lemmatized_words))

#TODO
# Complete the function which takes a sentence/corpus and gets its lemmatized version.

def lemmatizeSentence(sentence=None):
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in word_tokenize(sentence)]
    lemmatized_sent = ""
    for word in words:
        lemmatized_sent += (word + " ")
    return lemmatized_sent

print("\n" + lemmatizeSentence("The quick brown fox jumps over the lazy dog")) # Test sentences here

Lemmatized words with pos = 'n': ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa', 'relational', 'ha']
Lemmatized words with pos = 'v': ['grow', 'leave', 'fairly', 'cat', 'trouble', 'run', 'friendships', 'easily', 'be', 'relational', 'have']

The quick brown fox jump over the lazy dog 


In [None]:
print(stemSentence("loving"))
print(stemSentence("love"))
print(lemmatizeSentence("loving"))
print(lemmatizeSentence("love"))

love 
love 
loving 
love 


**Question:** Give example of two words which have same stem but different lemma? Show the stem and lemma of both words in the code below 



**Answer here:** <br> <br>
Words -> university, universe <br> 
*Stems* -> univers, univers <br>
*Lemmas* -> university, universe <br>

In [None]:
#TODO
# Write code to print the stem and lemma of both your words
print("Words -> university , universe")
print("Stems -> {}, {}\nLemmas -> {}, {}".format(stemSentence("university"), stemSentence("universe"), lemmatizeSentence("university"), lemmatizeSentence("universe")))

Words -> university , universe
Stems -> univers , univers 
Lemmas -> university , universe 


**Question:** Write a comparison between stemming and lemmatization?

**Answer here:** <br> <br>
Stemming: <br>
Stemming involves rule-based approaches to reduce a word to its root, i.e. its stem. The input word is passed thorugh a set of conditionals to determine what part of it should be dropped in the stemmed result. Stemmed words can often suffer from understemming and overstemming, where too little or too much of the word is chopped off, and hence is generally considered inferior to lemmatization for most tasks. Stemmed words may or may not be actual words.

Lemmatization: <br>
Lemmatization is more nuanced than stemming, and requires additional information about the part of speech (POS) of the word inputted in order to work. If this is not provided, the lemmatizer uses the default POS for the word. The same word may be lemmatized differently if the POS for it provided differently. Lemmatization generally results in a meaningful result which is the dictionary form of the input word.