# IT550 - Information Retrieval
## Assignment 1

## Importing necessary libraries:
Firstly we will import all the necessary libraries required for this assignment.

In [39]:
# Setup
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

import os
import re
from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Extracting contents of `<TEXT>` tag from all documents:
We use BeautifulSoup library to extract to parse the utf8 format files contained in the *mini_dataset* folder and extract the contents in the `<TEXT>` tags.

In [40]:
path = 'mini_dataset'
files = os.listdir(path)
textList = []

for file in files:
  with open(os.path.join(path, file), encoding='utf8') as data:
      soup = BeautifulSoup(data, features='html.parser')

      # There will be only one pair of <TEXT> tag per file
      textTag = soup.find('text')
      
      # Extracting the contents and appending it to the list
      textList.append(textTag.text)

text = ''.join(textList)
print(text)




The Telegraph - Calcutta : Business

 RBI breather for urban co-op banks

 OUR SPECIAL CORRESPONDENT 

 Mumbai, Sept. 4: The Reserve Bank has relaxed loan impairment norms for urban cooperative banks. The 90-day norm relating to asset classification and provisioning will now be applicable to gold loans and small loans up to Rs 1 lakh only from the financial year ending March 31, 2007.

 Till then these loans would be governed by the 180-day impairment norm.

 Earlier, the central bank had advised UCBs to apply the 90-day norm from the financial year ending March 31, 2005.

 The decision was based on the representations by UCBs, a Reserve Bank release said here today.

 RBI governor Y. V. Reddy had held meeting with federations of co-operative banks on September 2 where UCBs sought relaxation in these norms.

 In the meeting, specific suggestions were placed. These included a proposal to modify prudential regulations, including deferring the application of 90-day norm, a plan to excl

## Removing punctuation and numerical values from the text:
Here we remove the punctuation and numerical values using the `re.sub()` function from the re library.

In [41]:
cleanedText = re.sub(r'[^a-zA-Z\s]', '', text)
cleanedText = "".join(filter(lambda x: not x.isdigit(), cleanedText))

print(cleanedText)




The Telegraph  Calcutta  Business

 RBI breather for urban coop banks

 OUR SPECIAL CORRESPONDENT 

 Mumbai Sept  The Reserve Bank has relaxed loan impairment norms for urban cooperative banks The day norm relating to asset classification and provisioning will now be applicable to gold loans and small loans up to Rs  lakh only from the financial year ending March  

 Till then these loans would be governed by the day impairment norm

 Earlier the central bank had advised UCBs to apply the day norm from the financial year ending March  

 The decision was based on the representations by UCBs a Reserve Bank release said here today

 RBI governor Y V Reddy had held meeting with federations of cooperative banks on September  where UCBs sought relaxation in these norms

 In the meeting specific suggestions were placed These included a proposal to modify prudential regulations including deferring the application of day norm a plan to exclude gold loans and loans below Rs  lakh from these 

## Performing tokenization the text:
Here we show differences between two tokenization methods.<br>
One way to tokenize is by using `split()` function.<br>
Another way we can tokenize is using `nltk` library's `nltk.word_tokenize()` function.
<br>
<br>
As we can see both ways gives us the same results.<br>
It is because we have removed numerical values and punctuations from the text that they are tokenized the same by both methods. If the punctuations and numerical values were to be kept then `nltk.word_tokenize()` would have returned more accurate tokens while `str.split()` would split as per the whitespace characters only which not always return correct word tokens.

In [42]:
tokenizeSplit = cleanedText.split()
tokenizeNLTK = nltk.word_tokenize(cleanedText)

# print(len(tokenizeSplit), len(tokenizeNLTK))
print("Tokenizing by two methods:")
print("By Split".ljust(15), "<----------->", "By NLTK".rjust(15), end="\n\n")

for tbs, tbn in zip(tokenizeSplit, tokenizeNLTK):
  print(tbs.ljust(15), "<----------->", tbn.rjust(15))

Tokenizing by two methods:
By Split        <----------->         By NLTK

The             <----------->             The
Telegraph       <----------->       Telegraph
Calcutta        <----------->        Calcutta
Business        <----------->        Business
RBI             <----------->             RBI
breather        <----------->        breather
for             <----------->             for
urban           <----------->           urban
coop            <----------->            coop
banks           <----------->           banks
OUR             <----------->             OUR
SPECIAL         <----------->         SPECIAL
CORRESPONDENT   <----------->   CORRESPONDENT
Mumbai          <----------->          Mumbai
Sept            <----------->            Sept
The             <----------->             The
Reserve         <----------->         Reserve
Bank            <----------->            Bank
has             <----------->             has
relaxed         <----------->         relaxed
loan  

## Converting text to lowercase and removing stopwords:
Here we convert all the tokens to lowercase using `lower()` function of `str` class.<br>
We then remove all the stopword tokens by filtering from the `nltk` library's english *stopwords corpus* using `nltk.corpus.stopwords.words('english')`.

In [43]:
lowercaseTokens = [token.lower() for token in tokenizeNLTK]

stopwords = nltk.corpus.stopwords.words('english')

filteredText = [w for w in lowercaseTokens if not w in stopwords]

print(lowercaseTokens)
print(filteredText)

['the', 'telegraph', 'calcutta', 'business', 'rbi', 'breather', 'for', 'urban', 'coop', 'banks', 'our', 'special', 'correspondent', 'mumbai', 'sept', 'the', 'reserve', 'bank', 'has', 'relaxed', 'loan', 'impairment', 'norms', 'for', 'urban', 'cooperative', 'banks', 'the', 'day', 'norm', 'relating', 'to', 'asset', 'classification', 'and', 'provisioning', 'will', 'now', 'be', 'applicable', 'to', 'gold', 'loans', 'and', 'small', 'loans', 'up', 'to', 'rs', 'lakh', 'only', 'from', 'the', 'financial', 'year', 'ending', 'march', 'till', 'then', 'these', 'loans', 'would', 'be', 'governed', 'by', 'the', 'day', 'impairment', 'norm', 'earlier', 'the', 'central', 'bank', 'had', 'advised', 'ucbs', 'to', 'apply', 'the', 'day', 'norm', 'from', 'the', 'financial', 'year', 'ending', 'march', 'the', 'decision', 'was', 'based', 'on', 'the', 'representations', 'by', 'ucbs', 'a', 'reserve', 'bank', 'release', 'said', 'here', 'today', 'rbi', 'governor', 'y', 'v', 'reddy', 'had', 'held', 'meeting', 'with', 'f

## Performing Stemming using Porter Stemmer:
Here we perform Stemming process on our word tokens.<br>
We will use Porter Stemmer for this process.
<br>
<br>
The output shows the result of stemming using Porter Stemmer on each word tokens. We can observe that stemming is not a perfect way to convert surface word to its root word as many words are incorrectly stemmed to the root word.

In [44]:
# Creating a PorterStemmer object
ps = nltk.stem.PorterStemmer()
stemmedWords = []
print("Word Token".ljust(15), "---->", "Stem".rjust(15))
print("\n")
for token in filteredText:
  stem = ps.stem(token)
  print(token.ljust(15), "---->", stem.rjust(15))
  stemmedWords.append(stem)

Word Token      ---->            Stem


telegraph       ---->       telegraph
calcutta        ---->        calcutta
business        ---->            busi
rbi             ---->             rbi
breather        ---->        breather
urban           ---->           urban
coop            ---->            coop
banks           ---->            bank
special         ---->         special
correspondent   ---->      correspond
mumbai          ---->          mumbai
sept            ---->            sept
reserve         ---->          reserv
bank            ---->            bank
relaxed         ---->           relax
loan            ---->            loan
impairment      ---->          impair
norms           ---->            norm
urban           ---->           urban
cooperative     ---->          cooper
banks           ---->            bank
day             ---->             day
norm            ---->            norm
relating        ---->           relat
asset           ---->           asset
classifica

## Performing Lemmatization on the word tokens:
Here we use `nltk` library's `WordNetLemmatizer` to lemmatize surface word tokens back to their root words.<br>

`WordNetLemmatizer` accepts a second argument of POS tag to more accurately convert the word passed to the lemmatizer.<br>

To get POS tag for every word tokens we have defined a function which will take the word as an argument and return its appropriate POS tag using `nltk.pos_tag()` function.

In [45]:
# Creating WordNetLemmatizer object for lemmatization of words
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizedWords = []

# Defining a function to return an appropriate POS tag to pass
# as a parameter to lemmatizer.lemmatize()
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": nltk.corpus.wordnet.ADJ,
                "N": nltk.corpus.wordnet.NOUN,
                "V": nltk.corpus.wordnet.VERB,
                "R": nltk.corpus.wordnet.ADV}

    return tag_dict.get(tag, nltk.corpus.wordnet.NOUN)

print("Word Token".ljust(15), "---->", "Lemmatization".rjust(15))
print("\n")
for token in filteredText:
  lemmatized = lemmatizer.lemmatize(token, get_wordnet_pos(token))
  print(token.ljust(15), "---->", lemmatized.rjust(15))
  lemmatizedWords.append(lemmatized)

Word Token      ---->   Lemmatization


telegraph       ---->       telegraph
calcutta        ---->        calcutta
business        ---->        business
rbi             ---->             rbi
breather        ---->        breather
urban           ---->           urban
coop            ---->            coop
banks           ---->            bank
special         ---->         special
correspondent   ---->   correspondent
mumbai          ---->          mumbai
sept            ---->            sept
reserve         ---->         reserve
bank            ---->            bank
relaxed         ---->         relaxed
loan            ---->            loan
impairment      ---->      impairment
norms           ---->            norm
urban           ---->           urban
cooperative     ---->     cooperative
banks           ---->            bank
day             ---->             day
norm            ---->            norm
relating        ---->          relate
asset           ---->           asset
classifica

## Showing differences between Stemming and Lemmatization on the text:
The output will show the result of performing **stemming** and **lemmatization** on each of the word tokens.<br>
It is thus observed that for the current assignment **_lemmatization_** proves to be better at transforming the surface word tokens to their roots.

In [46]:
print("Stemmed".ljust(18), "<---->", "Lemmatized".rjust(18))
print("\n")
for stemmed, lemmatized in zip(stemmedWords, lemmatizedWords):
  print(stemmed.ljust(18), "<---->", lemmatized.rjust(18))

Stemmed            <---->         Lemmatized


telegraph          <---->          telegraph
calcutta           <---->           calcutta
busi               <---->           business
rbi                <---->                rbi
breather           <---->           breather
urban              <---->              urban
coop               <---->               coop
bank               <---->               bank
special            <---->            special
correspond         <---->      correspondent
mumbai             <---->             mumbai
sept               <---->               sept
reserv             <---->            reserve
bank               <---->               bank
relax              <---->            relaxed
loan               <---->               loan
impair             <---->         impairment
norm               <---->               norm
urban              <---->              urban
cooper             <---->        cooperative
bank               <---->               bank
day     