Assignment 1 submission 

In [21]:
#imports
import pandas as pd
import numpy as np
import os
from bs4 import BeautifulSoup
import nltk
from datascience import *
import warnings
warnings.filterwarnings('ignore')


In [22]:
#mounting the drive to access files
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [23]:

def import_files(path, mode='r'):
  """
  The function returns a list of iotext objects for each text file in the path passed as parameter
  """
  files = []
  for file_name in os.listdir(path):
    file = open(path + file_name, mode)
    files.append(file)
  return files

First, I will import all the files as iotextwrapper objects using the function **import_files**. The function takes a path and returns list of all the files in that directory read by python. 

In [24]:

path = '/gdrive/MyDrive/Datasets/mini_dataset/mini_dataset/'
text_files = import_files(path)

Now that we have a list of textio objects. what we want is to extract the texts from the objects remove tags and join the text. The **join_texts** function does exactly that.  

In [25]:
def join_texts(TextObjectList):
  """
  The function returns the the joined text of all text objects
  """
  textList = [file.read() for file in TextObjectList]
  text = ' '.join(textList)
  soup = BeautifulSoup(text)
  justtext = soup.get_text()
  return justtext
  


Here we get the concatenated text using the function join_texts. 

In [26]:
Alltext = join_texts(text_files)
Alltext

'\n1040904_business_story_3714615.utf8\n\n\n\nThe Telegraph - Calcutta : Business\n\n New LIC pension plans \n\n A STAFF REPORTER\n\n Calcutta, Sept. 3: Life Insurance Corporation of India (LIC) plans to unveil three new products, including a unit-linked pension plan.\n\n We have sought approval from the Insurance Regulatory and Development Authority of India (IRDA) for the approval of these three policies and will launch them soon after we receive it, said LIC zonal manager D. K. Mehrotra.\n\n Apart from the unit-linked pension plan, the schemes would include a childrens plan and another similar to Jeevan Shree for high net-worth individuals.\n\n Mehrotra acknowledged that the response to LICs existing pension schemes was not very encouraging. This is because of lack of awareness among employees in rural and the unorganised sector, said Mehrotra.\n\n However, the company is planning to increase awareness among these sections of people through a number of programmes to be addressed by 

Now that we have texts as single string we can remove punctuations and tokenize the text into tokens and further making all tokens lowercase. we could also have used the split method to split the words but It would have done only one thing. By using nltk's regexp_tokenize and appropriate regex expression, i was able to only keep tokens which have only letters in it. All the punctuations, special characters and numbers are removed in single step.

In [27]:
def tokenize_words(text):
  """
  The function tokenizes the words removes special characters, numbers and returns list of lowercase tokens
  """

  tokens = nltk.regexp_tokenize(text, '[a-zA-Z]+', gaps=False)
  return [word.lower() for word in tokens]
  

Finally we will use the function tokenize_words to get list of tokens in the list words

In [28]:
words = tokenize_words(Alltext)


In [29]:
#let's take a look at some of the tokens
words[10:20]

['plans',
 'a',
 'staff',
 'reporter',
 'calcutta',
 'sept',
 'life',
 'insurance',
 'corporation',
 'of']

we could have also used python's *split()* function tokenize words. Then we would have to remove certain punctuations and numbers after that. Let's try that.

In [30]:
Alltext.split()[:20]

['1040904_business_story_3714615.utf8',
 'The',
 'Telegraph',
 '-',
 'Calcutta',
 ':',
 'Business',
 'New',
 'LIC',
 'pension',
 'plans',
 'A',
 'STAFF',
 'REPORTER',
 'Calcutta,',
 'Sept.',
 '3:',
 'Life',
 'Insurance',
 'Corporation']

As you can see that there are still special characters like '-', ':' and numbers in there. But using nltk's *regexp_tokenize* we could do that in one single step.

We now have a list of tokens. Next step can be to remove stop words from the list. For that I created the function remove_stopwords which takes in a list of tokens and return a list of tokens with no stop words in it. Here, I have used nltk's library's stopwords for english and then I am filtering the words to have no stop words in it.

In [31]:
def remove_stopwords(words):
  """
  Removes the stop words from a list of words
  """
  nltk.download('stopwords')
  from nltk.corpus import stopwords
  engstopwords = stopwords.words('english')
  nostopwords = list(filter(lambda x: x not in engstopwords and len(x) != 1, words))
  return nostopwords

In [32]:
nostopwords = remove_stopwords(words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
nostopwords[:10]

['business',
 'story',
 'utf',
 'telegraph',
 'calcutta',
 'business',
 'new',
 'lic',
 'pension',
 'plans']

The displaywords function takes a function which will be used to either lemmatize or stemmize words. Display function displays words and their corresponding transformed words side by side in a table. It also returns the list of transformed words.

In [34]:
def displaywords(words, func,  limit,  label='New'):
  """
  displays the words changed in table and returns list of words after applying the function
  """
  newwords = [func(word) for word in words]
  
  
  tbl = Table()
  tbl = tbl.with_columns("Word", words, label + ' word', newwords)
  
  tbl.show(limit)
  return newwords


Now we want a function which will return us stemmed words. The function stemming uses display words and returns stemmed words. It uses PorterStemmer() to stem words. 

In [35]:
def stemming(words, stemsToShow=20):
  """
  Returns the unique stemmed words from a list of words. Also prints the words and corresponding stemmed words
  """
  from nltk.stem import PorterStemmer
  ps = PorterStemmer()
  wordsstemmed = displaywords(words, ps.stem, stemsToShow, "Stemmed")
  
  return list(set(wordsstemmed))



In [36]:
stemmedwords = stemming(nostopwords, 200)
stemmedwords[:10]


Word,Stemmed word
business,busi
story,stori
utf,utf
telegraph,telegraph
calcutta,calcutta
business,busi
new,new
lic,lic
pension,pension
plans,plan


['dtl',
 'awar',
 'west',
 'secur',
 'npa',
 'pick',
 'absenc',
 'number',
 'provis',
 'rate']

Lemmatize function returns the lemmatized words. 

In [37]:
def lemmatize(words, limit=20):
  """
  prints the word and lemmatized word and also returns the list of lemmatized words
  """
  from nltk.stem import 	WordNetLemmatizer
  wnl = WordNetLemmatizer()
  nltk.download('wordnet')
  lemmatizedwords = displaywords(words, wnl.lemmatize, limit, "Lemmatized")
  return list(set(lemmatizedwords))

In [38]:
lemmatizedwords = lemmatize(nostopwords,200)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Word,Lemmatized word
business,business
story,story
utf,utf
telegraph,telegraph
calcutta,calcutta
business,business
new,new
lic,lic
pension,pension
plans,plan


In [39]:

lemmatizedwords[:10]

['dtl',
 'west',
 'trading',
 'equity',
 'nestled',
 'npa',
 'applied',
 'pick',
 'company',
 'number']

As you can see that when we perform stemming we get undesired words like 'business' was stemmed to 'busi' but we know that busi is not a word. 
It should have been Busy instead. Lemmatization on the other hand was not able to get the root words for a lot of words like namely, reporter, namely. A thing to note here
was that stemming was able to extract root words for some words perfectly which lemmatizing words could not do like namely -> name, reporter -> report, linked -> link and many more but it wrong 
in extracting root words for some words like business -> busi, insurance -> insur.Lemmatize did nothing for words like business and insurance whose root words were busy and insure. But both methods were not able to extract the correct root.