# FIT5196 Assessment 1

## Task 2: Text Pre-Processing
#### Student Name:Suyash Sathe
#### Student ID: 29279208

Date: 02/09/2018

Environment: Python 3.6.4 and Jupyter notebook
Libraries used: 
* itertools (for flattening the list of lists)
* nltk - natural language toolkit (tokenizer, stemmer, stopwords, collocations and probabilities)
* re (for regular expression, included in Anaconda Python 3.6) 


## 1. Introduction
This analysis extracts data from approximately 250 text documents that contains employee resumes. Data is extracted by reading the input files according to the given list of files. These files are read from the folder "Resumes" which contains all the files.

The text files are read and each file is stored in a list called "data". This data is tokenized and it further goes through the following pre-processing:

* Digits removal
* Removal of tokens whose length < 3
* Stop-words removal
* Identify the bigrams
* remove tokens with threshold > 98%
* remove tokens with threshold < 2%
* Stemming

Text pre-processing is performed for producing a lexical vocabulary of the tokens and the associated sparse count vector matrix for each abstract. The pre-processing included tokenisation, lemmatisation and removal of stopwords. The initial tokenised vocabulary of the corpus was 14236 words, which was reduced to 3802 words after pre-processing.

## 2.  Import libraries 

In [1]:
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import MWETokenizer
from nltk.corpus import stopwords
import nltk.data
import nltk
import re
from nltk.collocations import *
from itertools import chain
from nltk.stem import PorterStemmer

## 3.  Extract the data

Extract the data from the given list of files and store it in the list *"data"*.

**Note:**
1. The input files are present in the *"Resumes* folder. This folder should be present in same the folder as this *.ipynb* file.

In [2]:
data = []
files = [219, 723, 475, 334, 717,  27, 590, 405, 848, 611, 555, 344, 760, 185, 127, 813, 724, 560,
         233, 547, 427, 156, 529, 801, 213, 332, 538, 291, 120, 278,  97, 736, 704,  91, 432, 339,
         275, 174, 808, 238, 466, 623, 663, 285, 582, 863, 336, 554, 232, 431, 415, 686, 270, 710,
         226, 382, 660, 627, 448, 604, 331, 549, 155, 661,  63,  91, 592, 370, 563,   9, 169, 576,
         772, 399, 574, 485, 612, 655, 683, 471, 218, 266, 534, 312, 520, 203, 360, 568,  13, 501,
         432,   3, 243, 533,  55, 498, 443, 715, 187, 828, 453, 133, 164, 515, 441, 781,  35, 577,
         379, 616, 186, 567, 168, 371, 273, 236, 305, 132, 329,  77, 772, 126, 509, 773, 472, 811,
         679, 424, 701, 480, 658, 120, 682, 469, 495, 569,  82, 842, 425,  75, 523, 336, 410, 790,
         539, 411, 574, 695, 558, 614, 850, 860, 373, 716, 287, 122, 502,  30,  83, 259, 764, 429,
         643, 390, 489, 121, 653, 647, 591,  35, 765, 266, 447, 145, 832, 598, 230, 609, 774, 589,
         477,  94, 410 ,658, 738, 454, 274, 414, 383, 338, 813, 464, 133, 302, 709, 713, 698, 628,
         853, 599, 168, 145, 444, 740, 177, 856, 655, 193, 261, 513, 704, 302, 486,  92,  54,
         452, 693, 537, 472, 135, 713, 746, 333, 391, 394, 698, 765,   4, 141, 703, 692, 414, 858,
         769,  62, 593, 634, 762,   3, 176, 632,  35, 465, 455,  69, 607, 258, 521, 783, 1]

#Read all the resumes in data[]
files = list(set(files))
for x in files:
    resume = open("Resumes/resume_("+str(x)+").txt", "r", encoding="utf-8")
    application = resume.read()
    data.append(application)
print("OK")

OK


## 4.  Text Pre-processing

Each element of *data* is read and is segmented according to the sentence.  These sentences are further tokenized and stored in a separate list. Following operations are performed on the elements of this list:

### 4.a  Sentence segmentation, Normalization and Tokenization

Each file is segmented into sentences. The first word of each sentence is normalized. The sentence is then tokenized. The list *'all_tokens'* contains tokens from all the files.

In [3]:
sentences_tokens = []
for i in range(len(files)):
    ############Sentence segmentation############
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(data[i].strip())
   
    ############Normalization and Tokenization of sentences############
    for sent in sentences:
        tokenizer = RegexpTokenizer(r"\w+(?:[-.]\w+)?")
        sent_tokens = tokenizer.tokenize(sent)
        sent_tokens[0] = sent_tokens[0].lower()
        sentences_tokens.append(sent_tokens)

# List of all tokens
all_tokens=list(chain.from_iterable(sentences_tokens))

print("OK")
print("Length of all_tokens: "+str(len(all_tokens)))

OK
Length of all_tokens: 148624


### 4.b  Remove the non-alphabetical tokens

List *'all_tokens_alpha'* contains all the non-digit tokens.

In [4]:
# Remove the non-alphabetical tokens
all_tokens_alpha = []
for each in all_tokens:
    if each.isalpha():
        all_tokens_alpha.append(each)
print("OK")
print("Length of all_tokens_alpha: "+str(len(all_tokens_alpha)))

OK
Length of all_tokens_alpha: 138588


### 4.c  Removal of tokens whose length < 3

Remove the tokens from the *all_tokens_alpha* list whose length <3

In [5]:
all_tokens_len = []
for w in all_tokens_alpha:
    if len(w)>2:
        all_tokens_len.append(w)
print("OK")
print("Length of all_tokens_len: "+str(len(all_tokens_len)))

OK
Length of all_tokens_len: 119370


### 4.d  Stop-words removal

Read the stop-words from the file *"stopwords_en.txt"* into a list. Use this list to remove the stop-words from the *all_tokens_len* list

**Note:**
1. The file *"stopwords_en.txt"* should be present in same the folder as this *.ipynb* file.

In [6]:
############Stopwords############
stopwords = open("stopwords_en.txt","r")
stopwords = stopwords.readlines()
stopwords_list=[]
for each in stopwords:
    stopwords_list.append(each.strip())
stopwords_set = list(set(stopwords_list))

all_tokens_stop = []
for w in all_tokens_len:
    if w not in stopwords_set:
        all_tokens_stop.append(w)

print("OK")
print("Length of all_tokens_stop: "+str(len(all_tokens_stop)))

OK
Length of all_tokens_stop: 99203


### 4.e  Bigrams

Collect the first 200 meaningful bigrams and store it in the list *bigrams* .

In [7]:
############ Bigrams ############
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_tokens_stop)

# Set frequency = 14 to get the top 200 bigrams
bigram_finder.apply_freq_filter(14)

# Top-200 bigrams
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) 

bigrams = []
for each in top_200_bigrams:
    s = each[0]+" "+each[1]
    bigrams.append(s)

print("OK")
print("Length of bigrams: "+str(len(bigrams)))

OK
Length of bigrams: 200


### 4.f  Removal of context-dependent and rare tokens

Create a dictionary *context_dict* that contains the tokens and frequency of occurence of tokens in all the files. The tokens that occur in more than 98% files or less than 2% files are removed from the vocab. 

In [8]:
# Set of unique tokens
all_tokens_stop_set = list(set(all_tokens_stop))

# Dictionary to store the count of tokens
context_dict = dict()

# Initialize the count of each token to 0
for each in all_tokens_stop_set:
    context_dict[each] = 0

# Calculate the frequency of tokens
for token in all_tokens_stop_set:
    for i in range(len(files)):
        if token in data[i]:
            context_dict[token]+=1

# Remove the rare and context-dependent tokens 
for each in context_dict.keys():
    if context_dict[each]>(0.98*225) or context_dict[each]<(0.02*225):
        all_tokens_stop_set.remove(each)

print("OK")
print("Length of all_tokens_stop_set: "+str(len(all_tokens_stop_set)))

OK
Length of all_tokens_stop_set: 4535


### 4.g  Stemming

Porter stemmer is used for stemming the tokens.

**Assumption:**<br>
Only the tokens that are in lower case are sent to the stemmer since tokens that are in capital or tokens whose initial letter is capital are considered as whole words that occur between the sentences or are headings.

If the tokens whose just initial letter is capital is sent to the stemmer, it converts the token to lower case which may not match to the original token in the text. Hence, not sent to the stemmer.

In [9]:
ps = PorterStemmer()

# Contains set of stemmed words
all_tokens_stem = set()

for word in all_tokens_stop_set:
    # Stem only lower case tokens
    if word.islower():
        x = ps.stem(word)
        all_tokens_stem.add(x)
    else:
        all_tokens_stem.add(word)
all_tokens_stem= list(all_tokens_stem)

# add the bigrams to the set of stemmen words
for each in bigrams:
    all_tokens_stem.append(each)

# Sort the tokens
all_tokens_stem=sorted(all_tokens_stem)

print("OK")
print("Length of all_tokens_stem after adding bigrams: "+str(len(all_tokens_stem)))

OK
Length of all_tokens_stem: 3802


## 5.  Vocabulary

It contains the bigrams and unigrams tokens in the following format, token_string:integer_index. Words in the vocabulary are sorted in alphabetical order and are stored in *"29279208_vocab.txt"* file.

In [10]:
word_id = {}
id=1
for each in all_tokens_stem:
    word_id[each] = id
    id+=1

vocab = open("29279208_vocab.txt","w+",encoding="utf-8")
for each in word_id:
    vocab.write(each+" : "+str(word_id[each])+"\n")
    
vocab.close()
print("OK")

OK


## 6.  Context Vector

The txt file contains all the “selected” resumes in the data-set. Each line in the txt file contains the sparse representations of one of the resumes in the data-set in the following format file_name, token_index:count, token_index:count,...

The output is stored in the file *"29279208_countVec.txt"*

**Note:**
1. This code takes 2-3 minutes execution time.

In [11]:
countVec = open("29279208_countVec.txt","w+",encoding="utf-8")

for i in range(len(files)):
    countVec.write("resume_("+str(files[i])+"), ")
    for each in word_id.keys():    
        count_each = re.findall(each,data[i])
        if len(count_each) >0:
            countVec.write(str(word_id[each])+" : "+ str(len(count_each)) + ", ")
    countVec.write("\n\n")
    
countVec.close()
print("OK")

OK


## 6.  Reference

1. Tutorial materials from Week 4 to Week5.
2. https://stackoverflow.com/
3. https://youtube.com/
4. https://www.tutorialspoint.com/