Date: 02/09/2018

Version: 1.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:<br/>

* nltk.data (for sentence detector)
* re (for regular expressions)
* from nltk.tokenize ( RegexpTokenizer for tokenization )
* from nltk.stem  (PorterStemmer for Stemming)
* from nltk.util  (ngrams for bigrams)
* from nltk.probability (for frequency Distribution) 
* from itertools chain (to join multiple dictionary values into one) 
* from nltk.tokenize (MWETokenizer to join bigrams and retokenize)
* from sklearn.feature_extraction.text CountVectorizer (Vectorizing in the format)



## 1. Introduction
This assignment comprises the execution of different text processing and analysis tasks applied to Resumes . There are a total of 250 resumes and files are named `resume_(n).txt1`(where n is an integer). The required tasks are the following:

1. Tokenize each file.
2. Normalize the tokens.
3. List out top 200 bigrams.
4. Remove all the stop words and rare words with the threshold mentioned to 98% and 2%.
5. Stemming each token.
6. Vectorizing and writing to a file.

More details for each task will be given in the following sections.

## 2.  Import libraries 

In [22]:
import nltk.data
import re 
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from nltk.probability import *
from itertools import chain
from nltk.tokenize import MWETokenizer
from sklearn.feature_extraction.text import CountVectorizer

## 3. Loading data

As a first step, all resume names are loaded into list.

In [23]:
# Fetching individual Resumes
resume='100 171 244 716 496 336 293 326 200 510 41 616 478 473 859 159 586 293 275 424 3 611 68 577 757 30 594 9 783 822 135 820 272 505 487 24 313 188 139 220 137 74 389 710 15 704 726 703 193 499 317 638 95 464 425 428 701 657 724 786 681 344 189 716 298 25 509 39 26 269 223 738 593 714 449 242 81 687 155 487 661 464 757 430 185 781 421 796 618 842 649 254 568 562 492 99 782 705 105 603 692 272 861 691 287 809 114 254 301 783 418 263 107 341 833 804 361 403 114 685 313 256 236 543 446 12 798 505 28 640 536 169 255 576 257 769 499 443 596 734 445 440 689 580 299 412 270 124 555 360 396 361 566 555 409 686 162 829 228 176 310 617 59 55 203 238 52 563 289 345 70 215 576 388 198 32 295 341 318 534 8 622 200 793 132 827 704 431 280 596 107 551 32 164 771 479 166 542 451 399 185 382 515 347 85 218 210 658 449 459 299 258 132 557 764 280 489 46 388 24 655 237 849 535 435 650 48 798 544 743 483 614 493 569 727 454 10 34 100 782 798 143 778 240 194 237 224 11 538 540'
resume=resume.split(' ') # Splitting by space
resumeset=set(resume)
resume_names=[]
for each in resumeset:
    resume_names.append('resume_('+each+').txt')


In [24]:
print('The total Number of resumes used for analysis is', len(resume_names), 'and list provided is',len(resume))

The total Number of resumes used for analysis is 217 and list provided is 250


The Resume list provided has duplicate files in it and hence considered unique resumes for the analysis.

In [25]:
#Loading StopWords
stopwords_file=open('stopwords_en.txt',"r",encoding="utf8")
stopwords=stopwords_file.read().split('\n')
print(len(stopwords))

571


Loading stop words from the stopword list provided. It contains 571 lowercase stopwords.

## 4. Pre-processing and  Tokenization

#### Pre-processing

As part of pre-processing of the resumes three tasks are carried out. They are
1. Removing all the bullet points and junk characters using regex  <font size=3 color="green"> r'[^!"' +"#$%&'"+'()*+,./:;<=>?@[\]^_`{|}~\sA-Za-z0-9]'</font> is used.<br/>This regex pattern matches bullet points(inclusive of '-') and other junk characters and replaces with nulls. <br/>
2. Sentence Segmentation (Sentence Boundary Detection) <br/>It is the method of text processing where sentences are converted into tokens. The Assignment uses `nltk.data.load('tokenizers/punkt/english.pickle')` tokenizer to detect sentences ending with full stop or Question mark or Exclamation mark.<br/>

3. Case Normalization
It is the process of converting the uppercase character in to lowercase charaters. As given in the assignment specifications
the first word of all the sentences are normalized to lowercase charaters. <br/> 

<font size=3 color="blue"> why do we do case normalization?</font><br/>

Case Normalization is done to reduce the number of words in the final vocabulary list which would be helpfull at the later stages for analysis. For instance, ‘qualifications’ and ‘Qualifications’ mean the same when context is Resume. Hence Normalizing these kind of words will not result in loss of information.

#### Tokenization

Tokenization is breaking up of sentence or paragraph into meaningful pieces such as words, keywords or phrases etc. As per assignment we have been provided a regular expression to split the text into tokens.
<font size=3 color="blue"> Why Tokenization?</font><br/>
Tokenization is fundamental in all text processing algorithms. It is comparatively easy to do analysis when we split large chunks of data into smaller units.For instance analysis of frequencies of each word , sentiment analysis stylometric analysis etc would be easier if there are tokenised.


The above mentioned preprocessing and tokenization are defined in one function called `tokenizeRawData` in the program. It takes each resume replaces junk values with nulls , segmentises the sentences , coverts the  first word of every sentence into lower case and at the end returns tokens and resume_name where tokenization is based on the given regular expression `r"\w+(?:[-']\w+)?"`

In [26]:
def tokenizeRawData(resume_id):  
    fopen=open(resume_id,"r",encoding="utf8")
    text=fopen.read() 
    text_without_junk=''
    reg=('[\n][^!"' +"#$%&'"+'()*+,-./:;<=>?@[\]^_`{|}~\sA-Za-z0-9]')# regex to remove bullet points and junk characters
    text_without_junk+=re.sub(reg,'',text)
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') # detecting sentences 
    sentences = sent_detector.tokenize(text_without_junk.strip())
    lower_text='' # asssign sentence back to string after doing lowercase 
    for sent in sentences: 
        lower_funct = lambda word: word.group(1).lower() # lamda function to lower first word in a sentence.
        lower_text+=re.sub(r'(^\w+)', lower_funct, sent)
    tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?") # tokenizing 
    tokenised_file=tokenizer.tokenize(lower_text) 
    return (resume_id, tokenised_file) 


In [29]:
# Creating Dictionary of which includes resume name and tokens of each resume 
tokenized_resumes =  dict(tokenizeRawData(each_file) for each_file in resume_names)
tokenized_resumes['resume_(100).txt']

['Name',
 'CHIN',
 'Kwok',
 'Ho',
 'Mobile',
 '852-5347',
 '8575',
 'E-mail',
 'chinkhthomas',
 'gmail',
 'com',
 'Education',
 'The',
 'Hong',
 'Kong',
 'University',
 'of',
 'Science',
 'and',
 'Technology',
 'Sep',
 '2015',
 'BBA',
 'in',
 'Finance',
 'and',
 'Professional',
 'Accounting',
 'Second',
 'Class',
 'Honors',
 'Division',
 'I',
 'Work',
 'Experience',
 'BOCI-Prudential',
 'Trustee',
 'Limited',
 'Finance',
 'Department',
 'Senior',
 'Fund',
 'Accountant',
 'Assistant',
 'Sep',
 '2015',
 'Valuated',
 'monthly',
 'Cayman',
 'fund',
 'SFC',
 'funds',
 'RQFII',
 'and',
 'QDII',
 'funds',
 'holding',
 'different',
 'types',
 'of',
 'financial',
 'Sep',
 '2017',
 'instruments',
 'including',
 'but',
 'not',
 'limited',
 'to',
 'stocks',
 'options',
 'futures',
 'warrants',
 'Cooperated',
 'with',
 'other',
 'teammates',
 'involving',
 'trade',
 'settlements',
 'corporate',
 'actions',
 'and',
 'price',
 'movements',
 'by',
 'reconciling',
 'with',
 'Bloomberg',
 'and',
 'other

In [30]:
words = list(chain.from_iterable(tokenized_resumes.values())) # fetching list of all tokens from all resumes
vocab = set(words) #Vocabulary 
lexical_diversity = len(words)/len(vocab) # Calculating Lexical Diversity which means on an Average how many times a word is repeated.
print('After pre-processing and tokenization')
print('Length of word list: ',len(words))
print('Length of vocabulary list: ',len(vocab))
print('lexical diversity: ',lexical_diversity)

After pre-processing and tokenization
Length of word list:  142975
Length of vocabulary list:  16594
lexical diversity:  8.616066047969145


## 5.StopWords Removal 

Stopwords in English are more frequently used words. For instance, articles, prepositions etc;<br/>

<font size=3 color="blue">Why are we removing these words? and why at this stage?</font>

As definition says they are frequent words which implies they will be present in almost all the resumes and hence keeping these words will not help in analysis.<br/>

Stopwords removal has to be done in the earlier stages. It is best to do before stemming because some stopwords can be stemmed by stemmer and cannot to filtered later.This may end up in false analysis. For instance "was" stems into "wa", "alone" becomes "alon" by porter stemmer.

On the other hand we can stem stopwards too and filer it later. By this approach we land up in stemmed bigrams which will not be meaningfull. 



In [31]:
for k,v in tokenized_resumes.items(): # removing stopwords
    tokenized_resumes[k] = [word for word in v if word not in stopwords]
tokenized_resumes['resume_(100).txt']

['Name',
 'CHIN',
 'Kwok',
 'Ho',
 'Mobile',
 '852-5347',
 '8575',
 'E-mail',
 'chinkhthomas',
 'gmail',
 'Education',
 'The',
 'Hong',
 'Kong',
 'University',
 'Science',
 'Technology',
 'Sep',
 '2015',
 'BBA',
 'Finance',
 'Professional',
 'Accounting',
 'Second',
 'Class',
 'Honors',
 'Division',
 'I',
 'Work',
 'Experience',
 'BOCI-Prudential',
 'Trustee',
 'Limited',
 'Finance',
 'Department',
 'Senior',
 'Fund',
 'Accountant',
 'Assistant',
 'Sep',
 '2015',
 'Valuated',
 'monthly',
 'Cayman',
 'fund',
 'SFC',
 'funds',
 'RQFII',
 'QDII',
 'funds',
 'holding',
 'types',
 'financial',
 'Sep',
 '2017',
 'instruments',
 'including',
 'limited',
 'stocks',
 'options',
 'futures',
 'warrants',
 'Cooperated',
 'teammates',
 'involving',
 'trade',
 'settlements',
 'corporate',
 'actions',
 'price',
 'movements',
 'reconciling',
 'Bloomberg',
 'credible',
 'sources',
 'Coordinated',
 'fund',
 'managers',
 'custodians',
 'bankers',
 'resolve',
 'valuation',
 'fund',
 'setup',
 'issues',
 '

In [32]:
#Printing statistics  
words = list(chain.from_iterable(tokenized_resumes.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print('After Removal of Stopwords \n')
print('Length of word list: ',len(words))
print('Length of vocabulary list: ',len(vocab))
print('Lexical diversity: ',lexical_diversity)

After Removal of Stopwords 

Length of word list:  106595
Length of vocabulary list:  16245
Lexical diversity:  6.561711295783318


## 5. Fetching top 200 Bigrams

Bigrams are pair of consecutive words. We are here to find the top 200 most frequent bigrams. 

The first step is to concatenate all the tokenized words using the chain.frome_iterable function. The returned list 
by the function contains a list of all the words seprated by while space.

<font size=3 color="blue">Why am I finding bigrams after removal of stopwords and  not after stemming?</font><br/>

The reason for doing this is if bigrams were found before stopwords removal, “of-the”,"along-with" etc; would appear as most frequent bigram. Even though its meaning full word it doesn’t convey anything for the analysis as it would be present in most of the text.<br/>

If we do it after stemming some of the words may get shortened and may become meaningless.



In [33]:
bigrams = ngrams(list(chain.from_iterable(tokenized_resumes.values())),n = 2) #Where n=2 represents number of words which occur together
freq_dist_bg = FreqDist(bigrams) #FreqDist gives count of each word
bg=freq_dist_bg.most_common(250)

bi_list=[]
for each in bg:
    bi_list.append(each[0]) # picking only words from freq_dist dataframe
    
bigrams_list=[] # Considering only alphabetical bigram list
for each in bi_list:
    if each[0].isalpha()==True and each[1].isalpha()==True:
        bigrams_list.append(each)
bigrams_list        


[('Hong', 'Kong'),
 ('financial', 'statements'),
 ('due', 'diligence'),
 ('real', 'estate'),
 ('Pte', 'Ltd'),
 ('private', 'equity'),
 ('Asset', 'Management'),
 ('Asia', 'Pacific'),
 ('Financial', 'Services'),
 ('Microsoft', 'Office'),
 ('English', 'Mandarin'),
 ('asset', 'management'),
 ('Business', 'Administration'),
 ('cash', 'flow'),
 ('Fluent', 'English'),
 ('Real', 'Estate'),
 ('Private', 'Equity'),
 ('hedge', 'funds'),
 ('Bachelor', 'Business'),
 ('Business', 'School'),
 ('WORK', 'EXPERIENCE'),
 ('University', 'Hong'),
 ('Financial', 'Reporting'),
 ('internal', 'control'),
 ('team', 'members'),
 ('M', 'A'),
 ('Fund', 'Accountant'),
 ('Senior', 'Associate'),
 ('fund', 'managers'),
 ('Certified', 'Public'),
 ('Excel', 'PowerPoint'),
 ('financial', 'reporting'),
 ('University', 'Singapore'),
 ('listed', 'companies'),
 ('New', 'York'),
 ('Nanyang', 'Technological'),
 ('Fund', 'Services'),
 ('Assistant', 'Manager'),
 ('Vice', 'President'),
 ('English', 'Chinese'),
 ('Technological', 

## 6. Re-tokenize the words 

In previous Task we calculated Bigrams. Here we need to add those bigrams into our tokenize list and retokenize the whole list. To ensure the bigrams wont split we use MWE tokenizer to add '_' between them and we retokenize them.

In [34]:
#passing it to MWE Tokenizer to re-tokenize and add '_' between words
mwetokenizer = MWETokenizer(bigrams_list)
tokenized_resumes =  dict((k, mwetokenizer.tokenize(v)) for k,v in tokenized_resumes.items())
tokenized_resumes['resume_(100).txt']


['Name',
 'CHIN',
 'Kwok',
 'Ho',
 'Mobile',
 '852-5347',
 '8575',
 'E-mail',
 'chinkhthomas',
 'gmail',
 'Education',
 'The',
 'Hong_Kong',
 'University',
 'Science',
 'Technology',
 'Sep',
 '2015',
 'BBA',
 'Finance',
 'Professional',
 'Accounting',
 'Second_Class',
 'Honors',
 'Division',
 'I',
 'Work_Experience',
 'BOCI-Prudential',
 'Trustee',
 'Limited',
 'Finance',
 'Department',
 'Senior',
 'Fund_Accountant',
 'Assistant',
 'Sep',
 '2015',
 'Valuated',
 'monthly',
 'Cayman',
 'fund',
 'SFC',
 'funds',
 'RQFII',
 'QDII',
 'funds',
 'holding',
 'types',
 'financial',
 'Sep',
 '2017',
 'instruments',
 'including',
 'limited',
 'stocks',
 'options',
 'futures',
 'warrants',
 'Cooperated',
 'teammates',
 'involving',
 'trade',
 'settlements',
 'corporate_actions',
 'price',
 'movements',
 'reconciling',
 'Bloomberg',
 'credible',
 'sources',
 'Coordinated',
 'fund_managers',
 'custodians',
 'bankers',
 'resolve',
 'valuation',
 'fund',
 'setup',
 'issues',
 'Monitored',
 'investment

In [35]:
# Getting Statistics of the words
words = list(chain.from_iterable(tokenized_resumes.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print('After Adding Bi-Grams back to tokenlist \n')
print('Size of word list: ',len(words))
print('Size of vocabulary list: ',len(vocab))
print('lexical diversity: ',lexical_diversity)

After Adding Bi-Grams back to tokenlist 

Size of word list:  101727
Size of vocabulary list:  16456
lexical diversity:  6.181757413709286


## 7. Stemming

The idea of stemming is a kind of normalization . Words have different variations but carry the same meaning. 

<font size=3 color="blue">Why are we doing stemming here ? Why only on lowercase characters for stemming?</font>

The reason why we stem is to shorten the lookup, and normalize sentences. When all he words are normalised it is easier to analyse and conclude.

When you pass tokens to Porter stemmer, it not only results in stemmed words but also converts everything into lower case. This something which is not expected. We may loose information. 	Hence, we passing only lower tokens would give us the necessary results.

In [36]:
# using Porter Stemmer for stemming
stemmer = PorterStemmer()
for k,v in tokenized_resumes.items():
    tokenized_resumes[k] = [stemmer.stem(token) if re.search('([a-z]+[_][a-z]+)',token)==False & token.islower() else token for token in v]
tokenized_resumes['resume_(100).txt']


['Name',
 'CHIN',
 'Kwok',
 'Ho',
 'Mobile',
 '852-5347',
 '8575',
 'E-mail',
 'chinkhthomas',
 'gmail',
 'Education',
 'The',
 'Hong_Kong',
 'University',
 'Science',
 'Technology',
 'Sep',
 '2015',
 'BBA',
 'Finance',
 'Professional',
 'Accounting',
 'Second_Class',
 'Honors',
 'Division',
 'I',
 'Work_Experience',
 'BOCI-Prudential',
 'Trustee',
 'Limited',
 'Finance',
 'Department',
 'Senior',
 'Fund_Accountant',
 'Assistant',
 'Sep',
 '2015',
 'Valuated',
 'monthly',
 'Cayman',
 'fund',
 'SFC',
 'funds',
 'RQFII',
 'QDII',
 'funds',
 'holding',
 'types',
 'financial',
 'Sep',
 '2017',
 'instruments',
 'including',
 'limited',
 'stocks',
 'options',
 'futures',
 'warrants',
 'Cooperated',
 'teammates',
 'involving',
 'trade',
 'settlements',
 'corporate_actions',
 'price',
 'movements',
 'reconciling',
 'Bloomberg',
 'credible',
 'sources',
 'Coordinated',
 'fund_managers',
 'custodians',
 'bankers',
 'resolve',
 'valuation',
 'fund',
 'setup',
 'issues',
 'Monitored',
 'investment

In [37]:
#Displaying Statistics
words = list(chain.from_iterable(tokenized_resumes.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print('After Stemming \n')
print('Size of word list: ',len(words))
print('Size of vocabulary list: ',len(vocab))
print('lexical diversity: ',lexical_diversity)

After Stemming 

Size of word list:  101727
Size of vocabulary list:  16456
lexical diversity:  6.181757413709286


## 8.Removal of less and more context dependant words 

This part invloves removal of words which are present in more than 98% of the document and less than 2% of the documents. 

<font size=3 color="blue">Why are we doing this? and why at this stage</font>

The words which appear in more than 98% documents and which are appearing in less than 2% are of no use because the probablity of determining the match is high and other case its low. So for the analysis it wont be helpfull.

If the words were removed earlier we would have lost bigrams. 


Considering resumes as context, removing less frequent tokens at the end would be better. For instance, Consider two sentences, I am a critical thinker and  I am good at critical thinking, if thinking occurs in most places than thinker , rejecting thinker would make you lose one potential candidate for the interview. 
If this is done after stemming thinker and thinking would be stemmed to same word and hence we still consider him for the interview.


In [38]:
#Calculating the frequency of each token using FreqDist  
less_freq_tokens=[]
for k,v in tokenized_resumes.items():
    less_freq_tokens+=list(set(v))
word_freq = FreqDist(less_freq_tokens)    

#Less than 2%
remove_list_2=[]   
for k,v in word_freq.items():
    if v/len(resume_names)<0.02:
        remove_list_2.append(k)
#greater than 98%
remove_list_98=[]   
for k,v in word_freq.items():
    if v/len(resume_names)>0.98:
        remove_list_98.append(k)

remove_list=remove_list_2+remove_list_98

# removing both the 98% and 2% lists
for k,v in tokenized_resumes.items():
    tokenized_resumes[k] = [word for word in v if word not in remove_list]

tokenized_resumes['resume_(100).txt']


['Name',
 'Ho',
 'Mobile',
 'E-mail',
 'gmail',
 'Education',
 'The',
 'Hong_Kong',
 'University',
 'Science',
 'Technology',
 'Sep',
 '2015',
 'BBA',
 'Finance',
 'Professional',
 'Accounting',
 'Second_Class',
 'Honors',
 'Division',
 'I',
 'Work_Experience',
 'Trustee',
 'Limited',
 'Finance',
 'Department',
 'Senior',
 'Fund_Accountant',
 'Assistant',
 'Sep',
 '2015',
 'monthly',
 'Cayman',
 'fund',
 'SFC',
 'funds',
 'funds',
 'holding',
 'types',
 'financial',
 'Sep',
 '2017',
 'instruments',
 'including',
 'limited',
 'stocks',
 'options',
 'futures',
 'teammates',
 'involving',
 'trade',
 'settlements',
 'corporate_actions',
 'price',
 'movements',
 'Bloomberg',
 'sources',
 'Coordinated',
 'fund_managers',
 'custodians',
 'bankers',
 'resolve',
 'valuation',
 'fund',
 'issues',
 'Monitored',
 'investment',
 'position',
 'margin',
 'requirement',
 'cash_flow',
 'investment',
 'compliance',
 'Prepared',
 'management_accounts',
 'audit',
 'queries',
 'junior',
 'colleagues',
 'jo

In [39]:
#Displaying the statistics 
words = list(chain.from_iterable(tokenized_resumes.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print('After removing context dependant words \n')
print('Size of word list: ',len(words))
print('Size of vocabulary list: ',len(vocab))
print('lexical diversity: ',lexical_diversity)

After removing context dependant words 

Size of word list:  78155
Size of vocabulary list:  3119
lexical diversity:  25.05771080474511


## 9.Removal of words which have length less than 3

This step can be done anywhere before stemming or after stemming. Just to have final each vocabulary word length greater than 3 considering it here.

In [41]:
# removing words which have less than 3
for k,v in tokenized_resumes.items():
    tokenized_resumes[k] = [word for word in v if len(word)>=3]  
tokenized_resumes['resume_(100).txt']

['Name',
 'Mobile',
 'E-mail',
 'gmail',
 'Education',
 'The',
 'Hong_Kong',
 'University',
 'Science',
 'Technology',
 'Sep',
 '2015',
 'BBA',
 'Finance',
 'Professional',
 'Accounting',
 'Second_Class',
 'Honors',
 'Division',
 'Work_Experience',
 'Trustee',
 'Limited',
 'Finance',
 'Department',
 'Senior',
 'Fund_Accountant',
 'Assistant',
 'Sep',
 '2015',
 'monthly',
 'Cayman',
 'fund',
 'SFC',
 'funds',
 'funds',
 'holding',
 'types',
 'financial',
 'Sep',
 '2017',
 'instruments',
 'including',
 'limited',
 'stocks',
 'options',
 'futures',
 'teammates',
 'involving',
 'trade',
 'settlements',
 'corporate_actions',
 'price',
 'movements',
 'Bloomberg',
 'sources',
 'Coordinated',
 'fund_managers',
 'custodians',
 'bankers',
 'resolve',
 'valuation',
 'fund',
 'issues',
 'Monitored',
 'investment',
 'position',
 'margin',
 'requirement',
 'cash_flow',
 'investment',
 'compliance',
 'Prepared',
 'management_accounts',
 'audit',
 'queries',
 'junior',
 'colleagues',
 'job',
 'duties'

In [42]:
words = list(chain.from_iterable(tokenized_resumes.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print('After removing words which have length less than 3 \n')
print('Size of word list: ',len(words))
print('Size of vocabulary list: ',len(vocab))
print('lexical diversity: ',lexical_diversity)

After removing words which have length less than 3 

Size of word list:  72739
Size of vocabulary list:  2991
lexical diversity:  24.319291206954198


## 10. Vectorizing, Indexing and Writing to file

In [46]:
out_file_dict = open("29416000_vocab.txt", 'w') 
vocab = list(vocab)

# indexing every token
vocab_dict = {}
i = 0
for w in vocab:
    vocab_dict[w] = i
    i = i + 1
# Writing it to file   
for k, v in sorted(vocab_dict.items()):
    out_file_dict.write("{}:{} ".format(k,v))
    out_file_dict.write('\n')

out_file_dict.close() 

In [47]:
#Building a Sparse Matrix
out_file_vector = open("29416000_countVec.txt", 'w')
#Vectorizing
for resume,token in tokenized_resumes.items():
    out_file_vector.write(resume[:-4]+' : ')# slicing is used to consider only resume_(n) not resume_(n).txt
    d_idx = [vocab_dict[w] for w in token]    
    for k, v in FreqDist(d_idx).items():       
        out_file_vector.write("{}:{} ".format(k,v))
    out_file_vector.write('\n\n')
    
out_file_vector.close()

## 11. Summary
This assessment measured the understanding of basic text file processing techniques in the Python programming language. The main outcomes achieved while applying these techniques were:

- **Sentence Segmentation**. By using the built-in `nltk.data English pickle` module, it was possible split the sentences of the each resume.
- **Case Normlization**. By using the `re` package, designed regex which captures only the first word of the sentence and changes to lower case.  
- **Tokenization**. By using the `nltk` and `re` package, regular expressions had to be used to tokenize text and obtain letter-only words. Roughly 200 bigrams were also generated to further tokenize the initial corpus. The Frequency Distribution was to detect pairs of words with more frequently appearing together. In addition, bigram filters were also used to refine the bigrams even more by considering only alphabetical words.Stemming process was carried out to shoten the lookup for prediction.
- **Vocabulary and sparse vector generation**. A vocabulary covering words from different abstracts was obtained by  removing stop words, words> 98% (most frequent ones), and words that appeared in <2% documents. Finally, a sparse vector was calculated for every abstract by counting the frequency of vocabulary word occurrences.


######################### Text Statistics Before Wrangling ##################################

Vocabulary size:  16539<br/>
Total number of tokens:  143031<br/>
Lexical diversity:  8.648104480319246<br/>

######################### Final Text Statistics After Wranling ##################################

Vocabulary size:  2990 <br/>
Total number of tokens:  72764<br/>
Lexical diversity:  24.335785953177258

## 12. References

* https://docs.python.org/3/library/re.html
* https://www.nltk.org/
* NLTK Project. (2017). NLTK 3.0 documentation: nltk.tokenize.regexp module. Retrieved from http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer
* NLTK Project. (2015). Collocations. Retrieved from http://www.nltk.org/howto/collocations.html
* https://www.nltk.org/_modules/nltk/stem/porter.html