# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Momitha Yepuri
#### Student ID: S3856512

Date: 3/10

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy

## Introduction

In this assessment task, we are given a large collection of job advertisement documents (~ 50k jobs). The job advertisements range across 8 different industries. i.e `Accounting_Finance, Engineering, Healthcare_Nursing, Hospitality_Catering, IT, PR_Advertising_Marketing, Sales and Teaching`.
The goal of this task is to perform basic text pre-processing on the job ads dataset, using processes such as tokenization, removing most/less frequent words and stop words, and extracting bigrams. Primarily, focusing on the pre-processing the description only.

We are outputting 3 text files,
- `vocab.txt`: contains the unigram vocabulary, one each line, in the following format: `word_string:word_integer_index`.
- `bigram.txt`: contains the found bigrams found in the whole document collection as well as their term frequency, separated by comma.
- `job_ads.txt`: contains the job advertisement information and the pre-processed descriptiontext for all the job advertisement documents.


## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import nltk
import pandas as pd
import numpy as np
from sklearn.datasets import load_files  
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.probability import *
from itertools import chain
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
import nltk.data
import re

## 1.1 Examining and loading data
- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


The loaded `job_data` is then a dictionary, with the following attributes:
* `data` - a list of job descriptions
* `target` - the corresponding label of the job descriptions (integer index)
* `target_names` - the names of job categories.
* `filenames` - the filenames holding the dataset.

In [2]:
ls data

[34mAccounting_Finance[m[m/       [34mHospitality_Catering[m[m/     [34mSales[m[m/
[34mEngineering[m[m/              [34mIT[m[m/                       [34mTeaching[m[m/
[34mHealthcare_Nursing[m[m/       [34mPR_Advertising_Marketing[m[m/


In [3]:
ls data/IT

Job_00001.txt  Job_02872.txt  Job_05743.txt  Job_08614.txt  Job_11485.txt
Job_00002.txt  Job_02873.txt  Job_05744.txt  Job_08615.txt  Job_11486.txt
Job_00003.txt  Job_02874.txt  Job_05745.txt  Job_08616.txt  Job_11487.txt
Job_00004.txt  Job_02875.txt  Job_05746.txt  Job_08617.txt  Job_11488.txt
Job_00005.txt  Job_02876.txt  Job_05747.txt  Job_08618.txt  Job_11489.txt
Job_00006.txt  Job_02877.txt  Job_05748.txt  Job_08619.txt  Job_11490.txt
Job_00007.txt  Job_02878.txt  Job_05749.txt  Job_08620.txt  Job_11491.txt
Job_00008.txt  Job_02879.txt  Job_05750.txt  Job_08621.txt  Job_11492.txt
Job_00009.txt  Job_02880.txt  Job_05751.txt  Job_08622.txt  Job_11493.txt
Job_00010.txt  Job_02881.txt  Job_05752.txt  Job_08623.txt  Job_11494.txt
Job_00011.txt  Job_02882.txt  Job_05753.txt  Job_08624.txt  Job_11495.txt
Job_00012.txt  Job_02883.txt  Job_05754.txt  Job_08625.txt  Job_11496.txt
Job_00013.txt  Job_02884.txt  Job_05755.txt  Job_08626.txt  Job_11497.txt
Job_00014.txt  Job_02885.

In [4]:
# Load the data from the data folder
job_data = load_files(r"./data")  

In [5]:
job_data['filenames']

array(['./data/Engineering/Job_14624.txt',
       './data/Healthcare_Nursing/Job_31567.txt',
       './data/Hospitality_Catering/Job_50131.txt', ...,
       './data/IT/Job_13401.txt',
       './data/PR_Advertising_Marketing/Job_52696.txt',
       './data/Accounting_Finance/Job_25296.txt'], dtype='<U45')

In [6]:
job_data['target']

array([1, 2, 3, ..., 4, 5, 0])

In the data folder, there are 8 different subfolders where each folder is a job category.

In [7]:
job_data['target_names']

['Accounting_Finance',
 'Engineering',
 'Healthcare_Nursing',
 'Hospitality_Catering',
 'IT',
 'PR_Advertising_Marketing',
 'Sales',
 'Teaching']

In [8]:
# test whether it matches, just in case
emp = 2
job_data['filenames'][emp], job_data['target'][emp] # from the file path we know that it's the correct class too

('./data/Hospitality_Catering/Job_50131.txt', 3)

In [9]:
#Assigning variables
descriptions, adverts = job_data.data, job_data.target  

In [10]:
# description of job advertisement
descriptions[emp]

b'Title: CHEF DE RANG FOR MICHELIN STARRED RESTAURANT\nWebindex: 69182387\nCompany: Club Gascon\nDescription: French restaurant Club Gascon , (1 Michelin) well established in the heart of London (easy access by bus, train or tube) is looking for an experienced Chef de Rang / Waiter to complete its team for 2013. Split and straight shifts depending on rota, 5 days / week, closed every sunday and bank holiday. Average of ****h / week Real career progression possible as we are part of a small group of quality restaurant all based in London. Excellent wages (basic share of gratuities), staff discount in all our restaurants etc. Please check our site: www.clubgascon.com and if interested in joining us, send a detailed CV with the position applied for to: infoclubgascon.com to arrange a meeting in order to discuss the position on offer.'

In [11]:
adverts[emp]

3

In [12]:
# getting the ID for each text file from each category
job_id = []
for id in job_data ['filenames']:
    job_id.append(id.split("Job_")[1].strip(".txt"))

In [13]:
# creating a list for categories 
job_category = []
for cat1 in adverts:
    job_category.append(job_data['target_names'][cat1])

## 1.2 Pre-processing data

### 1.2.1 Tokenization

In this sub-task, I'm tokenizing each of the job_ads description. First, converted the description into lowercase for consistency, then perform sentence segmentation followed by word tokenization. 
Finally, Stored each tokenized description value as a list of tokens.

In [14]:
#converting to lowercase
descriptions = [content.lower() for content in descriptions]

In [15]:
# descriptions[emp]

In [16]:
job_title = []
job_web_index = []
def tokenizeDescription(content):
   
    description = content.decode('utf-8') # convert the bytes-like object to python string, need this before we apply any pattern search on it
    #converting to lowercase
    to_lower = description.lower()
    
    #Searching for description using regex
    description = re.search(r'description:\s*(.*)$', str(to_lower)).group(1)
    
    #Searching for title using regex
    title = re.search(r'title:(.*)',str(to_lower)).group(1)
    title = title.strip() # strip whitespaces
    
    #Searching for webIndex using regex
    web_index = re.search(r'webindex:(.*\d+)',str(to_lower)).group(1)
    web_index = web_index.strip() # strip whitespaces
   
    # Storing all the results into a list after searcing job_title and webIndex through regex.
    job_title.append(title)
    job_web_index.append(web_index)
    #segmenting into sentences
    sentences = sent_tokenize(str(description))
    
    # tokenize each sentence
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern) 
    token_lists = [tokenizer.tokenize(sen) for sen in sentences]
    
    # merge them into a list of tokens
    tokenised_description = list(chain.from_iterable(token_lists))
    return tokenised_description



In [17]:
# test variable used throughout to test
test_ind = 2

In [18]:
print("Number of Job ID's:", len(job_id))

Number of Job ID's: 55449


In [19]:
test_ind = 2 # randomly select an element to check whether the job ID and txt are correctly correspond to each other, 
print("Job ID:", job_id[test_ind])


Job ID: 50131


#### Statistics Before Any Further Pre-processing

* The total number of tokens across the corpus
* The total number of types across the corpus, i.e. the size of vocabulary 
* Lexical diversity referrs to the ratio of different unique word stems (types) to the total number of words (tokens).  
* The average, minimum and maximum number of token (i.e. document length) in the dataset.

In the following, we are printing all these as a function, since we will use this printing module later to compare these statistic values before and after pre-processing.

In [20]:
def stats_print(tk_descriptions):
    words = list(chain.from_iterable(tk_descriptions)) # we put all the tokens in the corpus in a single list
    vocab = set(words) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of descriptions:", len(tk_descriptions))
    lens = [len(desc) for desc in tk_descriptions]
    print("Average description length:", np.mean(lens))
    print("Maximun description length:", np.max(lens))
    print("Minimun description length:", np.min(lens))
    print("Standard deviation of description length:", np.std(lens))



In [21]:
tk_descriptions = [tokenizeDescription(d) for d in descriptions]  # list comprehension, generate a list of tokenized descriptions


In [22]:
print("Raw description:\n",descriptions[emp],'\n')
print("Tokenized description:\n",tk_descriptions[emp])

Raw description:
 b'title: chef de rang for michelin starred restaurant\nwebindex: 69182387\ncompany: club gascon\ndescription: french restaurant club gascon , (1 michelin) well established in the heart of london (easy access by bus, train or tube) is looking for an experienced chef de rang / waiter to complete its team for 2013. split and straight shifts depending on rota, 5 days / week, closed every sunday and bank holiday. average of ****h / week real career progression possible as we are part of a small group of quality restaurant all based in london. excellent wages (basic share of gratuities), staff discount in all our restaurants etc. please check our site: www.clubgascon.com and if interested in joining us, send a detailed cv with the position applied for to: infoclubgascon.com to arrange a meeting in order to discuss the position on offer.' 

Tokenized description:
 ['french', 'restaurant', 'club', 'gascon', 'michelin', 'well', 'established', 'in', 'the', 'heart', 'of', 'londo

#### The Statistics

After performing the tokenisation process, let's have a look at the statistics:

In [23]:
stats_print(tk_descriptions)

Vocabulary size:  89591
Total number of tokens:  13799127
Lexical diversity:  0.006492512171240978
Total number of descriptions: 55449
Average description length: 248.861602553698
Maximun description length: 2001
Minimun description length: 10
Standard deviation of description length: 125.26507304982165


### Task 1.2.2 Removing Single Character Token

Removing any tokens that contain single characters (a token that of less than length 2) in job descriptions. 
Double checking whether it has been done properly.

In [24]:
# create a list of single character token for each description
doubleChar_list = [[d for d in descriptions if len(d) < 2] \
                      for descriptions in tk_descriptions] 
list(chain.from_iterable(doubleChar_list)) # merge them together in one list

['a',
 'a',
 'm',
 'a',
 'a',
 'a',
 's',
 'a',
 's',
 'h',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'p',
 'a',
 'a',
 'a',
 'k',
 'k',
 'a',
 'a',
 'a',
 'x',
 'x',
 'a',
 'k',
 'k',
 'a',
 'k',
 'k',
 'k',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'c',
 'c',
 'a',
 'a',
 'a',
 'k',
 'k',
 'k',
 'k',
 'a',
 'b',
 'b',
 'a',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 'a',
 'e',
 'g',
 'a',
 'c',
 'k',
 'k',
 'i',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 's',
 'u',
 's',
 'i',
 'g',
 'o',
 's',
 'a',
 'a',
 'a',
 'k',
 'c',
 'a',
 'c',
 'a',
 'a',
 'a',
 'c',
 'k',
 'a',
 's',
 'a',
 'm',
 'm',
 'a',
 's',
 'a',
 'a',
 'a',
 'e',
 'm',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'c',
 'a',
 'c',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 'j'

In [25]:
# Before removal of 
print("Tokenized description:\n",tk_descriptions[emp])

Tokenized description:
 ['french', 'restaurant', 'club', 'gascon', 'michelin', 'well', 'established', 'in', 'the', 'heart', 'of', 'london', 'easy', 'access', 'by', 'bus', 'train', 'or', 'tube', 'is', 'looking', 'for', 'an', 'experienced', 'chef', 'de', 'rang', 'waiter', 'to', 'complete', 'its', 'team', 'for', 'split', 'and', 'straight', 'shifts', 'depending', 'on', 'rota', 'days', 'week', 'closed', 'every', 'sunday', 'and', 'bank', 'holiday', 'average', 'of', 'h', 'week', 'real', 'career', 'progression', 'possible', 'as', 'we', 'are', 'part', 'of', 'a', 'small', 'group', 'of', 'quality', 'restaurant', 'all', 'based', 'in', 'london', 'excellent', 'wages', 'basic', 'share', 'of', 'gratuities', 'staff', 'discount', 'in', 'all', 'our', 'restaurants', 'etc', 'please', 'check', 'our', 'site', 'www', 'clubgascon', 'com', 'and', 'if', 'interested', 'in', 'joining', 'us', 'send', 'a', 'detailed', 'cv', 'with', 'the', 'position', 'applied', 'for', 'to', 'infoclubgascon', 'com', 'to', 'arrange', 

In [26]:
# filter out double character tokens
tk_descriptions = [[w for w in descriptions if len(w) >=2] \
                      for descriptions in tk_descriptions]

In [27]:
# After removal
print("Tokenized description:\n",tk_descriptions[emp])

Tokenized description:
 ['french', 'restaurant', 'club', 'gascon', 'michelin', 'well', 'established', 'in', 'the', 'heart', 'of', 'london', 'easy', 'access', 'by', 'bus', 'train', 'or', 'tube', 'is', 'looking', 'for', 'an', 'experienced', 'chef', 'de', 'rang', 'waiter', 'to', 'complete', 'its', 'team', 'for', 'split', 'and', 'straight', 'shifts', 'depending', 'on', 'rota', 'days', 'week', 'closed', 'every', 'sunday', 'and', 'bank', 'holiday', 'average', 'of', 'week', 'real', 'career', 'progression', 'possible', 'as', 'we', 'are', 'part', 'of', 'small', 'group', 'of', 'quality', 'restaurant', 'all', 'based', 'in', 'london', 'excellent', 'wages', 'basic', 'share', 'of', 'gratuities', 'staff', 'discount', 'in', 'all', 'our', 'restaurants', 'etc', 'please', 'check', 'our', 'site', 'www', 'clubgascon', 'com', 'and', 'if', 'interested', 'in', 'joining', 'us', 'send', 'detailed', 'cv', 'with', 'the', 'position', 'applied', 'for', 'to', 'infoclubgascon', 'com', 'to', 'arrange', 'meeting', 'in'

In [28]:
stats_print(tk_descriptions)

Vocabulary size:  89565
Total number of tokens:  13342925
Lexical diversity:  0.006712546162104636
Total number of descriptions: 55449
Average description length: 240.63418636945661
Maximun description length: 1919
Minimun description length: 10
Standard deviation of description length: 121.91270721028921


### Task 1.2.3 Removing Stop words

Removing the stop words from the given `stopwords_en.txt`.

In [29]:
# we put all the tokens in the corpus in a single list 
words = list(chain.from_iterable(tk_descriptions)) 
vocab = set(words) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words

In [30]:
term_fd = FreqDist(words) # compute term frequency for each unique word/type

In [31]:
stopwords_list = []
with open('stopwords_en.txt') as f:
    stopwords_list = f.read().splitlines()

In [32]:
len(stopwords_list)

571

In [33]:
tk_descriptions = [[w for w in description if w not in stopwords_list] \
                      for description in tk_descriptions]

In [34]:
words = list(chain.from_iterable([set(description) for description in tk_descriptions]))
doc_fd = FreqDist(words)
doc_fd.most_common(25)

[('experience', 43644),
 ('role', 34680),
 ('work', 33684),
 ('team', 32585),
 ('working', 30714),
 ('skills', 30412),
 ('client', 26899),
 ('job', 25552),
 ('business', 24739),
 ('uk', 24133),
 ('excellent', 22982),
 ('opportunity', 22678),
 ('company', 22263),
 ('management', 20620),
 ('required', 20555),
 ('development', 20223),
 ('apply', 20133),
 ('based', 19333),
 ('successful', 19118),
 ('join', 18682),
 ('www', 18421),
 ('salary', 18402),
 ('cv', 18383),
 ('support', 18286),
 ('knowledge', 17844)]

In [35]:
rm_words = list(vocab - set(doc_fd.keys()))
print("Remove",len(rm_words), "number of stop words.")
rm_words

Remove 513 number of stop words.


['unlikely',
 'somebody',
 'took',
 "you've",
 'vs',
 'former',
 'everybody',
 'tell',
 'mainly',
 'kept',
 'about',
 'were',
 'someone',
 'entirely',
 'further',
 'other',
 'third',
 'who',
 'both',
 'those',
 'tries',
 "weren't",
 'seem',
 'one',
 'everywhere',
 'corresponding',
 'qv',
 'four',
 'help',
 'gets',
 'especially',
 'knows',
 'specifying',
 'let',
 'above',
 'where',
 'anyway',
 'downwards',
 'little',
 'sub',
 'seriously',
 'gone',
 'trying',
 'moreover',
 'nobody',
 'over',
 'anywhere',
 "they're",
 'brief',
 'per',
 "i'd",
 'myself',
 'comes',
 'hardly',
 'wish',
 'towards',
 'plus',
 'somewhat',
 'specified',
 'currently',
 'welcome',
 'taken',
 'same',
 'unless',
 'none',
 'hence',
 'he',
 'reasonably',
 'immediate',
 'please',
 'name',
 'best',
 'cause',
 'looking',
 'value',
 'such',
 'between',
 'meanwhile',
 'not',
 'becomes',
 'them',
 'necessary',
 'under',
 'throughout',
 'any',
 'beside',
 "aren't",
 'sorry',
 'but',
 'at',
 "hasn't",
 'whole',
 'seven',
 'or

In [36]:
print("Tokenized description:\n",tk_descriptions[emp])

Tokenized description:
 ['french', 'restaurant', 'club', 'gascon', 'michelin', 'established', 'heart', 'london', 'easy', 'access', 'bus', 'train', 'tube', 'experienced', 'chef', 'de', 'rang', 'waiter', 'complete', 'team', 'split', 'straight', 'shifts', 'depending', 'rota', 'days', 'week', 'closed', 'sunday', 'bank', 'holiday', 'average', 'week', 'real', 'career', 'progression', 'part', 'small', 'group', 'quality', 'restaurant', 'based', 'london', 'excellent', 'wages', 'basic', 'share', 'gratuities', 'staff', 'discount', 'restaurants', 'check', 'site', 'www', 'clubgascon', 'interested', 'joining', 'send', 'detailed', 'cv', 'position', 'applied', 'infoclubgascon', 'arrange', 'meeting', 'order', 'discuss', 'position', 'offer']


In [37]:
print("Before stopword removal:",len(descriptions),"tokens")
print("After stopword removal:",len(tk_descriptions),"tokens")

Before stopword removal: 55449 tokens
After stopword removal: 55449 tokens


#### The Updated Statistics

In the above, we have done a few pre-processed steps, now let's have a look at the statistics again:
We notice that the vocab size has reduced from `89565` to `89052`, a difference of `513`.

In [38]:
stats_print(tk_descriptions)

Vocabulary size:  89052
Total number of tokens:  7863307
Lexical diversity:  0.011325006132915833
Total number of descriptions: 55449
Average description length: 141.8115204963119
Maximun description length: 1132
Minimun description length: 7
Standard deviation of description length: 73.78995293014496


### Task 1.2.4 Removing Less Frequent Words i.e words that appear only once

Removing the less frequent words from each tokenized description text by term frequency.
- find out the list of words that appear only once in the entire corpus of descriptions
- remove these less frequent words from each tokenized description text

In [39]:
words = list(chain.from_iterable(tk_descriptions)) # we put all the tokens in the corpus in a single list

Finding out the set of less frequent words by using the `hapaxes` function applied on the **term frequency** dictionary. Hapaxes are words that occurs only once within a context.

In [40]:
lessFreqWords = set(term_fd.hapaxes())
lessFreqWords

{'juniorsouschefwestsussex',
 'accountsdarabaseassistant',
 'seniortestanalystjoinaleadingfundmanagercity',
 'elderlyyou',
 'solmeliawhitehousecareers',
 'amenity',
 'ryw',
 'mdmwytuzzjq',
 'xmlexperience',
 'emailproducerforonlinetravelbrand',
 'milesfromhighwycombe',
 'netccommunication',
 'anastasia',
 'kaitlin',
 'nostell',
 'weekperson',
 'learningadvisorearlyyearscare',
 'leedsare',
 'unitrends',
 'experienceexcellent',
 "cubic's",
 'unittest',
 'sharepointtechnicalspecialistsharepointconsultant',
 'unixadminsunsolaris',
 'receptionistdeverevenuesltdlatimerplace',
 'navisionnavdeveloper',
 'motorfactor',
 'systemsperformancetechniciananalyst',
 'rosettesouschefnorthampton',
 'inlondonclose',
 'businessanalystproductownerto',
 'gebruikt',
 'shiftallowance',
 'diffuses',
 'requiredapply',
 'determent',
 'sdet',
 'corvus',
 'namtesco',
 'tenuous',
 'paintspray',
 'nibble',
 'secteur',
 'waeth',
 'warmsalesadvisorexistingcustomers',
 'treq',
 'panchalmerco',
 'livelyassistantmanagerw

In [41]:
#length of less frequenct words
len(lessFreqWords)

48975

We see that there are `48916` words that appear only once and we can prceed to remove them.

In [42]:
def removeLessFreqWords(description):
    return [d for d in description if d not in lessFreqWords]

tk_descriptions = [removeLessFreqWords(description) for description in tk_descriptions]

#### The Updated Statistics

In the above, we have done a few pre-processed steps, now let's have a look at the statistics again:
We notice that the vocab size has reduced from `89052` to `40088`, a difference of `48964`.

In [43]:
stats_print(tk_descriptions)

Vocabulary size:  40088
Total number of tokens:  7814343
Lexical diversity:  0.005130053799788415
Total number of descriptions: 55449
Average description length: 140.9284748146946
Maximun description length: 1121
Minimun description length: 7
Standard deviation of description length: 73.46663506985078


### Task 1.2.5 Removing the top 50 most frequent words based on document frequency.


Removing the most frequent words from each tokenized description text. 
Exploring the most frequent words in terms of document frequency:

In [44]:
words = list(chain.from_iterable([set(description) for description in tk_descriptions]))
doc_fd = FreqDist(words)  # compute document frequency for each unique word/type
doc_fd.most_common(50)

[('experience', 43644),
 ('role', 34680),
 ('work', 33684),
 ('team', 32585),
 ('working', 30714),
 ('skills', 30412),
 ('client', 26899),
 ('job', 25552),
 ('business', 24739),
 ('uk', 24133),
 ('excellent', 22982),
 ('opportunity', 22678),
 ('company', 22263),
 ('management', 20620),
 ('required', 20555),
 ('development', 20223),
 ('apply', 20133),
 ('based', 19333),
 ('successful', 19118),
 ('join', 18682),
 ('www', 18421),
 ('salary', 18402),
 ('cv', 18383),
 ('support', 18286),
 ('knowledge', 17844),
 ('strong', 16475),
 ('environment', 16408),
 ('posted', 16398),
 ('jobseeking', 16342),
 ('candidate', 16304),
 ('originally', 16294),
 ('leading', 16194),
 ('high', 15922),
 ('service', 15623),
 ('manager', 15587),
 ('good', 15252),
 ('ability', 15154),
 ('including', 14857),
 ('position', 14564),
 ('services', 14501),
 ('benefits', 14434),
 ('training', 14218),
 ('essential', 13915),
 ('experienced', 13826),
 ('key', 13567),
 ('contact', 13551),
 ('level', 13523),
 ('recruitment', 

In [45]:
df_words = set(w[0] for w in doc_fd.most_common(50))
df_words

{'ability',
 'apply',
 'based',
 'benefits',
 'business',
 'candidate',
 'candidates',
 'client',
 'company',
 'contact',
 'cv',
 'development',
 'environment',
 'essential',
 'excellent',
 'experience',
 'experienced',
 'good',
 'high',
 'including',
 'job',
 'jobseeking',
 'join',
 'key',
 'knowledge',
 'leading',
 'level',
 'management',
 'manager',
 'opportunity',
 'originally',
 'position',
 'posted',
 'provide',
 'recruitment',
 'required',
 'role',
 'salary',
 'service',
 'services',
 'skills',
 'strong',
 'successful',
 'support',
 'team',
 'training',
 'uk',
 'work',
 'working',
 'www'}

In [46]:
# function to remove most frequent words in tk descriptions.
def removeMostFreqWords(description):
    return [d for d in description if d not in df_words]

tk_descriptions = [removeMostFreqWords(description) for description in tk_descriptions]

#### The Updated Statistics

In the above, we have done a few pre-processed steps, now let's have a look at the statistics again:
We notice that the vocab size has reduced from `40088` to `40038`, a difference of `50`.

In [47]:
stats_print(tk_descriptions)

Vocabulary size:  40038
Total number of tokens:  6239169
Lexical diversity:  0.0064172007522155594
Total number of descriptions: 55449
Average description length: 112.52085700373316
Maximun description length: 990
Minimun description length: 4
Standard deviation of description length: 61.88637513583753


### Task 1.2.6 Extract the top 10 Bigrams based on term frequency
Exploring the bigrams (top 10) in the pre-processed description text. Also making sense of the vocabulary.

In [48]:
# adding all words 
words = list(chain.from_iterable(tk_descriptions))

In [49]:
bigrams = ngrams(words, n = 2)
fdbigram = FreqDist(bigrams)

In [50]:
bigrams = fdbigram.most_common(10) # top 10 bigrams
bigrams

[(('employment', 'agency'), 8055),
 (('track', 'record'), 5472),
 (('acting', 'employment'), 5095),
 (('sql', 'server'), 4804),
 (('asp', 'net'), 4687),
 (('relation', 'vacancy'), 3977),
 (('sales', 'executive'), 3619),
 (('chef', 'de'), 3586),
 (('nursing', 'home'), 3503),
 (('de', 'partie'), 3396)]

In [51]:
rep_patterns = [" ".join(bg[0]) for bg in bigrams]
rep_patterns

['employment agency',
 'track record',
 'acting employment',
 'sql server',
 'asp net',
 'relation vacancy',
 'sales executive',
 'chef de',
 'nursing home',
 'de partie']

In [52]:
replacements = [bg.replace(" ","_") for bg in rep_patterns] # convert the format of bigram into word1_word2
replacements

['employment_agency',
 'track_record',
 'acting_employment',
 'sql_server',
 'asp_net',
 'relation_vacancy',
 'sales_executive',
 'chef_de',
 'nursing_home',
 'de_partie']

#### The Updated Statistics

In the above, we have done a few pre-processed steps, now let's have a look at the statistics again. We notice that the vocab size has reduced from `40088` to `40038`, a difference of `50`.

In [53]:
stats_print(tk_descriptions)

Vocabulary size:  40038
Total number of tokens:  6239169
Lexical diversity:  0.0064172007522155594
Total number of descriptions: 55449
Average description length: 112.52085700373316
Maximun description length: 990
Minimun description length: 4
Standard deviation of description length: 61.88637513583753


### 1.2.7 Constructing the Vocabulary

In [54]:
# generating the vocabulary

words = list(chain.from_iterable(tk_descriptions)) # we put all the tokens in the corpus in a single list
vocab = sorted(list(set(words))) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words

len(vocab)

40038

In [55]:
tk_descriptions[test_ind]

['french',
 'restaurant',
 'club',
 'gascon',
 'michelin',
 'established',
 'heart',
 'london',
 'easy',
 'access',
 'bus',
 'train',
 'tube',
 'chef',
 'de',
 'rang',
 'waiter',
 'complete',
 'split',
 'straight',
 'shifts',
 'depending',
 'rota',
 'days',
 'week',
 'closed',
 'sunday',
 'bank',
 'holiday',
 'average',
 'week',
 'real',
 'career',
 'progression',
 'part',
 'small',
 'group',
 'quality',
 'restaurant',
 'london',
 'wages',
 'basic',
 'share',
 'gratuities',
 'staff',
 'discount',
 'restaurants',
 'check',
 'site',
 'clubgascon',
 'interested',
 'joining',
 'send',
 'detailed',
 'applied',
 'arrange',
 'meeting',
 'order',
 'discuss',
 'offer']

## Saving Pre-processing required outputs
Save the vocabulary, bigrams and job advertisment txt as per specification.
- vocab.txt
- bigram.txt
- job_ads.txt

* unigram vocab saved in the following format: word_string:word_integer_index with the index value starts from 0. Stored in a .txt file named `vocab.txt`
    * each line contains the unigram vocabulary
* bigrams are based on their term frequency (from high to low) and store in a .txt file named `bigram.txt'
    * contains the found bigrams found in the whole document collection as well as their term frequency, separated by comma (each line contains one bigram). 
* 
    
Double Checked if this is saved properly.

#### Saving the output for vocab

In [56]:
out_file = open("vocab.txt", 'w') # creates a txt file named 'vocab.txt', open in write mode

for index in range(0, len(vocab)):
    out_file.write("{}:{}\n".format(vocab[index],index)) # write each index and vocabulary word, note that index start from 0
out_file.close() # close the file

#### Saving the output for bigram

In [57]:
out_file = open("bigram.txt", 'w') # creates a txt file named 'bigrams.txt', open in write mode
for word in bigrams:
    out_file.write(''.join(str(word)) + '\n') # join the tokens in an article with space, and write the obtained string to the txt document
out_file.close() # close the file

#### Saving the output for job_ads

In [58]:
out_file = open ("job_ads.txt", 'w')# creates a txt file named 'job_ads.txt', open in write mode

for i in range(len(job_id)):
    out_file.write("ID: {}\nCategory: {}\nWebindex: {}\nTitle: {}\nDescription: {}\n".
                   format(str(job_id[i]),str(job_category[i]),str(job_web_index[i]),
                          str(job_title[i]),str(" ".join(tk_descriptions[i]))))
    

## Summary
Give a short summary and anything you would like to talk about the assessment task here.

Basic Text Pre-processing on description has been successful.