# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Joyal Joy Madeckal
#### Student ID: S3860476

Date: XXXX

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* re
* numpy
* itertools
* nltk
* sklearn

## Introduction

The aim of the assessment is to perform pre-processing of the description of the job advertisements provided.  Before the pre-processing is carried out we need to extract the data into suitable structures. There are around 50000 job advertisements that need to be pre processed under 8 different categories. The assignment task will give us insights on how to approach data pre processing tasks. The approach we are going to follow for this assignment task will be as follows:
- Load all the data files
- Extract the data into proper structures - job descriptions will be extracted as a list for this task
- Pre-processing of the data by tokenizing, converting to lower case, removing less frequent words, extracting bigrams etc.
- Saving the data into the required format.

## Importing libraries 

In [1]:
# Libraries required for the assessment
import re
import numpy as np
from itertools import chain
from nltk.util import ngrams
from nltk.probability import *
from sklearn.datasets import load_files
from nltk.tokenize import RegexpTokenizer

### 1.1 Examining and loading data

The data folder consists of 8 different job categories as sub folders and all folders are having .txt files. The text files are having the following format.
- First line consists of Title information
- Second line consists of WebIndex information
- Third line consists of Company information and this information is not available for all the job advertisements.
- Fourth line consists of Description and it can span multiple lines.

The different categories and number of job advertisements under each category is shown below.

- Accounting_Finance - 7407
- Engineering - 8210
- Healthcare_Nursing - 8808
- Hospitality_Catering - 4788
- IT - 14353
- PR_Advertising_Marketing - 2755
- Sales - 5349
- Teaching - 3779

We are loading all the data in all these folders with the help if `load_files` from `sklearn`. Then, we will try to analyse the volume of the data available and the number of items present under each of the categories. Further, we will create the `target_label_map` where the integer targets will be mapped to the corresponding label names. Using `FreqDist` we will have an idea of the size of data under each category. At the end of this section we will extract the data so that data is in the suitable form for pre-processing.

In [2]:
# Loading the data files into memory.
job_ads = load_files('./data', encoding='utf-8')

In [3]:
# Printing the length of the total number of files available.
print('Total number of job advertisements =', len(job_ads.filenames))
# Printing all the job categories available.
print('\nJob categories =', ', '.join(job_ads.target_names), end = '\n\n')

Total number of job advertisements = 55449

Job categories = Accounting_Finance, Engineering, Healthcare_Nursing, Hospitality_Catering, IT, PR_Advertising_Marketing, Sales, Teaching



Now, we are going to map the labels to the targets and get the number of job ads under each category. We know there are 8 categories, so, `target` will be in the range 0-7. We have to map it with `target_names`.
After analysing the structure, we are using `np.where` (since its a numpy array) to find the index of the targets and mapping it with corresponding `filenames` on top of which we will do the split by `\\` and extract the second array element which will be the label for the target.

Also, the mapping is stored in the object `target_label_map`

We are also printing the number of job ads under each category. For this we are using `FreqDist` from `nltk.probability`

In [4]:
# Mapping the target names with target and creating the map object.
target_label_map = {}
for i in range(len(job_ads.target_names)):
    label = job_ads.filenames[np.where(job_ads.target == i)[0][0]].split('\\')[1]
    target_label_map[i] = label
    print(f'Label for target {i} =', label)

print()
# Printing the number of job ads under eacg category.
target_frequencies = FreqDist(job_ads.target)
for i in range(len(job_ads.target_names)):
    print(f'Job ads under category {target_label_map[i]} =', target_frequencies[i])

Label for target 0 = Accounting_Finance
Label for target 1 = Engineering
Label for target 2 = Healthcare_Nursing
Label for target 3 = Hospitality_Catering
Label for target 4 = IT
Label for target 5 = PR_Advertising_Marketing
Label for target 6 = Sales
Label for target 7 = Teaching

Job ads under category Accounting_Finance = 7407
Job ads under category Engineering = 8210
Job ads under category Healthcare_Nursing = 8808
Job ads under category Hospitality_Catering = 4788
Job ads under category IT = 14353
Job ads under category PR_Advertising_Marketing = 2755
Job ads under category Sales = 5349
Job ads under category Teaching = 3779


As we have seen the basic statistics of the data, now we need to make the data suitable for pre-processing. For that, let's first see the structure of a data.

In [5]:
# Taking a random job ad and printing the data.
random_index = 10
print(job_ads.data[random_index])

Title: IT Project Manager (C/ASPnet) ****K Leicester
Webindex: 68092333
Company: Computer People
Description: I'm working on behalf of a stable and global company based in Leicester; I'm on the hunt an IT development Project Manager who has a background in development, at one point in your career you would have been a developer ideally coming from a strong C / ASP.Net background. You will be one of several project managers within the company and this role has come about due to sheer growth of the business. As a Project Manager you will be responsible for defining the Scope of the projects, resource requirements and project governance. You will be accountable for end to end delivery, risk management, and managing relationships with the business, partners and suppliers. You will also develop a communication strategy, to engage stakeholders and maintain support within IT and the wider business. You will have a strong track record in Project Management, and ideally be qualified in Prince**

From the above we could understand, there are Title, Webindex, Company and Description information with a job ad. It is also mentioned in the assignment specification that all of the job ads are not having Company information. We need to perform pre-processing only on the description of the job ad and so it is not a good option to go with `job_ads.data` directly. We have to extract the data into suitable form.

We are going to extract the data as per the following steps:

1. The Title, Webindex and Company information will be stored in a dictionary list `job_ads_title_webindex_company`
2. For the job ads where there is no company name 'Company' property won't be present.
3. All the job descriptions will be kept as a list named `job_ads_descriptions`

How the logic is written?

Each of the detail Title, Webindex, Company and Description are present in separate lines in the job ad. So, if we do splitting of the lines, then for a job ad with company will have 4 items and a job ad without company will only have 3 items. Then, further splitting of the data is performed using ':'. Before this we should ensure our asuumption that only 3 and 4 items will only be present after splitting have to be verified.

In [6]:
print('Number of different items after the proposed splitting =',set([len(item.split('\n')) for item in job_ads.data]))

Number of different items after the proposed splitting = {3, 4}


From the above we could understand the logic we are going to apply will be perfect for this case. The next thing we should be doing is how to extract the data out of the job ads. On observing one of the sample data above we could see that the job ad may be having the below structure:

- Title: --> First line
- Webindex: --> Second line
- Company: | Description: --> Third line
- Description: --> Fourth line

If the structure is as above we could extract the data by left stripping the lines with Title: , Webindex: , Company: and Description: respectively. Before we move forward, we should ensure our assumption is correct. We will ensure it in the following code block.

In [7]:
print('Number of files with Title: starting =', len([desc for desc in job_ads.data if desc.split('\n')[0].startswith('Title: ')]))
print('Number of files with Webindex: starting =', len([desc for desc in job_ads.data if desc.split('\n')[1].startswith('Webindex: ')]))
print('Number of files with Company: starting =', len([desc for desc in job_ads.data if desc.split('\n')[2].startswith('Company: ')]))
print('Number of files with Description: starting (at 3rd line, no company data) =', len([desc for desc in job_ads.data if desc.split('\n')[2].startswith('Description: ')]))
print('Number of files with Description: starting (at 4th line, with company data)=', len([desc for desc in job_ads.data if len(desc.split('\n')) == 4 and (desc.split('\n')[3].startswith('Description: '))]))

Number of files with Title: starting = 55449
Number of files with Webindex: starting = 55449
Number of files with Company: starting = 50061
Number of files with Description: starting (at 3rd line, no company data) = 5388
Number of files with Description: starting (at 4th line, with company data)= 50061


The conclusion which we can draw from above are:

- All the job ad has the first line starting with Title: 
- All the job ad has the second line starting with Webindex: 
- 50061 job ad has the third line starting with Company: which indicates there are 5388 job ads without company name
- There are 5388 job ads which starts with Description: at the third line and 50061 job ads in the fourth line.

The assumption we have made above is correct and we can proceed with left stripping to extract the data.

In [8]:
# Creation of lists for storing the data.
job_ads_title_webindex_company = []
job_ads_descriptions = []

# Looping through each job and updating the corresponding lists.
for job_ad in job_ads.data:
    job_obj = {}
    split_data = job_ad.split('\n')
#     The if condition below is required since some of the data files are not having Company information.
    if len(split_data) == 3:
        job_obj['Title'] = split_data[0].lstrip('Title:').strip()
        job_obj['Webindex'] = split_data[1].lstrip('Webindex:').strip()
        job_ads_title_webindex_company.append(job_obj)
        job_ads_descriptions.append(split_data[2].lstrip('Description:').strip())
    elif len(split_data) == 4:
        job_obj['Title'] = split_data[0].lstrip('Title:').strip()
        job_obj['Webindex'] = split_data[1].lstrip('Webindex:').strip()
        job_obj['Company'] = split_data[2].lstrip('Company:').strip()
        job_ads_title_webindex_company.append(job_obj)
        job_ads_descriptions.append(split_data[3].lstrip('Description:').strip())

Now we need to ensure we have done the extraction correct. For that, we will print the data for the random index.

In [9]:
# Priniting the loaded data
print('Loaded data:\n', job_ads.data[random_index], sep='')
print('\nData structure created: ')
print(job_ads_title_webindex_company[random_index], '\n', job_ads_descriptions[random_index])

Loaded data:
Title: IT Project Manager (C/ASPnet) ****K Leicester
Webindex: 68092333
Company: Computer People
Description: I'm working on behalf of a stable and global company based in Leicester; I'm on the hunt an IT development Project Manager who has a background in development, at one point in your career you would have been a developer ideally coming from a strong C / ASP.Net background. You will be one of several project managers within the company and this role has come about due to sheer growth of the business. As a Project Manager you will be responsible for defining the Scope of the projects, resource requirements and project governance. You will be accountable for end to end delivery, risk management, and managing relationships with the business, partners and suppliers. You will also develop a communication strategy, to engage stakeholders and maintain support within IT and the wider business. You will have a strong track record in Project Management, and ideally be qualifie

From the above we could see that we have done the creation of the data structure proper and the data is in sync with the loaded data.

We have checked the initial statistics and extracted the data available into a suitable for for pre-processing. We can proceed with the pre-processing of the data now.

### 1.2 Pre-processing data
Perform the required text pre-processing steps.

As per the specification of the assessment, first we have to tokenize the description of each of the advertisement description with the regex pattern: `r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"`

In [10]:
# Defining the pattern for tokenization
pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
tokenizer = RegexpTokenizer(pattern)

# Creation of the tokens for the description
tokenised_desc = [tokenizer.tokenize(desc) for desc in job_ads_descriptions]
tokenised_desc[random_index]

["I'm",
 'working',
 'on',
 'behalf',
 'of',
 'a',
 'stable',
 'and',
 'global',
 'company',
 'based',
 'in',
 'Leicester',
 "I'm",
 'on',
 'the',
 'hunt',
 'an',
 'IT',
 'development',
 'Project',
 'Manager',
 'who',
 'has',
 'a',
 'background',
 'in',
 'development',
 'at',
 'one',
 'point',
 'in',
 'your',
 'career',
 'you',
 'would',
 'have',
 'been',
 'a',
 'developer',
 'ideally',
 'coming',
 'from',
 'a',
 'strong',
 'C',
 'ASP',
 'Net',
 'background',
 'You',
 'will',
 'be',
 'one',
 'of',
 'several',
 'project',
 'managers',
 'within',
 'the',
 'company',
 'and',
 'this',
 'role',
 'has',
 'come',
 'about',
 'due',
 'to',
 'sheer',
 'growth',
 'of',
 'the',
 'business',
 'As',
 'a',
 'Project',
 'Manager',
 'you',
 'will',
 'be',
 'responsible',
 'for',
 'defining',
 'the',
 'Scope',
 'of',
 'the',
 'projects',
 'resource',
 'requirements',
 'and',
 'project',
 'governance',
 'You',
 'will',
 'be',
 'accountable',
 'for',
 'end',
 'to',
 'end',
 'delivery',
 'risk',
 'manageme

In [11]:
# Printing the data for the random index to cross check.
job_ads_descriptions[random_index]

"I'm working on behalf of a stable and global company based in Leicester; I'm on the hunt an IT development Project Manager who has a background in development, at one point in your career you would have been a developer ideally coming from a strong C / ASP.Net background. You will be one of several project managers within the company and this role has come about due to sheer growth of the business. As a Project Manager you will be responsible for defining the Scope of the projects, resource requirements and project governance. You will be accountable for end to end delivery, risk management, and managing relationships with the business, partners and suppliers. You will also develop a communication strategy, to engage stakeholders and maintain support within IT and the wider business. You will have a strong track record in Project Management, and ideally be qualified in Prince**** Line management experience would be beneficial. You will be delivering a range of systems from Business In

From the above we can see that the tokenization of the description went well and is working fine.

Now, it is instructed that we have to convert the words to lower case. We will do that in the next code block.

In [12]:
# Using list comprehension for conversion of each of the tokens to lower case.
tokenised_desc = [[token.lower() for token in desc] for desc in tokenised_desc]
tokenised_desc[random_index]

["i'm",
 'working',
 'on',
 'behalf',
 'of',
 'a',
 'stable',
 'and',
 'global',
 'company',
 'based',
 'in',
 'leicester',
 "i'm",
 'on',
 'the',
 'hunt',
 'an',
 'it',
 'development',
 'project',
 'manager',
 'who',
 'has',
 'a',
 'background',
 'in',
 'development',
 'at',
 'one',
 'point',
 'in',
 'your',
 'career',
 'you',
 'would',
 'have',
 'been',
 'a',
 'developer',
 'ideally',
 'coming',
 'from',
 'a',
 'strong',
 'c',
 'asp',
 'net',
 'background',
 'you',
 'will',
 'be',
 'one',
 'of',
 'several',
 'project',
 'managers',
 'within',
 'the',
 'company',
 'and',
 'this',
 'role',
 'has',
 'come',
 'about',
 'due',
 'to',
 'sheer',
 'growth',
 'of',
 'the',
 'business',
 'as',
 'a',
 'project',
 'manager',
 'you',
 'will',
 'be',
 'responsible',
 'for',
 'defining',
 'the',
 'scope',
 'of',
 'the',
 'projects',
 'resource',
 'requirements',
 'and',
 'project',
 'governance',
 'you',
 'will',
 'be',
 'accountable',
 'for',
 'end',
 'to',
 'end',
 'delivery',
 'risk',
 'manageme

As we have tokenised the description and converted it to lower case characters, we can check the stats of the tokens.

In [13]:
# The method can be used to print the basic statistics of the tokens we have generated.
def stats_print(tokenised_articles):
    words = list(chain.from_iterable(tokenised_articles))
    vocab = set(words)
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of articles:", len(tokenised_articles))
    lens = [len(article) for article in tokenised_articles]
    print("Average document length:", np.mean(lens))
    print("Maximum document length:", np.max(lens))
    print("Minimum document length:", np.min(lens))
    print("Standard deviation of document length:", np.std(lens))
    
stats_print(tokenised_desc)

Vocabulary size:  89591
Total number of tokens:  13799127
Lexical diversity:  0.006492512171240978
Total number of articles: 55449
Average document length: 248.861602553698
Maximum document length: 2001
Minimum document length: 10
Standard deviation of document length: 125.26507304982165


The next task is to remove the words having length less than 2. We can perform this operation using list comprehension and will be done in the code block below.

In [14]:
tokenised_desc = [[token for token in desc if len(token) >= 2] for desc in tokenised_desc]
tokenised_desc[random_index]

["i'm",
 'working',
 'on',
 'behalf',
 'of',
 'stable',
 'and',
 'global',
 'company',
 'based',
 'in',
 'leicester',
 "i'm",
 'on',
 'the',
 'hunt',
 'an',
 'it',
 'development',
 'project',
 'manager',
 'who',
 'has',
 'background',
 'in',
 'development',
 'at',
 'one',
 'point',
 'in',
 'your',
 'career',
 'you',
 'would',
 'have',
 'been',
 'developer',
 'ideally',
 'coming',
 'from',
 'strong',
 'asp',
 'net',
 'background',
 'you',
 'will',
 'be',
 'one',
 'of',
 'several',
 'project',
 'managers',
 'within',
 'the',
 'company',
 'and',
 'this',
 'role',
 'has',
 'come',
 'about',
 'due',
 'to',
 'sheer',
 'growth',
 'of',
 'the',
 'business',
 'as',
 'project',
 'manager',
 'you',
 'will',
 'be',
 'responsible',
 'for',
 'defining',
 'the',
 'scope',
 'of',
 'the',
 'projects',
 'resource',
 'requirements',
 'and',
 'project',
 'governance',
 'you',
 'will',
 'be',
 'accountable',
 'for',
 'end',
 'to',
 'end',
 'delivery',
 'risk',
 'management',
 'and',
 'managing',
 'relation

We have performed the removal of the words with less than 2 characters successfully. We can compare the two random index responses above and can understand the operation is performed successfully.

The next task is to remove the given stopwords and this also we can achieve via performing list comprehension. We will perform it in the code block below. First we have read the stop words. We are storing the stopwords as a set since processing the set is much faster.

In [15]:
# Reading the stopwords from the provided file.
stop_words = set()
with open('stopwords_en.txt') as stopwords:
    stop_words = set([word.strip('\n') for word in stopwords])
stop_words

{'a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [16]:
# Performing the removal of stopwords 
tokenised_desc = [[token for token in desc if token not in stop_words] for desc in tokenised_desc]
tokenised_desc[random_index]

['working',
 'behalf',
 'stable',
 'global',
 'company',
 'based',
 'leicester',
 'hunt',
 'development',
 'project',
 'manager',
 'background',
 'development',
 'point',
 'career',
 'developer',
 'ideally',
 'coming',
 'strong',
 'asp',
 'net',
 'background',
 'project',
 'managers',
 'company',
 'role',
 'due',
 'sheer',
 'growth',
 'business',
 'project',
 'manager',
 'responsible',
 'defining',
 'scope',
 'projects',
 'resource',
 'requirements',
 'project',
 'governance',
 'accountable',
 'end',
 'end',
 'delivery',
 'risk',
 'management',
 'managing',
 'relationships',
 'business',
 'partners',
 'suppliers',
 'develop',
 'communication',
 'strategy',
 'engage',
 'stakeholders',
 'maintain',
 'support',
 'wider',
 'business',
 'strong',
 'track',
 'record',
 'project',
 'management',
 'ideally',
 'qualified',
 'prince',
 'line',
 'management',
 'experience',
 'beneficial',
 'delivering',
 'range',
 'systems',
 'business',
 'intelligence',
 'data',
 'warehousing',
 'distribution',


We have removed the stopwords and printed the data in the random index and comparing with the previous result we can see that the stopwords are removed.

Next task is to remove the words that appears only once based on term frequency. We will use `FreqDist` and `chain.from_iterable` for this. We will also create the vocabulary of the words.

In [17]:
# Creating the words list from all descriptions and the vocabulary.
words = list(chain.from_iterable(tokenised_desc))
vocabulary = set(words)

# FInding the term frequency
term_freq = FreqDist(words)
# Getting the words that have occured only once.
terms_with_single_occurence = set(term_freq.hapaxes())

The variable `terms_with_single_occurence` will consist of all the words appearing only once in the document collection. Now, we will use this to remove these words from `tokenised_desc`. Also, we will have a stats check after this is done using `stats_print`

In [18]:
# Removing the words with single occurence
tokenised_desc = [[token for token in desc if token not in terms_with_single_occurence] for desc in tokenised_desc]

# Checking the statistics of the tokens.
stats_print(tokenised_desc)

Vocabulary size:  40088
Total number of tokens:  7814343
Lexical diversity:  0.005130053799788415
Total number of articles: 55449
Average document length: 140.9284748146946
Maximum document length: 1121
Minimum document length: 7
Standard deviation of document length: 73.46663506985078


We could see that initially we had a vocabulary size of 89591 and now it is reduced more than 50% to 40088. ALso, the total number of tokens have reduced to 7814343 from 13799127.

Next task is to remove top 50 most frequent words based on document frequency. For this we have to create the document frequency object and we will do it in the below code block.

In [19]:
# Creating the list of words for document frequency
words_for_doc_freq = list(chain.from_iterable([set(desc) for desc in tokenised_desc]))
# Creating the document frequency object
document_freq = FreqDist(words_for_doc_freq)
# Finding the top 50 words with highest frequency.
words_with_high_doc_freq_with_freq = document_freq.most_common(50)
words_with_high_doc_freq_with_freq

[('experience', 43644),
 ('role', 34680),
 ('work', 33684),
 ('team', 32585),
 ('working', 30714),
 ('skills', 30412),
 ('client', 26899),
 ('job', 25552),
 ('business', 24739),
 ('uk', 24133),
 ('excellent', 22982),
 ('opportunity', 22678),
 ('company', 22263),
 ('management', 20620),
 ('required', 20555),
 ('development', 20223),
 ('apply', 20133),
 ('based', 19333),
 ('successful', 19118),
 ('join', 18682),
 ('www', 18421),
 ('salary', 18402),
 ('cv', 18383),
 ('support', 18286),
 ('knowledge', 17844),
 ('strong', 16475),
 ('environment', 16408),
 ('posted', 16398),
 ('jobseeking', 16342),
 ('candidate', 16304),
 ('originally', 16294),
 ('leading', 16194),
 ('high', 15922),
 ('service', 15623),
 ('manager', 15587),
 ('good', 15252),
 ('ability', 15154),
 ('including', 14857),
 ('position', 14564),
 ('services', 14501),
 ('benefits', 14434),
 ('training', 14218),
 ('essential', 13915),
 ('experienced', 13826),
 ('key', 13567),
 ('contact', 13551),
 ('level', 13523),
 ('recruitment', 

We have obtained the top 50 words with most document frequency. Now, we have to remove these words from `tokenised_desc`.

In [20]:
# Storing the words with high frequency
words_with_high_doc_freq = set([item[0] for item in words_with_high_doc_freq_with_freq])

# Removing the 50 words with most document frequency
tokenised_desc = [[token for token in desc if token not in words_with_high_doc_freq] for desc in tokenised_desc]

Now, we have removed the words with highest document frequency. We will have a final stats check here as all the pre-processing is done.

In [21]:
# Checking the statistics of the tokens.
stats_print(tokenised_desc)

Vocabulary size:  40038
Total number of tokens:  6239169
Lexical diversity:  0.0064172007522155594
Total number of articles: 55449
Average document length: 112.52085700373316
Maximum document length: 990
Minimum document length: 4
Standard deviation of document length: 61.88637513583753


We have to identify the top 10 bigrams based on term frequency. For this we will use `ngrams` from `nltk.util` 

In [22]:
# Creating the words variable again since we have removed some words again after words variable was created.
words = list(chain.from_iterable(tokenised_desc))
# Creating the vocabulary again
vocabulary = set(words)
# Creating the bigrams.
bigrams = ngrams(words, n = 2)
# Creating the frequency distribution for bigrams
bigrams_freq = FreqDist(bigrams)
# Getting the top 10 bigrams.
top_10_bigrams = bigrams_freq.most_common(10)
top_10_bigrams

[(('employment', 'agency'), 8055),
 (('track', 'record'), 5472),
 (('acting', 'employment'), 5095),
 (('sql', 'server'), 4804),
 (('asp', 'net'), 4687),
 (('relation', 'vacancy'), 3977),
 (('sales', 'executive'), 3619),
 (('chef', 'de'), 3586),
 (('nursing', 'home'), 3503),
 (('de', 'partie'), 3396)]

We have performed all the tasks as per the assignment requirement. Now we need to save the data in the proper format mentioned in the assigment specification.

## Saving required outputs

First we will be saving the vocabulary as vocab.txt.

In [23]:
# Creating the sorted vocabulary
sorted_vocab = sorted(list(vocabulary))

# Writing the vocabulary to a file.
with open('vocab.txt', 'w') as vocab_file:
    for index, word in enumerate(sorted_vocab):
        print(f'{word}:{index}', file = vocab_file)

Now we will create the text file for bigrams.

In [24]:
# creating the top 10 bigrams file
with open('bigram.txt', 'w') as bigram_file:
    for bigram in top_10_bigrams:
        print(f'{" ".join(bigram[0])}, {bigram[1]}', file = bigram_file)

Now we have to create the job advertisements file with the details mentioned in the assignment specification. As per the specification, following information has to be there.

- First line should be ID: {ID obtained from the filename}
- Second line should be Category: {Category of the advertisement}
- Third line should be Webindex: {Webindex from the text file}
- Fourth line should be unprocessed Title: {Title from the text file}
- Fifth line should be Description: {Description obtained by joining all the tokens}

First we will create the list of the ID's extracted from the file names using regular expressions. 

In [25]:
job_ids = [re.search(r'\d{5}', filename).group() for filename in job_ads.filenames]
# Printing the first 5 job_ids
job_ids[:5]

['14624', '31567', '50131', '31419', '47238']

We know that for the remaining info we have the data already with us. We can obtain the categories using `target_label_map` and `job_ads.target`. Webindex and Title can be obtained from `job_ads_title_webindex_company`. For Dsecription we have `tokenised_desc`. As the data is available now we will create the file in the following code block.

In [26]:
with open('job_ads.txt', 'w') as job_ads_file:
    for i in range(len(job_ads.data)):
        print(f"ID: {job_ids[i]}\n"
              f"Category: {target_label_map[job_ads.target[i]]}\n"
              f"Webindex: {job_ads_title_webindex_company[i]['Webindex']}\n"
              f"Title: {job_ads_title_webindex_company[i]['Title']}\n"
              f"Description: {' '.join(tokenised_desc[i])}", file = job_ads_file)

## Summary

The assessment mainly focussed on pre-processing of a huge corpus data available related to job advertisements. The assessment could help us to understand before starting the pre-processing of the data we have to analyse the structure of the data file provided. If we are directly proceeding with the data pre-processing steps it will end up in wrong data which may not be useful at the end. In the data set provided here, we could see that there were around 50000 job advertisement text files divided among 8 categories. On further analysis of the data files we could see that we cannot directly preprocess the data file as it was containing multiple informations like Title, Webindex, Company and Description. The part that has to be pre processed is the Description and afterwards we have extracted the Description of each of the files and did the pre-processing.

Couple of things which we have to consider for any pre-processing task would be
- Have an idea of the volume and structure of the data
- Analsye the data files and identify the structure of data files.