#### author: Bharti Sinha

Libraries used
* pandas
* re
* numpy
* nltk 
* itertools
* ngrams
* random
* csv
* os
* sklearn.datasets
* RegexpTokenizer
* nltk.tokenize, sent_tokenize
* sklearn.datasets, load_files 
* random
* nltk.probability 


#Task: perform basic pre-processing for all the job descriptions in the provided data folder. This data folder has 8 categories of jobs spread over 55449 files. The pre-processing steps required in the assignment are tokenization, removal of words of length 1, removal of stop words, converting all words to lowercase, removal of all tokens which appear only once and removal of top 50 tokens which appear the most in the documents. After preprocessing, we are required to extract bigrams and save the preprocessed job descriptions and vocab in txt files. 


- [x] Tokenization
- [x] Convert tokens to Lower case
- [x] Remove words with length < 2
- [x] Remove Stop words
- [x] Remove words that appears only once in the document collection, based on term frequency
- [x] Remove the top 50 most frequent words based on document frequency.

## Importing libraries 

In [1]:
# scientific computation
import numpy as np
import pandas as pd  
import os

# for natural language processing
import nltk
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from itertools import chain
from nltk.util import ngrams

# to read data folders
from sklearn.datasets import load_files 

# to check random data
import random

from nltk.probability import *

# for regex patterns
import re

# to write csv file
import csv

# 1. Examining and loading data

- This section involves examining the data folder, including the categories and job advertisment txt documents, etc. 
- The findings are reported at the end of the section. 
- Before doing any pre-processing, we need to load the data into a proper format.
- Since the given dataset is well organised, we will use `load_files` method of the sklearn library
- `load_files` method loads data folder as a dictionary


In [2]:
# load the data folder

job_data = load_files(r"data")  

The loaded `job_data` is a dictionary, with the following attributes:
* `data` - a list of job ad descriptions
* `target` - the corresponding category of the job ad descriptions
* `target_names` - the names of category.
* `filenames` - the filenames holding the dataset.

In [3]:
# check the dictionary keys

job_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
# len of the data show the number of files in the data folder
# context: there are 55449 job advertisement txt files

len(job_data["data"])

55449

In [5]:
# data is a list with elements as text data of all the job ads - 55449
job_data["data"][3:6]

[b'Title: Inpatient Ward Team Leader\nWebindex: 64752715\nDescription: Skills: Inpatient Ward Team Leader Description: In Patient Team Leader  Southampton  Permanent  up to **** We currently have an opportunity for a senior registered general nurse to join an established team within a inpatient ward at a Private Hospital located in Southampton. This is a full time permanent role offering a salary in the region of **** to **** You will be UK experienced and have a background in surgical nursing. You will have previous experience of leading a team of staff and being accountable for senior duties within a ward environment. This is a day duty position, although as a senior member of staff you will be required to make yourself available to work around the needs of the ward. You will be responsible for pre and post assessing patients and providing them with a holistic package of care through to discharge, to include clinical assessments. To apply and for a job description, please call Isobel

In [6]:
job_data["data"][1]

b'Title: Residential Care Worker\nWebindex: 66314490\nCompany: Timeout Children\xe2\x80\x99s Homes Ltd\nDescription: Timeout Children s Homes Ltd are a rapidly expanding company at the forefront of therapeutic care for young people aged 1017 years, who have experienced emotional behavioural difficulties in their lives. We are looking to recruit Residential Care Workers to be based in our homes in the Swindon area. The successful candidates will need to work collaboratively and cooperatively with other Timeout staff, young people and external agencies. They will also be required to work in consultation with families, social workers, and YOT s and any other professionals involved with the young person including the education team, to deliver effective educational programmes. Successful applicants are required to provide an enhanced disclosure. Disclosure expense will be met by employer. To apply click on the apply button where you will be redirected to our site to complete an application

In [7]:
# lists all the sub-folders
# in this case, the sub-folders are the categories nmae

job_data['target_names']

['Accounting_Finance',
 'Engineering',
 'Healthcare_Nursing',
 'Hospitality_Catering',
 'IT',
 'PR_Advertising_Marketing',
 'Sales',
 'Teaching']

In [8]:
# display all the file path 
# file path of each job description

job_data['filenames']

array(['data/Engineering/Job_14624.txt',
       'data/Healthcare_Nursing/Job_31567.txt',
       'data/Hospitality_Catering/Job_50131.txt', ...,
       'data/IT/Job_13401.txt',
       'data/PR_Advertising_Marketing/Job_52696.txt',
       'data/Accounting_Finance/Job_25296.txt'], dtype='<U43')

In [9]:
# enlist first 12 values of the filename
# this is later used to check if all the files have their corresponding target
job_data['filenames'][:15]

array(['data/Engineering/Job_14624.txt',
       'data/Healthcare_Nursing/Job_31567.txt',
       'data/Hospitality_Catering/Job_50131.txt',
       'data/Healthcare_Nursing/Job_31419.txt',
       'data/Teaching/Job_47238.txt',
       'data/Healthcare_Nursing/Job_36205.txt',
       'data/Healthcare_Nursing/Job_30175.txt',
       'data/PR_Advertising_Marketing/Job_54070.txt',
       'data/IT/Job_05781.txt', 'data/Sales/Job_44021.txt',
       'data/IT/Job_02465.txt', 'data/IT/Job_13082.txt',
       'data/Healthcare_Nursing/Job_36404.txt', 'data/IT/Job_13998.txt',
       'data/Hospitality_Catering/Job_51419.txt'], dtype='<U43')

In [10]:
# there are 55449 job description files

job_data['filenames'].shape

(55449,)

In [11]:
# the category (integer) which the job description belongs to 
job_data['target']

array([1, 2, 3, ..., 4, 5, 0])

In [12]:
# the values show that the 12 files in the filename has corresponding target

# ['Accounting_Finance',       ---> target 0
#  'Engineering',              ---> target 1
#  'Healthcare_Nursing',       ---> target 2
#  'Hospitality_Catering',     ---> target 3
#  'IT',                       ---> target 4
#  'PR_Advertising_Marketing', ---> target 5
#  'Sales',                    ---> target 6
#  'Teaching']                 ---> target 7

job_data['target'][:15]

array([1, 2, 3, 2, 7, 2, 2, 5, 4, 6, 4, 4, 2, 4, 3])

In [13]:
# the target array holds a value for each txt file 
# the value corresponds to a category

job_data['target'].shape

(55449,)

In [14]:
# check all unique values of the target array
set(job_data['target'].tolist())

{0, 1, 2, 3, 4, 5, 6, 7}

In [15]:
# as we observe, the txt file has job of the category IT and target is 4
# last observation shows that the IT category has label 4
# this confirms that the files are matching targets
job_data['filenames'][15], job_data['target'][15]

('data/IT/Job_07060.txt', 4)

In [16]:
# how many job ads under each category

# accounting jobs
acc = job_data['target'].tolist().count(0)

# enginering jobs
eng = job_data['target'].tolist().count(1)

# healthcare jobs
health = job_data['target'].tolist().count(2)

# hospitality jobs
hosp = job_data['target'].tolist().count(3)

# IT jobs
it = job_data['target'].tolist().count(4)

# PR jobs
pr = job_data['target'].tolist().count(5)

# sales job
sale = job_data['target'].tolist().count(6)

# teaching job
teach = job_data['target'].tolist().count(7)



print(f"There are {acc} Accounting_Finance jobs.")
print(f"There are {eng} Engineering jobs.")
print(f"There are {health} Healthcare_Nursing jobs.")
print(f"There are {hosp} Hospitality_Catering jobs.")
print(f"There are {it} IT jobs.")
print(f"There are {pr} PR_Advertising_Marketing jobs.")
print(f"There are {sale} Sales jobs.")
print(f"There are {teach} Teaching jobs.")

print("total jobs: ", acc + eng + health + hosp + it + pr + sale + teach)

There are 7407 Accounting_Finance jobs.
There are 8210 Engineering jobs.
There are 8808 Healthcare_Nursing jobs.
There are 4788 Hospitality_Catering jobs.
There are 14353 IT jobs.
There are 2755 PR_Advertising_Marketing jobs.
There are 5349 Sales jobs.
There are 3779 Teaching jobs.
total jobs:  55449


In [17]:
# store all job description contents into job_txt_contents
# store their corresponding categories into job_txt_category

job_txt_contents, job_txt_category = job_data.data, job_data.target  

In [18]:
x = random.choice(range(55449))
x

27953

In [19]:
# check contents of a job file at random
# the contents are not string but bytes types
job_txt_contents[x]

b'Title: ISEB / ISTQB Test Manager  QTP\nWebindex: 72394615\nCompany: Definitive\nDescription: ISEB/ISTQB Testing Manager, Guildford area, ****K  Bens. You must have a background as a Testing Manager in the software development sector with experience of managing a testing department within an organisation. ISEB/ISTQB Testing Manager, Guildford area, ****K  Bens. You must have a background as a Testing Manager in the software development sector with experience of managing a testing department within an organisation. As a Testing Manager you will need to have experience of the following: &x**** Experience of using test tools, enterprise and open source. &x**** Excellent interpersonal skills to communicate at all levels and manage key stakeholders with regards to risk and testing progress. &x**** Thorough knowledge of structured test methods and processes. &x**** Detailed knowledge of both Manual and Automated Testing. &x**** Experience of testing using an automated test tool such as Sele

In [20]:
# the text is in bytes format

type(job_txt_contents[x])

bytes

In [21]:
job_txt_category[x]

4

<span style='font-family:Georgia;color:Blue;background:yellow'>Findings: The data folder contains following information:</span>

  -- There are 8 folders inside the data folder. Each of these sub-folders correspond to a category of job advertisements.
  
  -- There are 55449 job files stored in txt format.
  
  -- Each of the job txt files belongs to either of the 8 categories
  
  1. `Accounting_Finance`,                             ---> target 0
  
  2. `Engineering`,                                    ---> target 1
  
  3. `Healthcare_Nursing`,                             ---> target 2
  
  4. `Hospitality_Catering`,                           ---> target 3
        
  5. `IT`,                                             ---> target 4
  
  6. `PR_Advertising_Marketing`,                       ---> target 5
  
  7. `Sales`,                                          ---> target 6
  
  8. `Teaching`                                        ---> target 7
  
  
      -- The `job_<ID>.txt` files have a job title and description. Additionally, some job files have company name. 

      -- There are 7407 Accounting_Finance jobs.

      -- There are 8210 Engineering jobs.

      -- There are 8808 Healthcare_Nursing jobs.

      -- There are 4788 Hospitality_Catering jobs.

      -- There are 14353 IT jobs.

      -- There are 2755 PR_Advertising_Marketing jobs.

      -- There are 5349 Sales jobs.

      -- There are 3779 Teaching jobs.
  
  
 9. These job adverstisements are stored in bytes format.
  

# 2. Save details mentioned below in approriate data structures and txt files
- descriptions
- job id 
- category
- webindex
- un-processed title

## 2.1 ----------------------------------- Job Descriptions -------------------------------------

In [22]:
# extract all the job descriptions
job_txt_contents[1]

b'Title: Residential Care Worker\nWebindex: 66314490\nCompany: Timeout Children\xe2\x80\x99s Homes Ltd\nDescription: Timeout Children s Homes Ltd are a rapidly expanding company at the forefront of therapeutic care for young people aged 1017 years, who have experienced emotional behavioural difficulties in their lives. We are looking to recruit Residential Care Workers to be based in our homes in the Swindon area. The successful candidates will need to work collaboratively and cooperatively with other Timeout staff, young people and external agencies. They will also be required to work in consultation with families, social workers, and YOT s and any other professionals involved with the young person including the education team, to deliver effective educational programmes. Successful applicants are required to provide an enhanced disclosure. Disclosure expense will be met by employer. To apply click on the apply button where you will be redirected to our site to complete an application

In [23]:
# make a list containing only job descriptions from the data list 
job_descriptions = []

for job in job_txt_contents:
    
    # convert to string
    desc = job.decode('utf-8')
    
    # find the index where description starts
    result = desc.find('Description:')
    
    # append to the list everything that comes after 'Description:'
    # the length of 'Description:' is 12
    job_descriptions.append(desc[result+13:])   

In [24]:
# save all the job_descriptions in a txt file 

with open("job_descriptions.txt", "w") as file:
    
    for i in range(len(job_descriptions)):
        
        file.write(job_descriptions[i])
        file.write("\n")
      

In [25]:
# job_txt_contents[1] shows the entire content of a job file
# job_descriptions[1] shows the description of this job has been correctly extracted into a seperate file

job_descriptions[1]

'Timeout Children s Homes Ltd are a rapidly expanding company at the forefront of therapeutic care for young people aged 1017 years, who have experienced emotional behavioural difficulties in their lives. We are looking to recruit Residential Care Workers to be based in our homes in the Swindon area. The successful candidates will need to work collaboratively and cooperatively with other Timeout staff, young people and external agencies. They will also be required to work in consultation with families, social workers, and YOT s and any other professionals involved with the young person including the education team, to deliver effective educational programmes. Successful applicants are required to provide an enhanced disclosure. Disclosure expense will be met by employer. To apply click on the apply button where you will be redirected to our site to complete an application form.'

In [26]:
print(job_txt_contents[3])
print("-"*110 + "\n")
print(job_txt_contents[500])
print("-"*110 + "\n")
print(job_txt_contents[10000])
print("-"*110 + "\n")
print(job_txt_contents[20000])
print("-"*110 + "\n")
print(job_txt_contents[30000])
print("-"*110 + "\n")


b'Title: Inpatient Ward Team Leader\nWebindex: 64752715\nDescription: Skills: Inpatient Ward Team Leader Description: In Patient Team Leader  Southampton  Permanent  up to **** We currently have an opportunity for a senior registered general nurse to join an established team within a inpatient ward at a Private Hospital located in Southampton. This is a full time permanent role offering a salary in the region of **** to **** You will be UK experienced and have a background in surgical nursing. You will have previous experience of leading a team of staff and being accountable for senior duties within a ward environment. This is a day duty position, although as a senior member of staff you will be required to make yourself available to work around the needs of the ward. You will be responsible for pre and post assessing patients and providing them with a holistic package of care through to discharge, to include clinical assessments. To apply and for a job description, please call Isobell

In [27]:
print(job_descriptions[3])
print("-"*110 + "\n")
print(job_descriptions[500])
print("-"*110 + "\n")
print(job_descriptions[10000])
print("-"*110 + "\n")
print(job_descriptions[20000])
print("-"*110 + "\n")
print(job_descriptions[30000])
print("-"*110 + "\n")

Skills: Inpatient Ward Team Leader Description: In Patient Team Leader  Southampton  Permanent  up to **** We currently have an opportunity for a senior registered general nurse to join an established team within a inpatient ward at a Private Hospital located in Southampton. This is a full time permanent role offering a salary in the region of **** to **** You will be UK experienced and have a background in surgical nursing. You will have previous experience of leading a team of staff and being accountable for senior duties within a ward environment. This is a day duty position, although as a senior member of staff you will be required to make yourself available to work around the needs of the ward. You will be responsible for pre and post assessing patients and providing them with a holistic package of care through to discharge, to include clinical assessments. To apply and for a job description, please call Isobelle at STR Health on **** **** **** or email your CV to ifishstrgroup.co

<span style='font-family:Georgia;color:Blue;background:yellow'>Observation:</span>

-- job file may contain two descriptions example job_descriptions[3]

-- job file may have job tilte following job description, responsibilities, skills required, role, example job_descriptions[500]



## 2.2 -------------------------------------------- ID -------------------------------------------

In [28]:
# check the pattern that can fetch the id information from the filenames
# this is an example
string = "data/Engineering/Job_14624.txt"
s = re.search(r'\d{5}', string) 
s.group()

'14624'

In [29]:
# fetch all the ids from txt files to a list data sctructure
job_ids = []

for filename in job_data['filenames']:
    #print(filename)
    s = re.search(r'\d{5}', filename) 
    job_ids.append(s.group())

In [30]:
# check if all ids were corretly extracted 
job_ids

['14624',
 '31567',
 '50131',
 '31419',
 '47238',
 '36205',
 '30175',
 '54070',
 '05781',
 '44021',
 '02465',
 '13082',
 '36404',
 '13998',
 '51419',
 '07060',
 '18399',
 '39194',
 '44876',
 '47390',
 '08890',
 '02696',
 '00468',
 '39671',
 '52467',
 '31620',
 '32012',
 '06821',
 '13089',
 '05164',
 '01998',
 '20455',
 '46705',
 '15154',
 '50111',
 '52754',
 '09944',
 '18991',
 '10976',
 '45145',
 '30000',
 '09713',
 '50285',
 '43066',
 '43015',
 '50205',
 '49829',
 '05736',
 '13136',
 '51362',
 '09825',
 '31939',
 '30884',
 '14843',
 '42277',
 '31834',
 '28281',
 '35064',
 '53016',
 '06594',
 '46647',
 '23599',
 '53176',
 '20842',
 '54347',
 '21895',
 '04243',
 '04431',
 '39265',
 '04778',
 '28251',
 '00205',
 '25094',
 '21655',
 '54708',
 '11778',
 '19802',
 '15400',
 '47553',
 '35715',
 '29910',
 '11775',
 '12634',
 '20517',
 '01946',
 '40929',
 '28502',
 '11197',
 '21182',
 '25544',
 '04297',
 '04836',
 '47960',
 '18773',
 '43268',
 '18032',
 '28013',
 '00835',
 '22435',
 '18149',


In [31]:
job_ids[7]

'54070'

In [32]:
job_ids[10:17]

['02465', '13082', '36404', '13998', '51419', '07060', '18399']

In [33]:
job_data['filenames'][10:17]

array(['data/IT/Job_02465.txt', 'data/IT/Job_13082.txt',
       'data/Healthcare_Nursing/Job_36404.txt', 'data/IT/Job_13998.txt',
       'data/Hospitality_Catering/Job_51419.txt', 'data/IT/Job_07060.txt',
       'data/Engineering/Job_18399.txt'], dtype='<U43')

In [34]:
len(job_ids)

55449

<span style='font-family:Georgia;color:Blue;background:yellow'>Observation</span>

There are 55449 job ids and they have been correctly extracted.

Each job id is of length 5

## 2.3 --------------------------------------------- Category ----------------------------------------------

In [35]:
# check the pattern that can fetch the category information from the filenames
# this is an example
string = "data/Engineering/Job_14624.txt"
s = re.search(r'data/(\w+)/Job_', string) 
print(s.group())
print(s.group(1))

data/Engineering/Job_
Engineering


In [36]:
# extract all the categories corresponding to each job file
job_categories = []

for filename in job_data['filenames']:
    #print(filename)
    s = re.search(r'data/(\w+)/Job_', filename) 
    job_categories.append(s.group(1))

In [37]:
job_categories

['Engineering',
 'Healthcare_Nursing',
 'Hospitality_Catering',
 'Healthcare_Nursing',
 'Teaching',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'PR_Advertising_Marketing',
 'IT',
 'Sales',
 'IT',
 'IT',
 'Healthcare_Nursing',
 'IT',
 'Hospitality_Catering',
 'IT',
 'Engineering',
 'Sales',
 'Teaching',
 'Teaching',
 'IT',
 'IT',
 'IT',
 'Sales',
 'Hospitality_Catering',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'IT',
 'IT',
 'IT',
 'IT',
 'Engineering',
 'Teaching',
 'Engineering',
 'Hospitality_Catering',
 'PR_Advertising_Marketing',
 'IT',
 'Engineering',
 'IT',
 'Teaching',
 'Healthcare_Nursing',
 'IT',
 'Hospitality_Catering',
 'Sales',
 'Sales',
 'Hospitality_Catering',
 'Hospitality_Catering',
 'IT',
 'IT',
 'Hospitality_Catering',
 'IT',
 'Healthcare_Nursing',
 'Healthcare_Nursing',
 'Engineering',
 'Sales',
 'Healthcare_Nursing',
 'Accounting_Finance',
 'Healthcare_Nursing',
 'PR_Advertising_Marketing',
 'IT',
 'Teaching',
 'Accounting_Finance',
 'PR_Advertising_Marketi

In [38]:
# there are 8 categories
set(job_categories)

{'Accounting_Finance',
 'Engineering',
 'Healthcare_Nursing',
 'Hospitality_Catering',
 'IT',
 'PR_Advertising_Marketing',
 'Sales',
 'Teaching'}

In [39]:
dict = {'Job_ID':job_ids, "Category":job_categories}

df = pd.DataFrame(dict) 
    
# saving the dataframe 
df.to_csv('ids_categories.csv') 

In [40]:
df

Unnamed: 0,Job_ID,Category
0,14624,Engineering
1,31567,Healthcare_Nursing
2,50131,Hospitality_Catering
3,31419,Healthcare_Nursing
4,47238,Teaching
...,...,...
55444,55020,PR_Advertising_Marketing
55445,44874,Teaching
55446,13401,IT
55447,52696,PR_Advertising_Marketing


In [41]:
mylist = list(zip(job_ids, job_categories))

In [42]:
mylist

[('14624', 'Engineering'),
 ('31567', 'Healthcare_Nursing'),
 ('50131', 'Hospitality_Catering'),
 ('31419', 'Healthcare_Nursing'),
 ('47238', 'Teaching'),
 ('36205', 'Healthcare_Nursing'),
 ('30175', 'Healthcare_Nursing'),
 ('54070', 'PR_Advertising_Marketing'),
 ('05781', 'IT'),
 ('44021', 'Sales'),
 ('02465', 'IT'),
 ('13082', 'IT'),
 ('36404', 'Healthcare_Nursing'),
 ('13998', 'IT'),
 ('51419', 'Hospitality_Catering'),
 ('07060', 'IT'),
 ('18399', 'Engineering'),
 ('39194', 'Sales'),
 ('44876', 'Teaching'),
 ('47390', 'Teaching'),
 ('08890', 'IT'),
 ('02696', 'IT'),
 ('00468', 'IT'),
 ('39671', 'Sales'),
 ('52467', 'Hospitality_Catering'),
 ('31620', 'Healthcare_Nursing'),
 ('32012', 'Healthcare_Nursing'),
 ('06821', 'IT'),
 ('13089', 'IT'),
 ('05164', 'IT'),
 ('01998', 'IT'),
 ('20455', 'Engineering'),
 ('46705', 'Teaching'),
 ('15154', 'Engineering'),
 ('50111', 'Hospitality_Catering'),
 ('52754', 'PR_Advertising_Marketing'),
 ('09944', 'IT'),
 ('18991', 'Engineering'),
 ('10976',

In [43]:
# write a csv file with job id and category as columns

import csv
      
with open("labels.csv", 'w', newline='\n') as myfile:
    write = csv.writer(myfile)
    write.writerow(["jobID", "Category"])
    
    for i in range(0, len(mylist)):
        write.writerow([mylist[i][0], mylist[i][1]])

myfile.close()



In [44]:
# save all job_id in a txt file
with open("job_IDs.txt", "w") as f:
    
    for element in job_ids:
        f.write(element + "\n")
    
f.close()

## 2.4 -------------------------------------- WEB INDICES ----------------------------------------

In [45]:
# web index is part of text document
job_txt_contents[1]

b'Title: Residential Care Worker\nWebindex: 66314490\nCompany: Timeout Children\xe2\x80\x99s Homes Ltd\nDescription: Timeout Children s Homes Ltd are a rapidly expanding company at the forefront of therapeutic care for young people aged 1017 years, who have experienced emotional behavioural difficulties in their lives. We are looking to recruit Residential Care Workers to be based in our homes in the Swindon area. The successful candidates will need to work collaboratively and cooperatively with other Timeout staff, young people and external agencies. They will also be required to work in consultation with families, social workers, and YOT s and any other professionals involved with the young person including the education team, to deliver effective educational programmes. Successful applicants are required to provide an enhanced disclosure. Disclosure expense will be met by employer. To apply click on the apply button where you will be redirected to our site to complete an application

In [46]:
# convert to string
desc = job_txt_contents[1].decode('utf-8')

# find the index where description starts
result = desc.find('Webindex: ')
    
result

31

In [47]:
desc[result+10: (result+18)]

'66314490'

In [48]:
# fetch all the webindices from all job txt files and store it in list data structure

job_web_indices = []

result = desc.find('Description:')

for job in job_txt_contents:
    
    # convert to string
    text = job.decode('utf-8')
    
    # find the index where webindex starts
    result = text.find('Webindex: ')
    
    # append to the list 8 digits that comes after 'Webindex: '
    job_web_indices.append(text[result+10: (result+18)])  

In [49]:
len(job_web_indices)

55449

In [50]:
# observe a few elements
job_web_indices[100:110]

['71225325',
 '66745598',
 '71804058',
 '66923286',
 '68097413',
 '67638904',
 '68308963',
 '72448231',
 '69088898',
 '71838226']

In [51]:
# check if any of the webindices are not of length 8

count = 0
for item in job_web_indices:
    if len(item) != 8:
        count +=1
        
print(f"There are {count} web indicies which are not of length 8.")

There are 0 web indicies which are not of length 8.


In [52]:
# save all web indices in txt file
with open("web_indices.txt", "w") as file:
    
    for i in range(0, len(job_web_indices)):
        file.write(job_web_indices[i] + "\n")

## 2.5 -------------------------------------- TITLE ----------------------------------------

In [53]:
# convert to string
text2 = job_txt_contents[10000].decode('utf-8')

# find the index where description starts
start = text2.find('Title: ')
end = text2.find('Webindex: ')
    
    
print(start)
print(end)

0
59


In [54]:
job_txt_contents[10000]

b'Title: Java Developer  Woking  ****  ****  Benefits  Bonus\nWebindex: 70782772\nCompany: Cornucopia IT Resourcing\nDescription: Java Developer  Woking  ****  ****  Benefits  Bonus. Developer, Programmer, Java, J****EE, Glassfish, Weblogic, Spring, Swing, JSF, JSP, Servlets, EJB, JDBC, Hibernate Cornucopia is currently working with a very highly regarded company which has an immediate requirement for a Java Developer to join their Leeds team on permanent basis. My client is looking for a Java Developer with the following experience: University Degree Proven Java development experience Experience with JSF, JSP, Servlets, EJB, JDBC, Glassfish, WebLogic, Spring, Swing or Hibernate Experience with at least one relational database (ideally Oracle) Excellent English communication skills The company offers an excellent working environment with genuine career progression opportunities and a chance to work on a number of high profile projects. This job was originally posted as www.cwjobs.co.uk

In [55]:
text2[start+7: end - 1]

'Java Developer  Woking  ****  ****  Benefits  Bonus'

In [56]:
job_title = []

In [57]:
# fetch all the titles from all job txt files into list data structure

job_title = []

for job in job_txt_contents:
    
    # convert to string
    text = job.decode('utf-8')
    
    # find the index where Title starts
    start = text.find('Title: ')
    
    # find the index where Webindices starts
    end = text.find('Webindex: ')
    
    # append to the list 
    job_title.append(text[start+7:end-1])  

In [58]:
job_title

['Plant Engineer',
 'Residential Care Worker',
 'CHEF DE RANG FOR MICHELIN STARRED RESTAURANT',
 'Inpatient Ward Team Leader',
 'Information Services Support Analyst',
 'Physiotherapist  Cambridge',
 'Youth Justice Officer Job Luton',
 'Front End Web Developer  Slough  ****k****k',
 'NET / Web Developer  Clayton West, Huddersfield, West Yorkshire',
 'Senior Account Manager (Telecoms / IT)',
 'IT Project Manager (C/ASPnet) ****K Leicester',
 'Cisco Call Manager Technician  Greater Manchester',
 'Senior Nurse  Small Specialist Care Home',
 'Senior Oracle DBA/Database AdministratorNorth Yorkshire',
 'Cluster Sales Manager  Maternity Cover',
 'Software Build Manager, MSBuild',
 'Electrical Fitter  Crawley ,Sussex',
 'Sales Manager Door Automation',
 'Year ****/6 Teacher, Cannock',
 'Experienced KS****/KS**** Science teacher Braintree',
 'Design Engineer  Software and Firmware (Python)',
 'C Developer  Top Asset Management Firm',
 'Senior Javascript developer',
 'RESEARCHER, BOLTON',
 'Seni

In [59]:
# save all the job titles into a txt file 
with open("job_title.txt", "w") as file:
    
    for i in range(0, len(job_title)):
        file.write(job_title[i] + "\n")

# 3. Pre-processing data



 - 3.1 Tokenize each job advertisement description. The word tokenization must use the following regular expression,
          r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    
    
 - 3.2 All the words must be converted into the lower case
 
    
    
 - 3.3 Remove words with length less than 2.
 
    
    
 - 3.4 Remove stopwords using the provided stop words list (i.e, stopwords_en.txt).
 


 - 3.5 Remove the word that appears only once in the document collection, based on term frequency.
 
    
    
 - 3.6 Remove the top 50 most frequent words based on document frequency.
 
    
    
 - 3.7 Extract the top 10 Bigrams based on term frequency, save them as a txt file 
 
    
    
 - 3.8 Save all job advertisement text and information in a txt file
 
    
    
 - 3.9 Build a vocabulary of the cleaned job advertisement descriptions, save it in a txt file 


## 3.1 Tokenization

Tokenize each job advertisement description. The word tokenization must use the following regular expression, r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"



Tokenization is the task of breaking a sentence into pieces, called tokens. this step is performed after sentence segmentation to remove full stops. Tokenization also removes puntucations using regex pattern.




In [60]:
job_descriptions[1]

'Timeout Children s Homes Ltd are a rapidly expanding company at the forefront of therapeutic care for young people aged 1017 years, who have experienced emotional behavioural difficulties in their lives. We are looking to recruit Residential Care Workers to be based in our homes in the Swindon area. The successful candidates will need to work collaboratively and cooperatively with other Timeout staff, young people and external agencies. They will also be required to work in consultation with families, social workers, and YOT s and any other professionals involved with the young person including the education team, to deliver effective educational programmes. Successful applicants are required to provide an enhanced disclosure. Disclosure expense will be met by employer. To apply click on the apply button where you will be redirected to our site to complete an application form.'

In [61]:
# break the description into sentences
sentences = sent_tokenize(job_descriptions[1])
print(sentences) 

['Timeout Children s Homes Ltd are a rapidly expanding company at the forefront of therapeutic care for young people aged 1017 years, who have experienced emotional behavioural difficulties in their lives.', 'We are looking to recruit Residential Care Workers to be based in our homes in the Swindon area.', 'The successful candidates will need to work collaboratively and cooperatively with other Timeout staff, young people and external agencies.', 'They will also be required to work in consultation with families, social workers, and YOT s and any other professionals involved with the young person including the education team, to deliver effective educational programmes.', 'Successful applicants are required to provide an enhanced disclosure.', 'Disclosure expense will be met by employer.', 'To apply click on the apply button where you will be redirected to our site to complete an application form.']


In [62]:
# define pattern to tokenize the sentences
pattern = r'''[a-zA-Z]+(?:[-'][a-zA-Z]+)?'''

# initalise RegexpTokenizer class with the pattern above
tokenizer = RegexpTokenizer(pattern)

print(tokenizer)
print("-----------------------------------------------------------------------------------------------------\n")

list_tokens = [tokenizer.tokenize(sen) for sen in sentences]
print(list_tokens)
print(len(list_tokens))

# flatten into a single list
tokenised_job_ad = list(chain.from_iterable(list_tokens))

print("------------------------------------------------------------------------------------------------------\n")
print(tokenised_job_ad)
print(len(tokenised_job_ad))

RegexpTokenizer(pattern="[a-zA-Z]+(?:[-'][a-zA-Z]+)?", gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL)
-----------------------------------------------------------------------------------------------------

[['Timeout', 'Children', 's', 'Homes', 'Ltd', 'are', 'a', 'rapidly', 'expanding', 'company', 'at', 'the', 'forefront', 'of', 'therapeutic', 'care', 'for', 'young', 'people', 'aged', 'years', 'who', 'have', 'experienced', 'emotional', 'behavioural', 'difficulties', 'in', 'their', 'lives'], ['We', 'are', 'looking', 'to', 'recruit', 'Residential', 'Care', 'Workers', 'to', 'be', 'based', 'in', 'our', 'homes', 'in', 'the', 'Swindon', 'area'], ['The', 'successful', 'candidates', 'will', 'need', 'to', 'work', 'collaboratively', 'and', 'cooperatively', 'with', 'other', 'Timeout', 'staff', 'young', 'people', 'and', 'external', 'agencies'], ['They', 'will', 'also', 'be', 'required', 'to', 'work', 'in', 'consultation', 'with', 'families', 'social', 'workers', 'and', 'YO

In [63]:
# method to convert all descriptions into sentances and then tokenize each sentence

def tokenizeAds(raw_job_ad):
    """
        This function segments the raw job description into sentences and 
        tokenize each sentences and convert the description 
        to a list of tokens.
    """        
    
    # perform sentence segmentation (i.e. segment each description text into sentences),
    sentences = sent_tokenize(raw_job_ad)
    
    # tokenize each sentence
    pattern = r'''[a-zA-Z]+(?:[-'][a-zA-Z]+)?'''

    tokenizer = RegexpTokenizer(pattern) 

    token_lists = [tokenizer.tokenize(sen) for sen in sentences]
    
    # merge them into a list of tokens
    tokenised_job_ad = list(chain.from_iterable(token_lists))
    
    return tokenised_job_ad

In [64]:
#call the method on all the job descriptions
tokenized_job_desc = [tokenizeAds(job) for job in job_descriptions]  

In [65]:
len(tokenized_job_desc)

55449

In [66]:
# observe a few elements
tokenized_job_desc[3:5]

[['Skills',
  'Inpatient',
  'Ward',
  'Team',
  'Leader',
  'Description',
  'In',
  'Patient',
  'Team',
  'Leader',
  'Southampton',
  'Permanent',
  'up',
  'to',
  'We',
  'currently',
  'have',
  'an',
  'opportunity',
  'for',
  'a',
  'senior',
  'registered',
  'general',
  'nurse',
  'to',
  'join',
  'an',
  'established',
  'team',
  'within',
  'a',
  'inpatient',
  'ward',
  'at',
  'a',
  'Private',
  'Hospital',
  'located',
  'in',
  'Southampton',
  'This',
  'is',
  'a',
  'full',
  'time',
  'permanent',
  'role',
  'offering',
  'a',
  'salary',
  'in',
  'the',
  'region',
  'of',
  'to',
  'You',
  'will',
  'be',
  'UK',
  'experienced',
  'and',
  'have',
  'a',
  'background',
  'in',
  'surgical',
  'nursing',
  'You',
  'will',
  'have',
  'previous',
  'experience',
  'of',
  'leading',
  'a',
  'team',
  'of',
  'staff',
  'and',
  'being',
  'accountable',
  'for',
  'senior',
  'duties',
  'within',
  'a',
  'ward',
  'environment',
  'This',
  'is',
  '

In [67]:
# method to print few statistics about the tokensized list

def stats_print(tokenized_job_desc):
    
    # put all the tokens in the corpus in a single list
    words = list(chain.from_iterable(tokenized_job_desc)) 
    
    # compute the vocabulary by converting the list of words/tokens to a set,
    vocab = set(words) 
    
    # diverity of words respective of all the words in the corpus
    lexical_diversity = len(vocab)/len(words)
    
    lens = [len(job) for job in tokenized_job_desc]
    
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of job_ad: ", len(tokenized_job_desc))
    print("Average number of tokens in job descriptions: ", np.mean(lens))
    print("Maximun number of tokens in job descriptions: ", np.max(lens))
    print("Minimun number of tokens in job descriptions: ", np.min(lens))
    print("Standard deviation of number of tokens in job descriptions: ", np.std(lens))

In [68]:
stats_print(tokenized_job_desc)

Vocabulary size:  112580
Total number of tokens:  13799127
Lexical diversity:  0.008158487127482775
Total number of job_ad:  55449
Average number of tokens in job descriptions:  248.861602553698
Maximun number of tokens in job descriptions:  2001
Minimun number of tokens in job descriptions:  10
Standard deviation of number of tokens in job descriptions:  125.26507304982165


In [69]:
# check the un-tokenized and tokenized ads

print("Raw ad:\n",job_descriptions[15],'\n')

print("Tokenized ad:\n",tokenized_job_desc[15])

Raw ad:
 Software Build Manager, MSBuild London ****k  benefits (C, MS Build, Web Deploy) Software Build Manager is required by one of the UK's leading online film and TV providers. This is a newly created role which will see you taking responsibility for the design, development and implementation of their build solution. In this role, you will be responsible the following:  All aspects of the software build and deployment process.  Configuring and distributing release builds across multiple platforms  Expected to manage the branching and merging of branches across iterations, features and milestones.  Developing and monitoring continuous integration builds.  Developing scalable, repeatable automated deployment frameworks across multiple platforms.  Manage functional test environments and support automated testing.  Promote rigorous engineering practices across the platform. Required technical skill set:  End to end software development, predominantly in C.  MS Build  Version control (

## 3.2 Case Normalisation - lower case

All the words must be converted into the lower case

After the tokenization, we will have both "client" and "Client" as tokens.
In text analytsis tasks we consider "meaning" of a word therefore, an uppercase word should often be treated no differently than in lower case appearing in a document.


Therefore convert all tokens to lower case

In [70]:
tokenized_job_desc[0]

['Our',
 'client',
 'has',
 'established',
 'itself',
 'as',
 'a',
 'leading',
 'manufacturer',
 'and',
 'supplier',
 'of',
 'quality',
 'water',
 'treatment',
 'plants',
 'ranging',
 'from',
 'basic',
 'water',
 'softeners',
 'and',
 'reverse',
 'osmosis',
 'equipment',
 'to',
 'customer',
 'specified',
 'complex',
 'water',
 'treatment',
 'solutions',
 'The',
 'company',
 'are',
 'able',
 'to',
 'meet',
 'their',
 'clients',
 'requirements',
 'through',
 'flexibility',
 'in',
 'tailoring',
 'their',
 'product',
 'to',
 'their',
 'needs',
 'and',
 'budgets',
 'Due',
 'to',
 'expansion',
 'and',
 'an',
 'increased',
 'workload',
 'they',
 'are',
 'seeking',
 'to',
 'recruit',
 'a',
 'Planet',
 'Engineer',
 'to',
 'cover',
 'accounts',
 'along',
 'the',
 'M',
 'Corridor',
 'Responsibilities',
 'will',
 'include',
 'conducting',
 'the',
 'routine',
 'sampling',
 'and',
 'analysis',
 'of',
 'water',
 'systems',
 'interpreting',
 'results',
 'maintenance',
 'and',
 'the',
 'installation',


In [71]:
# converting all to lower case
tokenized_job_desc_lower = []

for i in range(0, len(tokenized_job_desc)):
    
    # convert each token to lower case
    # create a new list with all lower case tokens for each toeknised job description
    tokenized_job_desc_lower.append([token.lower() for token in tokenized_job_desc[i]])
   

In [72]:
# check the length to make sure all the descriptions were modified
print("Total number of job_ad:", len(tokenized_job_desc_lower))

Total number of job_ad: 55449


In [73]:
# compare this with tokenized_job_desc[0] obtained in the first cell in this sub-section
tokenized_job_desc_lower[0]

['our',
 'client',
 'has',
 'established',
 'itself',
 'as',
 'a',
 'leading',
 'manufacturer',
 'and',
 'supplier',
 'of',
 'quality',
 'water',
 'treatment',
 'plants',
 'ranging',
 'from',
 'basic',
 'water',
 'softeners',
 'and',
 'reverse',
 'osmosis',
 'equipment',
 'to',
 'customer',
 'specified',
 'complex',
 'water',
 'treatment',
 'solutions',
 'the',
 'company',
 'are',
 'able',
 'to',
 'meet',
 'their',
 'clients',
 'requirements',
 'through',
 'flexibility',
 'in',
 'tailoring',
 'their',
 'product',
 'to',
 'their',
 'needs',
 'and',
 'budgets',
 'due',
 'to',
 'expansion',
 'and',
 'an',
 'increased',
 'workload',
 'they',
 'are',
 'seeking',
 'to',
 'recruit',
 'a',
 'planet',
 'engineer',
 'to',
 'cover',
 'accounts',
 'along',
 'the',
 'm',
 'corridor',
 'responsibilities',
 'will',
 'include',
 'conducting',
 'the',
 'routine',
 'sampling',
 'and',
 'analysis',
 'of',
 'water',
 'systems',
 'interpreting',
 'results',
 'maintenance',
 'and',
 'the',
 'installation',


In [74]:
stats_print(tokenized_job_desc_lower)

Vocabulary size:  89591
Total number of tokens:  13799127
Lexical diversity:  0.006492512171240978
Total number of job_ad:  55449
Average number of tokens in job descriptions:  248.861602553698
Maximun number of tokens in job descriptions:  2001
Minimun number of tokens in job descriptions:  10
Standard deviation of number of tokens in job descriptions:  125.26507304982165


<span style='font-family:Georgia;color:Blue;background:yellow'>Observation</span>

The size of vocabulary, the unique words has decreased from 112580 in tokenzised list to 89591 in lowercase tokenised 

After lowercase, the number of unique words decreases as words such as 'client' and 'Client' now count as one instance in a set.

### 3.3 Remove words with length less than 2

In [75]:
# create a list of charaters of length less than 2 for each job_ad

tokens_length_lessThan2 = [[lowercase_token for lowercase_token in job if len(lowercase_token) < 2] for job in tokenized_job_desc_lower] 

# create a list of single character/empty token for each ad
list(chain.from_iterable(tokens_length_lessThan2))

['a',
 'a',
 'm',
 'a',
 'a',
 'a',
 's',
 'a',
 's',
 'h',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'p',
 'a',
 'a',
 'a',
 'k',
 'k',
 'a',
 'a',
 'a',
 'x',
 'x',
 'a',
 'k',
 'k',
 'a',
 'k',
 'k',
 'k',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'c',
 'c',
 'a',
 'a',
 'a',
 'k',
 'k',
 'k',
 'k',
 'a',
 'b',
 'b',
 'a',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 'a',
 'e',
 'g',
 'a',
 'c',
 'k',
 'k',
 'i',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 's',
 'u',
 's',
 'i',
 'g',
 'o',
 's',
 'a',
 'a',
 'a',
 'k',
 'c',
 'a',
 'c',
 'a',
 'a',
 'a',
 'c',
 'k',
 'a',
 's',
 'a',
 'm',
 'm',
 'a',
 's',
 'a',
 'a',
 'a',
 'e',
 'm',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'c',
 'c',
 'a',
 'c',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 'j'

In [76]:
# remove words with length less than 2

lowercase_tokens_lenGreater2 = [[lowercase_token for lowercase_token in job if len(lowercase_token) >= 2] for job in tokenized_job_desc_lower]


In [77]:
print("Tokenized review WITH single character words:\n\n", tokenized_job_desc_lower[1000])

Tokenized review WITH single character words:

 ['sql', 'server', 'developer', 'required', 'for', 'growing', 'nottingham', 'based', 'company', 'technical', 'requirements', 'proven', 'commercial', 'experience', 'in', 'microsoft', 'sql', 'server', 'or', 'above', 'fullcycle', 'sql', 'server', 'database', 'development', 'database', 'design', 'datadefinition', 'layer', 'ddl', 'tables', 'triggers', 'constraints', 'keys', 'etc', 'datamanipulation', 'layer', 'dml', 'views', 'functions', 'stored', 'procedures', 'etc', 'data', 'transformation', 'using', 'technologies', 'such', 'as', 'ssis', 'beneficial', 'technical', 'experience', 'but', 'not', 'essential', 'database', 'administration', 'users', 'backups', 'etc', 'use', 'of', 'ssrs', 'and', 'or', 'ssas', 'using', 'red', 'gate', 'database', 'tools', 'using', 'business', 'intelligence', 'tools', 'other', 'than', 'ssrs', 'php', 'development', 'in', 'particular', 'with', 'code', 'igniter', 'or', 'other', 'mvc', 'frameworks', 'general', 'description'

In [78]:
print("Tokenized review WITHOUT any single character word:\n\n",lowercase_tokens_lenGreater2[1000])

Tokenized review WITHOUT any single character word:

 ['sql', 'server', 'developer', 'required', 'for', 'growing', 'nottingham', 'based', 'company', 'technical', 'requirements', 'proven', 'commercial', 'experience', 'in', 'microsoft', 'sql', 'server', 'or', 'above', 'fullcycle', 'sql', 'server', 'database', 'development', 'database', 'design', 'datadefinition', 'layer', 'ddl', 'tables', 'triggers', 'constraints', 'keys', 'etc', 'datamanipulation', 'layer', 'dml', 'views', 'functions', 'stored', 'procedures', 'etc', 'data', 'transformation', 'using', 'technologies', 'such', 'as', 'ssis', 'beneficial', 'technical', 'experience', 'but', 'not', 'essential', 'database', 'administration', 'users', 'backups', 'etc', 'use', 'of', 'ssrs', 'and', 'or', 'ssas', 'using', 'red', 'gate', 'database', 'tools', 'using', 'business', 'intelligence', 'tools', 'other', 'than', 'ssrs', 'php', 'development', 'in', 'particular', 'with', 'code', 'igniter', 'or', 'other', 'mvc', 'frameworks', 'general', 'descri

In [79]:
# there are three words which are of length less than 2 in description at index 1000
print(len(tokenized_job_desc_lower[1000]))
print(len(lowercase_tokens_lenGreater2[1000]))

265
262


In [80]:
# print the words which are not in lowercase_tokens_lenGreater2[1000] but are in tokenized_job_desc_lower[1000]

for word in tokenized_job_desc_lower[1000]:
    if word not in lowercase_tokens_lenGreater2[1000]:
        print(word)

a
a
a


In [81]:
stats_print(lowercase_tokens_lenGreater2)

Vocabulary size:  89565
Total number of tokens:  13342925
Lexical diversity:  0.006712546162104636
Total number of job_ad:  55449
Average number of tokens in job descriptions:  240.63418636945661
Maximun number of tokens in job descriptions:  1919
Minimun number of tokens in job descriptions:  10
Standard deviation of number of tokens in job descriptions:  121.91270721028921


<span style='font-family:Georgia;color:Blue;background:yellow'>Observation</span>

-- The number of tokens has decresed 13799127 to 13342925 as all the tokens of length less than 2 have been removed.

-- The vocab list has also decreased from 89591 to 89565 as all the unique tokens of length less than 2 were removed.

## 3.4 Removing Stop words


Remove stopwords using the provided stop words list (i.e, stopwords_en.txt).

Stop words provide little or no value to the NLP objective and thus can be filtered and excluded from the text to be processed.

Stop words can be safely ignored by carrying out a lookup in a pre-defined list of keywords, here a list of stopwords is provided.

In [82]:
# read the stop words into a list data structure
stopwords_list = []

with open('./stopwords_en.txt') as f:
    
    stopwords_list = f.read().splitlines()

In [83]:
stopwords_list 

['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [84]:
# There are 571 stop words in the provided list

len(stopwords_list)

571

In [85]:
# remove stop words

tokens_without_stopwords = [[word for word in job if word not in stopwords_list] for job in lowercase_tokens_lenGreater2]


In [86]:
len(tokens_without_stopwords)

55449

In [87]:
for word in lowercase_tokens_lenGreater2[5000]:
    if word not in tokens_without_stopwords[5000]:
        print(word, end=" ")

is for in to and they with and with in particular in your as you will be for of and have of of the will be very in of and these so are will have the following particularly but also and with of and some in is the over of in the last and are to there is on in the of plus and up to and on for the right please through this to be 

In [88]:
# observation: the word particular has been removed 
# let's check if the word 'particular' in the list of stopwords provided
print("particular" in stopwords_list)
print("following" in stopwords_list)
print("particularly" in stopwords_list)
print("please" in stopwords_list)

True
True
True
True


In [89]:
stats_print(tokens_without_stopwords)

Vocabulary size:  89052
Total number of tokens:  7863307
Lexical diversity:  0.011325006132915833
Total number of job_ad:  55449
Average number of tokens in job descriptions:  141.8115204963119
Maximun number of tokens in job descriptions:  1132
Minimun number of tokens in job descriptions:  7
Standard deviation of number of tokens in job descriptions:  73.78995293014496


<span style='font-family:Georgia;color:Blue;background:yellow'>Observation</span>

- The vocab list has decreased from 89565 to 89052
- The number of tokens has decreased from 13342925 to 7863307
- The lexical diveristy has therefore increased from 0.0067 to 0.0113
- These observations are result of removing stop words from the lowercase tokenized job descriptions with all tokens of length greater than 1.

## 3.5  Term Frequency

Remove the word that appears only once in the document collection, based on term frequency.

Term frequency counts the number of times a word occurs in the whole corpus regardless which document it is in. Frequency distribution based on term frequency tells us how the total number of word tokens are distributed across all the types.

In [90]:
# create a list of all the words used in the corpus
words = list(chain.from_iterable(tokens_without_stopwords)) 

In [91]:
len(words)

7863307

In [92]:
# compute term frequency for all words
term_fd = FreqDist(words) 

In [93]:
# observe the 25 most common words in the corpus
term_fd.most_common(25)

[('experience', 104062),
 ('role', 66536),
 ('work', 64587),
 ('team', 62308),
 ('business', 59184),
 ('skills', 57872),
 ('working', 52280),
 ('job', 50814),
 ('client', 45209),
 ('sales', 43108),
 ('manager', 42023),
 ('management', 41295),
 ('development', 41099),
 ('support', 39567),
 ('company', 38961),
 ('uk', 36576),
 ('excellent', 34578),
 ('opportunity', 31093),
 ('service', 29859),
 ('required', 29783),
 ('knowledge', 28788),
 ('care', 28434),
 ('successful', 26012),
 ('services', 25557),
 ('apply', 25436)]

In [94]:
# find out the list of words that appear only once in the entire corpus
lessFreqWords = set(term_fd.hapaxes())
lessFreqWords

{'perio',
 'stephencoastsr',
 'sagecrmconsultantwithdevelopmentskillsallrounder',
 'axtradeandlogisticsconsultant',
 'ittrainerittraininganalysttechnicaltrainer',
 'bellamy',
 'hedgesapplying',
 'oppressive',
 'lifecyclekeen',
 'financecommercialmanagernorthenireland',
 'plannig',
 'trackhistory',
 'hoursoutofhours',
 'seismologists',
 'mobilefacilitiestechnician',
 'managerialin',
 'beautyfocused',
 'dataanalystdatabaseadministratorurgentsqlspss',
 'midlevelaspnetdeveloperabingdon',
 'skillsadvantageous',
 'seniorembeddedcrtossoftwareengineerswindon',
 'graphicsrelated',
 'oandm',
 'easeofuse',
 'brbrbenefits',
 'seniorlettingsnegotiatorinresidentiallettings',
 'theukliquidity',
 'liftservicemanager',
 'deputywebmanager',
 'assistantfrontofhousemanager',
 'snrphpdeveloperscrumbirmingham',
 'koteprogression',
 'teachermaternitycoverliverpool',
 'bolder',
 'acoenfrllondonlondon',
 'professionalb',
 'programmesskills',
 'planting',
 'senioronlinemarketingexecutivefrenchspeaking',
 'midla

In [95]:
len(lessFreqWords)

48964

In [96]:
# remove these less frequent words from each tokenized job description
def removeLessFreqWords(ad):
    return [w for w in ad if w not in lessFreqWords]

In [97]:
tokens_removeLessTermFreq = [removeLessFreqWords(job) for job in tokens_without_stopwords]

In [98]:
stats_print(tokens_removeLessTermFreq)

Vocabulary size:  40088
Total number of tokens:  7814343
Lexical diversity:  0.005130053799788415
Total number of job_ad:  55449
Average number of tokens in job descriptions:  140.9284748146946
Maximun number of tokens in job descriptions:  1121
Minimun number of tokens in job descriptions:  7
Standard deviation of number of tokens in job descriptions:  73.46663506985078


In [99]:
(89052 - 40088)/89052

0.5498360508466963

<span style='font-family:Georgia;color:Blue;background:yellow'>Observation</span>

- The vocab list has decreased from 89052 to 40088
- The number of tokens has decreased from 7863307 to 7814343

- These observations are result of removing lowest frequency tokens from the tokenized job description.

- The vocab has reduced by approx 55%.

## 3.6 Document Frequency
Remove the top 50 most frequent words based on document frequency.

Document frequency is slightly different then term frequency as it counts the number of documents a word occurs. For instance, if a word appear 3 times in a document, when we count the term frequency, this will be added 3 into the total number of occurrence; however, for document frequency, this will stil be counted as 1 only.

In [100]:
# build words list for document frequency

words_2 = list(chain.from_iterable([set(job) for job in tokens_removeLessTermFreq]))

# find words appear most commonly across documents
doc_fd = FreqDist(words_2)  
doc_fd.most_common(50)

[('experience', 43644),
 ('role', 34680),
 ('work', 33684),
 ('team', 32585),
 ('working', 30714),
 ('skills', 30412),
 ('client', 26899),
 ('job', 25552),
 ('business', 24739),
 ('uk', 24133),
 ('excellent', 22982),
 ('opportunity', 22678),
 ('company', 22263),
 ('management', 20620),
 ('required', 20555),
 ('development', 20223),
 ('apply', 20133),
 ('based', 19333),
 ('successful', 19118),
 ('join', 18682),
 ('www', 18421),
 ('salary', 18402),
 ('cv', 18383),
 ('support', 18286),
 ('knowledge', 17844),
 ('strong', 16475),
 ('environment', 16408),
 ('posted', 16398),
 ('jobseeking', 16342),
 ('candidate', 16304),
 ('originally', 16294),
 ('leading', 16194),
 ('high', 15922),
 ('service', 15623),
 ('manager', 15587),
 ('good', 15252),
 ('ability', 15154),
 ('including', 14857),
 ('position', 14564),
 ('services', 14501),
 ('benefits', 14434),
 ('training', 14218),
 ('essential', 13915),
 ('experienced', 13826),
 ('key', 13567),
 ('contact', 13551),
 ('level', 13523),
 ('recruitment', 

In [101]:
#view only top 50 common tokens and save them in a list
document_freq = doc_fd.most_common(50)
document_freq 

[('experience', 43644),
 ('role', 34680),
 ('work', 33684),
 ('team', 32585),
 ('working', 30714),
 ('skills', 30412),
 ('client', 26899),
 ('job', 25552),
 ('business', 24739),
 ('uk', 24133),
 ('excellent', 22982),
 ('opportunity', 22678),
 ('company', 22263),
 ('management', 20620),
 ('required', 20555),
 ('development', 20223),
 ('apply', 20133),
 ('based', 19333),
 ('successful', 19118),
 ('join', 18682),
 ('www', 18421),
 ('salary', 18402),
 ('cv', 18383),
 ('support', 18286),
 ('knowledge', 17844),
 ('strong', 16475),
 ('environment', 16408),
 ('posted', 16398),
 ('jobseeking', 16342),
 ('candidate', 16304),
 ('originally', 16294),
 ('leading', 16194),
 ('high', 15922),
 ('service', 15623),
 ('manager', 15587),
 ('good', 15252),
 ('ability', 15154),
 ('including', 14857),
 ('position', 14564),
 ('services', 14501),
 ('benefits', 14434),
 ('training', 14218),
 ('essential', 13915),
 ('experienced', 13826),
 ('key', 13567),
 ('contact', 13551),
 ('level', 13523),
 ('recruitment', 

In [102]:
freq_words = []

for word in document_freq:
    freq_words.append(word[0])
    
print(freq_words)
print("\n")
print(len(freq_words ))

['experience', 'role', 'work', 'team', 'working', 'skills', 'client', 'job', 'business', 'uk', 'excellent', 'opportunity', 'company', 'management', 'required', 'development', 'apply', 'based', 'successful', 'join', 'www', 'salary', 'cv', 'support', 'knowledge', 'strong', 'environment', 'posted', 'jobseeking', 'candidate', 'originally', 'leading', 'high', 'service', 'manager', 'good', 'ability', 'including', 'position', 'services', 'benefits', 'training', 'essential', 'experienced', 'key', 'contact', 'level', 'recruitment', 'candidates', 'provide']


50


In [103]:
def removeTop50(ad):
    return [w for w in ad if w not in freq_words]


In [104]:
tokens_removeMostDocumentFreq = [removeTop50(ad) for ad in tokens_removeLessTermFreq]

In [105]:
len(tokens_removeMostDocumentFreq)

55449

In [106]:
stats_print(tokens_removeMostDocumentFreq)


Vocabulary size:  40038
Total number of tokens:  6239169
Lexical diversity:  0.0064172007522155594
Total number of job_ad:  55449
Average number of tokens in job descriptions:  112.52085700373316
Maximun number of tokens in job descriptions:  990
Minimun number of tokens in job descriptions:  4
Standard deviation of number of tokens in job descriptions:  61.88637513583753


<span style='font-family:Georgia;color:Blue;background:yellow'>Observation</span>

- The vocab list has decreased from 40088 to 40038
- The number of tokens has decreased from 7814343 to 6239169

- These observations are result of removing 50 most frequent tokens from the tokenized job description.

# 4. Bigrams

Extract the top 10 Bigrams based on term frequency, save them as a txt file. 

When computing the bigrams we move a fixed size window of size 2 words forward. 


They can be used to build n-gram language model that can be further used for speech recognition, spelling correction, entity detection, etc. In terms of text mining tasks, n-grams are used for developing features for 
classification algorithms, such as SVMs, MaxEnt models, Naive Bayes, etc. [excerpt from activity 3, week 8]

In [107]:
# term frequency
words = list(chain.from_iterable(tokens_removeMostDocumentFreq))

In [108]:
# call ngrams method 
bigrams = ngrams(words, n = 2)

In [109]:
fdbigram = FreqDist(bigrams)

In [110]:
# view 10 most common bigrams
final_bigrams = fdbigram.most_common(10)
final_bigrams

[(('employment', 'agency'), 8055),
 (('track', 'record'), 5472),
 (('acting', 'employment'), 5095),
 (('sql', 'server'), 4804),
 (('asp', 'net'), 4687),
 (('relation', 'vacancy'), 3977),
 (('sales', 'executive'), 3619),
 (('chef', 'de'), 3586),
 (('nursing', 'home'), 3503),
 (('de', 'partie'), 3396)]

In [111]:
bigram_words = [" ".join(bg[0]) for bg in final_bigrams]
bigram_words

['employment agency',
 'track record',
 'acting employment',
 'sql server',
 'asp net',
 'relation vacancy',
 'sales executive',
 'chef de',
 'nursing home',
 'de partie']

In [112]:
bigram_freq = [bg[1] for bg in final_bigrams]
bigram_freq

[8055, 5472, 5095, 4804, 4687, 3977, 3619, 3586, 3503, 3396]

In [113]:
def save_bigrams(bigramFilename, bigram_words, bigram_freq):
    
    # creates a txt file and open to save the bigrams
    out_file = open(bigramFilename, 'w') 
    
    for i in range(0, len(bigram_freq)):
        string = bigram_words[i] + "," + str(bigram_freq[i]) + "\n"
        out_file.write(string)
    
    # close the file
    out_file.close() 

In [114]:
save_bigrams("bigram.txt", bigram_words, bigram_freq)

# 5. Save all processed job advertisements

 Save all job advertisement text and information in a txt file
 
`job_ads.txt` This file contains the job advertisement information and the pre-processed description
text for all the job advertisement documents. Each job advertisement occupies 5 lines in the file:

○ The first line stores the id of the job advertisement document, written in the format of “ID: <5 digit id>”, for instance, “ID: 44128”. The job advertisement id matches the 5 digit part of the file
name of the document.

○ The second line stores the category (name of it’s parent folder) of the job advertisement,
written in the format of “Category: <category>”, for instance, “Category: Teaching”.
    
○ The third line stores the webIndex of the job advertisement, written in the format of
“Webindex: <8 digit web index>”, for instance, “Webindex: 36757414”.
    
○ The fourth line stores the un-processed title of the job description, in the format of “Title:
<title of the advertisement>”
    
    
○ The fifth line stores the pre-processed description of the job description, in the format of
“Description: <description of the advertisement>”. In order to do so, you need to rejoin the
tokens of each pre-processed description text into one string, with space as the delimiter. 

In [115]:
# job_ids
# job_categories
# job_web_indices
# job_title
# job_descriptions
# tokens_removeMostDocumentFreq

In [116]:
out_file = open("job_ads.txt", 'w')

for i in range(0, len(job_data["data"])):
    
    first_line = "ID: " + str(job_ids[i]) + "\n"
    second_line = "Category: " + str(job_categories[i]) + "\n"
    third_line = "Webindex: " + str(job_web_indices[i]) + "\n"
    fourth_line = "Title: " + job_title[i] + "\n"
    fifth_line = "Description: " +" ".join(tokens_removeMostDocumentFreq[i]) + "\n"
    
    out_file.write(first_line)
    out_file.write(second_line)
    out_file.write(third_line)
    out_file.write(fourth_line)
    out_file.write(fifth_line)


# 6. Save vocab 

In [117]:
words = list(chain.from_iterable(tokens_removeMostDocumentFreq))

In [118]:
vocab = set(words)
len(vocab)

40038

In [119]:
# sort the list of vocab 
vocab = sorted(list(vocab))

In [120]:
vocab

["a'level",
 'aa',
 'aaa',
 'aaappointments',
 'aab',
 'aac',
 'aacc',
 'aae',
 'aah',
 'aak',
 'aamm',
 'aan',
 'aand',
 'aantrekkelijk',
 'aap',
 'aar',
 'aarca',
 'aardman',
 'aareon',
 'aaron',
 'aaronwallis',
 'aarosette',
 'aarosettehotel',
 'aarosettelivein',
 'aarosetterestaurant',
 'aarosetterestaurantlondon',
 'aarosettes',
 'aarosettesinternationalchainofhotels',
 'aarosetteslivein',
 'aasl',
 'aasp',
 'aastra',
 'aat',
 'aatom',
 'ab',
 'aba',
 'abacus',
 'abaility',
 'abandoned',
 'abandonment',
 'abandons',
 'abap',
 'abaqus',
 'abatement',
 'abb',
 'abbas',
 'abbey',
 'abbeywood',
 'abbie',
 'abbot',
 'abbots',
 'abbott',
 'abby',
 'abbyy',
 'abc',
 'abd',
 'abdo',
 'abdominal',
 'abdul',
 'abdulla',
 'abel',
 'abellprotocoleducation',
 'abenefit',
 'aberdare',
 'aberdeen',
 "aberdeen's",
 'aberdeenshire',
 'aberdenshire',
 'aberfeldy',
 'abergavenny',
 'abertillery',
 'aberystwyth',
 'abfa',
 'abgeschlossenes',
 'abi',
 'abid',
 'abide',
 'abiding',
 'abigail',
 'abili'

In [121]:
out_file = open("vocab.txt", 'w')

for ind in range(0, len(vocab)):
    out_file.write("{}:{}\n".format(vocab[ind],ind)) # write each index and vocabulary word, note that index start from 0
out_file.close() # close the file