# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Vu Duc Anh: s3979839
#### Bui Thien Phuoc: s3634831

Date: 2024-08-17

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy

## Introduction
You should give a brief information of this assessment task here.

<span style="color: red"> Note that this is a sample notebook only. You will need to fill in the proper markdown and code blocks. You might also want to make necessary changes to the structure to meet your own needs. Note also that any generic comments written in this notebook are to be removed and replace with your own words.</span>

## The Data
- Given a small collection of job advertisement documents (around 750 jobs) in data folder
- Inside the data folder, there are 4 different subfolders, namely: Accounting_Finance, Engineering, Healthcare_Nursing, and Sales, each folder name is a job category
- The job advertisement text documents of a particular category are located in the corresponding subfolder 
- Each job advertisement document is a txt file, named as "Job_<ID>.txt". It contains the title, the webindex, (some will also have information on the company name, some might not), and the full description of the job advertisement.

Job advertisement document having the following attributes:

| Variables | Description |
| --- | --- | 
| Title | Title of the advertised job position
| Webindex | 8 digit Id from the website of advertised job 
| Company | Company of the advertised job
| Description | Description of each job advertisement

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import sys
import re

# Libraries for basic NLP tasks taken from tutorial
import sklearn
from sklearn.datasets import load_files  
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from nltk.probability import *
from itertools import chain

# Import datascience toolkits
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

### 1.1 Examining and loading data

Load the dataset given in **data** folder using *load_files* API from **sklear.datasets**

In [2]:
df = load_files(r'data')
print(type(df))

<class 'sklearn.utils._bunch.Bunch'>


The data type of the loaded data using sklearn API is <class 'sklearn.utils._bunch.Bunch'>. 

We can do analysis from Bunch API 
- Bunch Ref: https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html)

In [3]:
df.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
print(f"Print first five example: \n{df.data[:5]}")
print(f"Print first five filenames: \n{df.filenames[:5]}")
print(f"Print target names: \n{df.target_names}")
print(f"Print target: \n{df.target}")

Print first five example: 
[b'Title: Finance / Accounts Asst Bromley to ****k\nWebindex: 68997528\nCompany: First Recruitment Services\nDescription: Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role', b'Title: Fund Accountant  Hedge Fund\nWebindex: 68063513\nCompany:

- The description is in the bytes-like format and need to be converted to python string
- The target class name is Accounting_Finance, Engineering, Healthcare_Nursing and Sales
- There are four target class corresponding to Integer 0 to 3

Let's look at the index of each of the target class name,

In [5]:
for i in range(len(df['target_names'])):
    print(f'Category at index {i} is {df["target_names"][i]}')

Category at index 0 is Accounting_Finance
Category at index 1 is Engineering
Category at index 2 is Healthcare_Nursing
Category at index 3 is Sales


 Create an test index variable. Check manually whether the text file attribute is the correct category index for target class,

In [6]:
idx = 20
df['filenames'][idx], df['target'][idx] 

('data/Healthcare_Nursing/Job_00491.txt', 2)

In [7]:
print(f"Job data: \n{df['data'][idx]}")
print(f"Corresponding to Job filenames: \n{df['filenames'][idx]}")
print(f"And to Job target: {df['target'][idx]} \n")



Job data: 
b'Title: PERM Unit Mgr RGN Kid minster Flexi ****K due\nWebindex: 71692209\nDescription: Job Title: Unit Manager Reporting to: Registered Manager Job Purpose: To manage in a professional manner the day to day running of the home\xe2\x80\x99s administration, clinical policies and procedures, training and care planning. To implement working practices that monitors the health and welfare of the home\xe2\x80\x99s service users and staff and their respective environments. To promote quality care within a warm friendly ambience. Key Result Areas Managing To work with the Directors to achieve the home\xe2\x80\x99s financial targets. To manage the home in a manner which will not bring the home or service users into disrepute. To maintain confidentiality on all aspects of care and staff management. To ensure all the home\xe2\x80\x99s policies and procedures are implemented and followed by all staff. To inform the Registered Manager immediately if a serious difficulty or event occurs.

Manually checked, we can confirm the Job data (including Title, Webindex, Decsription) match the file path and name, and job target.

### 1.2 Text Pre-processing
Before Text-Preprocessing, we need to converted data bytes-like format to Python string as finding from 1.1.

Then, performing the required text pre-processing steps as follow:

- 1.2.1.
Extract information from each job advertisement. Perform the following pre-processing steps to the
description of each job advertisement;

- 1.2.2.
Tokenize each job advertisement description. The word tokenization must use the following regular expression, r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?";

- 1.2.3.
All the words must be converted into the lower case;

- 1.2.4.
Remove words with length less than 2.

- 1.2.5.
Remove stopwords using the provided stop words list (i.e, stopwords_en.txt). It is located inside the same downloaded folder.

- 1.2.6.
Remove the word that appears only once in the document collection, based on term frequency.

- 1.2.7.
Remove the top 50 most frequent words based on document frequency.

- 1.2.8.
Save all job advertisement text and information in txt file(s) (you have flexibility to choose what format you want to save the preprocessed job ads, and you will need to retrieve the pre-processed job ads text in Task 2 & 3);

- 1.2.9.
Build a vocabulary of the cleaned job advertisement descriptions, save it in a txt file (please refer to the required output);

...... Sections and code blocks on basic text pre-processing


<span style="color: red"> You might have complex notebook structure in this section, please feel free to create your own notebook structure. </span>

In [8]:
# Function to convert bytes-like format to Python string
def bytes_to_string(data):
    """
    Converts bytes to a string using the 'utf-8' encoding.
    Parameters:
    - data (bytes or list): The bytes data, or list of bytes to be converted.
    Returns:
    - str: The converted string, or list of str.
    """
    if isinstance(data, list):
        return [bytes_to_string(d) for d in data]
    return data.decode('utf-8')

print(f"Before conversion: \n{type(df['data'][0])}")
print(f"After conversion: \n{type(bytes_to_string(df['data'][0]))}")
df.data = bytes_to_string(df.data)


Before conversion: 
<class 'bytes'>
After conversion: 
<class 'str'>


#### 1.2.1. Extract information from each job advertisement

In [9]:
print(f'Job data at index {idx}: \n{df["data"][idx]}')

Job data at index 20: 
Title: PERM Unit Mgr RGN Kid minster Flexi ****K due
Webindex: 71692209
Description: Job Title: Unit Manager Reporting to: Registered Manager Job Purpose: To manage in a professional manner the day to day running of the home’s administration, clinical policies and procedures, training and care planning. To implement working practices that monitors the health and welfare of the home’s service users and staff and their respective environments. To promote quality care within a warm friendly ambience. Key Result Areas Managing To work with the Directors to achieve the home’s financial targets. To manage the home in a manner which will not bring the home or service users into disrepute. To maintain confidentiality on all aspects of care and staff management. To ensure all the home’s policies and procedures are implemented and followed by all staff. To inform the Registered Manager immediately if a serious difficulty or event occurs. Managing Support To delegate respon

In [10]:
def extract_description(data):
    descriptions = []
    for item in data:
        match = re.search(r'Description: (.*)', str(item))
        if match:
            descriptions.append(match.group(1))
    return descriptions

descriptions = extract_description(df.data)
print(f'Job description at index {idx}:\n{descriptions[idx]}')
    

Job description at index 20:
Job Title: Unit Manager Reporting to: Registered Manager Job Purpose: To manage in a professional manner the day to day running of the home’s administration, clinical policies and procedures, training and care planning. To implement working practices that monitors the health and welfare of the home’s service users and staff and their respective environments. To promote quality care within a warm friendly ambience. Key Result Areas Managing To work with the Directors to achieve the home’s financial targets. To manage the home in a manner which will not bring the home or service users into disrepute. To maintain confidentiality on all aspects of care and staff management. To ensure all the home’s policies and procedures are implemented and followed by all staff. To inform the Registered Manager immediately if a serious difficulty or event occurs. Managing Support To delegate responsibility effectively and within legal boundaries. To ensure through clinical st

In [11]:
def extract_title(data):
    titles = []
    for item in data:
        match = re.search(r'Title: (.*)', str(item))
        if match:
            titles.append(match.group(1))
    return titles

titles = extract_title(df.data)
print(f'Job title at index {idx}:\n{titles[idx]}')

Job title at index 20:
PERM Unit Mgr RGN Kid minster Flexi ****K due


In [12]:
def extract_webindex(data):
    webindex = []
    for item in data:
        match = re.search(r'Webindex: (.*)', str(item))
        if match:
            webindex.append(match.group(1))
    return webindex

webindex = extract_webindex(df.data)
print(f'Webindex at index {idx}:\n{webindex[idx]}')

Webindex at index 20:
71692209
webindex len


In [13]:
def extract_company(data):
    companies = []
    for item in data:
        match = re.search(r'Company: (.*)', str(item))
        if match:
            companies.append(match.group(1))
        else:
            companies.append('NA')
    return companies

companies = extract_company(df.data)
print(f'Company at index {idx}:\n{companies[idx]}')

Company at index 20:
NA


#### 1.2.2-3: Converted Job description to lower case and Tokenization

In [14]:
def tokenizeDescription(description):
    # Convert all words to lowercase
    l_description = description.lower()
    
    # Segment into sentences
    sentences = sent_tokenize(l_description)
    
    # Tokenize each sentence
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern)
    token_list = [tokenizer.tokenize(sen) for sen in sentences]
    
    # Merged into a list of tokens
    tokenized_descriptions = list(chain.from_iterable(token_list))
    return tokenized_descriptions

tokenised_description = [tokenizeDescription(description) for description in descriptions]
print(f'Raw description at index {idx}:\n{descriptions[idx]}\n')
print(f'Tokenised description at index {idx}:\n{tokenised_description[idx]}')

Raw description at index 20:
Job Title: Unit Manager Reporting to: Registered Manager Job Purpose: To manage in a professional manner the day to day running of the home’s administration, clinical policies and procedures, training and care planning. To implement working practices that monitors the health and welfare of the home’s service users and staff and their respective environments. To promote quality care within a warm friendly ambience. Key Result Areas Managing To work with the Directors to achieve the home’s financial targets. To manage the home in a manner which will not bring the home or service users into disrepute. To maintain confidentiality on all aspects of care and staff management. To ensure all the home’s policies and procedures are implemented and followed by all staff. To inform the Registered Manager immediately if a serious difficulty or event occurs. Managing Support To delegate responsibility effectively and within legal boundaries. To ensure through clinical st

In [15]:
print(f'Length of tokenised description at index {idx}: {len(tokenised_description[idx])}')
print(f'Length of description at index {idx}: {len(descriptions[idx].lower().split())}')

Length of tokenised description at index 20: 815
Length of description at index 20: 813


In [16]:
def stats_print(tokenised_description):
    words = list(chain.from_iterable(tokenised_description)) # we put all the tokens in the corpus in a single list
    vocab = set(words) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of descriptions:", len(tokenised_description))
    lens = [len(description) for description in tokenised_description]
    print("Average description length:", np.mean(lens))
    print("Maximun description length:", np.max(lens))
    print("Minimun description length:", np.min(lens))
    print("Standard deviation of description length:", np.std(lens))

stats_print(tokenised_description)

Vocabulary size:  9834
Total number of tokens:  186952
Lexical diversity:  0.052601737344345076
Total number of descriptions: 776
Average description length: 240.91752577319588
Maximun description length: 815
Minimun description length: 13
Standard deviation of description length: 124.97750685071483


#### 1.2.4. Remove words with length less than 2

In [17]:
# Find the tokenised description with words less than 2 characters
w_less_than_2 = [[w for w in description if len(w) < 2] for description in tokenised_description]
print(f'Print description with words less than 2 characters at index {idx}: \n{w_less_than_2[idx]}')

Print description with words less than 2 characters at index 20: 
['a', 's', 's', 'a', 's', 'a', 's', 'a', 's', 'a', 's', 'a', 'a', 'a', 's', 's', 's', 'a', 'a', 's']


In [18]:
# Filter the tokenised description with words greater than 2 characters
tokenised_description = [[w for w in description if len(w) > 2] for description in tokenised_description]

# After filtering
w_less_than_2 = [[w for w in description if len(w) < 2] for description in tokenised_description]
print(f'Description with words less than 2 characters at index {idx} after removed: \n{w_less_than_2[idx]}')

Description with words less than 2 characters at index 20 after removed: 
[]


In [19]:
stats_print(tokenised_description)

Vocabulary size:  9615
Total number of tokens:  153200
Lexical diversity:  0.06276109660574412
Total number of descriptions: 776
Average description length: 197.42268041237114
Maximun description length: 661
Minimun description length: 12
Standard deviation of description length: 103.46267418502396


===> Lexical diversity increased after removed words less than 2 characters.

#### 1.2.5. Remove stopwords using the provided stop words list from stopwords_en.txt

In [20]:
stopwords_en = 'stopwords_en.txt'
with open(stopwords_en, 'r') as f:
    stopwords = f.read().splitlines()
print(f" Stop words list to removed: \n{stopwords}")

 Stop words list to removed: 
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', "c'mon", "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'co', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', '

In [21]:
# Filter the tokenised description with stopwords
tokenised_description = [[w for w in description if w not in stopwords] for description in tokenised_description]
stats_print(tokenised_description)

Vocabulary size:  9245
Total number of tokens:  105768
Lexical diversity:  0.08740828984191816
Total number of descriptions: 776
Average description length: 136.29896907216494
Maximun description length: 482
Minimun description length: 12
Standard deviation of description length: 72.23068213879858


===> Lexical diversity also increased after filter out stopwords.

#### 1.2.6. Remove the word that appears only once in the document collection, based on term frequency

In [22]:
words = list(chain.from_iterable(tokenised_description))
term_fd = FreqDist(words)
lessFreqWords = set(term_fd.hapaxes())
print(f'The number of words that appear only once in the corpus: {len(lessFreqWords)}\n')
print(f'Including: \n{lessFreqWords}')


The number of words that appear only once in the corpus: 4105

Including: 
{'frameworks', 'draughting', 'undoubted', 'xpath', 'replying', 'ebucklecompassltd', 'preferredenthusiastic', 'suggesting', 'bigger', 'gauge', 'president', 'bullself', 'measurements', 'wedi', 'newbusinessaccountmanager', 'teamcenter', 'grad', 'rotary', 'testability', 'renovations', 'barring', 'joy', 'physiotherapists', 'careerswedbush', 'distributive', 'exmouth', 'yearly', 'eager', 'busienss', 'sainsbury', 'substantiation', 'dietry', 'competitiveness', 'wpf', 'photocopying', 'goodfellow', 'violent', 'predicated', 'stripping', 'implication', 'renown', 'communities', 'internalauditmanager', 'accountmanagermanchesterote', 'coachable', 'payrollmanager', 'frs', 'personalities', 'flows', 'lengthy', 'originating', 'justified', 'arises', 'cro', 'envolve', 'cobalt', 'tunbridge', 'coin', 'dmm', 'scientist', 'irregularity', "solution's", 'winn', 'payne', 'seniorsalesexecutive', 'titlebusiness', 'patience', 'appeared', 'data

In [23]:
def removeLessFreqWords(description):
    return [w for w in description if w not in lessFreqWords]

tokenised_description = [removeLessFreqWords(description) for description in tokenised_description]
stats_print(tokenised_description)

Vocabulary size:  5140
Total number of tokens:  101663
Lexical diversity:  0.05055920049575559
Total number of descriptions: 776
Average description length: 131.0090206185567
Maximun description length: 466
Minimun description length: 12
Standard deviation of description length: 69.61315176035262


===> Lexical diversity decreased. However, words that appear less frequency usually not contribute to distinguishing in terms of lexical or simply typo errors.

#### 1.2.7. Remove the top 50 most frequent words based on document frequency

In [24]:
words_2 = list(chain.from_iterable([set(description) for description in tokenised_description]))
doc_fd = FreqDist(words_2) # compute document frequency for each unique word/type
top50FreqWords = doc_fd.most_common(50)
print(f'Top 50 most frequent words and number of occurance in the corpus: \n{top50FreqWords}')

Top 50 most frequent words and number of occurance in the corpus: 
[('experience', 586), ('role', 499), ('work', 453), ('team', 431), ('working', 407), ('skills', 366), ('client', 358), ('job', 348), ('company', 343), ('business', 342), ('excellent', 309), ('management', 301), ('based', 287), ('apply', 286), ('opportunity', 280), ('salary', 270), ('required', 269), ('successful', 267), ('support', 261), ('join', 252), ('candidate', 248), ('service', 242), ('knowledge', 241), ('development', 235), ('leading', 234), ('high', 224), ('manager', 220), ('www', 220), ('training', 214), ('sales', 211), ('strong', 211), ('including', 209), ('provide', 209), ('services', 208), ('ability', 201), ('contact', 200), ('position', 199), ('recruitment', 196), ('full', 194), ('benefits', 193), ('posted', 192), ('originally', 191), ('jobseeking', 191), ('clients', 187), ('include', 187), ('good', 187), ('essential', 186), ('information', 184), ('customer', 182), ('environment', 182)]


In [25]:
def removetop50FreqWords(description):
    return [w for w in description if w not in top50FreqWords]

tokenised_description = [removetop50FreqWords(description) for description in tokenised_description]
print(f'Statisctics after top 50 frequent words as final statistics after pre-processing steps: \n')
stats_print(tokenised_description)

Statisctics after top 50 frequent words as final statistics after pre-processing steps: 

Vocabulary size:  5140
Total number of tokens:  101663
Lexical diversity:  0.05055920049575559
Total number of descriptions: 776
Average description length: 131.0090206185567
Maximun description length: 466
Minimun description length: 12
Standard deviation of description length: 69.61315176035262


#### 1.2.8. Save all job advertisement text and information in txt file(s)
##### `description.txt`
Each line is a Tokenised description text for advetised job, seperated by a whitespace.
##### `category.txt`
Corresponding labels for job descriptions (Labels value: 0,1,2,3)
- each line is a label
##### `title.txt`
Corresponding titles for job description
- each line is a title

In [26]:
def save_description(descriptionsFilename,tokenised_description):
    out_file = open(descriptionsFilename, 'w') # creates a txt file and open to save the reviews
    string = "\n".join([" ".join(description) for description in tokenised_description])
    out_file.write(string)
    out_file.close() # close the file

# save the tokenised description to txt file
descriptionsFilename = 'descriptions.txt'
save_description(descriptionsFilename,tokenised_description)

def save_category(categoryFilename,category):
    out_file = open(categoryFilename, 'w') # creates a txt file and open to save sentiments
    string = "\n".join([str(s) for s in category])
    out_file.write(string)
    out_file.close() # close the file

# save category to txt file
categoryFilename = 'category.txt'
save_category(categoryFilename,df.target)

def save_titles(titlesFilename,titles):
    out_file = open(titlesFilename, 'w') # creates a txt file and open to save sentiments
    string = "\n".join([str(s) for s in titles])
    out_file.write(string)
    out_file.close() # close the

# save titles to txt file
titlesFilename = 'titles.txt'
save_titles(titlesFilename,titles)

##### `job.csv`
Store information into Dataframe and save into CSV file


In [30]:
tokenised_title = [tokenizeDescription(title) for title in titles]
#print(len(titles))
#print(len(tokenised_title))
tokenised_company = [tokenizeDescription(company) for company in companies]
#print(len(companies))
#print(len(tokenised_company))

# Create job data frame
job_df = pd.DataFrame({'Title': titles, 'Tokenised Title': tokenised_title, 
                       'Webindex': webindex, 
                       'Company': companies, 'Tokenised Company': tokenised_company, 
                       'Description': descriptions, 'Tokenised Description': tokenised_description,
                       'Category': df.target})
job_df

Unnamed: 0,Title,Tokenised Title,Webindex,Company,Tokenised Company,Description,Tokenised Description,Category
0,Finance / Accounts Asst Bromley to ****k,"[finance, accounts, asst, bromley, to, k]",68997528,First Recruitment Services,"[first, recruitment, services]",Accountant (partqualified) to **** p.a. South ...,"[accountant, partqualified, south, east, londo...",0
1,Fund Accountant Hedge Fund,"[fund, accountant, hedge, fund]",68063513,Austin Andrew Ltd,"[austin, andrew, ltd]",One of the leading Hedge Funds in London is cu...,"[leading, hedge, funds, london, recruiting, fu...",0
2,Deputy Home Manager,"[deputy, home, manager]",68700336,Caritas,[caritas],An exciting opportunity has arisen to join an ...,"[exciting, opportunity, arisen, join, establis...",2
3,Brokers Wanted Imediate Start,"[brokers, wanted, imediate, start]",67996688,OneTwoTrade,[onetwotrade],OneTwoTrade is expanding their Sales Team and ...,"[expanding, sales, team, recruiting, junior, t...",0
4,RGN Nurses (Hospitals) Penarth,"[rgn, nurses, hospitals, penarth]",71803987,Swiis Healthcare,"[swiis, healthcare]",RGN Nurses (Hospitals) Immediate fulltime and ...,"[rgn, nurses, hospitals, fulltime, part, swiis...",2
...,...,...,...,...,...,...,...,...
771,"Apply Today, Start Tomorrow New Sales for 2013","[apply, today, start, tomorrow, new, sales, for]",70457475,Motion Marketing Ltd,"[motion, marketing, ltd]","Apply Today, Start Tomorrow New Sales for 2013...","[apply, today, start, tomorrow, sales, money, ...",3
772,Assembly/Production Technicians Milton Keynes,"[assembly, production, technicians, milton, ke...",71631590,Newstaff Employment Services Ltd,"[newstaff, employment, services, ltd]",Main Purpose of Job:To perform a range of mech...,"[main, purpose, job, perform, range, mechanica...",1
773,Medical Sales Executive/Associate Orthopaedics,"[medical, sales, executive, associate, orthopa...",70028343,Progress Sales Recruitment,"[progress, sales, recruitment]",Sales Associate – Hip and Knee Orthopaedics A ...,"[sales, associate, hip, knee, orthopaedics, ma...",3
774,Mobile Optometrist Oxford,"[mobile, optometrist, oxford]",71402732,Zest Optical,"[zest, optical]",A mobile Super Optometrist is required to join...,"[mobile, optometrist, required, join, leading,...",2


In [31]:
# update Webindex to integer
job_df['Webindex'] = job_df['Webindex'].astype(int)

# update Category to corresponding category names
job_df['Category'] = [df['target_names'][i] for i in job_df['Category']]

In [33]:
# Check dataframe at test index 
job_df.loc[idx]

Title                        PERM Unit Mgr RGN Kid minster Flexi ****K due
Tokenised Title          [perm, unit, mgr, rgn, kid, minster, flexi, k,...
Webindex                                                          71692209
Company                                                                 NA
Tokenised Company                                                     [na]
Description              Job Title: Unit Manager Reporting to: Register...
Tokenised Description    [job, title, unit, manager, reporting, registe...
Category                                                Healthcare_Nursing
Name: 20, dtype: object

In [36]:
# Save job data to csv file
job_df.to_csv('job_data.csv', index=False)

# Get job_data info
job_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 776 entries, 0 to 775
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Title                  776 non-null    object
 1   Tokenised Title        776 non-null    object
 2   Webindex               776 non-null    int64 
 3   Company                776 non-null    object
 4   Tokenised Company      776 non-null    object
 5   Description            776 non-null    object
 6   Tokenised Description  776 non-null    object
 7   Category               776 non-null    object
dtypes: int64(1), object(7)
memory usage: 48.6+ KB


#### 1.2.9. Build a vocabulary of the cleaned job advertisement descriptions, save as txt fille `vocab.txt`

This file contains the unigram vocabulary, one each line, in the following format: word_string:word_integer_index. Very importantly, words in the vocabulary must be sorted in alphabetical order, and the index value starts from 0. This file is the key to interpret the sparse encoding. For instance, in the following example, the word aaron is the 20th word (the corresponding integer_index as 19) in the vocabulary (note that the index values and words in the following image are artificial and used to demonstrate the required format only, it doesn't reflect the values of the actual expected output).

In [37]:
def write_vocab(vocab, filename):
    with open(filename, 'w') as f:  # creates a txt file open in write mode
        for i, word in enumerate(vocab):
            # write each index and vocabulary word, note that index start from 0
            f.write(word + ':' + str(i) + '\n')
            
# convert tokenized description into a alphabetically sorted list
vocab = sorted(list(set(chain.from_iterable(tokenised_description))))

# save the sorted vocabulary list according to the requirement
write_vocab(vocab, 'vocab.txt')

## Summary
Give a short summary and anything you would like to talk about the assessment task here.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   
<span style="color: red"> This markdown block can be removed once the task is completed. </span>