## Expanding Base Dataset

**Author:** Shaun Khoo  
**Date:** 18 Oct 2021  
**Context:** There are serious data quality issues after investigating the original training data. Over 70% of the data have less than 10 samples (50% of the data has no examples) in the full training set, which will result in a poorer model that is unable to predict most of the available SSOCs.  
**Objective:** Instead of relying on the original training data, we seek to construct a higher-quality but smaller-sized new training dataset. This will be assembled by finding the most relevant job postings for each SSOC, and ensuring we have at least 10 samples for each SSOC.

#### A) Setting up

Importing the libraries and data here

In [2]:
import os
os.chdir('..')

In [3]:
import numpy as np
import pandas as pd
import copy
import re
import json
pd.options.mode.chained_assignment = None

In [5]:
SSOC_2020 = pd.read_csv('Data/Processed/Training/train-aws/SSOC_2020.csv')
data = pd.read_csv('Data/Processed/Training/train-aws/train_full.csv')
extra_info = pd.read_csv('Data/Archive/MCF_Training_Set_Full.csv')

#### B) Checking job titles

For the first step, we import the detailed definitions for SSOC 2020, and use the job titles to search for an exact match on the MCF job posting. We also use the "Examples of Job Classified Elsewhere" separately to enable a wider search and prevent misclassification.

In [4]:
detailed_definitions_raw = pd.read_excel('Data/Raw/SSOC2020 Detailed Definitions.xlsx', skiprows = 4)

  warn("""Cannot parse header or footer so it will be ignored""")


In [5]:
detailed_definitions = detailed_definitions_raw[(~detailed_definitions_raw['SSOC 2020'].astype('str').str.contains('X')) &
                                                (detailed_definitions_raw['SSOC 2020'].astype('str').apply(len) >= 5)].reset_index(drop = True)

Clean both the relevant and incorrect job titles

In [6]:
to_replace = {
    '‚Ä¢': '',
    '\n': '.',
    '<Blank>': '',
    "\([A-Za-z0-9 /.,*&'-]+\)": ''
}

detailed_definitions['Relevant_Job_Titles_Cleaned'] = detailed_definitions['Examples of Job Classified Under this Code']
detailed_definitions['Incorrect_Job_Titles_Cleaned'] = detailed_definitions['Examples of Job Classified Elsewhere']

for k, v in to_replace.items():
    detailed_definitions['Relevant_Job_Titles_Cleaned'] = detailed_definitions['Relevant_Job_Titles_Cleaned'].str.replace(k, v)
    detailed_definitions['Incorrect_Job_Titles_Cleaned'] = detailed_definitions['Incorrect_Job_Titles_Cleaned'].str.replace(k, v)

  detailed_definitions['Relevant_Job_Titles_Cleaned'] = detailed_definitions['Relevant_Job_Titles_Cleaned'].str.replace(k, v)
  detailed_definitions['Incorrect_Job_Titles_Cleaned'] = detailed_definitions['Incorrect_Job_Titles_Cleaned'].str.replace(k, v)


Use both data points to create a dictionary that helps us to map the job titles to the SSOC

In [7]:
ssoc_job_titles = {}
for i, row in detailed_definitions.iterrows():
    titles = [row['SSOC 2020 Title']]
    titles.extend([title.strip() for title in row['Relevant_Job_Titles_Cleaned'].split('.')])
    final_titles = list(set([title.lower() for title in titles]))
    ssoc_job_titles[row['SSOC 2020']] = final_titles

In [8]:
def extract_incorrect_job_titles(text):
    
    # If there are no jobs classified elsewhere, return nothing
    if len(text) == 0:
        return {}
    
    incorrect_job_titles = text.split('.')
    output = {}
    for entry in incorrect_job_titles:
        
        # Use the fact that the structure is consistent
        ssoc = entry.split(', see')[1].strip()
        title = entry.split(', see')[0].strip().lower()
        if ssoc in output.keys():
            output[ssoc].append(title)
        else:
            output[ssoc] = [title]
    return output

additions = detailed_definitions['Incorrect_Job_Titles_Cleaned'].apply(extract_incorrect_job_titles)

# Append the new 'jobs classified elsewhere' to their relevant SSOC
for addition in additions:
    if len(addition.keys()) != 0:
        for k, v in addition.items():
            ssoc_job_titles[k].extend(v)

In [9]:
def find_matching_job_title(data,
                            include,
                            exclude,
                            exclude_desc = []):
    
    output = copy.deepcopy(data)
    output['title'] = output['title'].str.lower()
    output['description'] = output['description'].str.lower()
    
    include_boolean = [False] * len(output)
    for words in include:
        entry_boolean = [True] * len(output)
        
        # This helps to clean out punctuation that trips up our functions below
        for k,v in to_replace.items():
            words = re.sub(k, v, words)
            
        for word in words.split(' '):
            try:
                entry_boolean = entry_boolean & output['title'].str.contains(word.lower())
            except:
                print(words)
                
        include_boolean = include_boolean | entry_boolean
    
    for words in exclude:
        for word in words.split(' '):
            include_boolean = include_boolean & ~output['title'].str.contains(word.lower())
    
    for words in exclude_desc:
        for word in words.split(' '):
            include_boolean = include_boolean & ~output['description'].str.contains(word.lower())
    
    job_titles_idx = output[include_boolean.values].index.tolist()
    return job_titles_idx

Test our function to make sure it works

In [10]:
job_titles_idx = find_matching_job_title(extra_info,
                                         include = ssoc_job_titles['13430'],
                                         exclude = ['provided'])

In [11]:
print(job_titles_idx)
for i, title in extra_info.loc[job_titles_idx, 'title'].iteritems():
    print(f"{i}: {title}")

[3007, 3415, 3416, 11933]
3007: ASSISTANT DIRECTOR OF NURSING ( NURSING HOME) #SGUnitedJobs
3415: ASSISTANT DIRECTOR OF NURSING ( NURSING HOME) #SGUnitedJobs
3416: ASSISTANT DIRECTOR OF NURSING ( NURSING HOME) #SGUnitedJobs
11933: ASSISTANT DIRECTOR OF NURSING ( NURSING HOME) #SGUnitedJobs


Now we run it for the entire group of SSOCs

In [None]:
output = {}
for ssoc in ssoc_job_titles.keys():
    job_titles_idx = find_matching_job_title(extra_info,
                                             include = ssoc_job_titles[ssoc],
                                             exclude = ['provided'])
    output[ssoc] = job_titles_idx

How many jobs have less than 10 entries?

In [None]:
count = 0
for ssoc in output.keys():
    if len(output[ssoc]) < 10:
        count += 1
count

#### C) Using word embeddings

For the second step, we convert our current subset of MCF data into word embeddings using `spacy`'s inbuilt word embeddings. This will be used to identify similar job descriptions to the SSOC description to enable a more thorough search that is not confined only to job titles.

Use `spacy` to convert the words into embeddings

In [12]:
import spacy
from spacy.language import Language
nlp = spacy.load('en_core_web_lg', disable = ['tagger', 'parser', 'ner', 'lemmatizer'])
stopwords = nlp.Defaults.stop_words

In [13]:
@Language.component("additional_preprocessing")
def additional_preprocessing(doc):
    lemma_list = [tok for tok in doc
                  if tok.is_alpha and tok.text.lower() not in stopwords] 
    return lemma_list
nlp.add_pipe('additional_preprocessing', last = True)

<function __main__.additional_preprocessing(doc)>

Run the `nlp` processing pipeline over the two corpuses and convert the job postings into vectors

In [14]:
SSOC_2020_nlp = list(nlp.pipe(SSOC_2020['Description']))
data_nlp = list(nlp.pipe(data['Cleaned_Description']))

In [15]:
target_vecs = []
for i, desc in enumerate(data_nlp):
    if (i % 100 == 0) or (i+1 == len(data_nlp)):
        print(f'Job posting {i}/{len(data_nlp)}...\r', end = '')
    if len(desc) == 0:
        target_vecs.append(np.array([0]*300))
    else:
        target_vecs.append(np.mean([token.vector for token in desc], axis = 0))

Job posting 42841/42842...

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

def identify_top_n(selected,
                   data,
                   extra_info,
                   target_vecs,
                   top_n = 15,
                   threshold = 0.8):
    
    source_vec = np.array([np.mean([token.vector for token in selected], axis = 0)])
    matrix = cosine_similarity(source_vec, target_vecs)
    indices = np.apply_along_axis(lambda x: x.argsort()[-top_n:][::-1], axis = 1, arr = matrix)
    above_threshold = matrix[0][indices][0] >= threshold
    indices = [idx for idx, above in zip(indices[0], above_threshold) if above]
    if len(indices) == 0:
        print('None meet the threshold required.')
    else:
        cosine_similarity_index = 0
        for i, row in data.loc[indices, :].iterrows():
            print(f'Index: {i}')
            print(f'Cosine similarity: {matrix[0][indices][cosine_similarity_index]}')
            print(f'Predicted SSOC: {row["SSOC 2020"]}')
            print(f'Job title: {extra_info["title"][i]}')
            print(f'Description: {row["Cleaned_Description"]}')
            print('================================================================')
            cosine_similarity_index += 1

#### D) Manual tagging

Use both the job titles and word embeddings to help identify the relevant job postings for each SSOC so as to improve coverage of the dataset.

In [17]:
# #Run this to initialise the dictionary object
# with open('manual_tagging.json', 'r') as outfile:
#     manual_tagging = json.load(outfile)

In [1752]:
# # Run this to export the manual tagging to the JSON file
# with open('manual_tagging.json', 'w') as outfile:
#     json.dump(manual_tagging, outfile)

In [1747]:
# Add a simple function to prevent accidental override
if ssoc in manual_tagging.keys():
    resp = input(f"SSOC {ssoc} is already in the dictionary. Are you sure you want to override? Y or N")
    if resp != 'Y':
        raise AssertionError("Stop")
        
# Input the indices here
inputting = [21687]


# Deduplicate the indices
inputting_dedup = list(set(inputting))
inputting_dedup_for_add = copy.deepcopy(inputting_dedup)

# Initialise and append the indices to the SSOC
manual_tagging[ssoc] = []
for key in manual_tagging.keys():
    for new_idx in inputting_dedup:
        if new_idx in manual_tagging[key]:
            print('---------------------------------------------------------------------')
            print(f'Duplicate detected for index {new_idx} which has already been marked for SSOC {key} ({len(manual_tagging[key])})')
            print(f'Job title for {new_idx}: {extra_info.loc[new_idx, "title"]}')
            print(f'SSOC title for {key}: {ssoc_job_titles[str(key)]}')
            print(f'Job description:')
            print(f'{extra_info.loc[new_idx, "description"]}')
            resp2 = input(f"Override or not? Y or N")
            if resp2 == 'Y':
                print(f'Removed {new_idx} which had been marked for SSOC {key}')
                manual_tagging[key].remove(new_idx)
            else:
                inputting_dedup_for_add.remove(new_idx)
                
print('---------------------------------------------------------------------')
manual_tagging[ssoc].extend(inputting_dedup_for_add)
print(f'SSOC {ssoc} updated successfully!')
print(manual_tagging[ssoc])

---------------------------------------------------------------------
SSOC 96293 updated successfully!
[21687]


##### Set the SSOC we are looking for here

In [1748]:
ssoc = str(96299)
print(ssoc_job_titles[str(ssoc)])

['other elementary workers n.e.c.', 'food delivery on foot', 'food delivery on foot']


In [1749]:
#import copy
search_titles = copy.deepcopy(ssoc_job_titles[str(ssoc)])
extra = input("Any other job titles to add?")
if len(extra) > 0:
    search_titles.extend([title.strip() for title in extra.split(',')])
#search_titles.extend(['teaching superintendent'])
if '' in search_titles:
    search_titles.remove('')
# search_titles.remove('housekeeper (hotels and other establishments)')
# search_titles.remove('car driver')

print(search_titles)

Any other job titles to add? food deliver


['other elementary workers n.e.c.', 'food delivery on foot', 'food delivery on foot', 'food deliver']


In [1750]:
job_titles_idx = find_matching_job_title(extra_info,
                                         include = search_titles,
                                         exclude = [],
                                         exclude_desc = [])

# job_titles_idx = extra_info[(extra_info['description'].str.lower().str.contains('hawker|food|restaurant')) &
#                              ~extra_info['title'].str.lower().str.contains('factory|cook|dishwasher') &
#                              ~extra_info['description'].str.lower().str.contains('classroom|teach|student|child') &
#                               (extra_info['title'].str.lower().str.contains('cleaner') | extra_info['title'].str.lower().str.contains('cleaning'))
# #                             ~extra_info['description'].str.lower().str.contains('construction')
#                             #extra_info['company_name'].str.lower().str.contains('school') ) &
# #                             (extra_info['description'].str.lower().str.contains('language') &
# #                              extra_info['description'].str.lower().str.contains('school')) &
# #                              extra_info['title'].str.lower().str.contains('supervisor') &
# #                              extra_info['description'].str.lower().str.contains('admin') &
# #                              extra_info['description'].str.lower().str.contains('account') 
#                              #extra_info['description'].str.lower().str.contains('office') 
#                             ].index.tolist()

print(job_titles_idx)
for i, title in extra_info.loc[job_titles_idx, 'title'].iteritems():
    print(f"{i}: {title}")

[1951, 3586, 7831, 17873, 18673, 27564, 31630, 31637, 41203]
1951: Part Time Food Delivery Driver
3586: Motor Food Delivery Rider-Freelance with Bonus
7831: Product Lead - Food Delivery üçî
17873: Food Delivery Driver
18673: GrabFood Delivery Partner
27564: 3453 -  Delivery Driver [ Frozen Food / Truck / Class 3 / Pioneer ] 
31630: Food Delivery Riders
31637: Food Delivery Riders
41203: Class 3 delivery driver [Food importer & distributer / Truck can drive back / Tuas] 9156


In [1756]:
# for i in job_titles_idx:
#     print(i)
#     print(extra_info.loc[i, 'company_name'])
#     print(extra_info.loc[i, 'title'])
#     print(extra_info.loc[i, 'Predicted_SSOC_2020'])
#     print(extra_info.loc[i, 'description'])
#     print('----------------------')

In [1755]:
# ssoc_index = SSOC_2020[SSOC_2020['SSOC 2020'] == int(ssoc)].index[0]
# identify_top_n(SSOC_2020_nlp[ssoc_index], data, extra_info, target_vecs, top_n = 50, threshold = 0.80)

#### E) Generating Labelled Dataset

Using the manual tagging, we generate the labelled dataset that will be used for the initial training.

In [6]:
#Run this to initialise the dictionary object
with open('manual_tagging.json', 'r') as outfile:
    manual_tagging = json.load(outfile)

In [7]:
print(f'Total number of labelled SSOCs: {len(manual_tagging.keys())}')

Total number of labelled SSOCs: 564


In [8]:
tagged_data_list = []

for ssoc, mcf_idx in manual_tagging.items():
    
    print(f'Processing SSOC {ssoc}...\r', end = '')
    
    tagged_data = extra_info.loc[mcf_idx, :]
    tagged_data['Predicted_SSOC_2020'] = ssoc
    
    tagged_data_list.append(tagged_data)

Processing SSOC 96293...

In [9]:
raw_labelled = pd.concat(tagged_data_list)

In [13]:
raw_labelled.shape

(14557, 19)

In [12]:
raw_labelled.to_csv('Data/Raw/Raw_Labelled.csv')