## Assessing MRSD's model

**Author:** Shaun Khoo  
**Date:** 8 Sep 2021
**Context:** Need to benchmark our model against existing alternatives to see how much better we do  
**Objective:** Generate labels for our dataset using Lucas's pre-trained SSOC autocoder model so we can compare it to our own model's performance

#### A) Importing libraries and data

In [1]:
import pandas as pd
import re

In [2]:
mcf_data = pd.read_csv('../Data/Processed/Artifacts/Raw_Text.csv')

In [3]:
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
table = str.maketrans(punct,' '*len(punct))

def remove_html_tags_newline(text):
    """
    Removes HTML and newline tags from a string with generic regex

    Parameters:
        text (str): Selected text

    Returns:
        cleaned_text(text) : Text with html tags and new line removed
    """

    clean = re.compile('<.*?>')
    newline_clean = re.compile('\n')
    non_punc = re.compile('[^\w\s]')
    output = re.sub(non_punc, ' ', re.sub(newline_clean, ' ', re.sub(clean, '', text))).lower()
    a = ' '.join([i for i in output.translate(table).split()])
    return ' '.join(re.findall("[a-zA-Z]+",a))

In [4]:
mcf_data['Cleaned_Description'] = mcf_data['Title'].apply(remove_html_tags_newline) + " " + mcf_data['Description'].apply(remove_html_tags_newline)

In [5]:
mcf_data['Cleaned_Description'][0]

'pega solution architect year contract technical specialists will be responsible for designing and building components of enterprise applications and providing consultative guidance on all project assignments he she will work as part of a project team to ensure that the business and technical architecture of the delivered solution matches customer requirements at times he she will be asked to lead aspects of design development and mentoring of resources below are the few vital things the resource need to possess strong communication and presentation skills primary skills must have good knowledge of general prpc architecture good understanding on bpm best practices implementation life cycles end to end experience of prpc based application design and implementation actively participate in the requirements design and construction phases to lead to successful delivery of the project able to plan and lead the execution of pprc implementation enhancements possess strong prpc knowledge in all

#### B) Generating predictions
Importing the `fasttext` model and generating the predictions

In [8]:
import fasttext # note you have to install fasttext==0.8.4
import numpy as np

In [9]:
def ft_output_single(x):
    return re.sub('__label__','',x[0][0])

In [10]:
model = fasttext.load_model("../Models/ft_epoch50_25wvs_mcf3.bin")

In [11]:
preds_raw = model.predict(np.array(mcf_data['Cleaned_Description']), k=1)

NameError: name 'mcf_data' is not defined

In [10]:
mcf_data['Predicted SSOC'] = [pred[0].replace('__label__', '') for pred in preds_raw]

Importing the SSOC mapping table (v2018)

In [11]:
ssoc = pd.read_csv('../Data/Raw/ssoc_v2018.csv', encoding='iso-8859-1')
ssoc.dropna(inplace = True)
ssoc['ssoc_f'] = ssoc['ssoc_f'].astype('float').astype('int').astype('str')

Cleaning up the MCF data for the join

In [12]:
mcf_data = mcf_data[(mcf_data['SSOC_2015'] != 'X5000') & (mcf_data['SSOC_2015'].notnull())]

In [13]:
mcf_data['SSOC_2015'] = mcf_data['SSOC_2015'].astype('float').astype('int').astype('str')

In [14]:
mcf_data_final = mcf_data.merge(ssoc, left_on = 'SSOC_2015', right_on = 'ssoc_f', how = 'left').merge(ssoc, left_on = 'Predicted SSOC', right_on = 'ssoc_f', how = 'left')

In [15]:
mcf_data_final.drop(['ssoc_f_x', 'ssoc_f_y'], axis = 1, inplace = True)
mcf_data_final.rename({'ssoc_desc_x': "Reported SSOC Desc", "ssoc_desc_y": "Predicted SSOC Desc"}, axis = 1, inplace = True)

Checking some job postings

In [16]:
idx = 160
print("Job Title: " + mcf_data_final['Title'][idx])
print("Reported SSOC: " + mcf_data_final['Reported SSOC Desc'][idx])
print("Predicted SSOC: " + mcf_data_final['Predicted SSOC Desc'][idx])
mcf_data_final['Description'][idx]

Job Title: digital marketing executive
Reported SSOC: Other administrative and related associate professionals n.e.c. 
Predicted SSOC: Sales and marketing manager 


'<p>We are searching for a highly-creative Digital Marketing Executive/Manager to lead our marketing team. In this position, you will be responsible for all aspects of our marketing operations. Your central goal is to help grow our brand’s influence locally while also increasing brand loyalty and awareness.</p>\n<p>Your duties will include planning, implementing, and monitoring our digital marketing campaigns across all digital networks. Our ideal candidate is someone with experience in marketing, art direction, and social media management. In addition to being an outstanding communicator, you will also demonstrate excellent interpersonal and analytical skills.</p>\n<h3>Responsibilities:</h3>\n<ul>\n  <li>Design and oversee all aspects of our digital marketing department including our marketing database, email, and display advertising campaigns.</li>\n  <li>Develop and monitor campaign budgets.</li>\n  <li>Plan and manage our social media platforms.</li>\n  <li>Prepare accurate reports

Exporting the file

In [17]:
mcf_data_final.to_csv('../Data/Processed/Artifacts/MCF_Subset_WithLabels.csv', index = False)

#### C) Testing Lucas's model on the SSOC 2020 definitions

Import the SSOC 2020 definitions Excel file and combine the detailed definition for each SSOC with the job tasks (4D SSOC level)

In [17]:
SSOC_Definitions = pd.read_excel('../Data/Raw/SSOC2020 Detailed Definitions.xlsx', skiprows = 4)

  warn("""Cannot parse header or footer so it will be ignored""")


In [18]:
SSOC_4D = SSOC_Definitions[SSOC_Definitions['SSOC 2020'].apply(len) == 4][['SSOC 2020', 'Tasks']]
SSOC_4D.columns = ['4D SSOC', 'Tasks']

In [19]:
SSOC_5D = SSOC_Definitions[(SSOC_Definitions['SSOC 2020'].apply(len) == 5) & ~SSOC_Definitions['SSOC 2020'].str.contains('X')].reset_index(drop = True)
SSOC_5D['4D SSOC'] = SSOC_5D['SSOC 2020'].str.slice(0, 4)
SSOC_5D.drop('Tasks', axis = 1, inplace = True)

In [20]:
SSOC_Final = SSOC_5D.merge(SSOC_4D, how = 'left', on = '4D SSOC')
SSOC_Final['Description'] = SSOC_Final['Detailed Definitions'] + " " + SSOC_Final['Tasks']

In [21]:
data = SSOC_Final[['SSOC 2020', 'Description']]

In [32]:
data.to_csv('../Data/Processed/SSOC_2020_Detailed_Descriptions_For_Benchmarking.csv')

This test was completed but we didn't pursue further as SSOC 2020 definitions were not indicative of MCF performance. 

#### D) Testing Lucas's model on our test set

This enabled us to directly compare the performance between the two models.

In [26]:
data = pd.read_csv('../Data/Train/Test.csv')

In [35]:
data['target'] = data['Predicted_SSOC_2020'].astype('str')
data['Predicted_SSOC_2020'] = data['Predicted_SSOC_2020'].astype('str')

Import the SSOC v2018 data and its mapping to SSOC 2020

In [5]:
ssoc_v18_2020_mapping = pd.read_excel('../Data/Reference/Correspondence Tables between SSOC2020 and 2015v18.xlsx', skiprows = 4, sheet_name = 'SSOC2015(v2018)-SSOC2020')

In [6]:
ssoc_v18 = pd.read_csv('../Data/Archive/ssoc_v2018.csv', encoding='iso-8859-1')
ssoc_v18.dropna(inplace = True)
ssoc_v18['SSOC 2015 (Version 2018)'] = ssoc_v18['ssoc_f'].astype('float').astype('int').astype('str')
ssoc_v2020 = ssoc_v18.merge(ssoc_v18_2020_mapping, how = 'left', on = 'SSOC 2015 (Version 2018)')

In [7]:
ssoc_mapping_final = ssoc_v2020[['SSOC 2015 (Version 2018)', 'SSOC 2020']].drop_duplicates('SSOC 2015 (Version 2018)')
ssoc_mapping_final.columns = ['SSOC 2015 v18', 'SSOC 2020']
mapping = ssoc_mapping_final.set_index('SSOC 2015 v18')['SSOC 2020']

Generate the predictions

In [36]:
data.drop('target', axis = 1, inplace = True)

In [39]:
preds_raw = model.predict(np.array(data['description']), k = 10)
data['SSOC_5D_Top_10_Preds'] = ''
for i, pred in enumerate(preds_raw):
    data['SSOC_5D_Top_10_Preds'][i] = ','.join([mapping[p.replace('__label__', '')] for p in pred])
    data['SSOC_5D_Top_10_Preds_Correct'][i] = data['Predicted_SSOC_2020'][i] in ','.join([mapping[p.replace('__label__', '')] for p in pred])

Generate the same customised fields for checking 5D SSOC accuracy and output the file

In [45]:
data['SSOC_5D_Top_Pred'] = data['SSOC_5D_Top_10_Preds'].str.slice(0,5)
data['SSOC_5D_Top_Pred_Correct'] = data['SSOC_5D_Top_Pred'] == data['Predicted_SSOC_2020']
data['SSOC_5D_Top_5_Preds_Correct'] = [ssoc in preds.split(',')[0:5] for ssoc, preds in zip(data['Predicted_SSOC_2020'], data['SSOC_5D_Top_10_Preds'])]
data['SSOC_5D_Top_10_Preds_Correct']= [ssoc in preds for ssoc, preds in zip(data['Predicted_SSOC_2020'], data['SSOC_5D_Top_10_Preds'])]

In [47]:
data.to_csv('../Notebooks/Exported Files/Test_MRSD.csv', index = False)