## Generating Labels for our Dataset

**Author:** Shaun Khoo  
**Date:** 8 Sep 2021
**Context:** Need labelled data in order to train our model  
**Objective:** Generate labels for our dataset using Lucas's pre-trained SSOC autocoder model (deprecated - see the notebook on assessing data distribution)

#### A) Importing libraries and data

In [1]:
import pandas as pd
import re

In [2]:
mcf_data = pd.read_csv('../Data/Processed/Artifacts/Raw_Text.csv')

In [3]:
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
table = str.maketrans(punct,' '*len(punct))

def remove_html_tags_newline(text):
    """
    Removes HTML and newline tags from a string with generic regex

    Parameters:
        text (str): Selected text

    Returns:
        cleaned_text(text) : Text with html tags and new line removed
    """

    clean = re.compile('<.*?>')
    newline_clean = re.compile('\n')
    non_punc = re.compile('[^\w\s]')
    output = re.sub(non_punc, ' ', re.sub(newline_clean, ' ', re.sub(clean, '', text))).lower()
    a = ' '.join([i for i in output.translate(table).split()])
    return ' '.join(re.findall("[a-zA-Z]+",a))

In [4]:
mcf_data['Cleaned_Description'] = mcf_data['Title'].apply(remove_html_tags_newline) + " " + mcf_data['Description'].apply(remove_html_tags_newline)

In [5]:
mcf_data['Cleaned_Description'][0]

'pega solution architect year contract technical specialists will be responsible for designing and building components of enterprise applications and providing consultative guidance on all project assignments he she will work as part of a project team to ensure that the business and technical architecture of the delivered solution matches customer requirements at times he she will be asked to lead aspects of design development and mentoring of resources below are the few vital things the resource need to possess strong communication and presentation skills primary skills must have good knowledge of general prpc architecture good understanding on bpm best practices implementation life cycles end to end experience of prpc based application design and implementation actively participate in the requirements design and construction phases to lead to successful delivery of the project able to plan and lead the execution of pprc implementation enhancements possess strong prpc knowledge in all

#### B) Generating predictions
Importing the `fasttext` model and generating the predictions

In [6]:
import fasttext # note you have to install fasttext==0.8.4
import numpy as np

In [7]:
def ft_output_single(x):
    return re.sub('__label__','',x[0][0])

In [8]:
model = fasttext.load_model("../Models/ft_epoch50_25wvs_mcf3.bin")

In [9]:
preds_raw = model.predict(np.array(mcf_data['Cleaned_Description']), k=1)

In [10]:
mcf_data['Predicted SSOC'] = [pred[0].replace('__label__', '') for pred in preds_raw]

Importing the SSOC mapping table (v2018)

In [11]:
ssoc = pd.read_csv('../Data/Raw/ssoc_v2018.csv', encoding='iso-8859-1')
ssoc.dropna(inplace = True)
ssoc['ssoc_f'] = ssoc['ssoc_f'].astype('float').astype('int').astype('str')

Cleaning up the MCF data for the join

In [12]:
mcf_data = mcf_data[(mcf_data['SSOC_2015'] != 'X5000') & (mcf_data['SSOC_2015'].notnull())]

In [13]:
mcf_data['SSOC_2015'] = mcf_data['SSOC_2015'].astype('float').astype('int').astype('str')

In [14]:
mcf_data_final = mcf_data.merge(ssoc, left_on = 'SSOC_2015', right_on = 'ssoc_f', how = 'left').merge(ssoc, left_on = 'Predicted SSOC', right_on = 'ssoc_f', how = 'left')

In [15]:
mcf_data_final.drop(['ssoc_f_x', 'ssoc_f_y'], axis = 1, inplace = True)
mcf_data_final.rename({'ssoc_desc_x': "Reported SSOC Desc", "ssoc_desc_y": "Predicted SSOC Desc"}, axis = 1, inplace = True)

Checking some job postings

In [16]:
idx = 160
print("Job Title: " + mcf_data_final['Title'][idx])
print("Reported SSOC: " + mcf_data_final['Reported SSOC Desc'][idx])
print("Predicted SSOC: " + mcf_data_final['Predicted SSOC Desc'][idx])
mcf_data_final['Description'][idx]

Job Title: digital marketing executive
Reported SSOC: Other administrative and related associate professionals n.e.c. 
Predicted SSOC: Sales and marketing manager 


'<p>We are searching for a highly-creative Digital Marketing Executive/Manager to lead our marketing team. In this position, you will be responsible for all aspects of our marketing operations. Your central goal is to help grow our brand’s influence locally while also increasing brand loyalty and awareness.</p>\n<p>Your duties will include planning, implementing, and monitoring our digital marketing campaigns across all digital networks. Our ideal candidate is someone with experience in marketing, art direction, and social media management. In addition to being an outstanding communicator, you will also demonstrate excellent interpersonal and analytical skills.</p>\n<h3>Responsibilities:</h3>\n<ul>\n  <li>Design and oversee all aspects of our digital marketing department including our marketing database, email, and display advertising campaigns.</li>\n  <li>Develop and monitor campaign budgets.</li>\n  <li>Plan and manage our social media platforms.</li>\n  <li>Prepare accurate reports

Exporting the file

In [17]:
mcf_data_final.to_csv('../Data/Processed/Artifacts/MCF_Subset_WithLabels.csv', index = False)

#### C) Testing Lucas's model on the SSOC 2020 definitions

Import the SSOC 2020 definitions Excel file and combine the detailed definition for each SSOC with the job tasks (4D SSOC level)

In [17]:
SSOC_Definitions = pd.read_excel('../Data/Raw/SSOC2020 Detailed Definitions.xlsx', skiprows = 4)

  warn("""Cannot parse header or footer so it will be ignored""")


In [18]:
SSOC_4D = SSOC_Definitions[SSOC_Definitions['SSOC 2020'].apply(len) == 4][['SSOC 2020', 'Tasks']]
SSOC_4D.columns = ['4D SSOC', 'Tasks']

In [19]:
SSOC_5D = SSOC_Definitions[(SSOC_Definitions['SSOC 2020'].apply(len) == 5) & ~SSOC_Definitions['SSOC 2020'].str.contains('X')].reset_index(drop = True)
SSOC_5D['4D SSOC'] = SSOC_5D['SSOC 2020'].str.slice(0, 4)
SSOC_5D.drop('Tasks', axis = 1, inplace = True)

In [20]:
SSOC_Final = SSOC_5D.merge(SSOC_4D, how = 'left', on = '4D SSOC')
SSOC_Final['Description'] = SSOC_Final['Detailed Definitions'] + " " + SSOC_Final['Tasks']

In [21]:
data = SSOC_Final[['SSOC 2020', 'Description']]

In [32]:
data.to_csv('../Data/Processed/SSOC_2020_Detailed_Descriptions_For_Benchmarking.csv')

Import the SSOC v2018 data and its mapping to SSOC 2020

In [22]:
ssoc_v18_2020_mapping = pd.read_excel('../Data/Raw/Correspondence Tables between SSOC2020 and 2015v18.xlsx', skiprows = 4, sheet_name = 'SSOC2015(v2018)-SSOC2020')

In [23]:
ssoc_v18 = pd.read_csv('../Data/Raw/ssoc_v2018.csv', encoding='iso-8859-1')
ssoc_v18.dropna(inplace = True)
ssoc_v18['SSOC 2015 (Version 2018)'] = ssoc_v18['ssoc_f'].astype('float').astype('int').astype('str')
ssoc_v2020 = ssoc_v18.merge(ssoc_v18_2020_mapping, how = 'left', on = 'SSOC 2015 (Version 2018)')

In [24]:
ssoc_mapping_final = ssoc_v2020[['SSOC 2015 (Version 2018)', 'SSOC 2020']].drop_duplicates('SSOC 2015 (Version 2018)')
ssoc_mapping_final.columns = ['SSOC 2015 v18', 'SSOC 2020']
mapping = ssoc_mapping_final.set_index('SSOC 2015 v18')['SSOC 2020']

Generate the predictions

In [27]:
preds_raw = model.predict(np.array(data['Description']), k = 5)
data['Predicted SSOC 2015 v18'] = ''
data['Predicted SSOC 2020'] = ''
for i, pred in enumerate(preds_raw):
    data['Predicted SSOC 2015 v18'][i] = ','.join([p.replace('__label__', '') for p in pred])
    data['Predicted SSOC 2020'][i] = ','.join([mapping[p.replace('__label__', '')] for p in pred])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Predicted SSOC 2015 v18'] = ''
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Predicted SSOC 2020'] = ''
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Predicted SSOC 2015 v18'][i] = ','.join([p.replace('__label__', '') for p in pred])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentatio

In [28]:
data.head()

Unnamed: 0,SSOC 2020,Description,Predicted SSOC 2015 v18,Predicted SSOC 2020
0,11110,"Legislator determines, formulates and directs ...",3349913499121211120133451,3349913499121211120124233
1,11121,"Senior government official plans, organises an...",1349924220121211330113211,1349924220121211330113210
2,11122,"Senior statutory board official plans, organis...",1212124220134991120113211,1212124220134991120113210
3,11140,Senior official of political party organisatio...,3331211201112024323126119,3331211201112024323126119
4,11150,"Senior official of employers', workers' and ot...",1120111202112033331226431,1120111202112033331226431


Run below if k > 1 to calculate accuracy

In [29]:
# Generate True/False depending on whether the actual SSOC is contained within the predicted SSOCs, at the 1D to 5D levels
level = 5
for lvl in range(1, level + 1):
    data[f'Correct_{lvl}D'] = False
    for i in range(len(data)):
        data[f'Correct_{lvl}D'][i] = data['SSOC 2020'][i][0:lvl] in [ssoc[0:lvl] for ssoc in data['Predicted SSOC 2020'][i].split(',')]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[f'Correct_{lvl}D'] = False
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[f'Correct_{lvl}D'][i] = data['SSOC 2020'][i][0:lvl] in [ssoc[0:lvl] for ssoc in data['Predicted SSOC 2020'][i].split(',')]


In [30]:
# Check accuracy rate at the 5D level
data['Correct_5D'].value_counts(normalize = True)

False    0.677031
True     0.322969
Name: Correct_5D, dtype: float64

In [152]:
# Breakdown by any SSOC level to check accuracy
data['1D SSOC'] = data['SSOC 2020'].str.slice(0,1)
data['2D SSOC'] = data['SSOC 2020'].str.slice(0,2)
data['3D SSOC'] = data['SSOC 2020'].str.slice(0,3)
data['4D SSOC'] = data['SSOC 2020'].str.slice(0,4)
accuracy_1d = data.groupby('1D SSOC').agg({'SSOC 2020': 'count', 'Correct_1D': 'sum'})

In [153]:
accuracy_1d['Percentage Correct'] = accuracy_1d['Correct_1D'] / accuracy_1d['SSOC 2020'] * 100

In [154]:
accuracy_1d.rename({'SSOC 2020': 'Count'}, axis = 1)

Unnamed: 0_level_0,Count,Correct_1D,Percentage Correct
1D SSOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,80,68,85.0
2,294,282,95.918367
3,216,202,93.518519
4,60,57,95.0
5,88,68,77.272727
6,12,3,25.0
7,99,51,51.515152
8,82,38,46.341463
9,66,52,78.787879


Run below if k = 1

In [62]:
final = data.merge(ssoc_v2020[['SSOC 2020', 'SSOC 2020 Title']].drop_duplicates('SSOC 2020'), how = 'left', on = 'SSOC 2020').merge(ssoc_v2020[['SSOC 2020', 'SSOC 2020 Title']].drop_duplicates('SSOC 2020'), how = 'left', left_on = 'Predicted SSOC 2020', right_on = 'SSOC 2020')

In [63]:
final = final.rename({'SSOC 2020 Title_x': 'Actual SSOC Title', 'SSOC 2020 Title_y': 'Predicted SSOC Title', 'SSOC 2020_x': 'Actual SSOC 2020'}, axis = 1)

Calculate the score at the 1D SSOC level

In [75]:
level = 5
(final['Actual SSOC 2020'].str.slice(0,level) == final['Predicted SSOC 2020'].str.slice(0,level)).value_counts(normalize = True)

False    0.865597
True     0.134403
dtype: float64

In [65]:
idx = 115
print(f"Actual: {final['Actual SSOC Title'][idx]} ({final['Actual SSOC 2020'][idx]})")
print(f"Predicted: {final['Predicted SSOC Title'][idx]} ({final['Predicted SSOC 2020'][idx]})")
print('------------------------------------------------------')
print(final['Description'][idx])

Actual: Building construction engineer (21422)
Predicted: Other engineering professionals n.e.c. (21499)
------------------------------------------------------
Building construction engineer determines and specifies construction methods, materials and quality standards, and directs construction work. He/she integrates engineering principles into designs of large buildings to ensure that structures are safe and structurally sound. He/she also plans, organises and supervises the erection, maintenance and repair of buildings. He/she works together with architects and other engineers to transform design ideas into executable plans. - conducting research and developing new or improved theories and methods related to civil engineering
- advising on and designing structures such as bridges, dams, docks, roads, airports, railways, canals, pipelines, waste-disposal and flood-control systems, and industrial and other large buildings
- determining and specifying construction methods, materials an

In [74]:
final[(final['Actual SSOC 2020'].str.slice(0,1) != '1') & (final['Actual SSOC 2020'].str.slice(0,level) != final['Predicted SSOC 2020'].str.slice(0,level))]

Unnamed: 0,Actual SSOC 2020,Description,Predictions,Predicted SSOC 2015 v18,Predicted SSOC 2020,Actual SSOC Title,SSOC 2020_y,Predicted SSOC Title
80,21110,"Physicist/Astronomer conducts research, improv...",32599,32599,32599,Physicist/Astronomer,32599,Other health associate professionals n.e.c.
81,21120,Meteorologist prepares short-term or long-term...,11202,11202,11202,Meteorologist,11202,Company director
83,21141,Geologist conducts research into the nature an...,29090,29090,Deleted,Geologist,Deleted,
84,21142,Geophysicist conducts research into the physic...,29090,29090,Deleted,Geophysicist,Deleted,
85,21149,This group includes physical science professio...,29090,29090,Deleted,Other physical science professionals,Deleted,
...,...,...,...,...,...,...,...,...
992,96272,Concierge (hotel) serves as the point of conta...,29090,29090,Deleted,Concierge (hotel),Deleted,
993,96291,Leaflet and newspaper distributor/deliverer ha...,83322,83322,83322,Leaflet and newspaper distributor/deliverer,83322,Trailer-truck driver (including prime mover dr...
994,96292,Meter reader/Vending-machine collector reads e...,83329,83329,83329,Meter reader/Vending-machine collector,83329,Other heavy truck and lorry drivers
995,96293,Odd job person performs tasks of a simple and ...,83441,83441,83441,Odd job person,83441,Fork lift truck operator
