## Generating Labels for our Dataset
This uses Lucas's pre-trained SSOC autocoder model.

In [1]:
import pandas as pd
import re

Importing and cleaning the text data

In [2]:
mcf_data = pd.read_csv('../Data/Processed/Artifacts/Raw_Text.csv')

In [3]:
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
table = str.maketrans(punct,' '*len(punct))

def remove_html_tags_newline(text):
    """
    Removes HTML and newline tags from a string with generic regex

    Parameters:
        text (str): Selected text

    Returns:
        cleaned_text(text) : Text with html tags and new line removed
    """

    clean = re.compile('<.*?>')
    newline_clean = re.compile('\n')
    non_punc = re.compile('[^\w\s]')
    output = re.sub(non_punc, ' ', re.sub(newline_clean, ' ', re.sub(clean, '', text))).lower()
    a = ' '.join([i for i in output.translate(table).split()])
    return ' '.join(re.findall("[a-zA-Z]+",a))

In [18]:
mcf_data['Cleaned_Description'] = mcf_data['Title'].apply(remove_html_tags_newline) + " " + mcf_data['Description'].apply(remove_html_tags_newline)

In [21]:
mcf_data['Cleaned_Description'][0]

'pega solution architect year contract technical specialists will be responsible for designing and building components of enterprise applications and providing consultative guidance on all project assignments he she will work as part of a project team to ensure that the business and technical architecture of the delivered solution matches customer requirements at times he she will be asked to lead aspects of design development and mentoring of resources below are the few vital things the resource need to possess strong communication and presentation skills primary skills must have good knowledge of general prpc architecture good understanding on bpm best practices implementation life cycles end to end experience of prpc based application design and implementation actively participate in the requirements design and construction phases to lead to successful delivery of the project able to plan and lead the execution of pprc implementation enhancements possess strong prpc knowledge in all

Importing the `fasttext` model and generating the predictions

In [6]:
import fasttext # note you have to install fasttext==0.8.4
import numpy as np

In [7]:
def ft_output_single(x):
    return re.sub('__label__','',x[0][0])

In [8]:
model = fasttext.load_model("../Models/ft_epoch50_25wvs_mcf3.bin")

In [26]:
preds_raw = model.predict(np.array(mcf_data['Cleaned_Description']), k=1)

In [28]:
mcf_data['Predicted SSOC'] = [pred[0].replace('__label__', '') for pred in preds_raw]

Importing the SSOC mapping table (v2018)

In [44]:
ssoc = pd.read_csv('../Data/Raw/ssoc_v2018.csv', encoding='iso-8859-1')
ssoc.dropna(inplace = True)
ssoc['ssoc_f'] = ssoc['ssoc_f'].astype('float').astype('int').astype('str')

Cleaning up the MCF data for the join

In [83]:
mcf_data = mcf_data[(mcf_data['SSOC_2015'] != 'X5000') & (mcf_data['SSOC_2015'].notnull())]

In [84]:
mcf_data['SSOC_2015'] = mcf_data['SSOC_2015'].astype('float').astype('int').astype('str')

In [85]:
mcf_data_final = mcf_data.merge(ssoc, left_on = 'SSOC_2015', right_on = 'ssoc_f', how = 'left').merge(ssoc, left_on = 'Predicted SSOC', right_on = 'ssoc_f', how = 'left')

In [86]:
mcf_data_final.drop(['ssoc_f_x', 'ssoc_f_y'], axis = 1, inplace = True)
mcf_data_final.rename({'ssoc_desc_x': "Reported SSOC Desc", "ssoc_desc_y": "Predicted SSOC Desc"}, axis = 1, inplace = True)

Checking some job postings

In [103]:
idx = 160
print("Job Title: " + mcf_data_final['Title'][idx])
print("Reported SSOC: " + mcf_data_final['Reported SSOC Desc'][idx])
print("Predicted SSOC: " + mcf_data_final['Predicted SSOC Desc'][idx])
mcf_data_final['Description'][idx]

Job Title: digital marketing executive
Reported SSOC: Other administrative and related associate professionals n.e.c. 
Predicted SSOC: Sales and marketing manager 


'<p>We are searching for a highly-creative Digital Marketing Executive/Manager to lead our marketing team. In this position, you will be responsible for all aspects of our marketing operations. Your central goal is to help grow our brand’s influence locally while also increasing brand loyalty and awareness.</p>\n<p>Your duties will include planning, implementing, and monitoring our digital marketing campaigns across all digital networks. Our ideal candidate is someone with experience in marketing, art direction, and social media management. In addition to being an outstanding communicator, you will also demonstrate excellent interpersonal and analytical skills.</p>\n<h3>Responsibilities:</h3>\n<ul>\n  <li>Design and oversee all aspects of our digital marketing department including our marketing database, email, and display advertising campaigns.</li>\n  <li>Develop and monitor campaign budgets.</li>\n  <li>Plan and manage our social media platforms.</li>\n  <li>Prepare accurate reports

Exporting the file

In [99]:
mcf_data_final.to_csv('../Data/Processed/Artifacts/MCF_Subset_WithLabels.csv', index = False)