This notebook builds the *full* [composite model](https://confluence.dhigroupinc.com/pages/viewpage.action?spaceKey=MT&title=Composite+Model) for Dice version 3.0.x, which includes the following submodels:
    
  1. TfIdf similarity of job title/description to resume
  2. Exact matching of Burning Glass skills by name, enriched with BG clusters and the BG-to-DHI skill crosswalk. See the [skills_model_feature_optimization notebook](./skills_model_feature_optimization.ipynb) for details.

These are put into two [CompositeTransformer](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/src/dhi-dsmatch/dhi/dsmatch/sklearnmodeling/models/compositetransformer.py) pipeline transformers. The first pipeline that takes the ['job_title', 'job_description', 'resume'] columns is the TfIdf model, and the second pipeline takes the ['description_bg_parse', 'resume_bg_parse', 'job_title'] fields.

This is the full model, which contains quantile data that can be purged after optimizing. Testing and optimizing is performed in [this notebook](./optimize_and_test_composite_model.ipynb).

**Author:** Tom McTavish

**Date:** November 18, 2021

**Output:** models/dice/dice-composite_3.0.x_full.joblib

This notebook takes about 30 minutes to run.

In [1]:
%pip install -r requirements-inf.txt
# %pip install dhi-dsmatch[training]==1.1.0
# %pip install -e "git+ssh://git@bitbucket.org/dhigroupinc/dhi-match-datascience.git@MATCH-2343-remove-duplicate-skills-that-#egg=dhi-dsmatch[training]&subdirectory=src/dhi-dsmatch"

# %pip install -e "git+ssh://git@bitbucket.org/dhigroupinc/dhi-match-datascience.git@MATCH-2246-explore-bi-directional-crossw#egg=dhi-dsmatch[training]&subdirectory=src/dhi-dsmatch"
# %pip install -e "git+ssh://git@bitbucket.org/dhigroupinc/dhi-match-datascience.git@c485248#egg=dhi-dsmatch[training]&subdirectory=src/dhi-dsmatch"

# %pip install -e "git+ssh://git@bitbucket.org/dhigroupinc/dhi-match-datascience.git@feature/MATCH-2128-refactor-bg-skills-model-to-u#egg=dhi-dsmatch[training]&subdirectory=src/dhi-dsmatch"
# %pip install -e "git+ssh://git@bitbucket.org/dhigroupinc/dhi-match-datascience.git#egg=dhi-dsmatch[training]&subdirectory=src/dhi-dsmatch"


Looking in indexes: https://pypi.org/simple, https://dhigroupinc.jfrog.io/artifactory/api/pypi/taxonomy-python/simple, https://dhigroupinc.jfrog.io/artifactory/api/pypi/match-python/simple, https://dhigroupinc.jfrog.io/dhigroupinc/api/pypi/match-python-lib-prod-local/simple
Collecting dhi-dsmatch==1.1.10
  Downloading https://dhigroupinc.jfrog.io/dhigroupinc/api/pypi/match-python-lib-prod-local/dhi-dsmatch/1.1.10/dhi_dsmatch-1.1.10-py3-none-any.whl (110 kB)
     |████████████████████████████████| 110 kB 29.3 MB/s            
Installing collected packages: dhi-dsmatch
  Attempting uninstall: dhi-dsmatch
    Found existing installation: dhi-dsmatch 1.1.12
    Uninstalling dhi-dsmatch-1.1.12:
      Successfully uninstalled dhi-dsmatch-1.1.12
Successfully installed dhi-dsmatch-1.1.10
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import json
from pathlib import Path

import yaml
import pandas as pd
import joblib
import boto3
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn import set_config
set_config(display='diagram')
from IPython.core.display import HTML

from dhi.dsmatch.sklearnmodeling.models.compositetransformer import CompositeTransformer
from dhi.dsmatch.sklearnmodeling.models.costfidftransformer import CosTfidfTransformer
from dhi.dsmatch.sklearnmodeling.models.bgleftoverlappingskillstransformer import BGLeftOverlappingSkillsTransformer, BGLeftOverlappingSkillsTransformerCore
from dhi.dsmatch.util.io import read_csv

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
pd.set_option('display.max_columns', 200)

# Load Parameters from config file

In [4]:
with open('config.yml', 'r') as f:
    params = yaml.load(f, Loader=yaml.FullLoader)
params

{'name': 'dice-composite',
 'version': '3.0.13',
 'cache_location': '/home/ec2-user/SageMaker/shared/data/dice/v2/dice-composite/cache',
 'model_dir': '/home/ec2-user/SageMaker/shared/models/dice/',
 'data_dir': '/home/ec2-user/SageMaker/shared/data/dice/',
 'train_data': 's3://dev-matchology-datascience/data/dice/v2/dice_train_20211217.csv.zip',
 'labeled_data': 's3://dev-matchology-datascience/data/dice/v2/dice_labeled_20211108_consensus.csv',
 'skills_labeled_data': 's3://dev-matchology-datascience/data/dice/v2/dice_labeled_20211108_skills_consensus.csv',
 'df_taxonomy_skills': 's3://dev-matchology-datascience/data/taxonomy/burning_glass_skills.csv',
 'df_bgt_skills_crosswalk': 's3://dev-matchology-datascience/data/taxonomy/bgt_skills_crosswalk.csv',
 'df_dhi_skills_crosswalk': 's3://dev-matchology-datascience/data/taxonomy/dhi_skills_crosswalk.csv',
 'dhi_job_titles_to_related_skills': 's3://dev-matchology-datascience/data/taxonomy/dhi_job_titles_to_related_skills.txt',
 's3_model_

# Read in training data

In [5]:
%%time
df = read_csv(params['train_data'])#, nrows=2_000)  # Uncomment nrows for quick iteration test/development.
df.head()

CPU times: user 1min 18s, sys: 7.76 s, total: 1min 26s
Wall time: 1min 26s


Unnamed: 0,snapshot_id,profile_id,year,month,day,previous_title,current_title,profile_skills,job_skills,desired_title,job_title,job_description,resume,application_id,description_bg_parse,resume_bg_parse,job_data_bg_skills,resume_data_bg_skills
0,4350cf17-f757-5c8a-b2e4-97a772b758a8,9e3468c4014bac5737dd0864543ab978,2021,2,24,Network Engineer I:: Technical Solutions Analyst,Network Engineer,,yaml:: amazon web services:: devops:: python::...,Network Engineer,AWS Cloud DevOps,<!DOCTYPE html><html><head></head><body><p><st...,D. Email: ❖ Cell: ❖ LinkedIn: Technical Skill...,2c92808277b1615f0177d4f21a543063,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...","[""AWS CloudFormation"", ""AngularJS"", ""Ansible"",...","[""Ansible"", ""Border Gateway Protocol (BGP)"", ""..."
1,1b18511f-3339-44c4-92b4-8ac7418aa227,7c486ad56923319f8369f3cabf9778aa,2021,11,19,,Android Developer,Software:: Android:: JSON:: RESTful:: Web serv...,cocoa:: xcode 10.x:: ios sdk:: swift,Android Developer,Mobile Application Developer,<!DOCTYPE html><html><head></head><body><p><sp...,﻿ Professional Summary: · Over 7 Years of tota...,c5de20d4d2d94d7c89110658f9c9fe9f,"{""status"": true, ""statusCode"": ""OK"", ""requestI...","{""status"": true, ""statusCode"": ""OK"", ""requestI...","[""Cocoa"", ""Communication Skills"", ""Creativity""...","[""Android Software Development Kit (SDK)"", ""An..."
2,421a2d27-06d8-5e86-8045-75df5bd32c41,9f8a9cbbfd69c11429f0576f77710129,2021,4,20,,,,,Sr. Project Manager,IT Project Manager,<b>IT Project Manager Job Opening!</b> <br /> ...,"﻿ Anna Frisco, TX 75036 ♦ ♦ Linkedin: Professi...",2c92808278eb4f3f0178ed929fa618d7,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...","[""Agile Development"", ""Budgeting"", ""Business P...","[""Account Adjustment"", ""Agile Project Manageme..."
3,68436bfa-c2c4-5849-b5e5-3252b97ae6af,744187a046aa197acad39dd325b43a07,2021,8,31,Field Service Technician:: Integration Special...,,Repair:: Network:: Technician:: Field service:...,it support:: desktop,Field Service Technician,Deskside Support Lead,<p><b><u>Job Title</u>:</b> <b>Deskside Suppor...,"﻿ • • San Diego, CA Analytical and solutions-f...",17a47e0c85ab4bcfbc84f21fda8c68ff,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...","[""Application Packaging"", ""Audio / Visual Know...","[""Application Support"", ""Building Effective Re..."
4,22e4dfce-db6c-5a98-bb2f-87ad33ac40d0,b32f76fef50fc6d9fa874ef2c35fb47e,2020,12,28,,,,.net core reactjs developer,Stack Developer,Sr .Net Core ReactJs Developer,<B> Principal duties/Roles and responsibilitie...,﻿ Email: Phone: Professional Summary: • 8+ yea...,2c928082762f2e340176a9ac56953b3d,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...","["".NET"", ""ASP.NET"", ""Active Server Pages (ASP)...","["".NET"", ""AJAX"", ""API Management"", ""ASP.NET"", ..."


In [6]:
# This model does not need the following columns, so delete them now.
df.drop(['profile_id', 'previous_title', 'current_title', 'profile_skills', 'desired_title'], axis=1, inplace=True)

In [7]:
eval_cols = ['job_title', 'job_description', 'resume', 'description_bg_parse', 'resume_bg_parse']
# eval_cols = ['description_bg_parse', 'resume_bg_parse']

The Burning Glass models need the taxonomy mapping.

In [8]:
df_taxonomy_skills = read_csv(params['df_taxonomy_skills'])
df_taxonomy_skills = df_taxonomy_skills[['skillLabel', 'skillClusterLabel']]
df_taxonomy_skills.sort_values(by=['skillLabel'], inplace=True)
df_taxonomy_skills.reset_index(drop=True, inplace=True)
df_taxonomy_skills.fillna('', inplace=True)
df_taxonomy_skills.skillClusterLabel[df_taxonomy_skills.skillLabel==df_taxonomy_skills.skillClusterLabel] = ''

In [9]:
def dhiskillsexactmatch_ignorecase(row):
    if row.dhiSkillsExactMatch != '':
        return row.dhiSkillsExactMatch
    bgt_lower = row.bgtSkill.lower()
    for x in row.dhiSkills.split(';'):
        if (x.lower() == bgt_lower) and (x != row.bgtSkill):
            return x
    return ''

def remove_exact_in_skills(row):
    newlist = row.dhiSkills.copy()
    if isinstance(row.dhiSkillsExactMatch, str):
        if row.dhiSkillsExactMatch in row.dhiSkills:
            newlist.remove(row.dhiSkillsExactMatch)
    return newlist
    
df_bgt_skills_crosswalk = read_csv(params['df_bgt_skills_crosswalk'])
df_bgt_skills_crosswalk = df_bgt_skills_crosswalk[['bgtSkill', 'dhiSkillsExactMatch', 'dhiSkills']]
df_bgt_skills_crosswalk.fillna('', inplace=True)
df_bgt_skills_crosswalk['dhiSkillsExactMatch'] = df_bgt_skills_crosswalk.apply(dhiskillsexactmatch_ignorecase, axis=1)
df_bgt_skills_crosswalk['dhiSkills'] = df_bgt_skills_crosswalk['dhiSkills'].apply(lambda x: x.split(';'))
df_bgt_skills_crosswalk['dhiSkills'] = df_bgt_skills_crosswalk.apply(remove_exact_in_skills, axis=1)
df_bgt_skills_crosswalk['skill'] = df_bgt_skills_crosswalk.dhiSkillsExactMatch.values
df_bgt_skills_crosswalk['skill'][df_bgt_skills_crosswalk['skill'] == ''] = df_bgt_skills_crosswalk.bgtSkill
df_bgt_skills_crosswalk = df_bgt_skills_crosswalk[['bgtSkill', 'dhiSkills', 'skill']]

In [10]:
def bgtskillsexactmatch_ignorecase(row):
    if row.bgtSkillsExactMatch != '':
        return row.bgtSkillsExactMatch
    dhi_lower = row.dhiSkill.lower()
    for x in row.bgtSkills.split(';'):
        if (x.lower() == dhi_lower) and (x != row.dhiSkill):
            return x
    return ''

def remove_exact_in_skills(row):
    newlist = row.bgtSkills.copy()
    if isinstance(row.bgtSkillsExactMatch, str):
        if row.bgtSkillsExactMatch in row.bgtSkills:
            newlist.remove(row.bgtSkillsExactMatch)
    return newlist

df_dhi_skills_crosswalk = read_csv(params['df_dhi_skills_crosswalk'])
# df_dhi_skills_crosswalk = df_dhi_skills_crosswalk[['bgtSkill', 'dhiSkills']]
df_dhi_skills_crosswalk.fillna('', inplace=True)
df_dhi_skills_crosswalk['bgtSkillsExactMatch'] = df_dhi_skills_crosswalk.apply(bgtskillsexactmatch_ignorecase, axis=1)
df_dhi_skills_crosswalk['bgtSkills'] = df_dhi_skills_crosswalk['bgtSkills'].apply(lambda x: x.split(';'))
df_dhi_skills_crosswalk['bgtSkills'] = df_dhi_skills_crosswalk.apply(remove_exact_in_skills, axis=1)
df_dhi_skills_crosswalk = df_dhi_skills_crosswalk[['dhiSkill', 'dhiSkillAliases', 'bgtSkills']]


# Create and train the composite model

This may take about 30 minutes to train the full dataset.

In [11]:
costfidf_tx = CosTfidfTransformer()
overlapping_skills_tx = BGLeftOverlappingSkillsTransformer(
        df_taxonomy_skills=df_taxonomy_skills,
        df_bgt_skills_crosswalk=df_bgt_skills_crosswalk,
        df_dhi_skills_crosswalk=df_dhi_skills_crosswalk)

In [12]:
ct_tx = CompositeTransformer([
        ('costfidf', costfidf_tx, ['job_title', 'job_description', 'resume']),
        ('overlapping_skills', overlapping_skills_tx, ['description_bg_parse', 'resume_bg_parse', 'job_title'])
    ], 
    instance_name=params['name'], instance_version=params['version'])

In [13]:
%%time
ct_tx.fit_transform(df[eval_cols])

clean_for_stemming:   0%|          | 0/150 [00:00<?, ?it/s]

stem:   0%|          | 0/150 [00:00<?, ?it/s]

_get_cossim_diag_helper:   0%|          | 0/100 [00:00<?, ?it/s]

_get_confidence_helper:   0%|          | 0/50 [00:00<?, ?it/s]

loads:   0%|          | 0/150 [00:00<?, ?it/s]

extract_canonical_skill_names:   0%|          | 0/150 [00:00<?, ?it/s]

_get_rand_scores:   0%|          | 0/50 [00:00<?, ?it/s]

_get_confidence_helper:   0%|          | 0/50 [00:00<?, ?it/s]

_drop_row:   0%|          | 0/50 [00:00<?, ?it/s]

CPU times: user 8min 59s, sys: 32.3 s, total: 9min 32s
Wall time: 25min 2s


Unnamed: 0,costfidf__score,costfidf__qtile,costfidf__confidence,overlapping_skills__score,overlapping_skills__qtile,overlapping_skills__confidence,overlapping_skills__details,confidence,costfidf__confimprt,overlapping_skills__confimprt,confidence_total,costfidf__relative_weight,overlapping_skills__relative_weight,costfidf__contribution,overlapping_skills__contribution,composite_score,qtile,pred
0,0.051712,0.176472,0.345585,0.370016,0.115344,0.301485,label descripti...,0.323535,0.415683,0.388256,0.803939,0.517058,0.482942,0.091246,0.055704,0.146951,0.079727,1
1,0.094521,0.454577,0.617729,0.420799,0.162078,0.360377,label descript...,0.489053,0.555756,0.424486,0.980242,0.566958,0.433042,0.257726,0.070187,0.327913,0.236688,2
2,0.156614,0.748753,0.868089,0.659337,0.519725,0.772557,label ...,0.820323,0.658821,0.621513,1.280334,0.514569,0.485431,0.385285,0.252291,0.637576,0.668248,3
3,0.081254,0.370896,0.597574,0.518037,0.282519,0.520606,label descript...,0.559090,0.546614,0.510199,1.056813,0.517229,0.482771,0.191838,0.136392,0.328230,0.237129,2
4,0.178119,0.810799,0.915683,0.904752,0.917659,0.73696,label descriptio...,0.826321,0.67664,0.607025,1.283665,0.527115,0.472885,0.427385,0.433947,0.861331,0.942829,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0.141959,0.694449,0.996829,0.984691,0.984078,0.794255,label descri...,0.895542,0.705985,0.630181,1.336165,0.528366,0.471634,0.366923,0.464124,0.831048,0.917752,5
99996,0.105828,0.521731,0.870655,0.885445,0.896888,0.585767,label descri...,0.728211,0.659793,0.541187,1.200981,0.549379,0.450621,0.286628,0.404157,0.690785,0.749464,4
99997,0.027908,0.049423,0.157021,0.343562,0.095852,0.272635,...,0.214828,0.280197,0.369212,0.649409,0.431465,0.568535,0.021324,0.054495,0.075819,0.032281,1
99998,0.058864,0.221364,0.486593,0.625727,0.457692,0.572319,label description_...,0.529456,0.493251,0.534939,1.028190,0.479728,0.520272,0.106195,0.238125,0.344319,0.253869,2


# Confirm that we can do a transform on a few rows

In [14]:
ct_tx.transform(df.loc[:4, eval_cols])

Unnamed: 0,costfidf__score,costfidf__qtile,costfidf__confidence,overlapping_skills__score,overlapping_skills__qtile,overlapping_skills__confidence,overlapping_skills__details,confidence,costfidf__confimprt,overlapping_skills__confimprt,confidence_total,costfidf__relative_weight,overlapping_skills__relative_weight,costfidf__contribution,overlapping_skills__contribution,composite_score,qtile,pred
0,0.051712,0.176472,0.345585,0.370016,0.115344,0.301485,label descripti...,0.323535,0.415683,0.388256,0.803939,0.517058,0.482942,0.091246,0.055704,0.146951,0.079727,1
1,0.094521,0.454577,0.617729,0.420799,0.162078,0.360377,label descript...,0.489053,0.555756,0.424486,0.980242,0.566958,0.433042,0.257726,0.070187,0.327913,0.236688,2
2,0.156614,0.748753,0.868089,0.659337,0.519725,0.772557,label ...,0.820323,0.658821,0.621513,1.280334,0.514569,0.485431,0.385285,0.252291,0.637576,0.668248,3
3,0.081254,0.370896,0.597574,0.518037,0.282519,0.520606,label descript...,0.55909,0.546614,0.510199,1.056813,0.517229,0.482771,0.191838,0.136392,0.32823,0.237129,2
4,0.178119,0.810799,0.915683,0.904752,0.917659,0.73696,label descriptio...,0.826321,0.67664,0.607025,1.283665,0.527115,0.472885,0.427385,0.433947,0.861331,0.942829,5


# Confirm that we can do a transform on one row

In [15]:
idx = 6
X = ct_tx.transform(df.loc[idx:idx, eval_cols])
X

Unnamed: 0,costfidf__score,costfidf__qtile,costfidf__confidence,overlapping_skills__score,overlapping_skills__qtile,overlapping_skills__confidence,overlapping_skills__details,confidence,costfidf__confimprt,overlapping_skills__confimprt,confidence_total,costfidf__relative_weight,overlapping_skills__relative_weight,costfidf__contribution,overlapping_skills__contribution,composite_score,qtile,pred
0,0.255996,0.93307,0.997938,0.818052,0.806892,0.680141,label descript...,0.83904,0.706378,0.583156,1.289533,0.547778,0.452222,0.511115,0.364895,0.87601,0.953899,5


# Confirm that `detailed_predict()` works

In [16]:
results = ct_tx.detailed_predict(df.loc[idx:idx, eval_cols].to_dict(orient='records'))
results

{'overall_name': 'dice-composite',
 'overall_version': '3.0.13',
 'composite_name': 'composite',
 'composite_version': '2.0.1',
 'composite_confidence': 0.8390396316186214,
 'composite_rawscore': 0.8760095984015217,
 'composite_score': 0.9538985173034121,
 'overall_class': 5,
 'composite_submodels': [{'domain': 'text',
   'name': 'costfidf',
   'version': '1.2',
   'importance': 0.5,
   'applied_weighting': 0.5477777500023132,
   'confidence': 0.9979384527942814,
   'score': 0.9330699492622891,
   'explain': '{"rawscore": 0.25599567477902774}'},
  {'domain': 'skills',
   'name': 'overlapping_skills',
   'version': '1.1.1',
   'importance': 0.5,
   'applied_weighting': 0.45222224999768684,
   'confidence': 0.6801408104429614,
   'score': 0.8068922769760196,
   'explain': '{"details": {"label": {"0": "Information Technology", "1": "I.T. Administration", "2": "Technical support", "3": "Data", "4": "Database", "5": "Database administration", "6": "SQL Databases and Programming", "7": "Syst

In [17]:
# import numpy as np
# rowcols = np.arange(1,6)
# model = ct_tx
# model.set_prediction_thresholds_from_labels(df_labeled.overall)
# df_labeled['pred'] = model.predict(df_labeled[eval_cols])
# df_xtab = labeled_xtab(df_labeled, pred_col='pred', labeled_col='overall', rownames=rowcols, colnames=rowcols)
# d_stats = aggregate_stats_from_xtab(df_xtab)
# print_aggregate_stats(d_stats)
# display(HTML(df_xtab.to_html()))


# Save the model locally, then upload to s3.

In [None]:
Path(params['model_dir']).mkdir(parents=True, exist_ok=True)
fname = f'{params["name"]}_{params["version"]}_full.joblib'
ct_tx.save_model(ct_tx, os.path.join(params['model_dir'], fname))

In [None]:
path = params['s3_model_dir'].split('s3://')[-1].split('/')
bucket = path[0]
key = '/'.join(path[1:]) + '/' + fname
print(f'Uploading: s3://{bucket}/{key}')
boto3.Session().resource('s3').Bucket(bucket).Object(key)\
    .upload_file(os.path.join(params['model_dir'], fname))