This notebook builds upon what Tom McTavish accomplished.  The acceptance criteria is

        Append Job Description for Job Summary
        Retrain Unsupervised Model
        Compare Accuracy to Previous Version

This notebook builds, tests, and profiles the unsupervised NDimCosineTfidf (N-Dimensional Cosine TF-IDF) model for eFC. If running the whole notebook, this takes about an hour to run. 



**Author:** Tom McTavish
**Update by:** Jeff Magouirk

**Date:** July 22, 2020
**Date of update:** September 2, 2020

**Confluence Page - https://confluence.dhigroupinc.com/display/MATCH/MATCH-554-prototype-job-summary-job-desc

**Training Data:** Live-Feed CSV files from February - Mid July, 2020.
**New Training Data:** Live-Feed CSV files from February 8, 2020 - August 26, 2020.

* s3://dev-dhi-match-datascience/data/efc/live-feed/raw-2020<02-0826>.csv
  

  
**Testing Data:**
 
  * s3://dev-dhi-match-datascience/data/efc/train_test_20200716.csv
  * s3://dev-dhi-match-datascience/data/efc/Validation/train_test_with_jobsummary_09012020.cs
  
**Main Model Source:**

  Bitbucket: [dhi-match-datascience/dsmatch/sklearnmodeling/models/ndimcostfidf.py](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/MATCH-484-package-unsupervised-tfidf-mod/dsmatch/sklearnmodeling/models/ndimcostfidf.py)
  
**Model File Output:** s3://dev-dhi-match-datascience/models/efc/unsupervised/twodimcostfidf-20200722.joblib
**Combined Feature - Model File Output:** s3://dev-dhi-match-datascience/models/efc/unsupervised/twodimcostfidf-20200828.joblib

**JIRA:** [MATCH-484](https://jira.dhigroupinc.com/browse/MATCH-484)
**JIRA:** https://jira.dhigroupinc.com/browse/MATCH-554

In [1]:
### Run the uncommented pip installs first this should allow the whole notebook
### This will allow the import below to run

!pip install --upgrade pip
!pip install --upgrade contractions
!pip install --upgrade line-profiler

# !pip install --upgrade tqdm
!pip install --upgrade joblib
# !conda update numpy --yes
# !conda remove scikit-learn --yes
!pip install --upgrade scikit-learn


Collecting pip
  Downloading pip-20.2.3-py2.py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 14.0 MB/s eta 0:00:01     |██████████████████▎             | 860 kB 14.0 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.0.2
    Uninstalling pip-20.0.2:
      Successfully uninstalled pip-20.0.2
Successfully installed pip-20.2.3
Collecting contractions
  Downloading contractions-0.0.25-py2.py3-none-any.whl (3.2 kB)
Collecting textsearch
  Downloading textsearch-0.0.17-py2.py3-none-any.whl (7.5 kB)
Collecting Unidecode
  Downloading Unidecode-1.1.1-py2.py3-none-any.whl (238 kB)
[K     |████████████████████████████████| 238 kB 18.9 MB/s eta 0:00:01
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.0.tar.gz (312 kB)
[K     |████████████████████████████████| 312 kB 35.2 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick

In [3]:
import os

import joblib
import boto3
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
from IPython.core.display import HTML
from bs4 import BeautifulSoup  ###Added by Jeff Magouirk - 8/31/2020

from dsmatch.sklearnmodeling.models.ndimcostfidf import NDimCosTfidf
from dsmatch.analytics.modelevaluation import labeled_xtab, aggregate_stats_from_xtab, print_aggregate_stats
from dsmatch.analytics.modelevaluation import print_timing_performance, profile, profile_transform, profile_predict
from dsmatch import local_bucket, s3_ds_bucket
from dsmatch.util.io import read_csv
from dsmatch.util.s3 import list_files

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Set io parameters

In [5]:
data_subpath = os.path.join('data', 'efc', 'live-feed')

model_name_0 = 'twodimcostfidf-20200908_0' ###creating a model with the original form of the data
model_name = 'twodimcostfidf-20200908'  ###creating a new model, due to form of the data
model_subpath = os.path.join('models', 'efc', 'unsupervised')
cache_location = os.path.join(local_bucket, model_subpath, 'cache')
data_out_path = os.path.join(local_bucket, data_subpath)
model_profilerpath_0 = os.path.join(local_bucket, model_subpath, model_name_0+'_timings.txt')
model_profilerpath = os.path.join(local_bucket, model_subpath, model_name+'_timings.txt')
try:
    os.makedirs(os.path.join(local_bucket, model_subpath))
except FileExistsError:
    pass

# Load training data.

We obtain the list of files from our s3 bucket. Note that as `raw-YYYYMM.csv` files are added to our bucket, we add more training data. Therefore, updating the model only requires running this notebook again.  This data is a from the live feed data from February 8, 2020 to August 26, 2020.

In [None]:
bucket_object_list = list_files(prefix=data_subpath)

In [8]:
csv_files = [f for f in bucket_object_list if f.find('raw-') >= 0 and f.endswith('.csv')]
csv_files = csv_files[0:]  # Last 3 months/data files. More than this and we run out of memory.  
print('Training files:')
csv_files

Training files:


['data/efc/live-feed/raw-202002.csv',
 'data/efc/live-feed/raw-202003.csv',
 'data/efc/live-feed/raw-202004.csv',
 'data/efc/live-feed/raw-202005.csv',
 'data/efc/live-feed/raw-202006.csv',
 'data/efc/live-feed/raw-202007.csv',
 'data/efc/live-feed/raw-20200826.csv']

We read each of the csv files into a dictionary of DataFrames, including only the columns of interest.

In [9]:
cols = ['resume', 'job.data.description', 'job.data.title','job.data.summary',
        'date_retrieved','Language_JD','Language_Resume']  ##change by jeff added the job.data.summary field.

<h3>Bringing in the data</h3>

This can be a dynamic file depending on the amount of live feed data available<br>
The sampling method will allow for additional data and will also allow for<br>
the same sample to be taken every time.<br>
<i>The records that are wanted are those resumes and job descriptions that are both in English

In [10]:
dfs = {}
pbar = tqdm(csv_files)
for csv_file in pbar:
    k = csv_file.split('/')[-1].split('.csv')[0]
    pbar.set_description(k)
    df = read_csv(data_subpath, k + '.csv')
    df = df[cols]
    df = df[(df['Language_JD']=='en') & (df['Language_Resume']=='en')]
    df = df.drop(['Language_JD','Language_Resume'],axis=1)
    # Random seed is the month (or last digits) of the file. This approach allows us to
    # capture the specific sample again, if we want.
    rs = int(csv_file.split('2020')[-1].split('.csv')[0])
    df = df.sample(n=61_000, random_state=rs)
    df = df.fillna('')
    dfs[k] = df
df = pd.concat(dfs.values(), ignore_index=True)
print(f'Number of rows: {df.shape[0]}')

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Number of rows: 427000


In [11]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427000 entries, 0 to 426999
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   resume                427000 non-null  object
 1   job.data.description  427000 non-null  object
 2   job.data.title        427000 non-null  object
 3   job.data.summary      427000 non-null  object
 4   date_retrieved        427000 non-null  object
dtypes: object(5)
memory usage: 16.3+ MB


<h3> Combining the 'job.data.description' and 'job.data.summary' together </h3>
<br><h4>Taking a 427,000 sample to allow for memory issues</h4> 
<br><h4>df_0 is the original dataset without the combined job.data.description and job.data.summary</h4>
<br><h4>df is the dataset with the combined job.data.description and job.data.summary</h4>

<i>Original dataset with new sample of 427,000 records training data</i> 

In [5]:
df_0 = df
df_0 = df_0[['resume','job.data.description','job.data.title']]
#df_0 = df_0.rename(columns={'job.data.description':'job_description','job.data.title':'job_title'})
print(df_0.info())
df_0.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427000 entries, 0 to 426999
Data columns (total 3 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   resume                427000 non-null  object
 1   job.data.description  427000 non-null  object
 2   job.data.title        427000 non-null  object
dtypes: object(3)
memory usage: 9.8+ MB
None


Unnamed: 0,resume,job.data.description,job.data.title
0,Reach Me at\r\n JASIM KOTTAKKARANTAVIDA Mob: ...,"<p>As an Accountant, you will report to the Ac...",Accountant
1,Georges E. Kairouz \r\n\r\nDate of birth: 05/0...,<p>We are a leading Banking Group headquartere...,Investment Banking Analyst
2,"SONAL NAGPAL, CFA \r\n Email: nagpalsonal89@g...",<p><strong>Responsibilities</strong></p> <ul> ...,Real Estate Portfolio Management
3,MAYO KOKU\r\nMobile Phone: - 07590 535685\r\nE...,<p><strong>Client</strong></p> <p>My client a ...,"Quantitative Analyst - C#, FRTB, DRC, IRC, VAR,"
4,Kush Chowdhary - CURRICULUM VITAE\r\n\r\nMobil...,<p><strong>Business Development Manager – Sale...,"Business Development Manager – Sales, Finance,..."


<i>Creating the dataframe of the combined fields of job.data.description and job.data.summary
<br> with the 427,000 record dataset </i>

In [6]:
df['job_description'] = df['job.data.description'] + ' ' + df['job.data.summary']
df_1 = df[['resume','job.data.description','job.data.title']]

df_1.head()

Unnamed: 0,resume,job.data.description,job.data.title
0,Reach Me at\r\n JASIM KOTTAKKARANTAVIDA Mob: ...,"<p>As an Accountant, you will report to the Ac...",Accountant
1,Georges E. Kairouz \r\n\r\nDate of birth: 05/0...,<p>We are a leading Banking Group headquartere...,Investment Banking Analyst
2,"SONAL NAGPAL, CFA \r\n Email: nagpalsonal89@g...",<p><strong>Responsibilities</strong></p> <ul> ...,Real Estate Portfolio Management
3,MAYO KOKU\r\nMobile Phone: - 07590 535685\r\nE...,<p><strong>Client</strong></p> <p>My client a ...,"Quantitative Analyst - C#, FRTB, DRC, IRC, VAR,"
4,Kush Chowdhary - CURRICULUM VITAE\r\n\r\nMobil...,<p><strong>Business Development Manager – Sale...,"Business Development Manager – Sales, Finance,..."


# Create and train the model.

This has a few parts to it:

  1. Preprocessing the data by cleaning and stemming it.  # Takes a while, but provides status bars
  2. Training the TfidfVectorizer.  # Takes a while and does not provide a status.
  3. Calculating the cosine similarities and setting the thresholds.  # Pretty fast, but no status.

In [7]:
df_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427000 entries, 0 to 426999
Data columns (total 3 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   resume                427000 non-null  object
 1   job.data.description  427000 non-null  object
 2   job.data.title        427000 non-null  object
dtypes: object(3)
memory usage: 9.8+ MB


<h4> Original Model

In [8]:
model = NDimCosTfidf(memory=cache_location)
model.fit_transform(df_0)
df_0 = None # Try to free up memory
print('Done training the model.')

HBox(children=(FloatProgress(value=0.0, description='coupling job.data.title-job.data.description', max=150.0,…


Done training the model.


<h4>Model with combined job description and job summary

In [39]:
model = NDimCosTfidf(memory=cache_location)
model.fit_transform(df_1)
df_1 = None # Try to free up memory
print('Done training the model.')

HBox(children=(FloatProgress(value=0.0, description='coupling job.data.title-job.data.description', max=150.0,…


Done training the model.


## Write out the model file and upload to s3.

In [40]:
#key = os.path.join(model_subpath, model_name + '.joblib')
#joblib.dump(model, os.path.join(local_bucket, key))

key = os.path.join(model_subpath, model_name + '.joblib')
joblib.dump(model, os.path.join(local_bucket, key))

['/home/ec2-user/SageMaker/shared/models/efc/unsupervised/twodimcostfidf-20200908_0.joblib']

In [41]:
boto3.Session().resource('s3').Bucket(s3_ds_bucket).Object(key).upload_file(os.path.join(local_bucket, key))

In [42]:
# And dump the tfidf vectorizer as a separate model, too.
key = os.path.join(model_subpath, 'onegram-tfidfvectorizer-20200908.joblib')
joblib.dump(model.vectorizer, os.path.join(local_bucket, key))
boto3.Session().resource('s3').Bucket(s3_ds_bucket).Object(key).upload_file(os.path.join(local_bucket, key))

# Evaluation

## Read the model from the s3 bucket, as if we were starting completely fresh.

In [12]:
# Original Model
# https://stackoverflow.com/a/59903472/394430

model_name = 'twodimcostfidf-20200908_0' ##Original model
key  = os.path.join(model_subpath, model_name + '.joblib')
print(key)
from io import BytesIO

with BytesIO() as data:
    boto3.resource('s3').Bucket(s3_ds_bucket).download_fileobj(key, data)
    data.seek(0)    # move back to the beginning after writing
    model = joblib.load(data)

models/efc/unsupervised/twodimcostfidf-20200908_0.joblib


models/efc/unsupervised/twodimcostfidf-20200908.joblib




## Pull in test data and run it through the model, evaluating accuracy.
First model is against the new validated dataset, with the combined job description and job summery fields

In [30]:
## Model with combined job summary and job description.
## 533 records
## This variable is job_description_clean
cols =['resume','job_description','job_title']
df_scored = read_csv(os.path.join('data', 'efc','Validated'), 'train_test_with_jobsummary_09012020.csv')
df_scored.info(verbose=True)
df_scored = df_scored.rename(columns={'resume_clean':'resume','job_description_clean':'job_description',
                                      'job_title_clean':'job_title'})
df_scored = df_scored.drop('current_job_title_clean',axis=1)
df_scored['combined'] = 1
df_scored.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533 entries, 0 to 532
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   match_score              533 non-null    float64
 1   job_description_clean    533 non-null    object 
 2   resume_clean             533 non-null    object 
 3   job_title_clean          533 non-null    object 
 4   current_job_title_clean  533 non-null    object 
dtypes: float64(1), object(4)
memory usage: 20.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533 entries, 0 to 532
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   match_score      533 non-null    float64
 1   job_description  533 non-null    object 
 2   resume           533 non-null    object 
 3   job_title        533 non-null    object 
 4   combined         533 non-null    int64  
dtypes: float64(1), int64(1), object(3)

In [31]:
### Original 1033 dataset
df_scored_0 = read_csv(os.path.join('data', 'efc'), 'train_test_20200716.csv')
df_scored_0 = df_scored_0[['match_score','job_description_clean','resume_clean','job_title_clean']]
df_scored_0 = df_scored_0.rename(columns={'resume_clean':'resume','job_title_clean':'job_title'})
df_scored_0['Combined1'] = 0
df_scored_0.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1033 entries, 0 to 1032
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   match_score            1033 non-null   float64
 1   job_description_clean  1033 non-null   object 
 2   resume                 1033 non-null   object 
 3   job_title              1033 non-null   object 
 4   Combined1              1033 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 40.5+ KB


Creating the original validation set

In [38]:
df_scored_00 = df_scored.merge(df_scored_0,on=['resume','job_title','match_score'], how='left')
df_scored_00.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 533 entries, 0 to 532
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   match_score            533 non-null    float64
 1   job_description        533 non-null    object 
 2   resume                 533 non-null    object 
 3   job_title              533 non-null    object 
 4   combined               533 non-null    int64  
 5   job_description_clean  533 non-null    object 
 6   Combined1              533 non-null    int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 33.3+ KB


In [15]:

df_scored_00.info(verbose=True)
df_scored_00 = df_scored_00[['match_score','resume','job_description_clean','job_title']]
df_scored_00 = df_scored_00.rename(columns={'job_description_clean':'job.data.description',
                                           'job_title':'job.data.title'})

df_scored_00.info(verbose=True)



<class 'pandas.core.frame.DataFrame'>
Int64Index: 533 entries, 0 to 532
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   match_score            533 non-null    float64
 1   job_description        533 non-null    object 
 2   resume                 533 non-null    object 
 3   job_title              533 non-null    object 
 4   combined               533 non-null    int64  
 5   job_description_clean  533 non-null    object 
 6   Combined1              533 non-null    int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 33.3+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 533 entries, 0 to 532
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   match_score           533 non-null    float64
 1   resume                533 non-null    object 
 2   job.data.description  533 non-null    object 
 3   j

In [17]:
cols =['resume','job.data.description','job.data.title']
df_scored_00['pred'] = model.predict(df_scored_00[cols])
df_xtab = labeled_xtab(df_scored_00, labeled_col='match_score')
d_stats = aggregate_stats_from_xtab(df_xtab)
print_aggregate_stats(d_stats)
display(HTML(df_xtab.to_html()))

HBox(children=(FloatProgress(value=0.0, description='clean', max=17.0, style=ProgressStyle(description_width='…




HBox(children=(FloatProgress(value=0.0, description='stem', max=17.0, style=ProgressStyle(description_width='i…


Total number of records: 533
Total exact matches: 272
Percent exact: 51.0%
Percent one-half 1 off: 71.8%
Percent Gaussian rolloff: 77.1%


match_score,1.0,2.0,3.0,4.0,5.0
pred,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,17,23,3,0,0
2,12,47,31,6,1
3,4,33,135,28,7
4,2,5,52,42,13
5,0,2,10,29,31


Creating the model with the combined job title and job description

In [45]:
# Combined model of job summary and job descriptioin
# https://stackoverflow.com/a/59903472/394430
model_name = 'twodimcostfidf-20200908' ##Original model
key  = os.path.join(model_subpath, model_name + '.joblib')
print(key)
from io import BytesIO

with BytesIO() as data:
    boto3.resource('s3').Bucket(s3_ds_bucket).download_fileobj(key, data)
    data.seek(0)    # move back to the beginning after writing
    model = joblib.load(data)

models/efc/unsupervised/twodimcostfidf-20200908.joblib


In [47]:
cols =['resume','job.data.description','job.data.title']    
df_scored.rename(columns={'job_title':'job.data.title', 'job_description':'job.data.description',
                         'resume':'resume'}, inplace=True)
df_scored.info(verbose=True)
df_scored['pred'] = model.predict(df_scored[cols])

print(df_scored.info())

df_xtab = labeled_xtab(df_scored, labeled_col='match_score')
d_stats = aggregate_stats_from_xtab(df_xtab)
print_aggregate_stats(d_stats)
display(HTML(df_xtab.to_html()))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533 entries, 0 to 532
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   match_score           533 non-null    float64
 1   job.data.description  533 non-null    object 
 2   resume                533 non-null    object 
 3   job.data.title        533 non-null    object 
 4   combined              533 non-null    int64  
 5   pred                  533 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 25.1+ KB


HBox(children=(FloatProgress(value=0.0, description='clean', max=17.0, style=ProgressStyle(description_width='…




HBox(children=(FloatProgress(value=0.0, description='stem', max=17.0, style=ProgressStyle(description_width='i…


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533 entries, 0 to 532
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   match_score           533 non-null    float64
 1   job.data.description  533 non-null    object 
 2   resume                533 non-null    object 
 3   job.data.title        533 non-null    object 
 4   combined              533 non-null    int64  
 5   pred                  533 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 25.1+ KB
None
Total number of records: 533
Total exact matches: 275
Percent exact: 51.6%
Percent one-half 1 off: 71.8%
Percent Gaussian rolloff: 77.0%


match_score,1.0,2.0,3.0,4.0,5.0
pred,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,18,25,5,0,0
2,11,46,33,8,1
3,4,32,133,28,8
4,2,5,52,48,13
5,0,2,8,21,30
