This notebook analyzes the impact of incorporating the confidence score. We find that accuracy is improved with this paricular definition confidence, which is simply the logistic regression probability of a score belonging to the training set vs. a random selection. Through this analysis, we found our previous definition of confidence, $2 \cdot \max(P(x=0), P(x=1)) - 1$ where $P(x=0)$ is the probability of the NULL hypothesis and $P(x=1)$ is the probability of the actual distribution, is actually *worse* than excluding the confidence measure altogether. However, with the current definition, we gain 1.5% on our consensus dataset.

**Author:** Tom McTavish

**Date:** September 20, 2021

See also [the confidence notebook](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/notebooks/demos/confidence-score-by-shuffling.ipynb).

In [13]:
%pip install -r requirements.txt
# %pip install dhi-dsmatch[training]==1.1.0
# %pip install -e "git+ssh://git@bitbucket.org/dhigroupinc/dhi-match-datascience.git@MATCH-1912-impact-of-confidence-score-in#egg=dhi-dsmatch[training]&subdirectory=src/dhi-dsmatch"


In [1]:
import os
from itertools import combinations_with_replacement, permutations

import yaml
import pandas as pd
import numpy as np
import joblib
from tqdm.auto import tqdm
from IPython.display import HTML

from dhi.dsmatch.util.io import read_csv
from dhi.dsmatch.analytics.modelevaluation import labeled_xtab, aggregate_stats_from_xtab, print_aggregate_stats
from dhi.dsmatch.analytics.modelevaluation import profile_predict

## Read in our parameters from the config file.

In [2]:
with open('config.yml', 'r') as f:
    params = yaml.load(f, Loader=yaml.FullLoader)
params

{'name': 'dice-composite',
 'version': '3.0.8',
 'cache_location': '/home/ec2-user/SageMaker/shared/data/dice/v2/dice-composite/cache',
 'model_dir': '/home/ec2-user/SageMaker/shared/models/dice/',
 'data_dir': '/home/ec2-user/SageMaker/shared/data/dice/',
 'train_data': 's3://dev-matchology-datascience/data/dice/v2/dice_train_20210713.csv',
 'labeled_data': 's3://dev-matchology-datascience/data/dice/v2/dice_labeled_20210713_consensus.csv',
 'df_taxonomy_skills': 's3://dev-matchology-datascience/data/burning_glass_skills.csv',
 's3_model_dir': 's3://dev-matchology-datascience/models/dice',
 'gaussian_accuracy_threshold': 0.6,
 'median_timing_threshold': 1000}

## Load our validation data.

In [3]:
df = read_csv(params['labeled_data'])
df

Unnamed: 0,snapshot_id,year,month,day,worker_email,experience,titles,skills,overall,previous_title,...,job_description,description_detected_lang,resume,application_id,resume_detected_lang,description_bg_parse,resume_bg_parse,set,job_data_bg_skills,resume_data_bg_skills
0,d23fa4d9-dbeb-55ac-a6b6-fc5c8f27172a,2021,1,22,thellmuth@iqclarity.com,2.0,2,3.0,2.0,,...,<!DOCTYPE html><html><head></head><body><p><st...,en,"\r\n Austin, TX • • • • \r\n\r\nTec...",2c928082762f2e6301772c3a64a407c3,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",test,"[""AngularJS"", ""Front-end Development"", ""JavaSc...","[""AWS Elastic Compute Cloud (EC2)"", ""Algebra"",..."
1,b1491953-2ddb-5188-9733-412c4008ebe2,2021,1,18,thellmuth@iqclarity.com,5.0,5,5.0,5.0,Programmer/Analyst,...,"<img src=""https://counter.adcourier.com/bWFyay...",en,"﻿\t F. , Jr.\r\n\r\n 375 Village Dr.\r\n ...",2c928082762f2e420177152c2b6c0221,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",test,"[""CA Endevor"", ""COBOL"", ""FileAID"", ""Mainframe""...","[""Business Management"", ""Business Systems Anal..."
2,8482bdb2-6c2b-54eb-bd0f-e5c26dd763cd,2021,1,6,thellmuth@iqclarity.com,4.0,4,4.0,4.0,Project Trainee,...,<!DOCTYPE html><html><head></head><body><p>We ...,en,Anand \r\n\r\n ...,2c928082762f2e340176d8eee7aa0e6c,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",train,"[""Communication Skills"", ""Verbal / Oral Commun...","[""Artificial Intelligence"", ""Autonomous System..."
3,54b821c1-4401-552b-9754-79bfc33974fd,2021,1,17,thellmuth@iqclarity.com,3.0,3,3.0,3.0,research:: Data Science Developer:: research::...,...,<!DOCTYPE html><html><head></head><body><p>Dat...,en,"\r\nCoconut Creek FL, , 412.614.\...",2c928082762f2e63017711a75bce74f2,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",test,"[""Communication Skills"", ""Data Analysis"", ""Dat...","[""Alteryx"", ""Analysis Of Variance (ANOVA)"", ""A..."
4,85ec22c4-5d4d-57cd-98e7-7ab2f4fdba8d,2021,1,11,thellmuth@iqclarity.com,5.0,4,4.0,4.0,Sr. Tableau Developer/Business Analyst:: Table...,...,<p><b>Please apply for this position by sendin...,en,﻿ ...,2c928082762f2e340176f2edfe314698,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",test,"[""Data Analysis"", ""Microsoft Excel"", ""Microsof...","[""AJAX"", ""AWS Redshift"", ""Active Server Pages ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1413,1b70a7aa-6c75-54c0-9212-ff3c41a803bc,2021,2,4,thellmuth@iqclarity.com,5.0,4,5.0,5.0,DevOps Cloud Engineer:: DevOps/AWS Engineer:: ...,...,<!DOCTYPE html><html><head></head><body><p><st...,en,﻿\r\n\r\n\r\n Engineer\r\n\r\nEmail: ...,2c928082773ade8c01776e0a4d675a6f,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",train,"[""Cloud Computing"", ""Communication Skills"", ""D...","[""API Management"", ""AWS Elastic Block Store (E..."
1414,c6ad065b-7fe5-54f9-8ab7-5ecfb0f16b99,2021,2,3,thellmuth@iqclarity.com,4.0,4,4.0,4.0,,...,<!DOCTYPE html><html><head></head><body><p><st...,en,﻿ \r\n\r\n (BIBY)\r\n \r\n\r\n\r\nMortgage und...,2c928082773ade9a017768b00d98028f,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",test,"[""Detail-Oriented"", ""Microsoft Office"", ""Mortg...","[""Account Closing"", ""Automated Underwriting Sy..."
1415,3276a097-a7e6-5d37-a4d3-f8019797c950,2021,2,4,thellmuth@iqclarity.com,5.0,4,5.0,5.0,Tibco Developer:: Lead TIBCO Developer\Adminis...,...,<p>Greetings !</p> <p> </p> <p>Our client is ...,en,﻿\r\n\r\n\r\n\r\n | | \r\n ...,2c928082773ade9a01776e1464324250,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",test,"[""Extensible Markup Language (XML)"", ""Hardware...","[""Apache Ant"", ""Apigee"", ""Atlassian JIRA"", ""Au..."
1416,214c2c95-1ae5-563f-bf03-74024304b213,2021,2,5,thellmuth@iqclarity.com,5.0,4,4.0,4.0,UC Contractor:: Consulting UC Engineer:: UC Co...,...,<!DOCTYPE html><html><head></head><body><div>W...,en,﻿\r\n\r\nCCIE#55003 - Collaboration\r\n\r\n\r\...,2c928082773adea00177729e3f0e7f29,en,"{""status"": true, ""status_code"": ""OK"", ""request...","{""status"": true, ""statusCode"": ""OK"", ""requestI...",train,"[""Avaya"", ""Cisco"", ""Cisco Contact Center"", ""Ci...","[""Avaya"", ""Business-to-Business"", ""Call-record..."


## Specify necessary functions.

Note that these functions are similar, if not identical to those in the [optimize_and_test_composite_model](./optimize_and_test_composite_model.ipynb) notebook.

In [5]:
rowcols = np.arange(1,6)
    
def get_combs(n_submodels=3):
    """Get weight combinations that sum to 1."""
    combs = []
    for c in combinations_with_replacement(np.linspace(0, 1, 21), n_submodels):
        if np.isclose(np.sum(c), 1):
            combs.extend(permutations(c))
    return sorted(set(combs))

def simulate_importances(combs, df, model):
    rowcols = np.arange(1,6)
    dfs = []
    cols = [c for c in df.columns if (c.endswith('qtile') or c.endswith('confidence')) and c.find('__') >= 0]

    for c in tqdm(combs):
        model.set_importances(c)
        model.update_fitted_quantiles()  # Needs to be called anytime after updating importances.
        df_ = model._transform_postprocess(df[cols])
        df_['overall'] = df.overall.values
        df_xtab = labeled_xtab(df_, labeled_col='overall', rownames=rowcols, colnames=rowcols)
        d_stats = aggregate_stats_from_xtab(df_xtab)
        df_sample = pd.DataFrame.from_dict(d_stats, orient='index').T
        df_sample['importances'] = str(c)
        dfs.append(df_sample)
    
    return pd.concat(dfs, ignore_index=True)
    
def importances_from_simulations(df_sims):
    s = df_sims.groupby('importances')['gaussian_accuracy'].mean()
    return s.sort_values(ascending=False)

def get_transformed(df, model):
    cols = ['job_title', 'job_description', 'resume', 'description_bg_parse', 'resume_bg_parse']
    df_results = model.transform(df[cols])
    df_results['overall'] = df.overall.values
    return df_results

def assign_optimal_importances(df, df_results, model):
    """This integrates the functions above in one sequence."""
#     model.set_prediction_thresholds_from_labels(df.overall)
#     df_results = get_transformed(df, model)
    model.update_fitted_quantiles()
    df_sims = simulate_importances(get_combs(), df_results, model)
    s_importances = importances_from_simulations(df_sims)
    model.set_importances(list(eval(s_importances.index[0])))
    print(f'model.importances: {model.importances}')
    model.update_fitted_quantiles()
    cols = [c for c in df_results.columns if (c.endswith('qtile') or c.endswith('confidence')) and c.find('__') >= 0]
    df_ = model._transform_postprocess(df_results[cols])
    df_['overall'] = df.overall.values
    df_xtab = labeled_xtab(df_, pred_col='pred', labeled_col='overall', rownames=rowcols, colnames=rowcols)
    d_stats = aggregate_stats_from_xtab(df_xtab)
    print_aggregate_stats(d_stats)
    display(HTML(df_xtab.to_html()))
    print(df_xtab.to_html())
    return df_xtab, d_stats, df_sims

## Load the model.

In [6]:
with open(os.path.join(params["model_dir"], params["version"], 
                       f'{params["name"]}_{params["version"]}.joblib'), 'rb') as f:
    model = joblib.load(f)
                       
# Pull in the full model so we can ascribe the fitted_quantiles_                    
with open(os.path.join(params["model_dir"], 
                       f'{params["name"]}_{params["version"]}_full.joblib'), 'rb') as f:
    model_full = joblib.load(f)
model.fitted_quantiles_ = model_full.fitted_quantiles_.copy()

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Run the dataset through the model, which uses confidence.

In [7]:
rowcols = np.arange(1,6)
df_results = get_transformed(df, model)
df_xtab = labeled_xtab(df_results, pred_col='pred', labeled_col='overall', rownames=rowcols, colnames=rowcols)
d_stats = aggregate_stats_from_xtab(df_xtab)
print(f'model.importances: {model.importances}')
print_aggregate_stats(d_stats)
display(HTML(df_xtab.to_html()))

clean_for_stemming:   0%|          | 0/45 [00:00<?, ?it/s]

stem:   0%|          | 0/45 [00:00<?, ?it/s]

conjoin_text:   0%|          | 0/45 [00:00<?, ?it/s]

conjoin_text:   0%|          | 0/45 [00:00<?, ?it/s]

_get_cossim_diag_helper:   0%|          | 0/2 [00:00<?, ?it/s]

_get_confidence_helper:   0%|          | 0/45 [00:00<?, ?it/s]

loads:   0%|          | 0/45 [00:00<?, ?it/s]

extract_canonical_skill_names:   0%|          | 0/45 [00:00<?, ?it/s]

  means = sums / nnz
  penalty_means = sums / nnz
  penalty_means = sums / nnz


_get_confidence_helper:   0%|          | 0/45 [00:00<?, ?it/s]

skillClusterLabel - get_hierarchy_labels:   0%|          | 0/45 [00:00<?, ?it/s]

  means = sums / nnz
  penalty_means = sums / nnz
  penalty_means = sums / nnz


_get_confidence_helper:   0%|          | 0/45 [00:00<?, ?it/s]

model.importances: [0.75, 0.05, 0.2]
Total number of records: 1416
Total exact matches: 606
Percent exact: 42.8%
Percent one-half 1 off: 66.0%
Percent Gaussian rolloff: 72.3%


overall,1.0,2.0,3.0,4.0,5.0
pred,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,62,63,28,9,1
2,27,112,94,38,5
3,9,57,130,116,27
4,0,13,82,201,94
5,0,2,20,125,101


## Turn off confidence.

When we set all of the model's fitted quantiles to 1 as well as what we calculated before, we effectively turn off the effect of confidence. We run through the space of submodel importances to see what the maximal model is when we turn off confidence.

In [8]:
import inspect
from dhi.dsmatch.util import measures
print(inspect.getsource(measures))

from tqdm.auto import tqdm
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression


def columnwise_cosine_similarity(a, b):
    sims = []
    for idx in tqdm(range(a.shape[0])):
        a_tfidf = a[idx, :]
        b_tfidf = b[idx, :]
        sims.append(cosine_similarity(a_tfidf, b_tfidf).item())
    return sims

def logistic_regression_confidence(trained_scores, rand_scores, scores, return_classifier=False):
    """Given values of trained/actual scores and values from a random, representative distribution, and
    a new sample of score(s), return the confidence that each score is in the actual vs random distribution.
    We define the confidence of a score as the likelihood it belongs to the training distribution as
    compared to a random distribution as obtained via logistic regression.

    Args:
        trained_scores (np.array): 1D array of "actual" distribution scores as obtained possibly during training.
  

In [9]:
conf_cols = [c for c in model.fitted_quantiles_.columns if c.endswith('confidence')]
for c in conf_cols:
    model.fitted_quantiles_[c] = 1.
    df_results[c] = 1.

In [10]:
df_xtab, d_stats, df_sims = assign_optimal_importances(df, df_results, model)

  0%|          | 0/231 [00:00<?, ?it/s]

model.importances: [0.8, 0.05, 0.15000000000000002]
Total number of records: 1416
Total exact matches: 600
Percent exact: 42.4%
Percent one-half 1 off: 65.4%
Percent Gaussian rolloff: 71.7%


overall,1.0,2.0,3.0,4.0,5.0
pred,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,63,66,30,9,1
2,29,111,96,44,4
3,5,52,123,111,31
4,1,16,85,203,92
5,0,2,20,122,100


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>overall</th>
      <th>1.0</th>
      <th>2.0</th>
      <th>3.0</th>
      <th>4.0</th>
      <th>5.0</th>
    </tr>
    <tr>
      <th>pred</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>63</td>
      <td>66</td>
      <td>30</td>
      <td>9</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2</th>
      <td>29</td>
      <td>111</td>
      <td>96</td>
      <td>44</td>
      <td>4</td>
    </tr>
    <tr>
      <th>3</th>
      <td>5</td>
      <td>52</td>
      <td>123</td>
      <td>111</td>
      <td>31</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1</td>
      <td>16</td>
      <td>85</td>
      <td>203</td>
      <td>92</td>
    </tr>
    <tr>
      <th>5</th>
      <td>0</td>
      <td>2</td>
      <td>20</td>
      <td>122</td>
      <td>100</td>
    </tr>
  </tbody>
</table>


In [11]:
df_sims.sort_values(by='gaussian_accuracy', ascending=False)

Unnamed: 0,n_records,n_correct,absolute_accuracy,one_half_accuracy,gaussian_accuracy,importances
217,1416.0,600.0,0.423729,0.654308,0.717389,"(0.8, 0.05, 0.15000000000000002)"
198,1416.0,598.0,0.422316,0.653955,0.717071,"(0.65, 0.15000000000000002, 0.2)"
212,1416.0,600.0,0.423729,0.653602,0.716820,"(0.75, 0.1, 0.15000000000000002)"
205,1416.0,604.0,0.426554,0.653249,0.716267,"(0.7000000000000001, 0.1, 0.2)"
206,1416.0,595.0,0.420198,0.652189,0.715858,"(0.7000000000000001, 0.15000000000000002, 0.15..."
...,...,...,...,...,...,...
3,1416.0,550.0,0.388418,0.605226,0.668527,"(0.0, 0.15000000000000002, 0.8500000000000001)"
2,1416.0,544.0,0.384181,0.602401,0.666099,"(0.0, 0.1, 0.9)"
1,1416.0,544.0,0.384181,0.599576,0.663437,"(0.0, 0.05, 0.9500000000000001)"
20,1416.0,499.0,0.352401,0.582980,0.650076,"(0.0, 1.0, 0.0)"


In [12]:
df_sims[df_sims.importances.str.find('1.0') >= 0]

Unnamed: 0,n_records,n_correct,absolute_accuracy,one_half_accuracy,gaussian_accuracy,importances
0,1416.0,527.0,0.372175,0.584746,0.649725,"(0.0, 0.0, 1.0)"
20,1416.0,499.0,0.352401,0.58298,0.650076,"(0.0, 1.0, 0.0)"
230,1416.0,562.0,0.396893,0.624647,0.690758,"(1.0, 0.0, 0.0)"
