## Quality assurance when you have fully labelled data

In this example, our data contains a fully-populated ground-truth column called `cluster` that enables us to perform accuracy analysis of the final model

In [13]:
#%pip install git+https://github.com/moj-analytical-services/splink.git@migrate-demos

In [14]:
from splink.datasets import splink_datasets
import altair as alt
alt.renderers.enable("html")

df = splink_datasets.fake_1000

df.head(2)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,cluster
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0


In [15]:
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.comparison_library as cl

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email"),#, include_username_fuzzy_level=False),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}

In [16]:
linker = DuckDBLinker(df, settings, set_up_basic_logging=False)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)


In [17]:
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

In [18]:
session_dob = linker.estimate_parameters_using_expectation_maximisation("l.dob = r.dob")
session_email = linker.estimate_parameters_using_expectation_maximisation("l.email = r.email")


Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison em

In [19]:
linker.truth_space_table_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1
).as_pandas_dataframe(limit=5)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


Unnamed: 0,truth_threshold,match_probability,row_count,p,n,tp,tn,fp,fn,P_rate,N_rate,tp_rate,tn_rate,fp_rate,fn_rate,precision,recall,f1
0,-24.3,4.8414e-08,4353.0,2031.0,2322.0,2031.0,0.0,2322.0,0.0,0.466575,0.533425,1.0,0.0,1.0,0.0,0.466575,1.0,0.636278
1,-23.8,6.846774e-08,4353.0,2031.0,2322.0,2030.0,0.0,2322.0,1.0,0.466575,0.533425,0.999508,0.0,1.0,0.000492,0.466452,0.999508,0.636065
2,-23.7,7.33819e-08,4353.0,2031.0,2322.0,2030.0,227.0,2095.0,1.0,0.466575,0.533425,0.999508,0.097761,0.902239,0.000492,0.492121,0.999508,0.659519
3,-22.6,1.572975e-07,4353.0,2031.0,2322.0,2030.0,419.0,1903.0,1.0,0.466575,0.533425,0.999508,0.180448,0.819552,0.000492,0.516145,0.999508,0.680751
4,-22.5,1.685873e-07,4353.0,2031.0,2322.0,2030.0,573.0,1749.0,1.0,0.466575,0.533425,0.999508,0.24677,0.75323,0.000492,0.537179,0.999508,0.698795


In [20]:
linker.roc_chart_from_labels_column("cluster")


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


In [21]:
linker.precision_recall_chart_from_labels_column("cluster")


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


In [22]:
# Plot some false positives
linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_pandas_dataframe(limit=5)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


Unnamed: 0,clerical_match_score,found_by_blocking_rules,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,bf_first_name,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,bf_email,cluster_l,cluster_r,match_key
0,0.0,True,0.072604,0.512579,110,844,Oliver,Oliver,4,85.511109,...,,1.0,1.0,oliver.atkinson@moran-smith.com,oliverwatson97@morgan.com,1,24.909135,31,211,0
1,0.0,True,0.072604,0.512579,112,844,Oliver,Oliver,4,85.511109,...,,1.0,1.0,oliver.atkinson@moran-smith.com,oliverwatson97@morgan.com,1,24.909135,31,211,0
2,0.0,True,0.072604,0.512579,114,844,Oliver,Oliver,4,85.511109,...,,1.0,1.0,oliver.atkinson@moran-smith.com,oliverwatson97@morgan.com,1,24.909135,31,211,0
3,0.0,True,2.207924,0.822067,115,844,Oliver,Oliver,4,85.511109,...,,1.0,1.0,oliver.atkinson@moran-smith.com,oliverwatson97@morgan.com,1,24.909135,31,211,0
4,0.0,True,0.987514,0.664741,461,603,Henry,Henry,4,85.511109,...,0.00738,0.429161,1.0,henry.w@miller-mitheiln.lnfo,henry.c35@love-banks.com,1,24.909135,117,149,0


In [23]:
records = linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_record_dict(limit=5)

linker.waterfall_chart(records)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained
