# Combining multiple blocking passes

In the [detailed deduplication tutorial](./deduplication_detailed_example.ipynb) we discussed a problem with blocking rules: it's rare to find a rule with both very high recall and high sensitivty.

We recommend instead running multiple splink jobs with different blocking rules, with the aim of finding global parameters.

This notebook contains an example of how to do this.

## Step 1:  Imports and setup

In [2]:
import pandas as pd 
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100

In [3]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

In [4]:
from utility_functions.demo_utils import get_spark
spark = get_spark() 

In [5]:
df = spark.read.parquet("data/fake_1000.parquet")
df.show(5)

+---------+----------+-------+----------+------+--------------------+-----+
|unique_id|first_name|surname|       dob|  city|               email|group|
+---------+----------+-------+----------+------+--------------------+-----+
|        0|    Julia |   null|2015-10-29|London| hannah88@powers.com|    0|
|        1|    Julia | Taylor|2015-07-31|London| hannah88@powers.com|    0|
|        2|    Julia | Taylor|2016-01-27|London| hannah88@powers.com|    0|
|        3|    Julia | Taylor|2015-10-29|  null|  hannah88opowersc@m|    0|
|        4|      oNah| Watson|2008-03-23|Bolton|matthew78@ballard...|    1|
+---------+----------+-------+----------+------+--------------------+-----+
only showing top 5 rows



## Job 1: Block on forename 

In [37]:
settings_first_name = {
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.first_name = r.first_name"
    ],
    "comparison_columns": [
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city",
            "term_frequency_adjustments": True
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In [38]:
from splink.estimate import estimate_u_values
settings_first_name_with_u = estimate_u_values(settings_first_name, df, spark, fix_u_probabilities=True)

In [39]:
from splink import Splink
linker_fn = Splink(settings_first_name_with_u, df, spark)
df_e_fn = linker_fn.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.22885468006134035 for key dob, level 1
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.06432151794433594 for key dob, level 0
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.01809200644493103 for key email, level 0
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.0058144330978393555 for key email, level 0
INFO:splink.iterate:EM algorithm has converged


## Job 2: Block on surname

In [40]:
settings_surname = {
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.surname = r.surname"
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city",
            "term_frequency_adjustments": True
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In [41]:
from splink.estimate import estimate_u_values
settings_surname_with_u = estimate_u_values(settings_surname, df, spark, fix_u_probabilities=True)

In [42]:
from splink import Splink
linker_sn = Splink(settings_surname_with_u, df, spark)
df_e_sn = linker_sn.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.18537762165069582 for key dob, level 1
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.05372816324234009 for key dob, level 0
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.017128735780715942 for key dob, level 0
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.005950570106506348 for key dob, level 1
INFO:splink.iterate:EM algorithm has converged


## Job 3: Block on dob

In [43]:
settings_dob = {
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.dob = r.dob"
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "city",
            "term_frequency_adjustments": True
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In [44]:
from splink.estimate import estimate_u_values
settings_dob_with_u = estimate_u_values(settings_dob, df, spark, fix_u_probabilities=True)

In [45]:
from splink import Splink
linker_db = Splink(settings_dob_with_u, df, spark)
df_e_db = linker_db.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.42148728370666505 for key proportion_of_matches
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.13432937860488892 for key proportion_of_matches
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.04863584041595459 for key proportion_of_matches
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.021563410758972168 for key proportion_of_matches
INFO:splink.iterate:Iteration 4 complete
INFO:splink.model:The maximum change in parameters was 0.010971009731292725 for key proportion_of_matches
INFO:splink.iterate:Iteration 5 complete
INFO:splink.model:The maximum change in parameters was 0.0061599016189575195 for key proportion_of_matches
INFO:splink.iterate:EM algorithm has converged


# Combine parameter estimates 


In [54]:
from splink.combine_models import ModelCombiner, combine_cc_estimates

fn_cc_1 = linker_sn.model.current_settings_obj.get_comparison_column("first_name")
fn_cc_2 = linker_db.model.current_settings_obj.get_comparison_column("first_name")
fn_cc = combine_cc_estimates([fn_cc_1, fn_cc_2])

m1 = {
    "name": "first_name",
    "model": linker_fn.model,
    "comparison_columns_for_global_lambda": [fn_cc]
}

sn_cc_1 = linker_fn.model.current_settings_obj.get_comparison_column("surname")
sn_cc_2 = linker_db.model.current_settings_obj.get_comparison_column("surname")
sn_cc = combine_cc_estimates([sn_cc_1, sn_cc_2])

m2 = {
    "name": "surname",
    "model": linker_sn.model,
    "comparison_columns_for_global_lambda": [sn_cc]
}

db_cc_1 = linker_fn.model.current_settings_obj.get_comparison_column("dob")
db_cc_2 = linker_sn.model.current_settings_obj.get_comparison_column("dob")
db_cc = combine_cc_estimates([db_cc_1, db_cc_2])

m3 = {
    "name": "dob",
    "model": linker_db.model,
    "comparison_columns_for_global_lambda": [db_cc]
}

mc = ModelCombiner([m1, m2, m3])

settings_combined = mc.get_combined_settings_dict()

In [55]:
settings_combined["blocking_rules"].append('l.email=r.email')
settings_combined["blocking_rules"].append('l.city=r.city')

In [60]:
from splink import Splink
linker_db = Splink(settings_combined, df, spark)
df_e = linker_db.manually_apply_fellegi_sunter_weights()

In [63]:
cols = ['unique_id', 'group']
dfpd_l = df.toPandas().sample(1000)[cols]
dfpd_l["join_col"] = 1
dfpd_r = dfpd_l.copy()
labels = dfpd_l.merge(dfpd_r, on = "join_col", suffixes=('_l', '_r'))
labels = labels[labels["unique_id_r"]> labels["unique_id_l"]]
labels["clerical_match_score"] = (labels["group_l"] == labels["group_r"]).astype(float)
labels = labels.drop(["group_l", "group_r", "join_col"], axis=1)
labels.head()

Unnamed: 0,unique_id_l,unique_id_r,clerical_match_score
7,947,950,1.0
66,947,991,0.0
72,947,974,0.0
87,947,984,0.0
111,947,969,0.0


In [72]:
from splink.truth import labels_with_splink_scores, roc_chart, precision_recall_chart
labels_sp = spark.createDataFrame(labels)
labels_and_scores = labels_with_splink_scores(labels_sp, df_e, "unique_id", spark, retain_all_cols=True)
roc_chart(labels_and_scores, spark)

In [66]:
from splink.truth import df_e_with_truth_categories

In [67]:
truth = df_e_with_truth_categories(labels_and_scores, 0.5, spark)
truth_pd = truth.toPandas()

In [68]:
f1 = truth_pd["FN"] == True
truth_pd[f1].sample(10)

Unnamed: 0,df_e__unique_id_l,df_e__unique_id_r,df_e__surname_l,df_e__surname_r,df_e__gamma_surname,df_e__dob_l,df_e__dob_r,df_e__gamma_dob,df_e__city_l,df_e__city_r,df_e__gamma_city,df_e__email_l,df_e__email_r,df_e__gamma_email,df_e__first_name_l,df_e__first_name_r,df_e__gamma_first_name,df_e__group_l,df_e__group_r,df_e__match_key,df_labels__unique_id_l,df_labels__unique_id_r,clerical_match_score,match_probability,found_by_blocking,truth_threshold,P,N,TP,TN,FP,FN
249370,973.0,974.0,Reuben,Taylor,0.0,1978-05-09,1978-05-09,1.0,London,Lonno,0.0,tiffanyrodriguez@rodriguez-yu.com,tiffanyrodrigueu@rodrigzez-u.com,0.0,Taylor,Reuben,0.0,175.0,175.0,2.0,973,974,1.0,0.013497,True,0.5,True,False,False,False,False,True
82835,,,,,,,,,,,,,,,,,,,,,279,282,1.0,0.0,False,0.5,True,False,False,False,False,True
124347,979.0,982.0,Ball,Layla,0.0,1992-07-03,1992-05-07,0.0,Newcastle-upon-Tyne,Newcastle-upoT-nye,0.0,stacykelly@brown.info,stacykelly@brown.info,1.0,Layla,Ball,0.0,176.0,176.0,3.0,979,982,1.0,0.015801,True,0.5,True,False,False,False,False,True
167429,,,,,,,,,,,,,,,,,,,,,528,533,1.0,0.0,False,0.5,True,False,False,False,False,True
4481,134.0,136.0,Joes,Harriet,0.0,1980-06-23,1980-06-19,0.0,London,London,1.0,trobinson@garza.com,trobinson@garza.com,1.0,Harrit,Jones,0.0,25.0,25.0,3.0,134,136,1.0,0.240478,True,0.5,True,False,False,False,False,True
120558,82.0,83.0,Carr,Cra,0.0,2013-01-21,2013-01-21,1.0,London,London,1.0,stacyball@medina.biz,,-1.0,Eh na,Ethan,0.0,16.0,16.0,2.0,82,83,1.0,0.448951,True,0.5,True,False,False,False,False,True
328861,239.0,240.0,Turner,Ryan,0.0,1990-03-21,1990-06-04,0.0,London,London,1.0,onorman@walker.info,onorman@walker.info,1.0,Ryan,Turner,0.0,41.0,41.0,3.0,239,240,1.0,0.240478,True,0.5,True,False,False,False,False,True
40097,495.0,496.0,nrGt,Grant,0.0,1991-06-27,1991-04-11,0.0,Ipswich,Iwspch,0.0,margaret50@garrett.com,margaret50@garrett.com,1.0,,Oscar,-1.0,86.0,86.0,3.0,495,496,1.0,0.063094,True,0.5,True,False,False,False,False,True
58438,831.0,833.0,Alexander,Rssell,0.0,1982-03-24,1982-03-24,1.0,Lndon,,-1.0,virginiarodriguez@holmes.org,virginiarerigudz@holmes.org,0.0,Russell,Alexander,0.0,145.0,145.0,2.0,831,833,1.0,0.036166,True,0.5,True,False,False,False,False,True
188657,,,,,,,,,,,,,,,,,,,,,11,12,1.0,0.0,False,0.5,True,False,False,False,False,True


In [69]:
# There are no false positives at a threshold of 0.05!
truth = df_e_with_truth_categories(labels_and_scores, 0.01, spark)
truth_pd = truth.toPandas()
f1 = truth_pd["FP"] == True
truth_pd[f1]

Unnamed: 0,df_e__unique_id_l,df_e__unique_id_r,df_e__surname_l,df_e__surname_r,df_e__gamma_surname,df_e__dob_l,df_e__dob_r,df_e__gamma_dob,df_e__city_l,df_e__city_r,df_e__gamma_city,df_e__email_l,df_e__email_r,df_e__gamma_email,df_e__first_name_l,df_e__first_name_r,df_e__gamma_first_name,df_e__group_l,df_e__group_r,df_e__match_key,df_labels__unique_id_l,df_labels__unique_id_r,clerical_match_score,match_probability,found_by_blocking,truth_threshold,P,N,TP,TN,FP,FN
40,0.0,423.0,,Brown,-1.0,2015-10-29,2005-07-15,0.0,London,London,1.0,hannah88@powers.com,sarahbron@mckinney.com,0.0,Julia,,-1.0,0.0,71.0,4,0,423,0.0,0.010430,True,0.01,False,True,False,False,True,False
41,0.0,425.0,,Brown,-1.0,2015-10-29,2005-10-06,0.0,London,London,1.0,hannah88@powers.com,sarahbrown@mckinney.com,0.0,Julia,,-1.0,0.0,71.0,4,0,425,0.0,0.010430,True,0.01,False,True,False,False,True,False
55,0.0,514.0,,Taylor,-1.0,2015-10-29,2005-06-20,0.0,London,London,1.0,hannah88@powers.com,michellejackson@smith-trujillo.com,0.0,Julia,,-1.0,0.0,88.0,4,0,514,0.0,0.010430,True,0.01,False,True,False,False,True,False
69,0.0,628.0,,Long,-1.0,2015-10-29,2015-02-05,0.0,London,London,1.0,hannah88@powers.com,garciarichard@brady.com,0.0,Julia,,-1.0,0.0,106.0,4,0,628,0.0,0.010430,True,0.01,False,True,False,False,True,False
152,1.0,343.0,Taylor,,-1.0,2015-07-31,2016-02-18,0.0,London,London,1.0,hannah88@powers.com,duanejames@reyes.net,0.0,Julia,,-1.0,0.0,59.0,4,1,343,0.0,0.010430,True,0.01,False,True,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499175,97.0,737.0,Morris,Walker,0.0,1983-07-20,2019-05-01,0.0,Birmingham,Leicester,0.0,emilysmith@irwin-medina.biz,jonohan52@eaton.arg,0.0,Noah,Noah,2.0,20.0,127.0,0,97,737,0.0,0.013935,True,0.01,False,True,False,False,True,False
499344,98.0,894.0,Msrio,,-1.0,1983-07-20,2007-10-30,0.0,Birmingham,Birmingham,1.0,emilysmith@irwin-medina.biz,stephanieromero@smith.com,0.0,,Mila,-1.0,20.0,159.0,4,98,894,0.0,0.010430,True,0.01,False,True,False,False,True,False
499453,99.0,632.0,Morris,Gibson,0.0,1983-07-20,1987-05-18,0.0,Birmingham,London,0.0,emilysmith@irwin-medina.biz,avazquez@banks.com,0.0,Noah,Noah,2.0,20.0,107.0,0,99,632,0.0,0.013935,True,0.01,False,True,False,False,True,False
499455,99.0,638.0,Morris,Gibson,0.0,1983-07-20,1987-08-21,0.0,Birmingham,London,0.0,emilysmith@irwin-medina.biz,avazquez@banks.com,0.0,Noah,Noah,2.0,20.0,107.0,0,99,638,0.0,0.013935,True,0.01,False,True,False,False,True,False


In [70]:
1+1

2