# Combining multiple blocking passes

In the [detailed deduplication tutorial](./deduplication_detailed_example.ipynb) we discussed a problem with blocking rules: it's rare to find a rule with both very high recall and high sensitivty.

We recommend instead running multiple splink jobs with different blocking rules, with the aim of finding global parameters.

This notebook contains an example of how to do this.

## Step 1:  Imports and setup

In [1]:
import pandas as pd 
pd.options.display.max_columns = 500
pd.options.display.max_rows = 100
import altair as alt
alt.renderers.enable('mimetype')

RendererRegistry.enable('mimetype')

In [2]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

In [3]:
from utility_functions.demo_utils import get_spark
spark = get_spark() 

22/01/11 05:44:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/11 05:44:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/01/11 05:44:06 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/01/11 05:44:06 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/01/11 05:44:06 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
22/01/11 05:44:06 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


In [4]:
df = spark.read.parquet("data/fake_1000.parquet")
df.show(5)

                                                                                

+---------+----------+-------+----------+------+--------------------+-----+
|unique_id|first_name|surname|       dob|  city|               email|group|
+---------+----------+-------+----------+------+--------------------+-----+
|        0|    Julia |   null|2015-10-29|London| hannah88@powers.com|    0|
|        1|    Julia | Taylor|2015-07-31|London| hannah88@powers.com|    0|
|        2|    Julia | Taylor|2016-01-27|London| hannah88@powers.com|    0|
|        3|    Julia | Taylor|2015-10-29|  null|  hannah88opowersc@m|    0|
|        4|      oNah| Watson|2008-03-23|Bolton|matthew78@ballard...|    1|
+---------+----------+-------+----------+------+--------------------+-----+
only showing top 5 rows



## Job 1: Block on forename 

In [5]:
settings_first_name = {
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.first_name = r.first_name"
    ],
    "comparison_columns": [
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city",
            "term_frequency_adjustments": True
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In [6]:
from splink.estimate import estimate_u_values
settings_first_name_with_u = estimate_u_values(settings_first_name, df, spark, fix_u_probabilities=True)

                                                                                

In [7]:
from splink import Splink
linker_fn = Splink(settings_first_name_with_u, df, spark)
df_e_fn = linker_fn.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete                                        
INFO:splink.model:The maximum change in parameters was 0.2248885055412183 for key city, level 1
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.059231828436591805 for key dob, level 0
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.014398845120375037 for key email, level 1
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.004010643026036376 for key email, level 1
INFO:splink.iterate:EM algorithm has converged


## Job 2: Block on surname

In [8]:
settings_surname = {
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.surname = r.surname"
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city",
            "term_frequency_adjustments": True
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In [9]:
from splink.estimate import estimate_u_values
settings_surname_with_u = estimate_u_values(settings_surname, df, spark, fix_u_probabilities=True)

In [10]:
from splink import Splink
linker_sn = Splink(settings_surname_with_u, df, spark)
df_e_sn = linker_sn.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.1800839835934427 for key city, level 1
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.04760766024182089 for key dob, level 0
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.01350498244309506 for key dob, level 0
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.0040167240284887384 for key dob, level 0
INFO:splink.iterate:EM algorithm has converged


## Job 3: Block on dob

In [11]:
settings_dob = {
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.dob = r.dob"
    ],
    "comparison_columns": [
        {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "city",
            "term_frequency_adjustments": True
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In [12]:
from splink.estimate import estimate_u_values
settings_dob_with_u = estimate_u_values(settings_dob, df, spark, fix_u_probabilities=True)

                                                                                

In [13]:
from splink import Splink
linker_db = Splink(settings_dob_with_u, df, spark)
df_e_db = linker_db.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete
INFO:splink.model:The maximum change in parameters was 0.42565220983708746 for key proportion_of_matches
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.14619790284831902 for key proportion_of_matches
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.044294029219279496 for key proportion_of_matches
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.017669303010503512 for key proportion_of_matches
INFO:splink.iterate:Iteration 4 complete
INFO:splink.model:The maximum change in parameters was 0.008723110957793212 for key proportion_of_matches
INFO:splink.iterate:EM algorithm has converged


# Combine parameter estimates 


In [14]:
from splink.combine_models import ModelCombiner, combine_cc_estimates

fn_cc_1 = linker_sn.model.current_settings_obj.get_comparison_column("first_name")
fn_cc_2 = linker_db.model.current_settings_obj.get_comparison_column("first_name")
fn_cc = combine_cc_estimates([fn_cc_1, fn_cc_2])

m1 = {
    "name": "first_name",
    "model": linker_fn.model,
    "comparison_columns_for_global_lambda": [fn_cc]
}

sn_cc_1 = linker_fn.model.current_settings_obj.get_comparison_column("surname")
sn_cc_2 = linker_db.model.current_settings_obj.get_comparison_column("surname")
sn_cc = combine_cc_estimates([sn_cc_1, sn_cc_2])

m2 = {
    "name": "surname",
    "model": linker_sn.model,
    "comparison_columns_for_global_lambda": [sn_cc]
}

db_cc_1 = linker_fn.model.current_settings_obj.get_comparison_column("dob")
db_cc_2 = linker_sn.model.current_settings_obj.get_comparison_column("dob")
db_cc = combine_cc_estimates([db_cc_1, db_cc_2])

m3 = {
    "name": "dob",
    "model": linker_db.model,
    "comparison_columns_for_global_lambda": [db_cc]
}

mc = ModelCombiner([m1, m2, m3])

settings_combined = mc.get_combined_settings_dict()

In [15]:
settings_combined["blocking_rules"].append('l.email=r.email')
settings_combined["blocking_rules"].append('l.city=r.city')

In [16]:
from splink import Splink
linker_db = Splink(settings_combined, df, spark)
df_e = linker_db.manually_apply_fellegi_sunter_weights()

22/01/11 05:44:49 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [17]:
cols = ['unique_id', 'group']
dfpd_l = df.toPandas().sample(1000)[cols]
dfpd_l["join_col"] = 1
dfpd_r = dfpd_l.copy()
labels = dfpd_l.merge(dfpd_r, on = "join_col", suffixes=('_l', '_r'))
labels = labels[labels["unique_id_r"]> labels["unique_id_l"]]
labels["clerical_match_score"] = (labels["group_l"] == labels["group_r"]).astype(float)
labels = labels.drop(["group_l", "group_r", "join_col"], axis=1)
labels.head()

Unnamed: 0,unique_id_l,unique_id_r,clerical_match_score
2,428,706,0.0
4,428,601,0.0
6,428,822,0.0
8,428,507,0.0
10,428,953,0.0


In [18]:
import altair as alt
alt.renderers.enable('mimetype')

from splink.truth import labels_with_splink_scores, roc_chart, precision_recall_chart
labels_sp = spark.createDataFrame(labels)
labels_and_scores = labels_with_splink_scores(labels_sp, df_e, "unique_id", spark, retain_all_cols=True)
roc_chart(labels_and_scores, spark)

22/01/11 05:45:00 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/11 05:45:00 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/11 05:45:08 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/11 05:45:08 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


In [19]:
from splink.truth import df_e_with_truth_categories

In [20]:
truth = df_e_with_truth_categories(labels_and_scores, 0.5, spark)
truth_pd = truth.toPandas()

In [21]:
f1 = truth_pd["FN"] == True
truth_pd[f1].sample(10)

Unnamed: 0,df_e__match_weight,df_e__unique_id_l,df_e__unique_id_r,df_e__surname_l,df_e__surname_r,df_e__gamma_surname,df_e__dob_l,df_e__dob_r,df_e__gamma_dob,df_e__city_l,df_e__city_r,df_e__gamma_city,df_e__email_l,df_e__email_r,df_e__gamma_email,df_e__first_name_l,df_e__first_name_r,df_e__gamma_first_name,df_e__group_l,df_e__group_r,df_e__match_key,df_labels__unique_id_l,df_labels__unique_id_r,clerical_match_score,match_probability,found_by_blocking,truth_threshold,P,N,TP,TN,FP,FN
62984,-4.610004,572.0,573.0,Cunningham,Oscar,0.0,1976-05-24,1976-08-07,0.0,,Brighton,-1.0,douglas53@watkins.info,douglas53@watkins.info,1.0,Oscar,Cunningham,0.0,95.0,95.0,3,572,573,1.0,0.039339,True,0.5,True,False,False,False,False,True
477182,-6.585094,9.0,10.0,Watson,Noah,0.0,2008-01-19,2008-03-23,0.0,Bolton,Bolton,1.0,,matthbw78eallard-mcdonald.net,-1.0,Noah,Watson,0.0,1.0,1.0,4,9,10,1.0,0.010308,True,0.5,True,False,False,False,False,True
64253,-1.958858,924.0,932.0,Thomas,Mills,0.0,1970-03-09,1970-03-09,1.0,London,London,1.0,hensondebbie@garcia.com,,-1.0,Mills,Thomas,0.0,167.0,167.0,2,924,932,1.0,0.204602,True,0.5,True,False,False,False,False,True
295363,-6.057207,826.0,831.0,Russell,Alexander,0.0,1982-05-26,1982-03-24,0.0,Londo,Lndon,0.0,virginiarodriguez@holmes.org,virginiarodriguez@holmes.org,1.0,Alexander,Russell,0.0,145.0,145.0,3,826,831,1.0,0.014795,True,0.5,True,False,False,False,False,True
47471,-0.201447,205.0,206.0,Thomas,Jacob,0.0,2007-05-28,2007-07-06,0.0,Sunderland,Sunderland,1.0,hknapp@davis-allen.info,hknapp@davis-allen.info,1.0,bacJ,Thomas,0.0,35.0,35.0,3,205,206,1.0,0.465148,True,0.5,True,False,False,False,False,True
190225,-2.578317,785.0,786.0,Mccarth,Jasmine,0.0,2002-08-23,2002-09-08,0.0,,Kingston-upon-Hull,-1.0,william04@martinez.info,william04@martinez.info,1.0,,Mccarthy,-1.0,138.0,138.0,3,785,786,1.0,0.143422,True,0.5,True,False,False,False,False,True
141656,-3.399717,144.0,146.0,Harry,Tarlo,0.0,2017-11-24,2017-10-24,0.0,London,London,1.0,coltonray@lee.com,coltonray@lee.com,1.0,Taylor,Hayr,0.0,26.0,26.0,3,144,146,1.0,0.08655,True,0.5,True,False,False,False,False,True
111364,-4.237016,125.0,126.0,Isabella,,-1.0,2000-02-01,2000-02-01,1.0,London,Lndoo,0.0,hillt.eres@pearsonhorg,hilltheresa@pearson.org,0.0,Wallace,Isabella,0.0,23.0,23.0,2,125,126,1.0,0.05036,True,0.5,True,False,False,False,False,True
53366,-4.041192,357.0,358.0,,George,-1.0,2018-09-08,2018-10-31,0.0,Stoke-nn-Teot,Stoke-on-Trent,0.0,jonathan74@glover.com,jonathan74@glover.com,1.0,George,Wallace,0.0,61.0,61.0,3,357,358,1.0,0.057263,True,0.5,True,False,False,False,False,True
459422,-4.346686,440.0,444.0,Charlie,,-1.0,1991-03-13,1990-12-14,0.0,Northampton,Northampton,1.0,jacobstafford@hamilton.com,,-1.0,Richards,Charlie,0.0,76.0,76.0,4,440,444,1.0,0.046847,True,0.5,True,False,False,False,False,True


In [22]:
# There are no false positives at a threshold of 0.05!
truth = df_e_with_truth_categories(labels_and_scores, 0.01, spark)
truth_pd = truth.toPandas()
f1 = truth_pd["FP"] == True
truth_pd[f1]

Unnamed: 0,df_e__match_weight,df_e__unique_id_l,df_e__unique_id_r,df_e__surname_l,df_e__surname_r,df_e__gamma_surname,df_e__dob_l,df_e__dob_r,df_e__gamma_dob,df_e__city_l,df_e__city_r,df_e__gamma_city,df_e__email_l,df_e__email_r,df_e__gamma_email,df_e__first_name_l,df_e__first_name_r,df_e__gamma_first_name,df_e__group_l,df_e__group_r,df_e__match_key,df_labels__unique_id_l,df_labels__unique_id_r,clerical_match_score,match_probability,found_by_blocking,truth_threshold,P,N,TP,TN,FP,FN
55,-5.734126,428.0,530.0,Thompson,Cook,0.0,2012-02-04,2013-03-24,0.0,,Coventry,-1.0,amandn06@suttoa.cm,michelemorrow@herrera.com,0.0,Oliver,Oliver,2.0,72.0,90.0,0,428,530,0.0,0.018441,True,0.01,False,True,False,False,True,False
154,-4.097443,428.0,529.0,Thompson,Cok,0.0,2012-02-04,2013-03-24,0.0,,,-1.0,amandn06@suttoa.cm,,-1.0,Oliver,Oliver,2.0,72.0,90.0,0,428,529,0.0,0.055194,True,0.01,False,True,False,False,True,False
184,-5.734126,428.0,533.0,Thompson,Cook,0.0,2012-02-04,2013-03-24,0.0,,,-1.0,amandn06@suttoa.cm,michelemorrow@herrera.com,0.0,Oliver,Oliver,2.0,72.0,90.0,0,428,533,0.0,0.018441,True,0.01,False,True,False,False,True,False
220,-5.615012,428.0,650.0,Thompson,Thompson,2.0,2012-02-04,2009-03-25,0.0,,Newcastle-upon-Tyne,-1.0,amandn06@suttoa.cm,mike46@contreras.biz,0.0,Oliver,Logan,0.0,72.0,110.0,1,428,650,0.0,0.019996,True,0.01,False,True,False,False,True,False
238,-5.734126,428.0,532.0,Thompson,Cook,0.0,2012-02-04,2013-03-24,0.0,,Coventry,-1.0,amandn06@suttoa.cm,michelemorrow@herrera.om,0.0,Oliver,Oliver,2.0,72.0,90.0,0,428,532,0.0,0.018441,True,0.01,False,True,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493444,-5.425533,595.0,709.0,Foster,Foster,2.0,2001-08-30,2007-09-03,0.0,Loodn,London,0.0,,qtran@wheeler-werner.biz,-1.0,Maisie,Jack,0.0,99.0,120.0,1,595,709,0.0,0.022739,True,0.01,False,True,False,False,True,False
493493,-5.425533,595.0,707.0,Foster,Foster,2.0,2001-08-30,2007-08-12,0.0,Loodn,London,0.0,,qtran@wheeler-werner.biz,-1.0,Maisie,Jkc,0.0,99.0,120.0,1,595,707,0.0,0.022739,True,0.01,False,True,False,False,True,False
496637,-4.792324,41.0,643.0,Andrews,Gordon,0.0,2009-01-23,1993-01-19,0.0,London,London,1.0,hesterkurt@taylor-fitzgerald.com,xfuller@roy.biz,0.0,Olivia,Oivi a,1.0,9.0,108.0,4,41,643,0.0,0.034831,True,0.01,False,True,False,False,True,False
497730,-4.393367,232.0,905.0,Cooper,Cooper,2.0,1995-07-11,2001-11-25,0.0,Sunderland,,-1.0,toddsean@wilkins-burton.biz,,-1.0,Charlotte,rexandeA,0.0,40.0,162.0,1,232,905,0.0,0.045423,True,0.01,False,True,False,False,True,False


## Do we get a different ROC if we compare every row with every other (cartesian blocking)

Usually this is not possible, but with a small test dataset, the cartesian product is less than a million rows, so we can try:

In [23]:
settings_cartesian = {
    "link_type": "dedupe_only",
    "blocking_rules": [
    ],
    "comparison_columns": [
       {
            "col_name": "first_name",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "surname",
            "num_levels": 3,
            "term_frequency_adjustments": True
        },
        {
            "col_name": "dob"
        },
        {
            "col_name": "city",
            "term_frequency_adjustments": True
        },
        {
            "col_name": "email"
        }
    ],
    "additional_columns_to_retain": ["group"],
    "em_convergence": 0.01
}

In [24]:
from splink import Splink
linker_c = Splink(settings_cartesian, df, spark)
df_e_c = linker_c.get_scored_comparisons()

INFO:splink.iterate:Iteration 0 complete                                        
INFO:splink.model:The maximum change in parameters was 0.34582416991101605 for key dob, level 0
INFO:splink.iterate:Iteration 1 complete
INFO:splink.model:The maximum change in parameters was 0.11784214611947069 for key dob, level 0
INFO:splink.iterate:Iteration 2 complete
INFO:splink.model:The maximum change in parameters was 0.01963293643091807 for key surname, level 0
INFO:splink.iterate:Iteration 3 complete
INFO:splink.model:The maximum change in parameters was 0.009142551770450535 for key first_name, level 0
INFO:splink.iterate:EM algorithm has converged


In [25]:
from splink.truth import labels_with_splink_scores, roc_chart, precision_recall_chart
labels_sp = spark.createDataFrame(labels)
labels_and_scores = labels_with_splink_scores(labels_sp, df_e_c, "unique_id", spark, retain_all_cols=True)
roc_chart(labels_and_scores, spark)

22/01/11 05:45:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/11 05:45:47 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/11 05:45:54 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/01/11 05:45:54 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html
