# Use Active Learning to Link FEBRL People Data

<a href="https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/03_Link_FEBRL_Data_with_Active_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

In this tutorial, we'll use the [dedupe library](https://github.com/dedupeio/dedupe) to experiment with an active learning approach to linking our FEBRL people datasets.

Once again, we'll use the same training dataset and evaluation functions as the SimSum classification tutorial; these have been included in a separate `.py` file for re-use and convenience, so we can focus on code unique to this tutorial.

## Google Colab Setup

In [4]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    
    !pip install -q altair dedupe dedupe-variable-name jellyfish recordlinkage 

In [3]:
!pip install dedupe -I

Collecting dedupe
  Using cached dedupe-2.0.8-cp37-cp37m-manylinux1_x86_64.whl (90 kB)
Collecting fastcluster
  Using cached fastcluster-1.2.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (155 kB)
Collecting doublemetaphone
  Using cached DoubleMetaphone-0.1-cp37-cp37m-manylinux1_x86_64.whl (79 kB)
Collecting Levenshtein-search
  Using cached Levenshtein_search-1.4.5-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (71 kB)
Collecting BTrees>=4.1.4
  Using cached BTrees-4.9.2-cp37-cp37m-manylinux2010_x86_64.whl (3.6 MB)
Collecting typing-extensions
  Downloading typing_extensions-4.0.1-py3-none-any.whl (22 kB)
Collecting highered>=0.2.0
  Using cached highered-0.2.1-py2.py3-none-any.whl (3.3 kB)
Collecting rlr>=2.4.3
  Using cached rlr-2.4.5-py2.py3-none-any.whl (4.8 kB)
Collecting simplecosine>=1.2
  Using cached simplecosine-1.2-py2.py3-none-any.whl (3.2 kB)
Collecting affinegap>=1.3
  Using cached affinegap-1.11-cp37-cp37m-manylinux1_x86_64.whl (46 kB)
Collecting havers

In [1]:
import datetime
import itertools
import os
import pathlib
import re
from typing import Any, Dict, Optional

import dedupe
import pandas as pd

import linking_tutorial_functions as tutorial

## Define Working Filepaths

For convenience, we'll define a `pathlib.Path` to reference our current working directory.

In [2]:
WORKING_DIR = pathlib.Path(os.path.abspath(''))
WORKING_DIR

PosixPath('/content')

## Load Training Dataset and Ground Truth Labels

In [5]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(COLAB)

Let's take a quick look at our training dataset to refresh on the columns, formats, and data.

In [6]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fbc4143d-15f9-4f27-b5f0-dedbadce6616,matilda,struck,8,ballard place,,west perth,2470,qld,19611002,32.0,03 05903135,8276847
48a56cad-7ba6-45e1-97cd-517ba65bdab5,lachlan,eglinton,36,kambalda crescent,villa 427,auburn,5109,,19260108,27.0,,9937958
b1792d21-e4be-4b86-8dea-454ffa5194c5,mikayla,asher,588,britten-jones drive,,miami,4218,nsw,19251102,32.0,03 33770501,7017310
96653d73-bebc-4459-94f3-c3f0a8c514d4,grace,bristow,7,,wandella park snowy,cardiff,6163,nsw,19400120,,07 37864073,3535974
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,wilson,bishop,11,chisholm street,,bronte,2490,nsw,19210305,27.0,04 15209769,5573522


## Data Augmentation

We'll do minimal data augmentation before feeding our training data to `dedupe`; we just want to format the date of birth data as `mm/dd/yy`, and ensure all columns are in string format and stripped of trailing/leading whitespace. Additionally, `dedupe` requires input data to be in dictionaries, using the record id as the key and the record metadata as the value. So, we'll convert our dataframes to this format.

In [7]:
def format_dob(dob: str) -> Optional[str]:
    """ Transform date of birth format from YYYYMMDD to mm/dd/yy.
        If DOB cannot be transformed, return None.
    """
    try:
        if re.match(r"\d{8}", dob):
            return (datetime.datetime.strptime(dob, "%Y%m%d")).strftime("%m/%d/%y")
    except:
        pass

    return None

def strip_and_null(x: Any) -> Optional[str]:
    """ Stringify incoming variable, remove trailing/leading whitespace
        and return resulting string. Return None if resulting string is empty.
    """
    x = str(x).strip()
    
    if x == "":
        return None
    else:
        return x
    
def convert_df_to_dict(df: pd.DataFrame) -> Dict[str, Dict]:
    """ Convert pandas DataFrame to dict keyed by record id.
        Convert all fields to strings or Nones to satisfy dedupe.
        Transform date format of date_of_birth field.
    """    

    for col in df.columns:
        df[col] = df[col].apply(lambda x: strip_and_null(x))

    df["date_of_birth"] = df["date_of_birth"].apply(lambda x: format_dob(x))    

    return df.to_dict("index")

In [8]:
records_A = convert_df_to_dict(df_A)
records_B = convert_df_to_dict(df_B)

We can examine a small sample of the resulting transformed records:

In [9]:
[records_A[k] for k in list(records_A.keys())[0:2]]

[{'address_1': 'ballard place',
  'address_2': None,
  'age': '32',
  'date_of_birth': '10/02/61',
  'first_name': 'matilda',
  'phone_number': '03 05903135',
  'postcode': '2470',
  'soc_sec_id': '8276847',
  'state': 'qld',
  'street_number': '8',
  'suburb': 'west perth',
  'surname': 'struck'},
 {'address_1': 'kambalda crescent',
  'address_2': 'villa 427',
  'age': '27',
  'date_of_birth': '01/08/26',
  'first_name': 'lachlan',
  'phone_number': None,
  'postcode': '5109',
  'soc_sec_id': '9937958',
  'state': None,
  'street_number': '36',
  'suburb': 'auburn',
  'surname': 'eglinton'}]

## Prepare Training

When we linked our data via SimSum and supervised learning, we defined our blockers and comparators manually with `recordlinkage`. The `dedupe` library takes an active learning approach to blocking and classification and will use our feedback gathered during the labeling session to learn blocking rules and train a classifier. 

To prepare our `dedupe.RecordLink` object for training, first we'll define the fields that we think `dedupe` should pay attention to when matching records - these definitions will serve as the comparators. The `field` contains the name of the attribute to use for comparison, and the `type` defines the comparison type.

In [10]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "address_1", "type" : "ShortString" },
    { "field" : "address_2", "type" : "ShortString" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker = dedupe.RecordLink(fields)
linker.prepare_training(records_A, records_B)

INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)


CPU times: user 54 s, sys: 1.18 s, total: 55.2 s
Wall time: 54.5 s


## Active Learning Labeling Session!

At this point, we're ready to provide feedback to `dedupe` via an active learning labeling session. For this, `dedupe` supplies a convenience method to iterate through pairs it is uncertain about. As you provide feedback for each pair, dedupe learns blocking rules and recalculates its linking model weights.

You can use `y` (yes, match), `n` (no, not match), and `u` (unsure) to provide feedback on candidate links. When you're ready to exit the labeling session, use `f`.

In [11]:
dedupe.console_label(linker)

first_name : timothy
surname : guymer
address_1 : None
address_2 : None
suburb : forrest
postcode : 2217
state : nsw
date_of_birth : 02/11/33
soc_sec_id : 3547663

first_name : nika
surname : lillwn
address_1 : mcalpine place
address_2 : None
suburb : forrest
postcode : 2380
state : nsw
date_of_birth : None
soc_sec_id : 6508513

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


first_name : logan
surname : mac onochie
address_1 : mackellar crescent
address_2 : None
suburb : oatlands
postcode : 4207
state : vic
date_of_birth : 08/13/84
soc_sec_id : 4647965

first_name : logt
surname : mac onochie
address_1 : None
address_2 : None
suburb : oatlands
postcode : 4207
state : vic
date_of_birth : None
soc_sec_id : 4648775

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
first_name : isabella
surname : mewett
address_1 : dettmann close
address_2 : riley house
suburb : koroit
postcode : 6163
state : qld
date_of_birth : 06/04/36
soc_sec_id : 3868039

first_name : soden
surname : joshua
address_1 : None
address_2 : None
suburb : newington
postcode : 6005
state : qld
date_of_birth : 05/15/25
soc_sec_id : 8231355

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : david
surname : sznajder
address_1 : None
address_2 : bowtells caravn park
suburb : eaton
postcode : 2298
state : vic
date_of_birth : 09/15/78
soc_sec_id : 8971940

first_name : luk
surname : neville
address_1 : holman street
address_2 : sec 443 bellamour
suburb : raymond terrace
postcode : 3174
state : vic
date_of_birth : 11/11/90
soc_sec_id : 1631768

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, surname, Surname)
INFO:dedupe.training:SimplePredicate: (commonTwoTokens, surname)
first_name : amy
surname : maliyasena
address_1 : cowper street
address_2 : None
suburb : surrey hills
postcode : 4120
state : nsw
date_of_birth : 04/06/62
soc_sec_id : 2730947

first_name : lachlan
surname : leatham
address_1 : None
address_2 : None
suburb : clareville
postcode : 2154
state : sv
date_of_birth : 06/29/72
soc_sec_id : 1612297

3/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : daniel
surname : van akker
address_1 : None
address_2 : None
suburb : biggera waters
postcode : 4032
state : wa
date_of_birth : 01/17/08
soc_sec_id : 8117255

first_name : daniel
surname : bassnari
address_1 : hilton close
address_2 : rocklea
suburb : corrimal east
postcode : 4306
state : qld
date_of_birth : None
soc_sec_id : 9096559

3/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : luke
surname : jolly
address_1 : guthrie street
address_2 : None
suburb : greenacre
postcode : 2525
state : qld
date_of_birth : None
soc_sec_id : 1508620

first_name : chelhsea
surname : nan
address_1 : None
address_2 : None
suburb : mount helen
postcode : None
state : qld
date_of_birth : 11/25/79
soc_sec_id : 6948126

4/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, surname, Surname)
INFO:dedupe.training:SimplePredicate: (commonTwoTokens, surname)
INFO:dedupe.training:PartialPredicate: (commonSixGram, first_name, Surname)
first_name : kaitlin
surname : snelling
address_1 : bacchus circuit
address_2 : None
suburb : redhead
postcode : 2089
state : nsw
date_of_birth : 08/01/52
soc_sec_id : 6540838

first_name : kadin
surname : clarje
address_1 : None
address_2 : blueberry hill
suburb : alubzry
postcode : 3838
state : sa
date_of_birth : None
soc_sec_id : 5708629

4/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : joshua
surname : ryan
address_1 : davenport street
address_2 : None
suburb : north curl curl
postcode : 4020
state : vic
date_of_birth : 10/17/35
soc_sec_id : 1997553

first_name : sian
surname : pai ne
address_1 : britten-jones drive
address_2 : None
suburb : north ward
postcode : 6230
state : None
date_of_birth : None
soc_sec_id : 5676142

4/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : niamh
surname : tarrillo villalobos
address_1 : palmer street
address_2 : None
suburb : port macquarie
postcode : 3072
state : nsw
date_of_birth : 03/21/91
soc_sec_id : 3738554

first_name : niamh
surname : nan
address_1 : phillip avenue
address_2 : None
suburb : None
postcode : 3939
state : vic
date_of_birth : None
soc_sec_id : 8913923

5/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (firstTokenPredicate, suburb)
INFO:dedupe.training:PartialPredicate: (commonSixGram, first_name, Surname)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, surname, Surname)
first_name : isabella
surname : di manno
address_1 : kennedy street
address_2 : None
suburb : cowan
postcode : 4055
state : nsw
date_of_birth : 02/13/53
soc_sec_id : 2167459

first_name : nasyah
surname : nan
address_1 : buntine ercent
address_2 : None
suburb : alice sprngs
postcode : 2261
state : vic
date_of_birth : None
soc_sec_id : 5607474

5/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : nan
surname : brock
address_1 : wheatley street
address_2 : None
suburb : yorkeys knob
postcode : 3550
state : vic
date_of_birth : 08/18/72
soc_sec_id : 3402102

first_name : nan
surname : brock
address_1 : None
address_2 : None
suburb : yorkeyw knob
postcode : 3550
state : vic
date_of_birth : None
soc_sec_id : 3402102

6/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (3, surname, CorporationName)
first_name : joshua
surname : beelitz
address_1 : None
address_2 : None
suburb : margaret river
postcode : 2204
state : vic
date_of_birth : 02/10/77
soc_sec_id : 2754936

first_name : joshua
surname : beelz
address_1 : None
address_2 : None
suburb : margaret river
postcode : 2204
state : vic
date_of_birth : 02/10/77
soc_sec_id : 2754936

6/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : toby
surname : dent
address_1 : studley street
address_2 : sunnydale cottage
suburb : bowral
postcode : 2450
state : wa
date_of_birth : 04/02/81
soc_sec_id : 9402107

first_name : toby
surname : de nd
address_1 : None
address_2 : None
suburb : bowral
postcode : 2450
state : wa
date_of_birth : 04/02/81
soc_sec_id : 9402107

7/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (3, surname, CorporationName)
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
first_name : matthew
surname : white
address_1 : None
address_2 : mungo park
suburb : hopetoun
postcode : 6110
state : vic
date_of_birth : 09/26/33
soc_sec_id : 7343175

first_name : matthew
surname : wighfe
address_1 : None
address_2 : mungo park
suburb : hopetoun
postcode : 6110
state : vic
date_of_birth : None
soc_sec_id : 7343175

8/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : dylan
surname : ciplys
address_1 : dutton street
address_2 : norma
suburb : east maitland
postcode : 5485
state : wa
date_of_birth : 11/28/02
soc_sec_id : 5171268

first_name : ciplns
surname : dylan
address_1 : dutton etreet
address_2 : None
suburb : east mailand
postcode : 5485
state : wa
date_of_birth : 10/28/02
soc_sec_id : 5171268

8/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : jack
surname : kulshrestha
address_1 : malara street
address_2 : furmiston
suburb : batlow
postcode : 2594
state : vic
date_of_birth : 03/07/02
soc_sec_id : 9862701

first_name : jack
surname : kulshreqtha
address_1 : malara street
address_2 : None
suburb : batlow
postcode : 2594
state : vic
date_of_birth : None
soc_sec_id : 9862701

8/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : jack
surname : bissett
address_1 : conyers street
address_2 : None
suburb : goomalling
postcode : 6112
state : vic
date_of_birth : 06/08/64
soc_sec_id : 9409056

first_name : jack
surname : bisst
address_1 : conyers street
address_2 : None
suburb : goomalling
postcode : 6112
state : vic
date_of_birth : 07/08/65
soc_sec_id : 9409056

9/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (3, surname, CorporationName)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : bailey
surname : au
address_1 : barritt street
address_2 : sunnyview
suburb : kinross
postcode : 4019
state : nsw
date_of_birth : 06/19/14
soc_sec_id : 4952744

first_name : baillet
surname : huggins
address_1 : barritt street
address_2 : sunnyview
suburb : kinross
postcode : 4019
state : nsw
date_of_birth : None
soc_sec_id : 4952744

10/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : ryley
surname : oddy
address_1 : mermaid street
address_2 : lakeside manor
suburb : nedlands
postcode : 2477
state : qld
date_of_birth : 12/01/93
soc_sec_id : 4513534

first_name : ryley
surname : oddt
address_1 : mermaid stireet
address_2 : lakeside manor
suburb : nedlands
postcode : 2477
state : qld
date_of_birth : 12/01/93
soc_sec_id : 6059952

10/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : holly
surname : brock
address_1 : greenvale street
address_2 : None
suburb : st kilda east
postcode : 5161
state : vic
date_of_birth : 03/19/02
soc_sec_id : 3384771

first_name : nan
surname : brlxk
address_1 : greenvale street
address_2 : None
suburb : st kilda east
postcode : 5116
state : vic
date_of_birth : 03/19/02
soc_sec_id : 3384172

10/10 positive, 10/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


first_name : charlotte
surname : copperstone
address_1 : eddy crescent
address_2 : moodwood angus stud
suburb : manunda
postcode : 4121
state : vic
date_of_birth : 08/22/99
soc_sec_id : 7446680

first_name : sarah
surname : bridgland
address_1 : None
address_2 : None
suburb : pymblel
postcode : 2340
state : None
date_of_birth : 04/22/00
soc_sec_id : 1069194

10/10 positive, 10/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


We can now train our linker, based on the labeling session feedback.

In [12]:
%%time
linker.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
INFO:rlr.crossvalidation:optimum alpha: 0.000100, score 0.3305338039941888
INFO:dedupe.training:Final predicate set:


CPU times: user 13.1 s, sys: 3.06 s, total: 16.1 s
Wall time: 13 s


Let's persist our training data (captured during in the labeling session), as well as the learned model weights.

In [13]:
ACTIVE_LEARNING_DIR = WORKING_DIR / "dedupe_active_learning"
ACTIVE_LEARNING_DIR.mkdir(parents=True, exist_ok=True)

SETTINGS_FILE = ACTIVE_LEARNING_DIR / "dedupe_learned_settings"
TRAINING_FILE = ACTIVE_LEARNING_DIR / "dedupe_training.json"

with open(TRAINING_FILE, "w") as fh:
    linker.write_training(fh)
    
with open(SETTINGS_FILE, "wb") as sf:
    linker.write_settings(sf)

## Examine Learned Blockers

Now, let's take a look at the predicates (blockers) that `dedupe` learned during our active learning labeling session. Note that `dedupe` can learn composite predicates/blockers, i.e. individual predicates can be combined with logical operators.

In [14]:
linker.predicates

()

Next, let's examine the resulting candidate pairs and look at our blocking efficiency. The `.pairs` method will give us all candidate record pairs that are generated by blocking with the learned blockers.

In [16]:
candidate_pairs = [x for x in linker.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

0 candidate pairs generated from blocking.


You'll notice that, in contrast to `recordlinkage`, our post-blocking candidate pairs contain both the record ids as well as the record metadata.

In [18]:
candidate_pairs

[]

We can assemble our candidate pair ids into an indexed pandas dataframe for easier comparision with our known true links.

In [None]:
df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

df_candidate_links.head()

Now, let's take a look at our learned blocker performance.

In [None]:
max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

## Score Pairs and Examine Learned Classifier

After `dedupe` has trained blockers and a classification model based on our labeling session, we can link the records in our training dataset via the `.join` method.

In [None]:
%%time
linked_records = linker.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

`linker.join` will return the links, along with a model confidence.

In [None]:
linked_records[0:3]

We'll format the `dedupe` linker predictions into a format that we can use with our existing evaluation functions.

In [None]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

## Choosing a Linking Model Score Threshold

The `dedupe` `.join` method that we used to score our training data directly incorporates the learned blockers. Thus, note that the scored pairs appearing on the distribution represent blocked pairs, and that our blockers *significantly* reduced the candidate pair search space.

### Model Score Distribution

In [None]:
df_predictions["ground_truth"].value_counts()

In [None]:
tutorial.plot_model_score_distribution(df_predictions)

### Precision and Recall vs. Model Score

In [None]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

In [None]:
df_eval.head()

In [None]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

## Iterating with Active Learning

When using active learning, we iterate on our linking solution, and incorporate progressively more labeled training data. Perhaps we're not satisfied with the current performance of the blockers or classifier, and we'd like to create more labeled examples for dedupe to train on.

Recall that earlier, we saved off our existing training data from the first labeling session. We can load this persisted data into a `dedupe` linker, and kick off another labeling session. Perhaps, after investigating the data during our first cycle, we don't think that dedupe should include `address_1` and `address2` in its comparators.

### Tweak the Linker and Use Existing Training Data

In [None]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker2 = dedupe.RecordLink(fields)

with open(TRAINING_FILE, "r") as fh:
    linker2.prepare_training(records_A, records_B, training_file=fh)

Now, we can kick off a second active learning/labeling session.

In [None]:
dedupe.console_label(linker2)

### Retrain the Linker and Examine Blocking Performance

Now, let's retrain, and examine blocker performance. Ideally, we see an improved true link retention following our second labeling session.

In [None]:
%%time
linker2.train()

In [None]:
candidate_pairs = [x for x in linker2.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

### Evaluate Classification Performance

In [None]:
%%time
linked_records = linker2.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

In [None]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

In [None]:
df_predictions["ground_truth"].value_counts()

In [None]:
tutorial.plot_model_score_distribution(df_predictions)

In [None]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

tutorial.plot_precision_recall_vs_threshold(df_eval)