# Use Active Learning to Link FEBRL People Data

<a href="https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/03_Link_FEBRL_Data_with_Active_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

In this tutorial, we'll use the [dedupe library](https://github.com/dedupeio/dedupe) to experiment with an active learning approach to linking our FEBRL people datasets.

Once again, we'll use the same training dataset and evaluation functions as the SimSum classification tutorial; these have been included in a separate `.py` file for re-use and convenience, so we can focus on code unique to this tutorial.

## Google Colab Setup

In [1]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    !pip install numpy --upgrade
    !pip install -q altair dedupe dedupe-variable-name jellyfish recordlinkage 



In [2]:
import datetime
import itertools
import os
import pathlib
import re
from typing import Any, Dict, Optional

import dedupe
import pandas as pd

import linking_tutorial_functions as tutorial

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt


## Define Working Filepaths

For convenience, we'll define a `pathlib.Path` to reference our current working directory.

In [3]:
WORKING_DIR = pathlib.Path(os.path.abspath(''))
WORKING_DIR

PosixPath('/content')

## Load Training Dataset and Ground Truth Labels

In [4]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(COLAB)

Let's take a quick look at our training dataset to refresh on the columns, formats, and data.

In [5]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fbc4143d-15f9-4f27-b5f0-dedbadce6616,matilda,struck,8,ballard place,,west perth,2470,qld,19611002,32.0,03 05903135,8276847
48a56cad-7ba6-45e1-97cd-517ba65bdab5,lachlan,eglinton,36,kambalda crescent,villa 427,auburn,5109,,19260108,27.0,,9937958
b1792d21-e4be-4b86-8dea-454ffa5194c5,mikayla,asher,588,britten-jones drive,,miami,4218,nsw,19251102,32.0,03 33770501,7017310
96653d73-bebc-4459-94f3-c3f0a8c514d4,grace,bristow,7,,wandella park snowy,cardiff,6163,nsw,19400120,,07 37864073,3535974
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,wilson,bishop,11,chisholm street,,bronte,2490,nsw,19210305,27.0,04 15209769,5573522


## Data Augmentation

We'll do minimal data augmentation before feeding our training data to `dedupe`; we just want to format the date of birth data as `mm/dd/yy`, and ensure all columns are in string format and stripped of trailing/leading whitespace. Additionally, `dedupe` requires input data to be in dictionaries, using the record id as the key and the record metadata as the value. So, we'll convert our dataframes to this format.

In [6]:
def format_dob(dob: str) -> Optional[str]:
    """ Transform date of birth format from YYYYMMDD to mm/dd/yy.
        If DOB cannot be transformed, return None.
    """
    try:
        if re.match(r"\d{8}", dob):
            return (datetime.datetime.strptime(dob, "%Y%m%d")).strftime("%m/%d/%y")
    except:
        pass

    return None

def strip_and_null(x: Any) -> Optional[str]:
    """ Stringify incoming variable, remove trailing/leading whitespace
        and return resulting string. Return None if resulting string is empty.
    """
    x = str(x).strip()
    
    if x == "":
        return None
    else:
        return x
    
def convert_df_to_dict(df: pd.DataFrame) -> Dict[str, Dict]:
    """ Convert pandas DataFrame to dict keyed by record id.
        Convert all fields to strings or Nones to satisfy dedupe.
        Transform date format of date_of_birth field.
    """    

    for col in df.columns:
        df[col] = df[col].apply(lambda x: strip_and_null(x))

    df["date_of_birth"] = df["date_of_birth"].apply(lambda x: format_dob(x))    

    return df.to_dict("index")

In [7]:
records_A = convert_df_to_dict(df_A)
records_B = convert_df_to_dict(df_B)

We can examine a small sample of the resulting transformed records:

In [8]:
[records_A[k] for k in list(records_A.keys())[0:2]]

[{'address_1': 'ballard place',
  'address_2': None,
  'age': '32',
  'date_of_birth': '10/02/61',
  'first_name': 'matilda',
  'phone_number': '03 05903135',
  'postcode': '2470',
  'soc_sec_id': '8276847',
  'state': 'qld',
  'street_number': '8',
  'suburb': 'west perth',
  'surname': 'struck'},
 {'address_1': 'kambalda crescent',
  'address_2': 'villa 427',
  'age': '27',
  'date_of_birth': '01/08/26',
  'first_name': 'lachlan',
  'phone_number': None,
  'postcode': '5109',
  'soc_sec_id': '9937958',
  'state': None,
  'street_number': '36',
  'suburb': 'auburn',
  'surname': 'eglinton'}]

## Prepare Training

When we linked our data via SimSum and supervised learning, we defined our blockers and comparators manually with `recordlinkage`. The `dedupe` library takes an active learning approach to blocking and classification and will use our feedback gathered during the labeling session to learn blocking rules and train a classifier. 

To prepare our `dedupe.RecordLink` object for training, first we'll define the fields that we think `dedupe` should pay attention to when matching records - these definitions will serve as the comparators. The `field` contains the name of the attribute to use for comparison, and the `type` defines the comparison type.

In [9]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "address_1", "type" : "ShortString" },
    { "field" : "address_2", "type" : "ShortString" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker = dedupe.RecordLink(fields)
linker.prepare_training(records_A, records_B)

INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (commonThreeTokens, address_2)


CPU times: user 57.2 s, sys: 1.09 s, total: 58.3 s
Wall time: 57.5 s


## Active Learning Labeling Session!

At this point, we're ready to provide feedback to `dedupe` via an active learning labeling session. For this, `dedupe` supplies a convenience method to iterate through pairs it is uncertain about. As you provide feedback for each pair, dedupe learns blocking rules and recalculates its linking model weights.

You can use `y` (yes, match), `n` (no, not match), and `u` (unsure) to provide feedback on candidate links. When you're ready to exit the labeling session, use `f`.

In [10]:
dedupe.console_label(linker)

first_name : ridley
surname : wasley
address_1 : lanley square
address_2 : None
suburb : magill
postcode : 2605
state : nsw
date_of_birth : 04/01/23
soc_sec_id : 9750456

first_name : wasley
surname : ridldh
address_1 : lanley square
address_2 : None
suburb : magill
postcode : 2605
state : nsw
date_of_birth : None
soc_sec_id : 9750456

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


first_name : mattheo
surname : bullock
address_1 : murchison street
address_2 : None
suburb : rye
postcode : 3041
state : nsw
date_of_birth : 12/12/05
soc_sec_id : 2296205

first_name : matteo
surname : bulloxk
address_1 : None
address_2 : None
suburb : rye
postcode : 3041
state : nsw
date_of_birth : 12/12/05
soc_sec_id : 2296205

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : callum
surname : nan
address_1 : britten-jones drive
address_2 : None
suburb : penshurst
postcode : 3799
state : nsw
date_of_birth : 11/08/68
soc_sec_id : 6212400

first_name : calkk
surname : nan
address_1 : britten-jones drive
address_2 : None
suburb : penshurst
postcode : 3799
state : ws
date_of_birth : 11/08/68
soc_sec_id : 6212480

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : joel
surname : lowe
address_1 : benjee place
address_2 : brindabella specialist centre
suburb : None
postcode : 2429
state : nsw
date_of_birth : 12/29/18
soc_sec_id : 8931185

first_name : mhary
surname : tilor
address_1 : None
address_2 : None
suburb : parap
postcode : 6051
state : vic
date_of_birth : None
soc_sec_id : 4657080

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : barnaby
surname : butt
address_1 : arabana street
address_2 : carinya lodge
suburb : broadbeach waters
postcode : 2195
state : nsw
date_of_birth : None
soc_sec_id : 6332332

first_name : barnaby
surname : haberfield
address_1 : arabana street
address_2 : None
suburb : broadbeach waters
postcode : 2195
state : nsw
date_of_birth : None
soc_sec_id : 4276489

3/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : jackson
surname : springthorpe
address_1 : werriwa crescent
address_2 : minore falls
suburb : blayney
postcode : 3068
state : nsw
date_of_birth : None
soc_sec_id : 5934407

first_name : tiarna
surname : grubb
address_1 : brickhilo place
address_2 : vincent court
suburb : craigie
postcode : 2745
state : nsw
date_of_birth : 07/12/50
soc_sec_id : 8508011

4/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_1)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
first_name : connor
surname : walkley
address_1 : william webb drive
address_2 : laurel bank
suburb : coombabah
postcode : 4506
state : nsw
date_of_birth : None
soc_sec_id : 5053738

first_name : talln
surname : pascoe
address_1 : crewsfplkce
address_2 : None
suburb : ashfield
postcode : 2088
state : nsw
date_of_birth : 04/27/95
soc_sec_id : 5504003

4/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : hamish
surname : lowe
address_1 : karrugang circuit
address_2 : None
suburb : keon park
postcode : 4215
state : qld
date_of_birth : 02/25/43
soc_sec_id : 1474368

first_name : hamish
surname : lowe
address_1 : None
address_2 : None
suburb : keon park
postcode : 4215
state : qld
date_of_birth : None
soc_sec_id : 1474368

4/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : chloe
surname : soroosh
address_1 : None
address_2 : None
suburb : kingston beach
postcode : 2038
state : wa
date_of_birth : 09/27/79
soc_sec_id : 8137719

first_name : chloe
surname : soroosh
address_1 : None
address_2 : None
suburb : kingston beach
postcode : 2038
state : wa
date_of_birth : None
soc_sec_id : 8137719

5/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, suburb)
first_name : logan
surname : mac onochie
address_1 : mackellar crescent
address_2 : None
suburb : oatlands
postcode : 4207
state : vic
date_of_birth : 08/13/84
soc_sec_id : 4647965

first_name : logt
surname : mac onochie
address_1 : None
address_2 : None
suburb : oatlands
postcode : 4207
state : vic
date_of_birth : None
soc_sec_id : 4648775

6/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : alexandra
surname : van rensburg
address_1 : astelia place
address_2 : None
suburb : woodcroft
postcode : 2756
state : nsw
date_of_birth : 01/31/22
soc_sec_id : 3123032

first_name : alexandra
surname : van rensburg
address_1 : None
address_2 : None
suburb : woodcroft
postcode : 2756
state : nsw
date_of_birth : None
soc_sec_id : 3123034

7/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, surname)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, suburb)
first_name : griffin
surname : bradshaw
address_1 : star close
address_2 : None
suburb : dunbogan
postcode : 3109
state : nsw
date_of_birth : 10/09/47
soc_sec_id : 8200588

first_name : griffin
surname : bradshaw
address_1 : star close
address_2 : None
suburb : dunbogan
postcode : 3109
state : nsw
date_of_birth : None
soc_sec_id : 3456267

8/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : georgia
surname : nguyen
address_1 : None
address_2 : brentwood vlge
suburb : sefton
postcode : 3101
state : wa
date_of_birth : None
soc_sec_id : 4084643

first_name : georgia
surname : nguyen
address_1 : None
address_2 : brentwoom vlge
suburb : sefton
postcode : 3101
state : wa
date_of_birth : None
soc_sec_id : 2139336

9/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_1)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, surname)
first_name : william
surname : wilkins
address_1 : None
address_2 : None
suburb : currumbin valley
postcode : 3166
state : nsw
date_of_birth : 11/12/56
soc_sec_id : 3252646

first_name : william
surname : wiljish
address_1 : None
address_2 : None
suburb : currumbinuvalley
postcode : 3166
state : nsw
date_of_birth : 11/12/56
soc_sec_id : 3252666

10/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_1)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, surname)
INFO:dedupe.training:LevenshteinSearchPredicate: (1, address_2)
first_name : kaitlin
surname : murton
address_1 : lucy gullett circuit
address_2 : None
suburb : dongara
postcode : 2768
state : wa
date_of_birth : 07/03/43
soc_sec_id : 5045300

first_name : kaitlin
surname : muron
address_1 : lucy gulletr circuit
address_2 : None
suburb : dongara
postcode : 2768
state : wa
date_of_birth : 07/03/43
soc_sec_id : 5439641

10/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : connor
surname : schaffarz
address_1 : None
address_2 : None
suburb : frankston
postcode : 2903
state : wa
date_of_birth : None
soc_sec_id : 3293827

first_name : connor
surname : schaffarz
address_1 : None
address_2 : None
suburb : frankston
postcode : 2936
state : wa
date_of_birth : None
soc_sec_id : 2687195

11/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:TfidfNGramSearchPredicate: (0.8, address_1)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, surname)
INFO:dedupe.training:LevenshteinSearchPredicate: (1, address_2)
first_name : bailee
surname : sheldon
address_1 : wentworth avenue
address_2 : montrose
suburb : altona meadows
postcode : 3132
state : nsw
date_of_birth : 10/05/30
soc_sec_id : 5034109

first_name : samara
surname : shelley
address_1 : bland place
address_2 : None
suburb : None
postcode : 2533
state : nsw
date_of_birth : None
soc_sec_id : 5902401

12/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n'


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
first_name : daniel
surname : van akker
address_1 : None
address_2 : None
suburb : biggera waters
postcode : 4032
state : wa
date_of_birth : 01/17/08
soc_sec_id : 8117255

first_name : dann
surname : van akker
address_1 : None
address_2 : None
suburb : biggera wayers
postcode : 4032
state : wr
date_of_birth : None
soc_sec_id : 8117255

12/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


We can now train our linker, based on the labeling session feedback.

In [11]:
%%time
linker.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  * (true_distinct + false_distinct)))
INFO:rlr.crossvalidation:optimum alpha: 1.000000, score 0.3743502093715684
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (doubleMetaphone, suburb), PartialPredicate: (fingerprint, surname, Surname), SimplePredicate: (wholeFieldPredicate, state))
INFO:dedupe.training:(SimplePredicate: (twoGramFingerprint, address_1), PartialIndexTfidfNGramSearchPredicate: (0.2, first_name, Surname), SimplePredicate: (sameSevenCharStartPredicate, suburb))
INFO:dedupe.training:(SimplePredicate: (threeDayPredicate, date_of_birth), SimplePredicate: (doubleMetaphone, first_name), SimplePredicate: (tokenFieldPredicate, suburb))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (commonSixGram, surname), SimplePredicate: (doubleMetaphone, first_name))


CPU times: user 5.85 s, sys: 835 ms, total: 6.68 s
Wall time: 5.93 s


Let's persist our training data (captured during in the labeling session), as well as the learned model weights.

In [12]:
ACTIVE_LEARNING_DIR = WORKING_DIR / "dedupe_active_learning"
ACTIVE_LEARNING_DIR.mkdir(parents=True, exist_ok=True)

SETTINGS_FILE = ACTIVE_LEARNING_DIR / "dedupe_learned_settings"
TRAINING_FILE = ACTIVE_LEARNING_DIR / "dedupe_training.json"

with open(TRAINING_FILE, "w") as fh:
    linker.write_training(fh)
    
with open(SETTINGS_FILE, "wb") as sf:
    linker.write_settings(sf)

## Examine Learned Blockers

Now, let's take a look at the predicates (blockers) that `dedupe` learned during our active learning labeling session. Note that `dedupe` can learn composite predicates/blockers, i.e. individual predicates can be combined with logical operators.

In [13]:
linker.predicates

((SimplePredicate: (doubleMetaphone, suburb),
  PartialPredicate: (fingerprint, surname, Surname),
  SimplePredicate: (wholeFieldPredicate, state)),
 (SimplePredicate: (twoGramFingerprint, address_1),
  PartialIndexTfidfNGramSearchPredicate: (0.2, first_name, Surname),
  SimplePredicate: (sameSevenCharStartPredicate, suburb)),
 (SimplePredicate: (threeDayPredicate, date_of_birth),
  SimplePredicate: (doubleMetaphone, first_name),
  SimplePredicate: (tokenFieldPredicate, suburb)),
 (SimplePredicate: (commonTwoTokens, suburb),
  SimplePredicate: (commonSixGram, surname),
  SimplePredicate: (doubleMetaphone, first_name)))

Next, let's examine the resulting candidate pairs and look at our blocking efficiency. The `.pairs` method will give us all candidate record pairs that are generated by blocking with the learned blockers.

In [14]:
candidate_pairs = [x for x in linker.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

1,492 candidate pairs generated from blocking.


You'll notice that, in contrast to `recordlinkage`, our post-blocking candidate pairs contain both the record ids as well as the record metadata.

In [15]:
candidate_pairs[0]

(('48a56cad-7ba6-45e1-97cd-517ba65bdab5',
  {'address_1': 'kambalda crescent',
   'address_2': 'villa 427',
   'age': '27',
   'date_of_birth': '01/08/26',
   'first_name': 'lachlan',
   'phone_number': None,
   'postcode': '5109',
   'soc_sec_id': '9937958',
   'state': None,
   'street_number': '36',
   'suburb': 'auburn',
   'surname': 'eglinton'}),
 ('c77c2c04-4415-4c4d-b248-18dc28fd63d0',
  {'address_1': 'kambalda crescent',
   'address_2': None,
   'age': None,
   'date_of_birth': '01/08/26',
   'first_name': 'lachlan',
   'phone_number': None,
   'postcode': '5109',
   'soc_sec_id': '9937958',
   'state': None,
   'street_number': '366',
   'suburb': 'auburn',
   'surname': 'eglinton'}))

We can assemble our candidate pair ids into an indexed pandas dataframe for easier comparision with our known true links.

In [16]:
df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

df_candidate_links.head()

person_id_A,person_id_B
48a56cad-7ba6-45e1-97cd-517ba65bdab5,c77c2c04-4415-4c4d-b248-18dc28fd63d0
b1792d21-e4be-4b86-8dea-454ffa5194c5,043d063f-3f72-46ca-bb66-e7f610d4c2cd
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,337aa0c5-4a0a-4bcd-89db-6fa998fa783c
b4e3efc2-9c8f-4e3e-8b98-9bfa842094f9,e63f19ca-3f5b-4021-ac1e-05fc7495bd48
7264bfb0-bbcb-4f68-b9bf-03619237cfb2,8e5d98b8-9611-480e-8c65-b0e56520307b


Now, let's take a look at our learned blocker performance.

In [17]:
max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

10,562,500 total possible pairs.

1,492 pairs after full blocking: 0.999859% search space reduction.
49.5% true links retained after blocking.


## Score Pairs and Examine Learned Classifier

After `dedupe` has trained blockers and a classification model based on our labeling session, we can link the records in our training dataset via the `.join` method.

In [18]:
%%time
linked_records = linker.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

CPU times: user 2.99 s, sys: 120 ms, total: 3.11 s
Wall time: 4.39 s


`linker.join` will return the links, along with a model confidence.

In [19]:
linked_records[0:3]

[(('a2bbbe62-18b2-47fa-92c4-3cab4f944bad',
   'b31a7a9f-ab05-410c-8edc-af64a034a2f8'),
  0.971633),
 (('5e1aa714-e0d5-4e90-abeb-0313ee74fa26',
   'd7eab569-1224-467c-b41e-2bd8588ab8f2'),
  0.971633),
 (('060e3e91-345b-46c8-bea2-7357b1ee8cee',
   '7ba7ecbe-d00c-429e-aee1-87cdd4d554d1'),
  0.96907943)]

We'll format the `dedupe` linker predictions into a format that we can use with our existing evaluation functions.

In [20]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
a2bbbe62-18b2-47fa-92c4-3cab4f944bad,b31a7a9f-ab05-410c-8edc-af64a034a2f8,0.971633,True
5e1aa714-e0d5-4e90-abeb-0313ee74fa26,d7eab569-1224-467c-b41e-2bd8588ab8f2,0.971633,True
060e3e91-345b-46c8-bea2-7357b1ee8cee,7ba7ecbe-d00c-429e-aee1-87cdd4d554d1,0.969079,True
53ac4048-4cae-4594-b7e9-99c42ece6214,8d667595-7dc6-400f-b10a-5d3ce823c257,0.966202,True
55560353-a703-4b73-8b0c-b6a02a257eaa,17126d49-d7c1-4eb6-b323-51efa7d02c77,0.963388,True
...,...,...,...
5aeb0acc-6757-4816-b9a5-73fc7ac045db,26ad1c99-b0c7-4e0b-a490-3eb08f8ad52f,0.108198,True
74dfc06d-62a5-473c-b5f7-4f7e1fa794ca,abfe5592-12ad-4dfb-9eed-1ab6dc1419fd,0.094371,True
a0f0298f-7de6-447c-8ddc-f2f921b9d1d8,8341b81f-1300-4b06-bc07-37d7b0196983,0.092118,True
5c7a8e0d-11e5-49ed-9a7a-6eb817dc3049,1eca555c-de48-4ecc-8b89-6a97b279b0ed,0.080167,False


## Choosing a Linking Model Score Threshold

The `dedupe` `.join` method that we used to score our training data directly incorporates the learned blockers. Thus, note that the scored pairs appearing on the distribution represent blocked pairs, and that our blockers *significantly* reduced the candidate pair search space.

### Model Score Distribution

In [21]:
df_predictions["ground_truth"].value_counts()

True     1485
False       2
Name: ground_truth, dtype: int64

In [22]:
tutorial.plot_model_score_distribution(df_predictions)

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


### Precision and Recall vs. Model Score

In [23]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

In [24]:
df_eval.head()

Unnamed: 0,threshold,tp,fp,tn,fn,precision,recall,f1
0,0.0,1485,2,0,0,0.998655,1.0,0.999327
1,0.020408,1485,2,0,0,0.998655,1.0,0.999327
2,0.040816,1484,2,0,1,0.998654,0.999327,0.99899
3,0.061224,1484,2,0,1,0.998654,0.999327,0.99899
4,0.081633,1484,1,1,1,0.999327,0.999327,0.999327


In [25]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

## Iterating with Active Learning

When using active learning, we iterate on our linking solution, and incorporate progressively more labeled training data. Perhaps we're not satisfied with the current performance of the blockers or classifier, and we'd like to create more labeled examples for dedupe to train on.

Recall that earlier, we saved off our existing training data from the first labeling session. We can load this persisted data into a `dedupe` linker, and kick off another labeling session. Perhaps, after investigating the data during our first cycle, we don't think that dedupe should include `address_1` and `address2` in its comparators.

### Tweak the Linker and Use Existing Training Data

In [26]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker2 = dedupe.RecordLink(fields)

with open(TRAINING_FILE, "r") as fh:
    linker2.prepare_training(records_A, records_B, training_file=fh)

INFO:dedupe.api:reading training from file
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)


CPU times: user 44.4 s, sys: 905 ms, total: 45.3 s
Wall time: 44.6 s


Now, we can kick off a second active learning/labeling session.

In [27]:
dedupe.console_label(linker2)

first_name : tynan
surname : vaisey
suburb : burwood
postcode : 5114
state : nsw
date_of_birth : 02/10/80
soc_sec_id : 7613105

first_name : tynsn
surname : vaisey
suburb : None
postcode : 5114
state : nswl
date_of_birth : 02/20/80
soc_sec_id : 7613105

12/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


first_name : harrison
surname : paine
suburb : patterson lakes
postcode : 2777
state : nsw
date_of_birth : 08/22/62
soc_sec_id : 3409750

first_name : harruon
surname : paine
suburb : None
postcode : 2777
state : nx
date_of_birth : None
soc_sec_id : 3409750

13/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)


### Retrain the Linker and Examine Blocking Performance

Now, let's retrain, and examine blocker performance. Ideally, we see an improved true link retention following our second labeling session.

In [28]:
%%time
linker2.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  * (true_distinct + false_distinct)))
INFO:rlr.crossvalidation:optimum alpha: 1.000000, score 0.21568730250512277
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (wholeFieldPredicate, soc_sec_id), SimplePredicate: (wholeFieldPredicate, state), TfidfNGramSearchPredicate: (0.8, suburb))
INFO:dedupe.training:(PartialPredicate: (commonSixGram, surname, Surname), SimplePredicate: (suffixArray, first_name), SimplePredicate: (wholeFieldPredicate, state))
INFO:dedupe.training:(SimplePredicate: (monthPredicate, date_of_birth), PartialPredicate: (sortedAcronym, first_name, Surname), SimplePredicate: (tokenFieldPredicate, surname))
INFO:dedupe.training:(SimplePredicate: (firstTwoTokensPredicate, surname), SimplePredicate: (commonSixGram, suburb))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), PartialPredicate: (commonSixGram, first_name, Surname))
INFO:dedupe.training:(Sim

CPU times: user 6.6 s, sys: 969 ms, total: 7.57 s
Wall time: 6.76 s


In [29]:
candidate_pairs = [x for x in linker2.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

1,701 candidate pairs generated from blocking.
10,562,500 total possible pairs.

1,701 pairs after full blocking: 0.999839% search space reduction.
56.4% true links retained after blocking.


### Evaluate Classification Performance

In [30]:
%%time
linked_records = linker2.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

CPU times: user 1.53 s, sys: 100 ms, total: 1.63 s
Wall time: 2.97 s


In [31]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
ed588703-022b-449d-aa43-c5b5f9a596e0,c55f2fd1-976a-495c-8613-33639391df30,1.000000,True
dc7b2f36-4925-4455-af27-8afba03098c4,b705af01-db9d-4da6-82be-590acbf7a33f,1.000000,True
d91c733d-94be-4984-ae85-407a1e7433a7,0da76f3f-0910-4e18-b86d-2344fe5a3206,1.000000,True
9e8b210e-316d-4b69-adb3-9a6d16277c3b,8a830a4b-054d-4c47-a3a7-d90395065797,1.000000,True
7bd2a4e1-76d8-4993-8921-f624bb83db01,6fa74f0b-a0df-4fe1-8ffa-75a24045a43f,1.000000,True
...,...,...,...
2474e206-15f6-4182-8858-48a3717d495a,74ece867-2eba-4ca6-a288-ce53eb4cf426,0.156811,True
1233ad5a-e423-4bd6-acf0-a2fdf42f4a27,1b53dc8f-74fb-48b7-9090-fa3fe99fbb19,0.116145,True
2f84019c-ea3a-4105-8087-821c40328773,d61b8581-02bc-4f10-89b3-5a06a65547ae,0.109106,True
75963c5c-3389-49e9-9404-4900c45abe43,92292e33-95bd-4ff0-9d4f-904d54712fa8,0.088072,True


In [32]:
df_predictions["ground_truth"].value_counts()

True     1686
False       7
Name: ground_truth, dtype: int64

In [33]:
tutorial.plot_model_score_distribution(df_predictions)

In [34]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

tutorial.plot_precision_recall_vs_threshold(df_eval)