# Sample Example: SPLink

We provide two libraries for record linking, RecordLinkage and SPLink. Follow the RecordLinkage notebook [here](https://www.antigranular.com/notebooks/651a938d0f1e51b4fa0b651a).

SPLink is a Python package designed for probabilistic record linkage, and like Record Linkage, plays a crucial part in linking records and deduplicating datasets.

Participants can use SPLink to predict which rows link together and then further cluster these connections to generate an Individual ID. This can prove especially useful when unique identifiers are missing or differ significantly across datasets.


## Getting Started: Setting Up the Environment

In [1]:
!pip install antigranular

In [2]:
import antigranular as ag
session = ag.login(<client_id>,<client_secret>, competition = "Sandbox for Harvard Open DP Hackathon")

Loading dataset "Flight Company Dataset for Sandbox" to the kernel...
Dataset "Flight Company Dataset for Sandbox" loaded to the kernel as flight_company_dataset_for_sandbox
Loading dataset "Health Organisation Dataset for Sandbox" to the kernel...
Dataset "Health Organisation Dataset for Sandbox" loaded to the kernel as health_organisation_dataset_for_sandbox
Connected to Antigranular server session id: 6705adb8-c6d1-4c35-9776-eed601942579, the session will time out if idle for 60 minutes
Cell magic '%%ag' registered successfully, use `%%ag` in a notebook cell to execute your python code on Antigranular private python server
🚀 Everything's set up and ready to roll!


### Importing the Datasets

***In this competition we are provided with two datasets:***

The airline companies have information about passengers and their travel dates (`flight_company_dataset_for_sandbox`), and the national health organisation has records of patients who did the COVID test and whether their result was positive or negative (`health_organisation_dataset_for_sandbox`).

These are already provided within the AG environment.

In [3]:
%%ag
health = health_organisation_dataset_for_sandbox
flight = flight_company_dataset_for_sandbox

## Pre Processing the Data

In [4]:
%%ag
import numpy as np
import op_pandas as opd
import pandas as pd

### **Splink requires that you clean your data and assign unique IDs to rows before linking**

* **Unique IDs:** Each input dataset must have a unique ID column, which is unique within the dataset. By default, Splink assumes this column will be called unique_id.
* **Conformant input datasets:** Input datasets must be conformant, meaning they share the same column names and data formats.
* **Cleaning:** Ensure data consistency by cleaning your data. This process includes standardising date formats, matching text case, and handling invalid data.


### Creating Unique IDs

Since the number of records in both the datasets is public information, we can use these to create the `unique_id` columns.

For this, we use `numpy.arange`, which generates evenly spaced values within a given interval.

In [5]:
%%ag
num_health = 59230 #taken from competition page
num_flight = 39028 # taken from competition page

unique_id_health = opd.PrivateSeries(pd.Series(np.arange(num_health)))
unique_id_flight = opd.PrivateSeries(pd.Series(np.arange(num_flight)))

In [6]:
%%ag
health['unique_id'] = unique_id_health
flight['unique_id'] = unique_id_flight

### Conforming Input Datasets

In order to conform the various column names, let us examine what they are.

In [7]:
%%ag
ag_print("Health Columns:")
ag_print(health.columns)
ag_print("Flight Columns:")
ag_print(flight.columns)

Health Columns:
Index(['patient_firstname', 'patient_lastname', 'patient_date_of_birth',
       'covidtest_date', 'covidtest_result', 'patient_address', 'unique_id'],
      dtype='object')
Flight Columns:
Index(['flight_number', 'flight_date', 'flight_from', 'flight_to',
       'passenger_firstname', 'passenger_lastname', 'passenger_date_of_birth',
       'unique_id'],
      dtype='object')



As we can see, the column names are not the same in both of the datasets. For example, first name in the health dataset is `patient_firstname` and in the flight dataset it is `passenger_firstname`.

We also want to link on basis of dates (`covidtest_date` in Health Dataset and `flight_date` in Flight Dataset).

Hence, we will write a function which makes the same columns of the same name.

In [None]:
%%ag
def conform_columns(df: opd.PrivateDataFrame) -> opd.PrivateDataFrame:
    final_columns = []
    for col in df.columns:
        if "firstname" in col:  # converting patient_firstname and passenger_firstname -> firstname
            final_columns.append("firstname")
            df["firstname"] = df[col]
        elif "lastname" in col:  # converting patient_lastname and passenger_lastname -> lastname
            final_columns.append("lastname")
            df["lastname"] = df[col]
        elif "date_of_birth" in col: # converting patient_date_of_birth and ppassenger_date_of_birth -> date_of_birth
            final_columns.append("date_of_birth")
            df["date_of_birth"] = df[col]
        elif "covidtest_date" in col: # converting covidtest_date and flight_date -> date
            final_columns.append("date")
            df["date"] = df[col]
        elif "flight_date" in col:
            final_columns.append("date")
            df["date"] = df[col]
        else:
            final_columns.append(col)

    df = df[final_columns]
    return df

health = conform_columns(health)
flight = conform_columns(flight)

We only need the records where covidtest_result is positive, so we will extract them.

In [None]:
%%ag
# Lets remove those passenger records who tested negative.
health['covidtest_result'] = health['covidtest_result'].where(health['covidtest_result'] == 'positive')
health = health.dropna()

Checking out the columns again, we can see that they are conformant.

In [10]:
%%ag
ag_print("Health Columns:")
ag_print(health.columns)
ag_print("Flight Columns:")
ag_print(flight.columns)

Health Columns:
Index(['firstname', 'lastname', 'date_of_birth', 'date', 'covidtest_result',
       'patient_address', 'unique_id'],
      dtype='object')
Flight Columns:
Index(['flight_number', 'date', 'flight_from', 'flight_to', 'firstname',
       'lastname', 'date_of_birth', 'unique_id'],
      dtype='object')



Now we can see that the column name is conformant. But there are some columns which we don't need for our analysis, so let us remove those.

In [None]:
%%ag
health_link = health[['firstname', 'lastname', 'date_of_birth', 'date', 'unique_id']]
flight_link = flight[['firstname', 'lastname', 'date_of_birth', 'date', 'unique_id']]

## Comparisons

A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types.

By tailoring the definitions of similarity, linking models are more effectively able to distinguish beteween different gradations of similarity, leading to more accurate data linking models.

For more information on comparisons, follow [this link](https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html).

Here, we will create 4 comparisons:

* Fuzzy matching of firstname
* Fuzzy matching of lastname
* Difference of `date` column to be within 14 days
* Fuzzy matching of `date_of_birth` column

In [12]:
%%ag
import op_splink.duckdb.comparison_template_library as ctl
from op_splink.duckdb.blocking_rule_library import block_on

first_name_comparison = ctl.name_comparison("firstname")
last_name_comparison = ctl.name_comparison("lastname", jaro_winkler_thresholds=[0.8])
date_difference = ctl.date_comparison("date", cast_strings_to_date = True, include_exact_match_level = False, damerau_levenshtein_thresholds = [], datediff_thresholds = [14], datediff_metrics = ["day"])
date_of_birth_comparison = ctl.date_comparison("date_of_birth", cast_strings_to_date = True)

SPLink uses SQL to find the comparisons. We can use `human_readable_description` method to check the comparison levels and the sql rules for the same.

In [13]:
%%ag
ag_print(date_difference.human_readable_description)

Comparison 'Dates within the following threshold Day(s): 14 vs. anything else' of "date".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "date_l" IS NULL OR "date_r" IS NULL
    - 'Within 14 days' with SQL rule: 
            abs(date_diff('day',
                strptime("date_l", '%Y-%m-%d'),
                strptime("date_r", '%Y-%m-%d'))
                ) <= 14
        
    - 'All other comparisons' with SQL rule: ELSE




In [14]:
%%ag
ag_print(first_name_comparison.human_readable_description)

Comparison 'Exact match vs. Firstname within levenshtein threshold 1 vs. Firstname within damerau-levenshtein threshold 1 vs. Firstname within jaro_winkler thresholds 0.9, 0.8 vs. anything else' of "firstname".
Similarity is assessed using the following ComparisonLevels:
    - 'Null' with SQL rule: "firstname_l" IS NULL OR "firstname_r" IS NULL
    - 'Exact match firstname' with SQL rule: "firstname_l" = "firstname_r"
    - 'Damerau_levenshtein <= 1' with SQL rule: damerau_levenshtein("firstname_l", "firstname_r") <= 1
    - 'Jaro_winkler_similarity >= 0.9' with SQL rule: jaro_winkler_similarity("firstname_l", "firstname_r") >= 0.9
    - 'Jaro_winkler_similarity >= 0.8' with SQL rule: jaro_winkler_similarity("firstname_l", "firstname_r") >= 0.8
    - 'All other comparisons' with SQL rule: ELSE




## Creating a Linker

Comparisons are specified as part of the Splink settings, a Python dictionary which controls all of the configuration of a Splink model.

Currently, we only support the DuckDBLinker.

In [15]:
%%ag
from op_splink.duckdb.linker import DuckDBLinker

settings = {
                "link_type": "link_only",
                "comparisons":[
                    first_name_comparison,
                    last_name_comparison,
                    date_difference,
                    date_of_birth_comparison
                ],
                "blocking_rules_to_generate_predictions": [
                    block_on("firstname"),
                    block_on("lastname"),
                ]
}
linker = DuckDBLinker([health_link, flight_link], settings)


In other words, this setting dictionary says:

* We are performing a link_only (the other options are dedupe_only, or link_and_dedupe, which may be used if there are multiple input datasets).
* When comparing records, we will use information from the `firstname`, `lastname`, `date`, and `date_of_birth` columns to compute a match score.
* The blocking_rules_to_generate_predictions states that we will only check for duplicates amongst records where either the firstname or lastname is identical.

## Training the Model

Now that we have specified our linkage model, we need to estimate the u and m parameters which are used to train the Fellegi Sunter model.

The u values are the proportion of records falling into each ComparisonLevel amongst truly non-matching records.

We estimate u using the estimate_u_using_random_sampling method.

In [16]:
%%ag
linker.estimate_u_using_random_sampling(max_pairs=1e6)

----- Estimating u probabilities using random sampling -----


Estimated u probabilities using random sampling


Your model is not yet fully trained. Missing estimates for:
    - firstname (no m values are trained).
    - lastname (no m values are trained).
    - date (no m values are trained).
    - date_of_birth (no m values are trained).



m is the trickiest of the parameters to estimate, because we have to have some idea of what the true matches are.

If we have labels, we can directly estimate it. However, if we do not have labelled data, the m parameters can be estimated using an iterative maximum likelihood approach called the **Expectation Maximisation.**

Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a manageable level.

In our first estimation pass, we block on first_name, meaning we will generate all record comparisons that have first_name exactly equal.

In [17]:
%%ag
linker.estimate_parameters_using_expectation_maximisation(block_on("firstname"))


----- Starting EM training session -----


Estimating the m probabilities of the model by blocking on:
l."firstname" = r."firstname"

Parameter estimates will be made for the following comparison(s):
    - lastname
    - date
    - date_of_birth

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - firstname



EM Converged successfully


Your model is not yet fully trained. Missing estimates for:
    - firstname (no m values are trained).



In the second estimation pass, we block on date_of_birth. This allows us to estimate parameters for the first_name and the surname comparisons.

In [18]:
%%ag
linker.estimate_parameters_using_expectation_maximisation(block_on("date_of_birth"))


----- Starting EM training session -----


Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"

Parameter estimates will be made for the following comparison(s):
    - firstname
    - lastname
    - date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - date_of_birth



EM Converged successfully


Your model is fully trained. All comparisons have at least one estimate for their m and u values



## Predicting Results

Now, we need to find the linked dataset. This can be done using the `predict_and_create_linked_df`, which will take a threshold parameter, to extract all the linked records with probability more than the threshold.

In [19]:
%%ag
linked_df = linker.predict_and_create_linked_df(0.9)
ag_print(linked_df.columns)

Index(['unique_id_l', 'unique_id_r', 'firstname_l', 'firstname_r',
       'lastname_l', 'lastname_r', 'date_l', 'date_r', 'date_of_birth_l',
       'date_of_birth_r'],
      dtype='object')



Counting the number of records within this linked PrivateDataFrame can be done as follows:

In [20]:
%%ag
ag_print(linked_df.count(eps=0.1))

unique_id_l        2903
unique_id_r        2996
firstname_l        3201
firstname_r        3008
lastname_l         2718
lastname_r         3026
date_l             2868
date_r             2586
date_of_birth_l    2894
date_of_birth_r    2930
dtype: int64



### Finding Out Which Flights Should Be Notified

To find out which flights should be notified, we can use the following algorithm:

* Find all the unique IDs of flight records in the linked dataset. This will include all flights where a COVID-positive passenger was identified in the subsequent 14 days. Let us call this set of unique ids `unique_ids`.
* Create a column within the flight dataset, where if `unique_id` of this record belongs to `unique_ids`, the value will be True, else False. This can be done with `isin` method in `op_pandas`. Let us call this column `notify`.
* Extract the `flight_number` of all the flights where `notify` is true.

In [None]:
%%ag
unique_ids = linked_df['unique_id_r']

flights_to_notify_id = flight['unique_id'].isin(unique_ids)
flight = flight[['unique_id', 'flight_number']]
flight['notify'] = flights_to_notify_id

flight = flight.where(flight['notify'] == True)
flights_to_notify = flight[['flight_number']]

Now, we will submit the predictions using `submit_predictions` method preloaded within the AG environment.

In [22]:
%%ag
submit_predictions(flights_to_notify)

score: {'leaderboard': 0.9026014308070851, 'logs': {'LIN_EPS': -0.0, 'MCC': 0.9026014308070851}}



Now that we're all done, we use this line to close our work session neatly. It's like turning off the lights when you leave a room – it’s a good habit to wrap things up properly!

In [23]:
session.terminate_session()

{'status': 'ok'}