## Deduplication quickstart

In this demo we de-duplicate a small dataset, using simple settings. 

The aim is to demonstarate core Splink functionality succinctly, rather that comprehensively document all configuration options.



## Step 1: Choose your backend

In `splink` version 3, you have the option to choose the SQL backend that will perform match

Currently, `splink` offers three different SQL backends:
- `duckdb` with: `from splink.duckdb.duckdb_linker import DuckDBLinker`

- `sqlite` with: `from splink.sqlite.sqlite_linker import SQLiteLinker`

- `spark` with: `from splink.spark.spark_linker import SparkLinker`

For smaller datasets (up to a few million records), we `duckdb` is likely to give you the best performance.

The subsequent code is the same irrespective of the backend used.

In [1]:
from splink.duckdb.duckdb_linker import DuckDBLinker

## Step 1: Read in data
Read in a 1000-record dataset that contains duplicates.

Note that the `group` column represents the 'ground truth' - i.e. this is a labelled dataset, so we know which rows refer to the same person.  In reality, we wouldn't have this column - this is the information that Splink is trying to estimate.

In [2]:
import pandas as pd 
pd.options.display.max_rows = 1000
df = pd.read_csv("./data/fake_1000.csv")
df.head(5)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,group
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0
2,2,Rob,Allen,1971-06-24,London,roberta25@smith.net,0
3,3,Robert,Alen,1971-06-24,Lonon,,0
4,4,Grace,,1997-04-26,Hull,grace.kelly52@jones.com,1


## Step 2: Profile columns

Splink can perform exploratory analysis of columns (e.g. `first_name`, or arbitrary sql expressions like `concat(first_name, surname)`).  

This is useful for understanding your data, whether it suffers from skew, and whether additional data cleaning may be necessary.

In [4]:
# Initialise the linker, passing in the input dataset(s)
linker = DuckDBLinker(input_tables = {"fake_1000": df})

c = linker.profile_columns(["first_name", "city", "substr(dob, 1,4)"], top_n=10, bottom_n=5)

## Step 3: Configure how Splink compares records using a `settings` dictionary

`splink` needs to know how to compare records from the input dataset:  Which columns should be compared, and how should Splink assess their similarity?

This is configured using a `settings` dictionary.  For the purposes of this simple example, we will make these comparisons simple:  

- For the `first_name` column, we will model the comparison as either:
  - an 'exact match' (e.g. `John` vs `John`)
  - similar but not the exactly the same (e.g. `John` vs `Jon`).  Specifically this will be defined as a levenshtein distance of either 1 or 2.
  - all other comparisons 

- For all other copmarisons, Splink will categorise comparisons as either an 'exact match' (e.g. `Smith` vs `Smith`), or 'anything else' (e.g. `Smith` vs `Jones`, or even `Smith` vs `Smyth`).

- For `city`, we enable term frequency comparisons because we observed significant skew in the distribution of values

In [5]:

from splink.comparison_library import exact_match, levenshtein
settings = {
    "proportion_of_matches": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01
}

In words, this setting dictionary says:

* We are performing a deduplication task (the other options are `link_only`, or `link_and_dedupe`, which may be used if there are multiple input datasets)
* The blocking rule states that we will only check for duplicates amongst records where the `first_name`s or `surname`s are identical.
* When comparing records, we will use information from the first_name, surname, dob, city and email columns to compute a match score.
* We have enabled term frequency adjustments for the 'city' column, because some values (e.g. `London`) appear much more frequently than others
* We will retain the group column in the results even though this is not used as part of comparisons. This is a labelled dataset and group contains the true match status, so it is interesting to retain this information so it can be compared to the Splink estimates.
* Will will consider the algorithm to have converged when no parameter changes by more than 0.01 between iterations.
* To ensure the notebook runs quickly, we will stop iterations at 10.
* We have set `retain_intermediate_calculation_columns` and `additional_columns_to_retain` to True for the purposes of the demo, because this will mean the output datasets contain additional information that, whilst not strictly needed by Splink, helps the user understand the calculations. If these were not included in the settings dictionary, they would be set to `False` (their default value).

## Step 4: Estimate the parameters of your model

Estimate the parameters of a Fellegi Sunter model, and use the model to generate predictions.  

We starby by using the `train_u_using_random_sampling` to compute the `u` values of the model.




In [6]:
linker.initialise_settings(settings)
linker.train_u_using_random_sampling(target_rows=1e6)

Trained u using random sampling: u values have now been estimated for all comparisons


We then use the expectation maximisation algorithm to train the `m` values.

Note that in this first EM training session we block on `first_name` and `surname`, meaning that all comparisons will have `first_name` and `surname` exactly equal.   This means that, in this training session, we cannot estimate parameter estimates for the `first_name` or `surname` columns, as seen in their absence from the match weights chart.

In [10]:
blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
training_session_names = linker.train_m_using_expectation_maximisation(blocking_rule)


c = training_session_names.match_weights_interactive_history_chart()

----- Starting EM training session -----
Training the m probabilities of the model by blocking on: l.first_name = r.first_name and l.surname = r.surname
Parameter estimates will be made for the following comparison: dob, city, email
Parameter estimates cannot be made for the following comparison since they are used in the blocking rules: first_name, surname
Iteration 1: Largest change in params was 0.00387 in proportion_of_matches
EM converged after 1 iterations
Your model is fully trained. All comparisons have at least one estimate for their m and u values, and the global proportion of matches can be estimated.


In a second training session, we block on `dob`.  This allows us to estimate parameters for the `first_name ` and `surname` comparisons.

Between the two training sessions, we now have parameter estimates for all comparisons.

In [12]:
blocking_rule = "l.dob = r.dob"
training_session_dob = linker.train_m_using_expectation_maximisation(blocking_rule)
c = training_session_names.match_weights_interactive_history_chart()

----- Starting EM training session -----
Training the m probabilities of the model by blocking on: l.dob = r.dob
Parameter estimates will be made for the following comparison: first_name, surname, city, email
Parameter estimates cannot be made for the following comparison since they are used in the blocking rules: dob
Iteration 1: Largest change in params was 0.485 in proportion_of_matches
Iteration 2: Largest change in params was 0.0931 in proportion_of_matches
Iteration 3: Largest change in params was 0.0369 in proportion_of_matches
Iteration 4: Largest change in params was 0.0193 in proportion_of_matches
Iteration 5: Largest change in params was 0.0116 in proportion_of_matches
Iteration 6: Largest change in params was 0.00749 in proportion_of_matches
EM converged after 6 iterations
Your model is fully trained. All comparisons have at least one estimate for their m and u values, and the global proportion of matches can be estimated.


The final match weights can be viewed in the match weights chart:

In [13]:
c = linker.settings_obj.match_weights_chart()

## Step 7: Predicting match weights using the trained model 

In [14]:
df_predictions = linker.predict()

## Step 8: Visualising results

We can view the output table as follows:

In [15]:
df_predictions.as_pandas_dataframe().head(5)

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,bf_first_name,surname_l,surname_r,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,bf_email,group_l,group_r,match_key
0,8.164401,0.996527,4,5,Grace,Grace,2,86.909413,,Kelly,...,,1.0,1.0,grace.kelly52@jones.com,grace.kelly52@jones.com,1,267.195855,1,1,0
1,-2.028731,0.196833,9,922,Evie,Evie,2,86.909413,Dean,Jones,...,,1.0,1.0,evihd56@earris-bailey.net,eviejones@brewer-sparks.org,0,0.414717,3,230,0
2,-2.028731,0.196833,14,998,Oliver,Oliver,2,86.909413,Griffiths,Bird,...,,1.0,1.0,o.griffiths90@reyes-coleman.com,oliver.b@smith.net,0,0.414717,5,250,0
3,-0.758928,0.371439,18,475,Caleb,Caleb,2,86.909413,Rwoe,Scott,...,,1.0,1.0,,c.scott@brooks.com,-1,1.0,8,119,0
4,-3.224175,0.096666,21,917,Darcy,Darcy,2,86.909413,Bernass,Rhodes,...,0.0492,0.436652,1.0,darcy.b@silva.com,drhodes16@johnson-robinson.com,0,0.414717,9,229,0


You can also view rows in this dataset as a waterfall chart as follows:

In [17]:
from splink.charts import waterfall_chart
records_to_plot = df_predictions.as_pandas_dataframe().head(5).to_dict(orient="records")
c = waterfall_chart(records_to_plot, linker.settings_obj, filter_nulls=False)

A histogram showing the distribution of match weights can be viewed as follows

In [18]:
c = linker.match_weight_histogram(df_predictions)

If you have a sample of labels, you can output a ROC chart.  (A precision-recall chart is also available with `linker.precision_recall_from_labels`) 

Your labels need to be formatted as follows:

In [19]:
df_labels = pd.read_csv("./data/fake_1000_labels.csv")
df_labels.head(5)



Unnamed: 0,unique_id_l,source_dataset_l,unique_id_r,source_dataset_r,clerical_match_score
0,0,fake_1000,1,fake_1000,1.0
1,0,fake_1000,2,fake_1000,1.0
2,0,fake_1000,3,fake_1000,1.0
3,0,fake_1000,4,fake_1000,0.0
4,0,fake_1000,5,fake_1000,0.0


Then to produce the chart:

In [21]:
linker.con.register("labels", df_labels)

c = linker.roc_from_labels("labels")

Create a [splink_comparison_viewer](https://www.youtube.com/watch?v=DNvCMqjipis) interactive dashboard and display in an iframe

In [22]:
linker.splink_comparison_viewer(df_predictions, "scv.html", True,2)

from IPython.display import IFrame

IFrame(
    src="./scv.html", width=1400, height=1200
)  