## Deduplication quickstart

In this demo we de-duplicate a small dataset, using simple settings. The aim is to demonstarate core Splink functionality succinctly, rather that comprehensively document all configuration options.



## Step 1: Choose your backend

In `splink` version 3, you have the option to choose the SQL backend that will perform match

Currently, `splink` offers three different SQL backends:
- `duckdb` with: `from splink.duckdb.duckdb_linker import DuckDBLinker`

- `sqlite` with: `from splink.sqlite.sqlite_linker import SQLiteLinker`

- `spark` with: `from splink.spark.spark_linker import SparkLinker`

For smaller datasets (up to a few million records), we `duckdb` is likely to give you the best performance.

The subsequent code is the same irrespective of the backend used.

In [20]:
from splink.duckdb.duckdb_linker import DuckDBLinker

## Step 1: Read in data
Read in a 1000-record dataset that contains duplicates.

Note that the `group` column represents the 'ground truth' - i.e. this is a labelled dataset, so we know which rows refer to the same person.  In reality, we wouldn't have this column - this is the information that Splink is trying to estimate.

In [2]:
import pandas as pd 
pd.options.display.max_rows = 1000
df = pd.read_csv("./data/fake_1000.csv")
df.head(5)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,group
0,0,Julia,,2015-10-29,London,hannah88@powers.com,0
1,1,Julia,Taylor,2015-07-31,London,hannah88@powers.com,0
2,2,Julia,Taylor,2016-01-27,London,hannah88@powers.com,0
3,3,Julia,Taylor,2015-10-29,,hannah88opowersc@m,0
4,4,oNah,Watson,2008-03-23,Bolton,matthew78@ballard-mcdonald.net,1


## Step 2: Profile columns

Splink can perform exploratory analysis of columns (e.g. `first_name`, or arbitrary sql expressions like `concat(first_name, surname)`).  

This is useful for understanding your data, whether it suffers from skew, and whether additional data cleaning may be necessary.

In [22]:
# Initialise the linker, passing in the input dataset(s)
linker = DuckDBLinker(input_tables = {"fake_1000": df})

linker.profile_columns(["first_name", "city", "substr(dob, 1,4)"], top_n=10, bottom_n=5)

## Step 3: Configure how Splink compares records using a `settings` dictionary

`splink` needs to know how to compare records from the input dataset:  Which columns should be compared, and how should Splink assess their similarity?

This is configured using a `settings` dictionary.  For the purposes of this simple example, we will make these comparisons simple:  

- For the `first_name` column, we will model the comparison as either:
  - an 'exact match' (e.g. `John` vs `John`)
  - similar but not the exactly the same (e.g. `John` vs `Jon`).  Specifically this will be defined as a levenshtein distance of either 1 or 2.
  - all other comparisons 

- For all other copmarisons, Splink will categorise comparisons as either an 'exact match' (e.g. `Smith` vs `Smith`), or 'anything else' (e.g. `Smith` vs `Jones`, or even `Smith` vs `Smyth`).

- For `city`, we enable term frequency comparisons because we observed significant skew in the distribution of values

In [27]:

from splink.comparison_library import exact_match, levenshtein
settings = {
    "proportion_of_matches": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        levenshtein("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
    "em_convergence": 0.01
}

In words, this setting dictionary says:

* We are performing a deduplication task (the other options are `link_only`, or `link_and_dedupe`, which may be used if there are multiple input datasets)
* The blocking rule states that we will only check for duplicates amongst records where the `first_name`s or `surname`s are identical.
* When comparing records, we will use information from the first_name, surname, dob, city and email columns to compute a match score.
* We have enabled term frequency adjustments for the 'city' column, because some values (e.g. `London`) appear much more frequently than others
* We will retain the group column in the results even though this is not used as part of comparisons. This is a labelled dataset and group contains the true match status, so it is interesting to retain this information so it can be compared to the Splink estimates.
* Will will consider the algorithm to have converged when no parameter changes by more than 0.01 between iterations.
* To ensure the notebook runs quickly, we will stop iterations at 10.
* We have set `retain_intermediate_calculation_columns` and `additional_columns_to_retain` to True for the purposes of the demo, because this will mean the output datasets contain additional information that, whilst not strictly needed by Splink, helps the user understand the calculations. If these were not included in the settings dictionary, they would be set to `False` (their default value).

## Step 4: Estimate the parameters of your model

Estimate the parameters of a Fellegi Sunter model, and use the model to generate predictions.  

We starby by using the `train_u_using_random_sampling` to compute the `u` values of the model.




In [28]:
linker.initialise_settings(settings)
linker.train_u_using_random_sampling(target_rows=1e6)

Iteration 0: Largest change in params was 0.26 in the {m_u} of {level_text}
Iteration 1: Largest change in params was -0.00989 in the {m_u} of {level_text}
EM converged after 1 iterations
Proportion of matches not fully trained, current estimates are [0.0014515057140222377, 0.12937934847213228, 0.0015196874913106597, 0.13881766380537874, 0.0010375306341629182]
Iteration 0: Largest change in params was 0.5 in proportion_of_matches
Iteration 1: Largest change in params was 0.163 in proportion_of_matches
Iteration 2: Largest change in params was 0.0478 in proportion_of_matches
Iteration 3: Largest change in params was 0.0194 in proportion_of_matches
Iteration 4: Largest change in params was 0.00994 in proportion_of_matches
EM converged after 4 iterations
Proportion of matches can now be estimated, estimates are [0.004420132662727674, 0.12937934847213228, 0.004627113992291026, 0.13881766380537874, 0.0031621748124566634, 0.0953550335258382]


We then use the expectation maximisation algorithm to train the `m` values.

Note that in this first EM training session we block on `first_name` and `surname`, meaning that all comparisons will have `first_name` and `surname` exactly equal.   This means that, in this training session, we cannot estimate parameter estimates for the `first_name` or `surname` columns, as seen in their absence from the match weights chart.

In [31]:
blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
training_session_names = linker.train_m_using_expectation_maximisation(blocking_rule)


training_session_names.match_weights_interactive_history_chart()

Iteration 0: Largest change in params was 0.0409 in the {m_u} of {level_text}
Iteration 1: Largest change in params was -0.000963 in proportion_of_matches
EM converged after 1 iterations
Proportion of matches can now be estimated, estimates are [0.004442707489851205, 0.13581817638614724, 0.004650740991533665, 0.1456476290097235, 0.0031783454152726678, 0.10029563904936951, 0.10998945767101086, 0.04699190136047462, 0.03869678029841706]


In a second training session, we block on `dob`.  This allows us to estimate parameters for the `first_name ` and `surname` comparisons.

Between the two training sessions, we now have parameter estimates for all comparisons.

In [34]:
blocking_rule = "l.dob = r.dob"
training_session_dob = linker.train_m_using_expectation_maximisation(blocking_rule)
training_session_names.match_weights_interactive_history_chart()

Iteration 0: Largest change in params was 0.00621 in proportion_of_matches
EM converged after 0 iterations
Proportion of matches can now be estimated, estimates are [0.00446197393923154, 0.13581817638614724, 0.00467090537580908, 0.1456476290097235, 0.0031921463549355413, 0.10029563904936951, 0.10998945767101086, 0.047186942771483845, 0.03885879638580187, 0.10813694032383643, 0.10918772756914302, 0.11291361966040615]


The final match weights can be viewed in the match weights chart:

In [35]:
linker.settings_obj.match_weights_chart()

## Step 7: Predicting match weights using the trained model 

In [36]:
df_e = linker.predict()

## Step 8: Visualising results

In [37]:
df_e.as_pandas_dataframe().head(5)

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,bf_first_name,surname_l,surname_r,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,bf_email,group_l,group_r,match_key
0,8.778874,0.997729,0,3,Julia,Julia,2,76.389623,,Taylor,...,,1.0,1.0,hannah88@powers.com,hannah88opowersc@m,0,0.304757,0,0,0
1,6.130421,0.985926,1,3,Julia,Julia,2,76.389623,Taylor,Taylor,...,,1.0,1.0,hannah88@powers.com,hannah88opowersc@m,0,0.304757,0,0,0
2,6.130421,0.985926,2,3,Julia,Julia,2,76.389623,Taylor,Taylor,...,,1.0,1.0,hannah88@powers.com,hannah88opowersc@m,0,0.304757,0,0,0
3,-2.324511,0.166418,5,737,Noah,Noah,2,76.389623,Watson,Walker,...,0.01459,0.351199,1.0,matthew78@ballard-mcdonald.net,jonohan52@eaton.arg,0,0.304757,1,127,0
4,-2.324511,0.166418,7,737,Noah,Noah,2,76.389623,Watson,Walker,...,0.01459,0.351199,1.0,matthew78@ballard-mcdonald.net,jonohan52@eaton.arg,0,0.304757,1,127,0


You can also view rows in this dataset as a waterfall chart as follows:

In [9]:
from splink.charts import waterfall_chart
records_to_plot = df_e.as_pandas_dataframe().head(5).to_dict(orient="records")
waterfall_chart(records_to_plot, linker.settings_obj, filter_nulls=False)