**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [Record Linkage](#toc2_)    
  - [Preprocessing](#toc2_1_)    
  - [Indexing](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pyreadr
import datetime as pydt
import missingno as msno
from thefuzz import fuzz, process

## <a id='toc2_'></a>[Record Linkage](#toc0_)

The term record linkage is used to indicate the procedure of bringing together information from two or more records that are believed to belong to the same entity. Record linkage is used to link data from multiple data sources or to find duplicates in a single data source. In computer science, record linkage is also known as data matching, data linking, entity resolution, or field matching.

Dataset A

Name     |   Age    |   Address
---------|----------|---------------
Alice    |   28     |   123 Main St
Bob      |   32     |   456 Elm St
Charlie  |   22     |   789 Oak St

Dataset B


person_name     |    age_years    |   street_address
---------|----------|--------------------------------
Allan    |   28     |   123 Main Street
Brian    |   30     |   457 Elm Street
Dave     |   22     |   789 Oak St


If you pay close attention to the above two datasets, you will notice that the column names are slightly different. Also, although the "Names" are different for each entry in both datasets, the "Age" and "Address" entries for the first and third entries are practically the same (with slight difference in "Address" strings but we can clearly understand that they are the same). This can be termed as "potential matches or duplicates based on certain similarity criteria". 

If we wanted to merge these two datasets we wouldn't want to create duplicate entries in the new dataset. In such complicated merging scenarios, `pd.merge()` is not enough. We need to use record linkage techniques to identify the potential matches and then merge the datasets.

The record linkage procedure can be represented as a workflow. The steps are:
 
0. cleaning 
1. indexing 
2. comparing 
3. classifying 
4. evaluation

If needed, the classified record pairs flow back to improve the previous step.

The `recordlinkage` package in Python provides a simple interface to link records in or between data sources and follows the above workflow. 

### <a id='toc2_1_'></a>[Preprocessing](#toc0_)

Preprocessing data, like cleaning and standardising, may increase record linkage accuracy. Pandas is a very good package for data cleaning and preprocessing. The `recordlinkage` package also provides some preprocessing functions. Such as: `recordlinkage.preprocessing.clean()`, `recordlinkage.preprocessing.phonenumbers()` etc. See the [documentation](https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html) for more information.

### <a id='toc2_2_'></a>[Indexing](#toc0_)

The indexing module is used to make pairs of records. These pairs are called candidate links or candidate matches. There are `several indexing algorithms available such as full, blocking and sorted neighborhood indexing`. See the [documentation](https://recordlinkage.readthedocs.io/en/latest/ref-index.html) for detailed information on implementation of each indexing algorithm.

Before jumping into coding let's first familiarize ourselves with some of the relevant terminologies.

*`Record pair:`* A record pair refers to a pair of records or data points from two different datasets that are considered potential matches or duplicates based on certain similarity criteria. 

Record linkage or entity resolution tasks often involve identifying record pairs that represent the same entity in different datasets, despite variations or discrepancies in the data.

*`Full record space:`* A full record space is the set of all possible record pairs between two datasets.

**For example**, if we consider dataset A and dataset B, to find all possible record pairs we would match each record in dataset A with each record in dataset B in a one-to-many i.e, a combinatory fashion. This would result in 9 record pairs.

The number of record pairs in a full record space is given by, $$No.\ of\ records\ in\ dataset\ A\ X\ No.\ of\ records\ in\ dataset\ B$$

This doesn't depend on the number of columns rather it depends on the number of records in each dataset. This can be helpful to identify duplicates with more accuracy but for large datasets this is computationally expensive (two datasets of 1000 rows will generate a 1 million record pairs).

*`Blocking:`* Blocking is a common technique used in record linkage to reduce the number of record pairs that need to be compared. It involves grouping records into blocks or buckets based on some common attribute(s) and certain criteria, to limit the number of potential comparisons. This is somewhat similar to groupby in pandas.

**For example**, when blocking is applied to the "Address" column (with partial matching constraint), the process would look like this:

Block 1

Name     |   Age    |   Address
---------|----------|----------------------
Alice    |   28     |   123 Main St
Allan    |   28     |   123 Main Street

Block 2

Name     |   Age    |   Address
---------|----------|----------------------
Bob      |   32     |   456 Elm St
Brian    |   30     |   457 Elm Street

Block 3

Name     |   Age    |   Address
---------|----------|----------------------
Charlie  |   22     |   789 Oak St
Dave     |   22     |   789 Oak St

Now, instead of comparing every record in Dataset A with every record in Dataset B, you would only compare records within the same blocks. This reduces the number of potential record pairs to consider for comparison, making the record linkage process more efficient.

**Note:** It's important to choose blocking criteria that are likely to group together records that are more likely to be matches, which can further improve the efficiency of the record linkage process.