# How to Perform Fuzzy Dataframe Row Matching With RecordLinkage
## An elite skill for hardest of the problems
<img src='images/chain.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@joey-kyber-31917?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Joey Kyber</a>
        on 
        <a href='https://www.pexels.com/photo/sea-nature-sunset-water-119562/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Unsplash</a>
    </strong>
</figcaption>

### Introduction <small id='intro'></small>

In one of my previous articles, I wrote about how to perform string similarity to clean text data using `fuzzywuzzy` package. Learning about the package and performing it in practice was really awesome. But wouldn't be even greater if we could perform the same process between rows of dataframes? 

Actually, the question should be why would we even need it? Today data is never collected in the same place but across several locations. A common challenge in this process is to convert all the little pieces of data into the same format so that when you merge them they work smoothly with data manipulation softwares such as SQL or `pandas`. 

But it is just not always possible. Consider these two fake tables:
<img src='images/1.png'></img>

Assume they are schedules for NBA games and they were scraped from different sites. If we want to merge them together, the merge would result in duplicates because even though not exact, there are fuzzy duplicates:
<img src='images/3.png'></img>

To merge them you would have to perform serious data cleaning operations to get the merge working. However, this dataset could have easily been thousands of rows and you would not be able to find all the edge cases. 

Real-world cases will be much more complex. Fuzzy row matching helps to remove duplicates and introduces consistency to your data. 

With that goal in mind, let me introduce you to `recordlinkage` package. It provides all the tools needed for record linkage and deduplication. In the next sections, we will see case studies to perform record linkage and will build a solid foundation for your future data cleaning projects.

### Setup and Installation <small id='setup'></small>

`recordlinkage` can be installed using `pip`:

```pip install recordlinkage```

For it to work, you need to import it with `pandas`:

In [25]:
# Load necessary packages
import pandas as pd
import recordlinkage as rl
import time

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Record Linkage, Indexing <small id='example1'></small>

For the next case studies, we will load one of the built-in datasets of `recordlinkage` to showcase its powers:

In [2]:
from recordlinkage.datasets import load_febrl4

census_a, census_b = load_febrl4()

The above two datasets contain census data generated by [Febrl](https://sourceforge.net/projects/febrl/) project. It was divided into two with 5k rows in each and each are suited to perform record linkage.

For easy illustration, I will just take a random sample from both datasets:

In [76]:
rand_a = census_a.sample(5)
rand_b = census_b.sample(5)

In [77]:
rand_a

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-4916-org,india,nicoll,432,buckmaster crescent,kurrajong,moree,4812,qld,19111101,5165691
rec-2323-org,rachael,yallop,15,fullagar crescent,rsde 668,beenleigh,4860,qld,19760214,1296128
rec-4544-org,kiera,everett,4,rohan rivett crescent,brindabella specialist centre,alice springs,5558,nsw,19371021,2315324
rec-1632-org,jasmine,stancombe,41,roseby street,,dulwich hill,2086,wa,19980319,8947188
rec-4449-org,olivia,boyle,37,chauncy crescent,cygnet river schoolhouse,plumpton,2460,nsw,19581017,8738461


In [78]:
rand_b

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-917-dup-0,carly,georgetti,155,lockyer street,langi,rowviille,2087,qld,19740602.0,7008776
rec-4618-dup-0,chloe,musoilno,4,warrumbmul street,gordoivale,brighton,3728,ni,,7147784
rec-3700-dup-0,steven,pokkias,407,may maxwell crescent,ardonachie,dalveen,5046,wa,19350305.0,2928326
rec-309-dup-0,eliza,campbell,113,dines place,villa 74 village glen,figtree,3550,nsw,19651205.0,4509107
rec-85-dup-0,dakin,joselyn,57,aberneth ystreet,kanangra hostel,manundza,3028,wa,19261205.0,4852063


Assume we want to link the records of the two datasets without introducing duplication. To start the process, we would have to generate pairs for possible matches. Obviously, we cannot know which rows match so we would have to take all the possible pairs. Generating pairs to calculate similarity is done using the indexes of the two datasets. That's why it is also called `indexing`. `recordlinkage` package makes this process very easy:

In [79]:
# Create an indexing object
indexer = rl.Index()

To start the process, we will create an indexing object. Next, we should specify the mode of generating the pairs. Since we need to generate all the possible combinations of indexes, we will use `.full()` method on the indexing object:

In [83]:
# Set the mode of generation to full
indexer.full()



<Index>

Next, we will input the datasets to generate the pairs, also called candidates and assign the result to a new variable:

In [84]:
pairs = indexer.index(rand_a, rand_b)
pairs

MultiIndex([('rec-4916-org',  'rec-917-dup-0'),
            ('rec-4916-org', 'rec-4618-dup-0'),
            ('rec-4916-org', 'rec-3700-dup-0'),
            ('rec-4916-org',  'rec-309-dup-0'),
            ('rec-4916-org',   'rec-85-dup-0'),
            ('rec-2323-org',  'rec-917-dup-0'),
            ('rec-2323-org', 'rec-4618-dup-0'),
            ('rec-2323-org', 'rec-3700-dup-0'),
            ('rec-2323-org',  'rec-309-dup-0'),
            ('rec-2323-org',   'rec-85-dup-0'),
            ('rec-4544-org',  'rec-917-dup-0'),
            ('rec-4544-org', 'rec-4618-dup-0'),
            ('rec-4544-org', 'rec-3700-dup-0'),
            ('rec-4544-org',  'rec-309-dup-0'),
            ('rec-4544-org',   'rec-85-dup-0'),
            ('rec-1632-org',  'rec-917-dup-0'),
            ('rec-1632-org', 'rec-4618-dup-0'),
            ('rec-1632-org', 'rec-3700-dup-0'),
            ('rec-1632-org',  'rec-309-dup-0'),
            ('rec-1632-org',   'rec-85-dup-0'),
            ('rec-4449-org',  'rec-917-d

The result will be a `pandas.MultiIndex` object. The first level contains the indexes from the first dataset and similarly, the second level indexes contain the indexes for the second dataset.

The length of the result `series` will always be the product of the lengths of datasets. Because for our 5-row datasets, each index from the first table will have 5 pairs of indexes from the second:
<img src='images/4.png'></img>

However, if our datasets are large, generating all the possible pairs will be very computationally expensive. To avoid generating all the possible pairs, we should choose one column which has consistent values from both datasets. For our small datasets, there is a state column:
```
>>> rand_a[['state']], rand_b[['state']]
```
<img src='images/5.png'></img>

If you pay attention, the unique values of `state` is consistent in both datasets. Meaning, one state name is not different in the other. This helps us very much because now we can exclude all the pairs that does not have a matching state value. To do this with `recordlinkage`, we have to change the mode from `full` to `blocking`:

In [91]:
# From scratch
indexer = rl.Index()
# Set the mode to blocking with `state`
indexer.block('state')
# Generate pairs
pairs = indexer.index(rand_a, rand_b)
pairs

MultiIndex([('rec-4916-org',  'rec-917-dup-0'),
            ('rec-2323-org',  'rec-917-dup-0'),
            ('rec-4544-org',  'rec-309-dup-0'),
            ('rec-4449-org',  'rec-309-dup-0'),
            ('rec-1632-org', 'rec-3700-dup-0'),
            ('rec-1632-org',   'rec-85-dup-0')],
           names=['rec_id_1', 'rec_id_2'])

> Remember, the logic behind blocking on a certain column is that we expect duplicate values to have the same or similar values across the columns of both datasets and if the rows do not match on some certain column, we can exclude that pair. 

As you see, the number of pairs (6) got reduced significantly. These index pairs are also the ones that have the same values for `state`. Let's check some of the pairs:

In [94]:
rand_a.loc['rec-4916-org']['state'], rand_b.loc['rec-917-dup-0']['state']

('qld', 'qld')

In [97]:
rand_a.loc['rec-4449-org']['state'], rand_b.loc['rec-309-dup-0']['state']

('nsw', 'nsw')

Using these index pairs, we can perform future operations if any of them are duplicates. 

If you use blocking on a consistent common column, the number of pairs will be much less. We can even use multiple columns to block as long as the unique values of those columns are inconsistent in both tables.