# How to Perform Fuzzy Dataframe Row Matching With RecordLinkage
## An elite skill for hardest of the problems
<img src='images/chain.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@joey-kyber-31917?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Joey Kyber</a>
        on 
        <a href='https://www.pexels.com/photo/sea-nature-sunset-water-119562/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Unsplash</a>
    </strong>
</figcaption>

### Introduction <small id='intro'></small>

In one of my previous articles, I wrote about how to perform string similarity to clean text data using `fuzzywuzzy` package. Learning about the package and performing it in practice was really awesome. But wouldn't be even greater if we could perform the same process between rows of dataframes? 

Actually, the question should be why would we even need it? Today data is never collected in the same place but across several locations. A common challenge in this process is to convert all the little pieces of data into the same format so that when you merge them they work smoothly with data manipulation softwares such as SQL or `pandas`. 

But it is just not always possible. Consider these two fake tables:
<img src='images/1.png'></img>

Assume they are schedules for NBA games and they were scraped from different sites. If we want to merge them together, the merge would result in duplicates because even though not exact, there are fuzzy duplicates:
<img src='images/3.png'></img>

To merge them you would have to perform serious data cleaning operations to get the merge working. However, this dataset could have easily been thousands of rows and you would not be able to find all the edge cases. 

Real-world cases will be much more complex. Fuzzy row matching helps to remove duplicates and introduces consistency to your data. 

With that goal in mind, let me introduce you to `recordlinkage` package. It provides all the tools needed for record linkage and deduplication. In the next sections, we will see case studies to perform record linkage and will build a solid foundation for your future data cleaning projects.

### Setup and Installation <small id='setup'></small>

`recordlinkage` can be installed using `pip`:

```pip install recordlinkage```

For it to work, you need to import it with `pandas`:

In [25]:
# Load necessary packages
import pandas as pd
import recordlinkage as rl
import time

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Record Linkage, Indexing <small id='indexing'></small>

For the next examples, we will load one of the built-in datasets of `recordlinkage` to showcase its powers:

In [2]:
from recordlinkage.datasets import load_febrl4

census_a, census_b = load_febrl4()

The above two datasets contain census data generated by [Febrl](https://sourceforge.net/projects/febrl/) project. It was divided into two with 5k rows in each and each are suited to perform record linkage.

For easy illustration, I will just take a random sample from both datasets:

In [76]:
rand_a = census_a.sample(5)
rand_b = census_b.sample(5)

In [77]:
rand_a

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-4916-org,india,nicoll,432,buckmaster crescent,kurrajong,moree,4812,qld,19111101,5165691
rec-2323-org,rachael,yallop,15,fullagar crescent,rsde 668,beenleigh,4860,qld,19760214,1296128
rec-4544-org,kiera,everett,4,rohan rivett crescent,brindabella specialist centre,alice springs,5558,nsw,19371021,2315324
rec-1632-org,jasmine,stancombe,41,roseby street,,dulwich hill,2086,wa,19980319,8947188
rec-4449-org,olivia,boyle,37,chauncy crescent,cygnet river schoolhouse,plumpton,2460,nsw,19581017,8738461


In [78]:
rand_b

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-917-dup-0,carly,georgetti,155,lockyer street,langi,rowviille,2087,qld,19740602.0,7008776
rec-4618-dup-0,chloe,musoilno,4,warrumbmul street,gordoivale,brighton,3728,ni,,7147784
rec-3700-dup-0,steven,pokkias,407,may maxwell crescent,ardonachie,dalveen,5046,wa,19350305.0,2928326
rec-309-dup-0,eliza,campbell,113,dines place,villa 74 village glen,figtree,3550,nsw,19651205.0,4509107
rec-85-dup-0,dakin,joselyn,57,aberneth ystreet,kanangra hostel,manundza,3028,wa,19261205.0,4852063


Assume we want to link the records of the two datasets without introducing duplication. To start the process, we would have to generate pairs for possible matches. Obviously, we cannot know which rows match so we would have to take all the possible pairs. Generating pairs to calculate similarity is done using the indexes of the two datasets. That's why it is also called `indexing`. `recordlinkage` package makes this process very easy:

In [79]:
# Create an indexing object
indexer = rl.Index()

To start the process, we will create an indexing object. Next, we should specify the mode of generating the pairs. Since we need to generate all the possible combinations of indexes, we will use `.full()` method on the indexing object:

In [83]:
# Set the mode of generation to full
indexer.full()



<Index>

Next, we will input the datasets to generate the pairs, also called candidates and assign the result to a new variable:

In [84]:
pairs = indexer.index(rand_a, rand_b)
pairs

MultiIndex([('rec-4916-org',  'rec-917-dup-0'),
            ('rec-4916-org', 'rec-4618-dup-0'),
            ('rec-4916-org', 'rec-3700-dup-0'),
            ('rec-4916-org',  'rec-309-dup-0'),
            ('rec-4916-org',   'rec-85-dup-0'),
            ('rec-2323-org',  'rec-917-dup-0'),
            ('rec-2323-org', 'rec-4618-dup-0'),
            ('rec-2323-org', 'rec-3700-dup-0'),
            ('rec-2323-org',  'rec-309-dup-0'),
            ('rec-2323-org',   'rec-85-dup-0'),
            ('rec-4544-org',  'rec-917-dup-0'),
            ('rec-4544-org', 'rec-4618-dup-0'),
            ('rec-4544-org', 'rec-3700-dup-0'),
            ('rec-4544-org',  'rec-309-dup-0'),
            ('rec-4544-org',   'rec-85-dup-0'),
            ('rec-1632-org',  'rec-917-dup-0'),
            ('rec-1632-org', 'rec-4618-dup-0'),
            ('rec-1632-org', 'rec-3700-dup-0'),
            ('rec-1632-org',  'rec-309-dup-0'),
            ('rec-1632-org',   'rec-85-dup-0'),
            ('rec-4449-org',  'rec-917-d

The result will be a `pandas.MultiIndex` object. The first level contains the indexes from the first dataset and similarly, the second level indexes contain the indexes for the second dataset.

The length of the resulting `series` will always be the product of the lengths of datasets. Because for our 5-row datasets, each index from the first table will have 5 pairs of indexes from the second:
<img src='images/4.png'></img>

However, if our datasets are large, generating all the possible pairs will be very computationally expensive. To avoid generating all the possible pairs, we should choose one column which has consistent values from both datasets. For our small datasets, there is a state column:
```
>>> rand_a[['state']], rand_b[['state']]
```
<img src='images/5.png'></img>

If you pay attention, the unique values of `state` is consistent in both datasets. Meaning, one state name is not different in the other. This helps us very much because now we can exclude all the pairs that does not have a matching state value. To do this with `recordlinkage`, we have to change the mode from `full` to `blocking`:

In [91]:
# From scratch
indexer = rl.Index()
# Set the mode to blocking with `state`
indexer.block('state')
# Generate pairs
pairs = indexer.index(rand_a, rand_b)
pairs

MultiIndex([('rec-4916-org',  'rec-917-dup-0'),
            ('rec-2323-org',  'rec-917-dup-0'),
            ('rec-4544-org',  'rec-309-dup-0'),
            ('rec-4449-org',  'rec-309-dup-0'),
            ('rec-1632-org', 'rec-3700-dup-0'),
            ('rec-1632-org',   'rec-85-dup-0')],
           names=['rec_id_1', 'rec_id_2'])

> Remember, the logic behind blocking on a certain column is that we expect duplicate values to have the same or similar values across the columns of both datasets and if the rows do not match on some certain column, we can exclude that pair. 

As you see, the number of pairs (6) got reduced significantly. These index pairs are also the ones that have the same values for `state`. Let's check some of the pairs:

In [94]:
rand_a.loc['rec-4916-org']['state'], rand_b.loc['rec-917-dup-0']['state']

('qld', 'qld')

In [97]:
rand_a.loc['rec-4449-org']['state'], rand_b.loc['rec-309-dup-0']['state']

('nsw', 'nsw')

If you use blocking on a consistent common column, the number of pairs will be much less. We can even use multiple columns to block as long as the unique values of those columns are inconsistent in both tables.

### Record Linkage, Case Study <small id='case1'></small>

Now that you have an understanding of indexing, we can start record linkage with the full datasets:

In [108]:
# Create an indexing object
indexer = rl.Index()
# Block on state
indexer.block('state')
# Generate candidate pairs
pairs = indexer.index(census_a, census_b)
print(len(pairs))

5458951


For full datasets, almost 5.5 million pairs are returned. Remember, if we used full indexing, it would have been 25 million. 
Now, using these candidate pairs, we will perform comparison of each column values. To start comparing, we should create a comparing object:

In [109]:
# Create a comparing object
compare = rl.Compare()

This object has many useful functions to match exact or fuzzy values of the columns. First, let's start by matching exact matches:

In [110]:
# Query the exact matches of state
compare.exact('state', 'state', label='state')
# Query the exact matches of date of birth
compare.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
# Query the exact matches of date of birth
compare.exact('soc_sec_id', 'soc_sec_id', label='soc_sec_id')
# Query the exact matches of date of birth
compare.exact('postcode', 'postcode', label='postcode')

<Compare>

When we use `exact` for certain fields, we expect row pairs have exactly the same values for these fields. The parameters of `exact`:
- `left_on`:  the column name of the left dataset
- `right_on`: the column name of the right dataset
- `label`: the column name of the resulting dataset

After we perform all the comparisons, the result will be a `pandas` dataframe and `label` controls the name of the appropriate column name in the resulting dataframe. 

Why did we choose exact matching? Because the postcode, social security ID, the date of birth and the state columns have to be an exact match to be a duplicate. This also depends on the values of those columns. If the unique values are consistent among the datasets, we should use `exact`.

Now, for fuzzy matching. The given name, surname, address columns will probably have typos and inconsistencies, so we will use fuzzy string matching for them:

In [111]:
# Query the fuzzy matches for given name
compare.string('given_name', 'given_name', threshold=0.75, method='levenshtein', label='given_name')
# Query the fuzzy matches for surname
compare.string('surname', 'surname', threshold=0.75, method='levenshtein', label='surname')
# Query the fuzzy matches for address
compare.string('address_1', 'address_1', threshold=0.75, method='levenshtein', label='address')

<Compare>

For fuzzy `string` matching, we will use `.string` method. The parameters for column names are the same. Other parameters:
- `method`: controls the algorithm used to calculate string similarity
- `threshold`: the similarity score threshold. If similarity is higher than the given score, it is a match

> There are other methods of matching values depending on the data type: `.numeric` and `.date`.

Now, we have the methods in place, it is time to compute them and assign the result to a variable:

In [112]:
# Compute the matches, this will take a while
matches = compare.compute(pairs, census_a, census_b)

`.compute` takes three arguments. First one is the `MultiIndex` object of potential indexes. The next two are the two data frames we are using. Note that the order of their input should be the same as `indexer.index()`.

After the computation is done, we will have a dataset of this sort:

In [122]:
matches.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,state,date_of_birth,soc_sec_id,postcode,given_name,surname,address
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rec-3254-org,rec-1416-dup-0,1,0,0,0,0.0,0.0,0.0
rec-809-org,rec-4389-dup-0,1,0,0,0,0.0,0.0,0.0
rec-155-org,rec-3088-dup-0,1,0,0,0,0.0,0.0,0.0
rec-3023-org,rec-3296-dup-0,1,0,0,0,0.0,0.0,0.0
rec-3157-org,rec-4670-dup-0,1,0,0,0,0.0,0.0,0.0


The resulting data frame also has multi-level index, first one being `census_a`, second one being `census_b`. The rest of the columns will have either 1 for a match or 0 for not a match. Let's interpret the first row of the above sample:

The rows with indexes `rec-3254-org` and `rec-1416-dup-0` only matched on the `state` column because there is 1 in that field. These rows failed to match in other fields.

Now, let's set when we decide that two rows are duplicate. For our dataset, I think if the rows match on at least 4 columns, there is pretty high chance that they are duplicates. We can easily subset for rows with overall matching score of at least 4 with `sum` and boolean indexing:

In [124]:
# Query matches with score over 4
full_matches = matches[matches.sum(axis='columns') >= 4]
full_matches.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,state,date_of_birth,soc_sec_id,postcode,given_name,surname,address
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rec-3937-org,rec-3937-dup-0,1,1,1,1,1.0,1.0,0.0
rec-2723-org,rec-2723-dup-0,1,1,1,1,0.0,1.0,1.0
rec-2089-org,rec-2089-dup-0,1,1,1,1,1.0,1.0,1.0
rec-3818-org,rec-3818-dup-0,1,1,1,1,1.0,1.0,1.0
rec-2466-org,rec-2466-dup-0,1,0,1,1,0.0,1.0,1.0


In [125]:
full_matches.shape

(4676, 7)

As you can see, almost 4676 rows matched out of 5.5 million possible pairs. Now before merging our original tables together, we have to make sure that we do not include these 4676 rows. To do this, we will do a little bit of manipulation:

In [127]:
# Get the indexes from either of index levels
duplicates = full_matches.index.get_level_values('rec_id_2')
print(duplicates[:10])

Index(['rec-4405-dup-0', 'rec-1985-dup-0', 'rec-4302-dup-0', 'rec-4641-dup-0',
       'rec-1300-dup-0', 'rec-4178-dup-0', 'rec-1280-dup-0', 'rec-780-dup-0',
       'rec-4098-dup-0', 'rec-4663-dup-0'],
      dtype='object', name='rec_id_2')


To get indexes of some level from multi-level indexes, we use `.get_level_values` on `df.index`. 

Since we chose the second level index, we should exclude them from `census_b`:

In [128]:
# Exclude the indexes of duplicates from census_b
unique_b = census_b[~census_b.index.isin(duplicates)]
unique_b.shape

(324, 10)

Now, the `unique_b` is ready to be appended to the first dataset:

In [129]:
# Append deduplicated census_b to census_a
full_census = census_a.append(unique_b)
full_census.shape

(5324, 10)

In [130]:
full_census.sample(5)

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-3549-org,harry,thorpe,11.0,kambalda crescent,louisa tor 4,angaston,2777,qld,19421128,2701790
rec-2152-org,emiily,fitzpatrick,,aland place,keralland,rowville,2219,vic,19270130,1148897
rec-67-org,erin,matthews,10.0,williamson street,yaraan,kilcoy,4218,qld,19991129,7747845
rec-4239-org,bianca,cuming,9.0,albermarle place,harkness station,natimuk,3158,nsw,19911129,2315136
rec-1767-org,keaton,webb,2.0,mckinley circuit,solitaire,lemon tree passage,3943,nsw,19940724,6702216


There you go. From 10k rows full of duplicates, we got it to 5324 unique rows. Here is the full code:

In [131]:
# Create an indexing object
indexer = rl.Index()
# Block on state
indexer.block('state')
# Generate candidate pairs
pairs = indexer.index(census_a, census_b)

# Create a comparing object
compare = rl.Compare()

# Query the exact matches of state
compare.exact('state', 'state', label='state')
# Query the exact matches of date of birth
compare.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
# Query the exact matches of date of birth
compare.exact('soc_sec_id', 'soc_sec_id', label='soc_sec_id')
# Query the exact matches of date of birth
compare.exact('postcode', 'postcode', label='postcode')
# Query the fuzzy matches for given name
compare.string('given_name', 'given_name', threshold=0.75, method='levenshtein', label='given_name')
# Query the fuzzy matches for surname
compare.string('surname', 'surname', threshold=0.75, method='levenshtein', label='surname')
# Query the fuzzy matches for address
compare.string('address_1', 'address_1', threshold=0.75, method='levenshtein', label='address')

# Compute the matches, this will take a while
matches = compare.compute(pairs, census_a, census_b)
# Query matches with score over 4
full_matches = matches[matches.sum(axis='columns') >= 4]

# Get the indexes from either of index levels
duplicates = full_matches.index.get_level_values('rec_id_2')
# Exclude the indexes of duplicates from census_b
unique_b = census_b[~census_b.index.isin(duplicates)]

# Append deduplicated census_b to census_a
full_census = census_a.append(unique_b)

### Record Linkage, Case Study <small id='case2'></small>

To solidify your knowledge, we will perform record linkage with two ot