# Customize

The Python Record Linkage Toolkit contains several built-in indexing and record-comparing algorithms. Examples of indexing algorithms are *full indexing*, *blocking* and *sorted neighbourhood indexing*. Sometimes, these built-in indexing and comparing algorithms do not fit your needs. With the Python Record Linkage Toolkit, it is easy to implement your own algorithms. In this example, we will show how you can implement your own indexing and comparing algorithms. If you think your algorithm might help others, consider sharing it with us!


In [31]:
%precision 5

from __future__ import print_function

import pandas as pd
pd.set_option('precision',5)
pd.options.display.max_rows = 10


## Custom indexing algorithms
This example shows how to make a set of candidate record pairs with a custom indexing algorithm. Consider the situation were you need record pairs were the name start with a 'W'. 

Import ``recordlinkage``, ``numpy`` and ``pandas`` and two sample census datasets. 

In [45]:
import numpy
import pandas

import recordlinkage as rl
from recordlinkage.datasets import load_febrl4

dfA, dfB = load_febrl4()

dfA

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-1070-org,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218
rec-1016-org,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625
rec-4405-org,charles,green,38,salkauskas crescent,kela,dapto,4566,nsw,19480930,4365168
rec-1288-org,vanessa,parr,905,macquoid place,broadbridge manor,south grafton,2135,sa,19951119,9239102
rec-3585-org,mikayla,malloney,37,randwick road,avalind,hoppers crossing,4552,vic,19860208,7207688
...,...,...,...,...,...,...,...,...,...,...
rec-2153-org,annabel,grierson,97,mclachlan crescent,lantana lodge,broome,2480,nsw,19840224,7676186
rec-1604-org,sienna,musolino,22,smeaton circuit,pangani,mckinnon,2700,nsw,19890525,4971506
rec-1003-org,bradley,matthews,2,jondol place,horseshoe ck,jacobs well,7018,sa,19481122,8927667
rec-4883-org,brodee,egan,88,axon street,greenslopes,wamberal,2067,qld,19121113,6039042


### Basic

So far, nothing changed. To make a custom indexing algorithm, we have to make a function that does the work for us. In the following example, a random indexing algorithm is made. The algorithm makes record pairs where each record in the record pair in sampled randomly of dataframe ``dfA`` or ``dfB``.

In [33]:
def name_starts_with_w_index(A, B):

    # Select records with names starting with a w.
    A_startswith_w = A[A['given_name'].str.startswith('w') == True]
    B_startswith_w = B[B['given_name'].str.startswith('w') == True]

    # Make a product of the two numpy arrays
    return pandas.MultiIndex.from_product(
        [A_startswith_w.index.values, B_startswith_w.index.values],
        names=[A.index.name, B.index.name]
    )

In that case, it is not possible to use the other build methods in the ``Pairs`` class. Therefore, we can call the following method: 

In [35]:
pcl = rl.Pairs(dfA, dfB)
candidate_pairs = pcl.index(name_starts_with_w_index)

print ('Number of candidate record pairs starting with the letter w:', len(candidate_pairs))

Number of candidate record pairs starting with the letter w: 6072


### Additional arguments

Custom indexing functions can handle additional arguments.

In [38]:
def name_starts_with_index(A, B, letter):

    # Select records with names starting with a 'letter'.
    A_startswith_w = A[A['given_name'].str.startswith(letter) == True]
    B_startswith_w = B[B['given_name'].str.startswith(letter) == True]

    # Make a product of the two numpy arrays
    return pandas.MultiIndex.from_product(
        [A_startswith_w.index.values, B_startswith_w.index.values],
        names=[A.index.name, B.index.name]
    )

In [44]:
pcl = rl.Pairs(dfA, dfB)

candidate_pairs_x = pcl.index(name_starts_with_index, 'x')
candidate_pairs_w = pcl.index(name_starts_with_index, 'w')
candidate_pairs_a = pcl.index(name_starts_with_index, 'a')

print ('Number of candidate record pairs starting with the letter w:', len(candidate_pairs_w))
print ('Number of candidate record pairs starting with the letter x:', len(candidate_pairs_x))
print ('Number of candidate record pairs starting with the letter a:', len(candidate_pairs_a))

Number of candidate record pairs starting with the letter w: 6072
Number of candidate record pairs starting with the letter x: 132
Number of candidate record pairs starting with the letter a: 172431


## Custom comparing algorithms

This section shows what to do if the built-in algorithms to compare string, numeric values or dates are not sufficient for you application. 

Import ``recordlinkage``, ``numpy`` and ``pandas`` and two sample census datasets. 

In [57]:
import numpy
import pandas

import recordlinkage as rl
from recordlinkage.datasets import load_febrl4

dfA, dfB = load_febrl4()

# Make an index
pcl = rl.Pairs(dfA, dfB)
candidate_pairs = pcl.block('given_name')

In [58]:
def compare_zipcodes(s1, s2):
    """
    If zipcodes in both records are identical, the similarity 
    is 0. If the first two values agree while the last 2 don't, then 
    the similarity is 0.5. Otherwise, the similarity is 0.
    """

    # check if the zipcode are identical (return 1 or 0)
    sim = (s1 == s2).astype(float)

    # check the first 2 numbers of the distinct comparisons
    sim[(sim == 0) & (s1.str[0:2] == s2.str[0:2])] = 0.5
    
    return sim

In [59]:
crl = rl.Compare(candidate_pairs, dfA, dfB)
crl.compare(compare_zipcodes, 'postcode', 'postcode')

rec_id        rec_id        
rec-1070-org  rec-3024-dup-0    0.0
              rec-2371-dup-0    0.0
              rec-4652-dup-0    0.0
              rec-4795-dup-0    0.0
              rec-1314-dup-0    0.0
                               ... 
rec-4528-org  rec-4528-dup-0    0.0
rec-4887-org  rec-4887-dup-0    1.0
rec-4350-org  rec-4350-dup-0    1.0
rec-4569-org  rec-4569-dup-0    1.0
rec-3125-org  rec-3125-dup-0    1.0
dtype: float64

In [61]:
crl.vectors[0].value_counts()

0.0    71229
0.5     3166
1.0     2854
Name: 0, dtype: int64

## Indexing with large files

Sometimes, the input files are very large. In that case, it can be hard to make an index without running out of memory in the indexing step or in the comparing step. ``recordlinkage`` has a method to deal with large files. It is fast, although is not primary developed to be fast. SQL databases may outperform this method. It is especially developed for the useability.
The idea was to spllit the input files into small blocks. For each block the record pairs are computed. Then iterate over the blocks. Consider full indexing:

In [6]:
pcl_blocks = recordlinkage.Pairs(dfA, dfB, chunks=(500,500))

for index_block in pcl_blocks.full():
    
    # Index returned
    print(type(index_block))

    # Length of index block
    print(len(index_block))
    
    # Your analysis here

<class 'pandas.indexes.multi.MultiIndex'>
250000
<class 'pandas.indexes.multi.MultiIndex'>
250000
<class 'pandas.indexes.multi.MultiIndex'>
250000
<class 'pandas.indexes.multi.MultiIndex'>
250000


The chunks of 500x500 result in four iterations (both files contain 1000 records).