# Record linkage

Record linkage is a common data science problem. The idea is that you have duplicates of the same entity in your data, and you want to find them. For example, you might have a list of people who have made purchases at your store, and you want to find all the purchases made by the same person. Or you might have a list of people who have signed up for your email newsletter, and you want to find all the people who signed up more than once.

There are dedicated tools for this, such as [this](https://github.com/J535D165/recordlinkage) and [this](https://github.com/moj-analytical-services/splink). However, record linkage is very problem-specific, and it's often more worthwhile to write your own code.

In [1]:
dataset = pd.read_csv(
    'https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/repeatability/Restaurants/restaurants.tsv',
    sep='\t',
    index_col='id'
)
dataset.head(6)


Unnamed: 0_level_0,name,address,city,phone,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,arnie morton's of chicago,435 s. la cienega blv.,los angeles,310/246-1501,american
2,arnie morton's of chicago,435 s. la cienega blvd.,los angeles,310-246-1501,steakhouses
3,art's delicatessen,12224 ventura blvd.,studio city,818/762-1221,american
4,art's deli,12224 ventura blvd.,studio city,818-762-1221,delis
5,hotel bel-air,701 stone canyon rd.,bel air,310/472-1211,californian
6,bel-air hotel,701 stone canyon rd.,bel air,310-472-1211,californian


In [2]:
duplicates = pd.read_csv(
    'https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/repeatability/Restaurants/restaurants_DPL.tsv',
    sep='\t'
)
duplicates.head()


Unnamed: 0,id1,id2
0,1,2
1,3,4
2,5,6
3,7,8
4,9,10


This is nice because we have a ground truth. We should be able to determine the duplicates we find are correct or not.

How should we go about finding these duplicates? Well, pandas does have a `duplicated` method:

In [3]:
search_by = 'address'
is_duplicate = dataset[search_by].duplicated(keep=False)
is_duplicate.head(6)


id
1    False
2    False
3     True
4     True
5     True
6     True
Name: address, dtype: bool

We can use this to group row indices together:

In [4]:
indices = dataset[is_duplicate].groupby(search_by).apply(lambda x: x.index.tolist())
indices.head()


address
1 margaret mitchell sq.    [161, 162]
1 mission st.              [193, 194]
1 w. 67th st.                [55, 56]
100 e. 63rd st.            [103, 104]
1001 n. alameda st.          [35, 36]
dtype: object

Now we can compare this to the ground truth. There may be more than two row indices that are duplicates of each other. We can use `itertools.combinations` to get all the pairs of row indices:

In [5]:
import itertools

found_duplicates = pd.DataFrame(
    list(itertools.chain.from_iterable(
        itertools.combinations(index, 2) for index in indices
    )),
    columns=['id1', 'id2']
)
found_duplicates.head()


Unnamed: 0,id1,id2
0,161,162
1,193,194
2,55,56
3,103,104
4,35,36


We'll convert each dataframe to a set to each comparison.

In [6]:
TRUTH = set(tuple(sorted(pair)) for pair in duplicates.values)
FOUND = set(tuple(sorted(pair)) for pair in found_duplicates.values)


Let's look at how well we did:

In [7]:
print(f'#true_positives: {len(TRUTH & FOUND)}')
print(f'#false_positives: {len(FOUND - TRUTH)}')


#true_positives: 67
#false_positives: 35


Let's take a look at the true negatives, which in this case are the duplicates we didn't find:

In [8]:
import random

for (a, b) in random.choices(list(TRUTH - FOUND), k=5):
    print(dataset.loc[a, search_by])
    print(dataset.loc[b, search_by])
    print()


156 2nd ave. at 10th st.
156 second ave.

3000 w. paradise rd.
3000 paradise rd.

2355 peachtree rd.  peachtree battle shopping center
2355 peachtree rd. ne

570 4th st.
570 fourth st.

in central park at 67th st.
central park west



How about the false positives, which in this case are the duplicates we found but shouldn't have:

In [9]:
for (a, b) in random.choices(list(FOUND - TRUTH), k=5):
    print(dataset.loc[a, search_by])
    print(dataset.loc[b, search_by])
    print()


3570 las vegas blvd. s
3570 las vegas blvd. s

3570 las vegas blvd. s
3570 las vegas blvd. s

3799 las vegas blvd. s.
3799 las vegas blvd. s.

1248 clairmont rd.
1248 clairmont rd.

3570 las vegas blvd. s
3570 las vegas blvd. s



Why are these false positives? Well because although the address is the same, the names are different. This could be because the restaurants exited at different times.

In [10]:
dataset.loc[a]


name                 palace court
address    3570 las vegas blvd. s
city                    las vegas
phone                702/731-7547
type                  continental
Name: 141, dtype: object

In [11]:
dataset.loc[b]


name                empress court
address    3570 las vegas blvd. s
city                    las vegas
phone                702/731-7888
type                        asian
Name: 551, dtype: object

We'll need a smarter duplicate detection algorithm. We *could* use a machine learning algorithm. But that's not a good way to start. It's always a good idea to start to solve this manually.

The correct manual matching unction entirely depends on your application. There is no silver bullet that will work for each and every case. But by proceeding incrementally, and by studying the false positives and false negatives, we can build up a good matching function.

Here we'll use a function from [this](https://maxhalford.github.io/blog/transitive-duplicates/) blog post. Don't worry too much about how it works.

In [12]:
import numpy as np
import pandas as pd


def find_partitions(df, match_func, max_size=None, block_by=None):
    """Recursive algorithm for finding duplicates in a DataFrame."""

    # If block_by is provided, then we apply the algorithm to each block and
    # stitch the results back together
    if block_by is not None:
        blocks = df.groupby(block_by).apply(lambda g: find_partitions(
            df=g,
            match_func=match_func,
            max_size=max_size
        ))

        keys = blocks.index.unique(block_by)
        for a, b in zip(keys[:-1], keys[1:]):
            blocks.loc[b, :] += blocks.loc[a].iloc[-1] + 1

        return blocks.reset_index(block_by, drop=True)

    def get_record_index(r):
        return r[df.index.name or 'index']

    # Records are easier to work with than a DataFrame
    records = df.to_records()

    # This is where we store each partition
    partitions = []

    def find_partition(at=0, partition=None, indexes=None):

        r1 = records[at]

        if partition is None:
            partition = {get_record_index(r1)}
            indexes = [at]

        # Stop if enough duplicates have been found
        if max_size is not None and len(partition) == max_size:
            return partition, indexes

        for i, r2 in enumerate(records):

            if get_record_index(r2) in partition or i == at:
                continue

            if match_func(r1, r2):
                partition.add(get_record_index(r2))
                indexes.append(i)
                find_partition(at=i, partition=partition, indexes=indexes)

        return partition, indexes

    while len(records) > 0:
        partition, indexes = find_partition()
        partitions.append(partition)
        records = np.delete(records, indexes)

    return pd.Series({
        idx: partition_id
        for partition_id, idxs in enumerate(partitions)
        for idx in idxs
    })


Let's use it to understand how it works. It takes as input a dataframe, and a matching function. We'll make a first matching function that (fuzzy) matches on the name, and does an exact match on the phone.

In [118]:
from thefuzz import fuzz

def same_phone(r1, r2):
    return r1['phone'] == r2['phone']

def similar_name(r1, r2):
    return fuzz.partial_ratio(r1['name'], r2['name']) > 50

def same_restaurant(r1, r2):
    return (
        same_phone(r1, r2) and
        similar_name(r1, r2)
    )


What does a fuzzy match mean? Well, it's not an exact match. The idea is to measure the similarity between two strings.

In [119]:
a = "2355 peachtree rd.  peachtree battle shopping center"
b = "2355 peachtree rd. ne"
fuzz.partial_ratio(a, b)


95

In [120]:
fuzz.ratio(a, b)


58

Anyway, let's see how `find_partitions` works:

In [121]:
duplicate_ids = find_partitions(
    df=dataset,
    match_func=same_restaurant
)


In [122]:
duplicate_ids


1        0
2        1
3        2
4        3
5        4
      ... 
860    856
861    857
862    858
863    859
864    860
Length: 864, dtype: int64

In [123]:
duplicate_ids.value_counts()


181    2
179    2
178    2
0      1
579    1
      ..
292    1
293    1
294    1
295    1
860    1
Length: 861, dtype: int64

In [124]:
duplicate_ids[duplicate_ids.eq(181)].index


Int64Index([184, 823], dtype='int64')

In [125]:
dataset.loc[184]


name       ritz-carlton restaurant
address          181 peachtree st.
city                       atlanta
phone                 404-659-0400
type              french (classic)
Name: 184, dtype: object

In [126]:
dataset.loc[823]


name       ritz-carlton cafe (atlanta)
address              181 peachtree st.
city                           atlanta
phone                     404-659-0400
type                    american (new)
Name: 823, dtype: object

These are paired because the names are similar and the phone numbers are the same:

In [127]:
(
    similar_name(dataset.loc[184], dataset.loc[823]) and
    same_phone(dataset.loc[184], dataset.loc[823])
)


True

Now that we understand how this works for a single case, let's measure the overall performance:

In [128]:
def evaluate(duplicate_ids):
    dups = pd.DataFrame({
        'original_id': duplicate_ids.index,
        'duplicate_id': duplicate_ids
    })
    pairs = set()

    for duplicate_id, original_indices in dups.groupby('duplicate_id'):
        if len(original_indices) < 2:
            continue
        for a, b in itertools.combinations(original_indices['original_id'], 2):
            pairs.add(tuple(sorted((a, b))))

    print("FALSE NEGATIVES (duplicates we didn't find)")
    print(len(TRUTH - pairs))
    for (a, b) in random.choices(list(TRUTH - pairs), k=3):
        pprint(dataset.loc[a].to_dict())
        pprint(dataset.loc[b].to_dict())
        print('-' * 80)

    print('FALSE POSITIVES (duplicates we found that are not actual duplicates))')
    print(len(pairs - TRUTH))
    for (a, b) in random.choices(list(pairs - TRUTH), k=3):
        pprint(dataset.loc[a].to_dict())
        pprint(dataset.loc[b].to_dict())
        print('-' * 80)

evaluate(duplicate_ids)


FALSE NEGATIVES (duplicates we didn't find)
112


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
FALSE POSITIVES (duplicates we found that are not actual duplicates))
3


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


The idea is to study these errors, and then design a better matching function.

In [131]:
def same_phone(r1, r2):
    return r1['phone'] == r2['phone']


def same_area_code(r1, r2):
    return r1['phone'][:3] == r2['phone'][:3]


def same_name(r1, r2):
    return fuzz.ratio(r1['name'], r2['name']) > 75


def similar_address(r1, r2):
    return (
        fuzz.ratio(r1['address'], r2['address']) > 55 or
        fuzz.partial_ratio(r1['address'], r2['address']) > 75
    )

def similar_name(r1, r2):
    return fuzz.partial_ratio(r1['name'], r2['name']) > 50

def same_restaurant(r1, r2):
    return (
        (
            same_phone(r1, r2) and
            similar_name(r1, r2)
        ) or
        (
            same_area_code(r1, r2) and
            same_name(r1, r2) and
            similar_address(r1, r2)
        )
    )

duplicate_ids = find_partitions(
    df=dataset,
    match_func=same_restaurant
)

evaluate(duplicate_ids)


FALSE NEGATIVES (duplicates we didn't find)
23


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
FALSE POSITIVES (duplicates we found that are not actual duplicates))
14


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


This is better, because we reduced the number of false negatives. But we also increased the number of false positives.

As you can see, this manual solution isn't ideal. Another solution is to look at cosine similarities between rows. This is well explained [here](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html). Let's do a basic implementation:

In [175]:
from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct

def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape

    idx_dtype = np.int32

    nnz_max = M*ntop

    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data
    )

    return csr_matrix((data,indices,indptr), shape=(M, N))

def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similarity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similarity[index] = sparse_matrix.data[index]

    return pd.DataFrame({
        'left_side': left_side,
        'right_side': right_side,
        'similarity': similarity
    }).query('left_side != right_side')


text = (
    dataset['name'].fillna('') + ' ' +
    dataset['address'].fillna('') + ' ' +
    dataset['city'].fillna('') + ' ' +
    dataset['phone'].fillna('') + ' ' +
    dataset['type'].fillna('')
)

tfidf = TfidfVectorizer().fit_transform(raw_documents=text)
matches = awesome_cossim_top(tfidf, tfidf.transpose(), 10, 0.8)
matches_df = get_matches_df(matches, dataset.index, top=1000)
matches_df


Unnamed: 0,left_side,right_side,similarity
1,1,2,0.883348
3,2,1,0.883348
5,3,4,0.838229
7,4,3,0.838229
8,5,6,1.0
11,6,5,1.0
13,7,8,0.966199
15,8,7,0.966199
17,9,10,0.955402
19,10,9,0.955402


How good is this? Well let's a look.

In [176]:
matches_set = set(tuple(sorted(pair)) for pair in matches_df[['left_side', 'right_side']].values)
len(TRUTH - matches_set)


23

In [177]:
len(matches_set - TRUTH)


4

Not bad for a generic algorithm!