# Classifying record pairs

There are dozens of classification algorithms for record linkage. This page shows several classification examples. For these examples, the [Krebs register](http://recordlinkage.readthedocs.org/en/latest/reference.html#recordlinkage.datasets.krebsregister_cmp_data) (German for cancer registry) dataset is used. The Krebs register dataset contains comparison vectors for record pairs. For each record pair is known if they match or not. This was done with a massive clerical review.

In [1]:
%precision 5

from __future__ import print_function

import pandas as pd
pd.set_option('precision',5)
pd.options.display.max_rows = 10


## Introduction
First, import the recordlinkage module and the Krebs register data. 

In [2]:
import recordlinkage as rl
from recordlinkage.datasets import load_krebsregister

In [3]:
krebs_data, krebs_match = load_krebsregister()
krebs_data

Start downloading the data.
Data download succesfull.


Unnamed: 0_level_0,Unnamed: 1_level_0,cmp_firstname1,cmp_firstname2,cmp_lastname1,cmp_lastname2,cmp_sex,cmp_birthday,cmp_birthmonth,cmp_birthyear,cmp_zipcode
id1,id2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
37291,53113,0.83333,,1.00000,,1,1.0,1.0,1.0,0.0
39086,47614,1.00000,,1.00000,,1,1.0,1.0,1.0,1.0
70031,70237,1.00000,,1.00000,,1,1.0,1.0,1.0,1.0
84795,97439,1.00000,,1.00000,,1,1.0,1.0,1.0,1.0
36950,42116,1.00000,,1.00000,1.0,1,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
32517,73116,1.00000,,0.22222,,1,0.0,1.0,0.0,0.0
67707,83757,0.11111,,1.00000,,1,0.0,0.0,0.0,0.0
53258,91808,1.00000,,0.00000,,1,0.0,0.0,1.0,0.0
31865,85285,1.00000,,0.11111,,1,0.0,1.0,0.0,0.0


The dataset contains 5749132 compared record pairs and has attributes first name, last name, sex, birthday, birthmonth, birthyear and zipcode.

In [4]:
krebs_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cmp_firstname1,5748125.0,0.7129,0.38876,0.0,0.28571,1.0,1.0,1.0
cmp_firstname2,103698.0,0.90002,0.27132,0.0,1.0,1.0,1.0,1.0
cmp_lastname1,5749132.0,0.31563,0.33423,0.0,0.1,0.18182,0.42857,1.0
cmp_lastname2,2464.0,0.31841,0.36857,0.0,0.0,0.16667,0.375,1.0
cmp_sex,5749132.0,0.955,0.2073,0.0,1.0,1.0,1.0,1.0
cmp_birthday,5748337.0,0.22447,0.41723,0.0,0.0,0.0,0.0,1.0
cmp_birthmonth,5748337.0,0.48886,0.49988,0.0,0.0,0.0,1.0,1.0
cmp_birthyear,5748337.0,0.22275,0.41609,0.0,0.0,0.0,0.0,1.0
cmp_zipcode,5736289.0,0.00553,0.07415,0.0,0.0,0.0,0.0,1.0


The number of links in the data is:

In [5]:
len(krebs_match)

20931

Most of the classifiers in the ``recordlinkage`` package can not handle missing values in the comparison data. In the krebsregister dataset, two features are nearly always missing (cmp_firstname2 and cmp_lastname2). Also the other features are sometimes missing. To prevent issues with the classification algorithms, a widely used method is to declare missing values as disagreeing comparisons. So, use the power of ``pandas`` to replace the missing values quickly:

In [6]:
krebs_data.fillna(0, inplace=True)

## K-means classifier

The K-means clustering algorithm is well-known and widely used in big data analysis. The K-means classfier in the ``recordlinkage`` package is configured in such a way that it can be used for linking records. For more info about the K-means clustering see [wikipedia](https://en.wikipedia.org/wiki/K-means_clustering). 

In [7]:
kmeans = rl.KMeansClassifier()
result_kmeans = kmeans.learn(krebs_data)

# The predicted number of matches
len(result_kmeans)

371525

The classifier is now trained and the comparison vectors are classified. 

In [8]:
cm_kmeans = rl.confusion_matrix(krebs_match, result_kmeans, len(krebs_data))
fscore_kmeans = rl.fscore(cm_kmeans)

fscore_kmeans

0.10598

## Logistic regression

For this example, consider that the true match status of the first 5000 record pairs is known. 

In [9]:
golden_pairs = krebs_data[0:5000]
golden_matches_index = golden_pairs.index & krebs_match

The logistic classifier can be called in the same way as the k-means clustering algorithm. The only difference is now that the golden data is used for learning the algorithm. 

In [10]:
# Train the classifier
logreg = rl.LogisticRegressionClassifier()
logreg.learn(golden_pairs, golden_matches_index)

# Predict the match status for all record pairs
result_logreg = logreg.predict(krebs_data)

len(result_logreg)

22699

In [11]:
conf_logreg = rl.confusion_matrix(krebs_match, result_logreg, len(krebs_data))
conf_logreg

array([[  20925,       6],
       [   1774, 5726427]])

In [12]:
# The F-score for this classification is
rl.fscore(conf_logreg)

0.95920

## Support Vector Machines

Support vector machines have become increasingly popular in record linkage. See below why:

In [13]:
# Train the classifier
svm = rl.SVMClassifier()
svm.learn(golden_pairs, golden_matches_index)

# Predict the match status for all record pairs
result_svm = svm.predict(krebs_data)

len(result_svm)

22302

In [14]:
conf_svm = rl.confusion_matrix(krebs_match, result_svm, len(krebs_data))
conf_svm

array([[  20925,       6],
       [   1377, 5726824]])

In [15]:
# The F-score for this classification is
rl.fscore(conf_svm)

0.96801

## Expectation/Conditional Maximization Algorithm

In [16]:
# Train the classifier
ecm = rl.ECMClassifier()
result_ecm = ecm.learn((krebs_data > 0.8).astype(int))

len(result_ecm)

19817

In [17]:
conf_ecm = rl.confusion_matrix(krebs_match, result_ecm, len(krebs_data))
conf_ecm

array([[  19813,    1118],
       [      4, 5728197]])

In [18]:
# The F-score for this classification is
rl.fscore(conf_ecm)

0.97246