# Classifying record pairs

There are dozens of classification algorithms for record linkage. In this example, several classification examples are showed. For these example, the [Krebs register](http://recordlinkage.readthedocs.org/en/latest/reference.html#recordlinkage.datasets.krebsregister_cmp_data) (German for cancer registry) dataset is used found in the [recordlinkage-datasets package](https://github.com/J535D165/recordlinkage-datasets). Be sure that this package is installed before trying the examples below. The Krebs register dataset contains comaprison vectors for which is known if the records belong to the same entity or not (so if they match or are distinct). This was done with a massive clerical review.

## Introduction
First, import the recordlinkage module and the Krebs register data. 

In [1]:
import recordlinkage
from recordlinkage.datasets import krebsregister_cmp_data

In [2]:
krebs_data, krebs_match = krebsregister_cmp_data(block=range(1,11))
krebs_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,cmp_firstname1,cmp_firstname2,cmp_lastname1,cmp_lastname2,cmp_sex,cmp_birthday,cmp_birthmonth,cmp_birthyear,cmp_zipcode
id1,id2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
49,6439,0.142857,,0.0,,1,1.0,1.0,1.0,0.0
79667,83449,1.0,,0.25,,1,0.0,0.0,1.0,0.0
51539,59550,0.166667,,0.1,,0,1.0,1.0,1.0,0.0
63018,66603,1.0,,0.0,,1,0.0,1.0,0.0,0.0
56779,79443,1.0,,0.166667,,1,0.0,0.0,1.0,0.0


The dataset contains 5749132 compared record pairs and has attributes first name, last name, sex, birthday, birthmonth, birthyear and zipcode.

In [3]:
krebs_data.describe()

Unnamed: 0,cmp_firstname1,cmp_firstname2,cmp_lastname1,cmp_lastname2,cmp_sex,cmp_birthday,cmp_birthmonth,cmp_birthyear,cmp_zipcode
count,5748125.0,103698.0,5749132.0,2464.0,5749132.0,5748337.0,5748337.0,5748337.0,5736289.0
mean,0.7129025,0.900018,0.3156278,0.318413,0.9550014,0.2244653,0.4888553,0.2227486,0.005528661
std,0.3887584,0.271318,0.3342336,0.368567,0.2073011,0.4172297,0.4998758,0.416091,0.07414915
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.2857143,1.0,0.1,0.0,1.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,0.1818182,0.166667,1.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,0.4285714,0.375,1.0,0.0,1.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The number of links in the data is:

In [4]:
len(krebs_match)

20931

Most of the classifiers in the ``recordlinkage`` package can not handle missing values in the comparison data. In the krebsregister dataset, two features are nearly always missing (cmp_firstname2 and cmp_lastname2). Also the other features are sometimes missing. To prevent issues with the classification algorithms, a widely used method is to declare missing values as disagreeing comparisons. So, use the power of ``pandas`` to replace the missing values quickly:

In [5]:
krebs_data.fillna(0, inplace=True)

## K-means classifier

The K-means clustering algorithm is well-known and widely used in big data analysis. The K-means classfier in the ``recordlinkage`` package is configured in such a way that it can be used for linking records. For more info about the K-means clustering see [wikipedia](https://en.wikipedia.org/wiki/K-means_clustering). 

In [6]:
kmeans = recordlinkage.KMeansClassifier()
result_kmeans = kmeans.learn(krebs_data)

type(result_kmeans)

pandas.indexes.multi.MultiIndex

The classifier is now trained and classified the comparison vectors. 

In [7]:
conf_kmeans = recordlinkage.confusion_matrix(krebs_match, result_kmeans, len(krebs_data))
conf_kmeans

array([[  20797,     134],
       [ 350725, 5377476]])

In [8]:
# The F-score for this classification is
recordlinkage.fscore(conf_kmeans)

0.10598466567971196

## Logistic regression

For this example, consider that the true match status of the first 5000 record pairs is known. 

In [9]:
golden_pairs = krebs_data[0:5000]
golden_matches_index = golden_pairs.index & krebs_match

The logistic classifier can be called in the same way as the k-means clustering algorithm. The only difference is now that the golden data is used for learning the algorithm. 

In [10]:
# Train the classifier
logreg = recordlinkage.LogisticRegressionClassifier()
logreg.learn(golden_pairs, golden_matches_index)

# Predict the match status for all record pairs
result_logreg = logreg.predict(krebs_data)

len(result_logreg)

20024

In [11]:
conf_logreg = recordlinkage.confusion_matrix(krebs_match, result_logreg, len(krebs_data))
conf_logreg

array([[  19831,    1100],
       [    193, 5728008]])

In [12]:
# The F-score for this classification is
recordlinkage.fscore(conf_logreg)

0.9684287632767672

## Support Vector Machines

Support vector machines have become increasingly popular in record linkage. See below why:

In [13]:
# Train the classifier
svm = recordlinkage.SVMClassifier()
svm.learn(golden_pairs, golden_matches_index)

# Predict the match status for all record pairs
result_svm = svm.predict(krebs_data)

len(result_svm)

20843

In [14]:
conf_svm = recordlinkage.confusion_matrix(krebs_match, result_svm, len(krebs_data))
conf_svm

array([[  20821,     110],
       [     22, 5728179]])

In [15]:
# The F-score for this classification is
recordlinkage.fscore(conf_svm)

0.9968401397998754

## Expectation/Conditional Maximization Algorithm

In [16]:
# Train the classifier
ecm = recordlinkage.ECMClassifier()
result_ecm = ecm.learn((krebs_data > 0.8).astype(int))

len(result_ecm)

19817

In [17]:
conf_ecm = recordlinkage.confusion_matrix(krebs_match, result_ecm, len(krebs_data))
conf_ecm

array([[  19813,    1118],
       [      4, 5728197]])

In [18]:
# The F-score for this classification is
recordlinkage.fscore(conf_ecm)

0.9724649062530676