# Linking two census datasets
This example shows how two datasets with name and contact details information about individuals can be linked. We will try to link the data based on the first name, last name, sex, birthdate, city, street, job and email. The data used in this example is fictitious. 

Firstly, start with importing the ``recordlinkage`` module. The submodule ``recordlinkage.datasets`` contains several datasets which can be used for testing. For this example, we use the datasets ``censusA`` and ``censusB`` that can be loaded with the functions ``load_censusA`` en ``load_censusB`` respectively.

In [1]:
import recordlinkage
from recordlinkage.datasets import load_censusA, load_censusB

The datasets ``censusA`` and ``censusB`` are loaded with the following code. The datasets are ``pandas.DataFrame`` objects. This makes it easy to manipulate the data if desired. For details about data manipulation with ``pandas``, see their comprehensive documentation http://pandas.pydata.org/. 

In [2]:
dfA = load_censusA()
dfB = load_censusB()

## Making record pairs

It is very intuitive to start with comparing each record of DataFrame ``dfA`` with all records of DataFrame ``dfB``. In fact, we want to make record pairs. Each record pair should contain one record of ``dfA`` and one record of ``dfB``. This process of making record pairs is also called 'indexing'. With the ``recordlinkage`` module, indexing is easy. Firstly, load the ``Pairs`` class. This class takes two dataframes as input arguments. In case of deduplication of a single dataframe, one dataframe is sufficient as input argument. 

In [3]:
pcl = recordlinkage.Pairs(dfA, dfB)

With the method ``Pairs.full``, all possible (and unique) record pairs are made. The method returns a ``pandas.MultiIndex``. 

In [4]:
pairs = pcl.full()

The number of pairs is equal to the number of records in ``dfA`` times the number of records in ``dfB``.

In [5]:
len(dfA)*len(dfB) == len(pairs)

True

Many of the record pairs do not belong to the same person. In case of one-to-one matching, the largest number of matches should be the number of records in the smallest dataframe. In case of full indexing, ``min(len(dfA), len(N_dfB))`` is much smaller than ``len(pairs)``. The ``recordlinkage`` module has some more advanced indexing methods to reduce the number of record pairs. Obvious non-matches are left out of the index. Note that if a matching record pair is not included in the index, it can not be matched anymore.

One of the most well known indexing methods is named 'blocking'. This method includes only record pairs that are identical on one or more stored attributes of the person (or entity in general). The blocking method can be used in the ``recordlinkage`` module. 

In [6]:
pairs = pcl.block('first_name')

The argument 'first_name' is the blocking variable. This variable has to be the name of a column in ``dfA`` and ``dfB``. It is possible to parse a list of columns names to block on multiple variables. Blocking on multiple variables will reduce the number of record pairs even further. 

Another implemented indexing method is sortedneighbourhood indexing (``Pairs.sortedneighbourhood``). This method is very useful when there are many misspellings in the string were used for indexing. In fact, sorted neighbourhood indexing is a generalisation of blocking. See the documentation for details about sorted neighbourd indexing.

## Comparing record pairs

Each record pair is a candidate match. To classify the candidate record pairs into matches and non-matches, compare the records on all attributes both records have in common. The ``recordlinkage`` module has a class named ``Compare``. This class is used to compare the records. The following code shows how to compare attributes. 

In [7]:
compare_cl = recordlinkage.Compare(pairs, dfA, dfB)

compare_cl.exact('first_name', 'first_name', name='first_name')
compare_cl.fuzzy('last_name', 'last_name', name='last_name', method='jarowinkler', threshold=0.85)
compare_cl.exact('sex', 'sex', name='sex')
compare_cl.exact('birthdate', 'birthdate', name='birthdate')
compare_cl.exact('city', 'city', name='city')
compare_cl.exact('street_address', 'street_address', name='street_address')
compare_cl.exact('job', 'job', name='job')
compare_cl.exact('email', 'email', name='email');

All comparisons are stored in a dataframe with horizontally the comparison features and vertically the record pairs. The comparison can be found in ``vectors`` attribute of the ``Compare`` class. The first 10 comparison vectors are:

In [8]:
compare_cl.vectors.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,first_name,last_name,sex,birthdate,city,street_address,job,email
index_A,index_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1000000,1000349,1,1,1,1,1,1,1,1
1000002,1000027,1,1,1,1,1,1,0,0
1000003,1000404,1,1,1,1,1,0,0,1
1000004,1000691,1,1,1,1,1,0,1,1
1000005,1000411,1,1,1,1,1,1,0,1
1000006,1000301,1,1,1,1,1,1,1,1
1000008,1000644,1,1,1,1,1,1,1,1
1000009,1000676,1,1,1,1,1,1,1,0
1000010,1000112,1,1,1,1,1,1,1,0
1000011,1000173,1,1,1,1,1,1,0,1


In [None]:
ecm_cl = recordlinkage.ExpectationMaximisationClassifier(method='ecm')

ecm_cl.learn(compare_cl.vectors)