<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan.

# Record Linkage
----

## Table of Contents

- [Introduction](#Introduction)
    - [Learning Objectives](#Learning-Objectives)
    - [Methods](#Methods)
- [The Principles of Record Linkage](#The-Principles-of-Record-Linkage)
- [Data Description](#Data-Description)
- [Python Setup](#Python-Setup)
- [Load the Data](#Load-the-Data)
- [Data Exploration](#Data-Exploration)
- [Record Linkage on Business Names](#Record-Linkage-on-Business-Names)
    - [The Importance of Pre-Processing](#The-Importance-of-Pre-Processing)
    - [Cleaning String Variables](#Cleaning-String-Variables)
    - [Regular Expressions – `regex`](#Regular-Expressions-–-regex)
    - [Handling Business Suffixes](#Handling-Business-Suffixes)
    - [Record Linkage: Exact Matching on One Field](#Record-Linkage:-Exact-Matching-on-One-Field)
- [Record Linkage on Addresses](#Record-Linkage-on-Addresses)
    - [Address Parsing](#Address-Parsing)
    - [Record Linkage: Exact Matching on Several Fields](#Record-Linkage:-Exact-Matching-on-Several-Fields)
    - [Record Linkage: Rule-Based Matching](#Record-Linkage:-Rule-Based-Matching)
    - [Record Linkage: Fellegi Sunter](#Record-Linkage:-Fellegi-Sunter)
- [Additional Resources](#Additional-Resources)

## Introduction
- Back to [Table of Contents](#Table-of-Contents)

This notebook will provide you with an instruction into Record Linkage using Python. Upon completion of this notebook you will be able to apply record linkage techniques using the *recordlinkage* package to combine data from different sources in Python. 
It will lead you through all the steps necessary for a successful record linkage starting with data preparation  including pre-processing, cleaning and standardization of data.
The notebook follows the underlying lecture and provides examples on how to implement record linkage techniques. 

### Learning Objectives

The goal of this notebook is for you to understand the record linkage techniques. You will be responsible for linking the different datasets in the ADRF, and the subsequent dataset will be used in your later projects.

### Methods

For this notebook exercise we are interested in business data from Kansas City, MO. Business names appear in different datasets: Wage Records and Employer Data from the Missouri Department of Labor, Business Registrations from the Kansas City, MO, Department of Revenue, and Water Consumption data from the Kansas City, MO, Water Services. Here, we will combine the three.

- **Analytical Exercise**: Merge Employer Wage Records, Business Registrations, and Water Services Data. 

- **Approach**: We will look at the data available to us, and clean & pre-process it to enable better linkage. Since the only identifiers we have for this case study are the names of the employers and addresses, we will have to use string matching techniques, part of Python's record linkage package. 

## The Principles of Record Linkage
- Back to [Table of Contents](#Table-of-Contents)

The goal of record linkage is to determine if pairs of records describe the same identity. For instance, this is important for removing duplicates from a data source or joining two separate data sources together. Record linkage also goes by the terms data matching, merge/purge, duplication detection, de-duping, reference matching, entity resolution, disambiguation, co-reference/anaphora in various fields.

There are several approaches to record linkage that include 
    - exact matching 
    - rule-based linking 
    - probabilistic linking 
- An example of **exact matching** is joining records based on social security number, exact name, or geographic code information. This is what you already have done in SQL by joining tables on an unique identifier. 
- **Rule-based matching** involves applying a cascading set of rules that reflect the domain knowledge of the records being linked. 
- In **probabilistic record linkages**, linkage weights are estimated to calculate the probability of a certain match.

In practical applications you will need record linkage techiques to combine information addressing the same entity that is stored in different data sources. Record linkage will also help you to address the quality of different data sources. For example, if one of your databases has missing values you might be able to fill those by finding an identical pair in a different data source. Overall, the main applications of record linkage are
    1. Merging two or more data files 
    2. Identifying the intersection of the two data sets 
    3. Updating data files (with the data row of the other data files) and imputing missing data
    4. Entity disambiguation and de-duplication

## Analytical Approach

For this notebook exercise we are interested in data on establishments.
- **Analytical Exercise**: Compare business names across 2005 & 2015 employer data- and finding record pairs. 
- **Data Availability**: We have 'names', 'addresses' and 'setup_date' for the various businesses in the two time periods. 

- **Approach**: We will look at the data available to us- and clean & pre-process it to enable better linkage. Since the only identifier we have for this case study are the names of the firms- we will have to use string matching techniques which is enabled by record linkage package in Python. 

- *Caveat*: The data we use for this exercise has been taken from the QCEW data from Illinois Department of Employment Security database. As some of you might know, we do have unique identifiers (EIN) in this database. However, we have removed those identifiers for this exercise. Our dataset is also a small sample of the overall data available to reduce runtime & CUP usage in class. 
The goal of this notebook is for you to understand the recoed linkage techniques- and the analytical results at the end of this exercise should be noted as taken from a subset, and hence might not hold true for the larger database. 


## Data Description
- Back to [Table of Contents](#Table-of-Contents)

The dataset used in this exercise comes from the IDES (Illinois Department of Employment Security Database). 
We use the QCEW data or the Quarterly Census of Employment & Wages for our exercise. Below are some more concrete details about the dataset used: 

Time Period: 2005 Q1, and 2015 Q1

**Variables Used**:
- Name_legal (legal name of the firm)
- Address (concatenated the different address strings such as house number, street address, etc to have one variable)
- Setup Date: concatenated the day, month & time to get one date variable which refers to the setup of the establishment in question. 

**Filter Applied**: for the purposes of this exercise, we also removed certain records with the following criterion: 
- All records which have a unique combination of ein & name_legal... This means any record for which one ein is linked to only one name_legal is not a part of this dataset.
- All records which have a null value for address or setup_date
- For 2005 data in particular, for some records, we have character values (A-Z) in the variable holding #employees. We have removed such records. 

In a last step, we drew a sample which ensures a certain degree of deterministic name matching between the two datasets for 2005 & 2015, to make sure we have enough matches in the data for this exercise. 

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

Python provides us with some tools we can use for record linkages so we don't have to start from scratch and code our own linkage algorithms. Before we start we need to load the package `recordlinkage`. To fully function this packages uses other packages which also need to be imported. We are adding a couple more packages to the ones you are already familiar with.

In [None]:
# general use imports
%pylab inline
import datetime
import glob
import inspect
import numpy as np
import os
import six
import warnings
import matplotlib.pyplot as plt
import jellyfish
import re

# pandas-related imports
from __future__ import print_function
import pandas as pd
import scipy
import sklearn

# record linkage package
import recordlinkage as rl
from recordlinkage.standardise import clean

# CSV file reading-related imports
import csv

# database interaction imports
import psycopg2
import psycopg2.extras

print( "Imports loaded at " + str( datetime.datetime.now() ) )

## Load the Data
- Back to [Table of Contents](#Table-of-Contents)

For our case study we already prepared some data for you. In order to demostrate record linkage we need directly identifiable data. In typical applications you would link based on names or some account numbers. We can't do this in class because the data in the ADRF is de-identified to guarantee data privacy for individuals who are in the data. Thus we need some other kind of information which can be linked. The QCEW (Quarterly Census of Employment and Wages) files we have for Illinois however contain names of employer. Thus, we prepared a subsample of the QCEW data for the years 2005 & 2015-- consisting of legal name, physical address, and some other establishment characteristics (see [Data Description](#Data-Description)). The data is accessible in the class database.

Let's first set up our database scheme and connnect to the Database using `psycopg2`, as you've done this in previous notebooks.

In [None]:
# Database connection properties
db_host = "10.10.2.10"
db_port = -1
db_username = None
db_password = None
db_name = "appliedda"
schema = "ada_18_uchi"

# Create psycopg2 connection to Postgresql
pgsql_connection = psycopg2.connect( host = db_host, database = db_name )

# Read SQL 
sql_string2015 = "select distinct name_legal as orig_name, address, setup_date, total_wages, tot_empl from " + schema + "." + "rec_link_2015"
sql_string2005 = "select distinct name_legal as orig_name, address, setup_date, total_wages, tot_empl from " + schema + "." + "rec_link_2005"

# Save table in dataframe
d_2015 = pd.read_sql(sql_string2015, con=pgsql_connection)
d_2005 = pd.read_sql(sql_string2005, con=pgsql_connection)

## Data Exploration
- Back to [Table of Contents](#Table-of-Contents)

Next, we want to get to know the data a bit so we need to know what kind of pre-processing we have to apply. What you want to check for example are formats, missing values, and the quality of your data in general. 

__Visualizing the top rows of the different datasets__

In [None]:
d_2005.head()

In [None]:
d_2015.head()

__Shape and Properties__

In [None]:
# Shape of the data frame
print(d_2005.shape)
print(d_2015.shape)

In [None]:
d_2005.describe()

In [None]:
d_2015.describe()

In [None]:
#1. Checking for NULLS   
d_2015['address'].isnull().values.any()
# Output: boolean operator telling if there are any null values or not

In [None]:
d_2015['address'].isnull().sum()
# Output: count of the null values

In [None]:
d_2015.isnull().sum()
# Output: count of null values for all of the columns

In [None]:
#2. Checking for specific values such as '.', '', etc. 
d_2015['address'].isin(['', '.']).values.any()

In [None]:
d_2015['address'].isin(['', '.']).sum()

## The Importance of Pre-Processing
Data pre-processing is an important step in a data anlysis project in general, in record linkage applications in particular. The goal of pre-processing is to transform messy data into a dataset that can be used in a project workflow.

Linking records from different data sources comes with different challenges that need to be addressed by the analyst. The analyst must determine whether or not two entities (individuals, businesses, geographical units) on two different files are the same. This determination is not always easy. In most of the cases there is no common uniquely identifing characteristic for a entity. For example, is Bob Miller from New Yor the same person as Bob Miller from Chicago in a given dataset? This detemination has to be executed carefully because consequences of wrong linkages may be substantial (is person X the same person as the person X on the list of identified terrorists). Pre-processing can help to make better informed decisions.

Pre-processing can be difficult because there are a lot of things to keep in mind. For example, data input errors, such as typos, misspellings, truncation, abbreviations, and missing values need to be corrected. Literature shows that preprocessing can improve matches. In some situations, 90% of the improvement in matching efficiency may be due to preprocessing. The most common reason why matching projects fail is lack of time and resources for data cleaning. 

In the following we will walk you through some pre-processing steps, these include but are not limited to removing spaces, parsing fields, and standardizing strings.

Let's look at the the most recurring business names in the different datasets.

In [None]:
d_2005['orig_name'].value_counts()

In [None]:
d_2015['orig_name'].value_counts()

***Right away, we notice that the record linkage between the different datasets will not be straightforward. The variable is messy and non-standardized, similar names can be written differently (in upper-case or lower-case characters, with or without suffixes, etc.) The essential next step is to process the variables in order to make the linkage the most effective and relevant possible.***


### Parsing String Variables

By default, the split method returns a list of strings obtained by splitting the original string on spaces or commas, etc. The record linkage package comes with a build in cleaning function we can also use. In addition, we can extract information from strings for example by using regex search commands.

In [None]:
# Uppercasing names and creating a new column of names to work with
d_2015['orig_name_clean']=d_2015.orig_name.str.upper()

# Cleaning names (using the record linkage package tool, see imports)
# Clean removes any characters such as '-', '.', '/', '\', ':', brackets of all types. 
d_2015['orig_name_clean']=clean(d_2015['orig_name_clean'], lowercase=False, strip_accents='ascii', \
                                remove_brackets=False)
d_2015.head()

### Regular Expressions – `regex`

- Back to [Table of Contents](#Table-of-Contents)


Regular expressions (regex) are a way of searching for a character pattern. They can be used for matching or replacing operations in strings.

When defining a regular expression search pattern, it is a good idea to start out by writing down, explicitly, in plain English, what you are trying to search for and exactly how you identify when you've found a match.
For example, if we look at an author field formatted as "&lt;last_name&gt; , &lt;first_name&gt; &lt;middle_name&gt;", in plain English, this is how I would explain where to find the last name: "starting from the beginning of the line, take all the characters until you see a comma."


In a regular expression, there are special reserved characters and character classes. For example:
- "`^`" matches the beginning of the line or cell
- "`.`" matches any character
- "`+`" means one or more repetitions of the preceding expressions

Anything that is not a special charater or class is just looked for explicitly. A comma, for example, is not a special character in regular expressions, so inserting "`,`" in a regular expression will simply match that character in the string.

In our example, in order to extract the last name, the resulting regular expression would be:
"`^.+,`". We start at the beginning of the line ( "`^`" ), matching any characters ( "`.+`" ) until we come to the literal character of a comma ( "`,`" ).


_Note: if you want to actually look for one of these reserved characters, it must be escaped, so that, for example, the expression looks for a literal period, rather than the special regular expression meaning of a period. To escape a reserved character in a regular expression, precede it with a back slash ( "`\`" ). For example, "`\.`" will match a "`.`" character in a string._

__REGEX CHEATSHEET__


    - abc...     Letters
    - 123...     Digits
    - \d         Any Digit
    - \D         Any non-Digit Character
    - .          Any Character
    - \.         Period
    - [a,b,c]    Only a, b or c
    - [^a,b,c]   Not a,b, or c
    - [a-z]      Characters a to z
    - [0-9]      Numbers 0 to 9
    - \w any     Alphanumeric chracter
    - \W         any non-Alphanumeric character
    - {m}        m Repetitions
    - {m,n}      m to n repetitions
    - *          Zero or more repetitions
    - +          One or more repetitions
    - ?          Optional Character
    - \s         any Whitespace
    - \S         any non-Whitespace character
    - ^...$      Starts & Ends
    - (...)      Capture Group
    - (a(bc))    Capture sub-Group
    - (.*)       Capture All
    - (abc|def)  Capture abc or def

__Examples:__
    - `(\d\d|\D)`      will match 22X, 23G, 56H, etc...
    - `(\w)`           will match any characters between 0-9 or a-z
    - `(\w{1-3})`      will match any alphanumeric character of a length of 1 to 3. 
    - `(spell|spells)` will match spell or spells
    - `(corpo?)        will match corp or corpo
    - `(feb 2.)`       will match feb 20, feb 21, feb 2a, etc.


__Using REGEX to match characters:__

In python, to use a regular expression like this to search for matches in a given string, we use the built-in "`re`" package ( https://docs.python.org/2/library/re.html ), specifically the "`re.search()`" method. To use "`re.search()`", pass it first the regular expression you want to use to search, enclosed in quotation marks, and then the string you want to search within. 



__Using REGEX for replacing characters:__

The `re` package also has an "`re.sub()`" method used to replace regular expressions by other strings. The method can be applied to an entire pandas column (replacing expression1 with expression2) with the following syntax: `df['variable'].str.replace(r'expression1', 'expression2')`. Note the `r` before the first string to signal we are using regular expressions.

In [None]:
# Extracting ZIPCODE
# Since a US Zipcode typically follows a set pattern of 5 digits or 5digits followed by a hypen and then 4 digits
# We can use this information to extract zipcodes. 

# Pattern1
d_2015['zipcode']=d_2015['address'].str.extract('(\d{5})')

# Breaking the code down: 
# \d ---- tells that we need a digit
# \d{5} ---- tells that we need 5 digits consecutively
# () enclosing brackets tell that we need to extract this information in the new variable
d_2015[['zipcode', 'address','setup_date']].head()

In [None]:
# Pattern 2
# What if we want the full zipcode: '5 digits <hyphen> 4 digits'
d_2015['zipcode_full']=d_2015['address'].str.extract('(\d{5}-\d{4})')

# Breaking the code down: 
# \d ---- tells that we need a digit
# \d{5} ---- tells that we need 5 digits consecutively
# '-' this is just passing the string exactly as we need it
# \d{4} tells us 4 more digits
# () enclosing brackets tell that we need to extract this information in the new variable

In [None]:
# Pattern 3
# We can also pass both the expressions in our query as an OR.
d_2015['zipcode_either']=d_2015['address'].str.extract('(\d{5}-\d{4}|\d{5})')

# Breaking the code: 
# \d ---- tells that we need a digit
# \d{5} ---- tells that we need 5 digits consecutively
# '-' this is just passing the string exactly as we need it
# \d{4} tells us 4 more digits
# () enclosing brackets tell that we need to extract this information in the new variable

In [None]:
# Parsing the date into day, month & year

d_2015['day']=d_2015['setup_date'].str.extract('(\A\d{1,2})', expand=True)
# d{1,2} is looking for a 2 digit variable
# \A is indicating that we want this 2 digit variable to be at the start of the string

d_2015['month']=d_2015['setup_date'].str.extract('\d{1,2}.(\d{1,2})', expand=True)
# (\d{1,2}) is enclosed in brackets--- and shows that this is the only part we really want to be extracted
# \d\d.  -- the 2 digit long enclosure is preceded by /d{1,2} 
###### indicating that the pattern is preceded by a 1 or 2 digits followed by a .

d_2015['year']=d_2015['setup_date'].str.extract('\w+.(\d{4})', expand=True)
# Here, \w+ indicates that we need 4 digits which are preceded by an alphanumeric character string of however long length
#### and that should be followed by a '.'

d_2015[['setup_date', 'day', 'month', 'year']].head(10)

### Handling Business Suffixes

You may have noticed that several business names finish with a legal suffix ("CO", "INC", "LIMITED LIABILITY", etc.). Unfortunately, examples below will show that these suffixes are not consistent between tables, and they make record linkage impossible. Below we detail one possible way of dealing with legal suffixes. 

We will start by getting the legal suffix of the businesses. We will then isolate them into a separate variable, then standardize them. When matching the dataframes, we can now choose to match on both business name and legal suffixes, or on business name stripped of the legal suffix.

In [None]:
# Getting the last character in the cleaned name variable
d_2015['lname'] = d_2015.orig_name_clean.str.split(' ').str.get(-1)

# Creating a new variable 'legal_type' and assigning it value based on the values in the last character of name variable. 
# If the last character in the name variable does not belong to the pre-set list of legal types- we assign a null value to this character
legal=pd.Series(['INC', 'LTD','LIMITED', 'INCORPORATED', 'INCORPORATION', \
                 'ASSOCIATION','CORP', 'CORPORATION' 'CO', 'LLC', 'ASSOC', 'ASSOCIATES', 'PTNRSHP', 'COMPANY'])

# Creating an indicator which tells us if the last name is specifying the legal name of the establishment or not
d_2015['ind']=d_2015['lname'].isin(legal)
d_2015.head()

# Getting legal names
d_2015['legal_name']=np.where(d_2015['ind']==1, d_2015['lname'], None)

d_2015['len']=d_2015.orig_name_clean.str.len()-d_2015.lname.str.len()
d_2015['name_2'] = d_2015.apply(lambda r: r.orig_name_clean[:r.len], axis=1)

# Here, axis=1 says we are performing this function over the column
# Apply tells that we are performing this function one row at a time
# Lambda is a set of methods in pandas 

d_2015['est_name']=np.where(d_2015['ind']==1, d_2015['name_2'], d_2015['orig_name_clean'])
d_2015[['orig_name_clean','est_name', 'legal_name', 'lname','ind', 'len','name_2']].head()

In [None]:
# standardizing the legal names
# ['INC', 'LTD', 'LIMITED', 'INCORPORATED', 'INCORPORATION', 'ASSOCIATION','CORP', 'CORPORATION' 'CO', 
# LLC', 'ASSOC', 'ASSOCIATES', 'PTNRSHP', 'COMPANY'])

conditions= [
    (d_2015['legal_name']=='INC'), 
    (d_2015['legal_name']=='LTD'), 
    (d_2015['legal_name']=='LIMITED'), 
    (d_2015['legal_name']=='INCORPORATED'), 
    (d_2015['legal_name']=='INCORPORATION'), 
    (d_2015['legal_name']=='ASSOCIATION'), 
    (d_2015['legal_name']=='CORP'), 
    (d_2015['legal_name']=='CORPORATION'),
    (d_2015['legal_name']=='CO'),
    (d_2015['legal_name']=='LLC'), 
    (d_2015['legal_name']=='ASSOC'), 
    (d_2015['legal_name']=='ASSOCIATES'), 
    (d_2015['legal_name']=='PTNRSHP') ,
    (d_2015['legal_name']=='COMPANY')
]

choices=['INC', 'LTD', 'LTD', 'INC', 'INC', 'ASSOCIATION', 'CORP', 'CORP', \
         'CO', 'LLC', 'ASSOCIATION', 'ASSOCIATES', 'PARTNERSHIP', 'COMPANY']

d_2015['legal_name_clean']=np.select(conditions, choices, default=None)

# This is how our prepared data looks
d_2015[['est_name', 'legal_name_clean', 'address', 'setup_date']].head()

Do you see any other possible standardizations? Insert them below!

Now we are done with the inital data prep work. Please keep in mind that we just provided some examples for you to demonstrate the process. You can add as many further steps to it as necessary. 

In [None]:
# We will just run the same code on the year 2005 data. You can also run this in a loop over both years
d_2005['orig_name_clean']=d_2005.orig_name.str.upper()
d_2005['orig_name_clean']=clean(d_2005['orig_name_clean'], lowercase=False, strip_accents='ascii', \
                                remove_brackets=False)
d_2005['lname'] = d_2005.orig_name_clean.str.split(' ').str.get(-1)
legal=pd.Series(['INC', 'LTD', 'COMPANY', 'CORPORATION', 'LIMITED', 'INCORPORATED', 'INCORPORATION', \
                 'ASSOCIATION','CORP', 'LLC', 'ASSOC', 'ASSOCIATES', 'PTNRSHP'])
d_2005['ind']=d_2005['lname'].isin(legal)
d_2005['legal_name']=np.where(d_2005['ind']==1, d_2005['lname'], None)
d_2005['len']=d_2005.orig_name_clean.str.len()-d_2005.lname.str.len()
d_2005['name_2'] = d_2005.apply(lambda r: r.orig_name_clean[:r.len], axis=1)
d_2005['est_name']=np.where(d_2005['ind']==1, d_2005['name_2'], d_2005['orig_name_clean'])
conditions= [
    (d_2005['legal_name']=='INC'), 
    (d_2005['legal_name']=='LTD'), 
    (d_2005['legal_name']=='LIMITED'), 
    (d_2005['legal_name']=='INCORPORATED'), 
    (d_2005['legal_name']=='INCORPORATION'), 
    (d_2005['legal_name']=='ASSOCIATION'), 
    (d_2005['legal_name']=='CORP'), 
    (d_2005['legal_name']=='CORPORATION'),
    (d_2005['legal_name']=='CO'),
    (d_2005['legal_name']=='LLC'), 
    (d_2005['legal_name']=='ASSOC'), 
    (d_2005['legal_name']=='ASSOCIATES'), 
    (d_2005['legal_name']=='PTNRSHP') ,
    (d_2005['legal_name']=='COMPANY')
]
choices=['INC', 'LTD', 'LTD', 'INC', 'INC', 'ASSOCIATION', 'CORP', 'CORP', 'CO', 'LLC', \
         'ASSOCIATION', 'ASSOCIATES', 'PARTNERSHIP', 'COMPANY']
d_2005['legal_name_clean']=np.select(conditions, choices, default=None)
d_2005['zipcode']=d_2005['address'].str.extract('(\d{5})')
d_2005['day']=d_2005['setup_date'].str.extract('(\A\d{1,2})', expand=True)
d_2005['month']=d_2005['setup_date'].str.extract('\d{1,2}.(\d{1,2})', expand=True)
d_2005['year']=d_2005['setup_date'].str.extract('\w+.(\d{4})', expand=True)

## Record Linkage
The record linkage package is a quite powerful tool for you to use when you want to link records within a dataset or across multiple datasets. It comes with different bulid in distances metrics and comparison functions, however, it also allows you to create your own. In general record linkage is divided in several steps. 

In [None]:
# For the match later so we have different names for years
d_2015.rename(columns={'est_name':'est_name_2015'}, inplace=True)
d_2005.rename(columns={'est_name':'est_name_2005'}, inplace=True)

In [None]:
# Only keep variables relevant for linkage
d_2015 = d_2015[['est_name_2015', 'zipcode', 'year', 'legal_name_clean']]
d_2005 = d_2005[['est_name_2005', 'zipcode', 'year', 'legal_name_clean']]

In [None]:
d_2015.head()

In [None]:
d_2005.head()

We've already done the pre-processing, so the next step is indexing the data we would like to link. Indexing allows you to create candidate links, which basically means identifying pairs of data rows which might refer to the same real world entity. This is also called the comparison space (matrix). There are different ways to index data. The easiest is to create a full index and consider every pair a match. This is also the least efficient method, because we will be comparing every row of one dataset with every row of the other dataset. 

In [None]:
# Let's generate a full index first (comparison table of all possible linkage combinations)
indexer = rl.FullIndex()
pairs = indexer.index(d_2015, d_2005)
# Returns a pandas MultiIndex object
print(len(pairs))

We can do better if we actually include our knowledge about the data to eliminate bad link from the start. This can be done through blocking. The recordlinkage packages gives you multiple options for this. For example, you can block by using variables, which menas only links exactly equal on specified values will be kept. You can also use a neighbourhood index in which the rows in your dataframe are ranked by some value and python will only link between the rows that are closeby.

In [None]:
indexerBL = rl.BlockIndex(on='zipcode')
pairs2 = indexerBL.index(d_2015, d_2005)
# Returns a pandas MultiIndex object
print(len(pairs2))

In [None]:
# Initiate compare object (we are using the blocked ones here)
# You want to give python the name of the MultiIndex and the names of the datasets
compare = rl.Compare(pairs2, d_2015,d_2005)

Now we have set up our comparison space. We can start to compareour files and see if we find matches. We will demonstrate an exact match and rule based approches using distance measures. 

In [None]:
# Exact comparison
# This compares all the pairs of strings for exact matches 
# It is similar to a JOIN-- 
exact = compare.exact('est_name_2015','est_name_2005',name='exact')

In [None]:
# This command gives us the probability of match between two strings basis the levenshtein distance
# The measure is 0 if there are no similarities in thee string, 1 if it's identical  
levenshtein = compare.string('est_name_2015','est_name_2005', name='levenshtein')

In [None]:
# This command gives us the probability of match between two strings basis the jarowinkler distance
# The measure is 0 if there are no similarities in thee string, 1 if it's identical 
jarowinkler_name = compare.string('est_name_2015','est_name_2005', method='jarowinkler', name='jarowinkler')

In [None]:
# Finally- we can compare the different metrics for an aggregate comparison of their performance 
print(compare.vectors.describe())

## Results
Once we have our comparison measures we need to classify the measure in matches and non matches. A rule based approach would be to say if the similarity of our indicators is 0.70 or higher we consider this a match, everything else we won't match. This decision need to be made by the analyst.

In [None]:
# Classify matches 
matches = compare.vectors[compare.vectors.max(axis=1) > 0.80]
matches = matches.sort_values("jarowinkler")
matches.head()

Now that we have the list of matches we can fuse our dataset, becasue at the end we want to have a combined dataset. We are using a function for this task.

In [None]:
def fuse(dfA, dfB, dfmatches):
    newDF = dfA.copy()
    columns = dfB.columns.values
    
    for col in columns:
        newDF[col] = newDF.apply(lambda _: '', axis=1)
        
    for row in dfmatches.iterrows():
        indexA = row[0][0]
        indexB = row[0][1]
        
        for col in columns:
            newDF.loc[indexA][col] = dfB.loc[indexB][col]
    return newDF

In [None]:
result = fuse(d_2015, d_2005, matches)
result.head(10)

Another way of classiying records is the Fellegi Sunter Method. If Fellegi Sunter is used to classify record pairs you would follow all the step we have done so far. However now, we would estimate probabilities to construct weights. These weights will then be applied during the classification to give certain characteristics more importance. For example we are more certain that very unique names are a match than Bob Millers.

#### Fellegi Sunter

In [None]:
name = compare.string('est_name_2015','est_name_2005', method='jarowinkler', name='name')
# legal = compare.string('legal_name_clean','legal_name_clean', method='jarowinkler', name='legal')
matches = compare.vectors[compare.vectors.max(axis=1) > 0.80]

In [None]:
# Running the Classifier
# fs = rl.ECMClassifier()
# model =fs.learn(matches)
# pred = fs.predict(matches)
# prob = fs.prob(matches)

In [None]:
# prob

In [None]:
## Generate Training Data and index
ml_pairs = matches[0:40000]
ml_matches_index = ml_pairs.index & pairs2 

The Naive Bayes classifier is a probabilistic classifier. The probabilistic record linkage framework by Fellegi and Sunter (1969) is the most well-known probabilistic classification method for record linkage. Later, it was proved that the Fellegi and Sunter method is mathematically equivalent to the Naive Bayes method in case of assuming independence between comparison variables.

In [None]:
## Train the classifier
nb = rl.NaiveBayesClassifier()
nb.learn(ml_pairs, ml_matches_index)

## Predict the match status for all record pairs
result_nb = nb.predict(matches)

## Predict probability for record to be a match
prob_nb = nb.prob(matches)## Check header

In [None]:
## Check header
prob_nb

### Evaluation

The last step is to evaluate the results of the record linkage. We will cover this in more detail in the machine learning session. This is just for completeness.

In [None]:
## Confusion matrix
conf_nb = rl.confusion_matrix(pairs2, result_nb, len(matches))
conf_nb

In [None]:
## Precision and Accuracy
precision = rl.precision(conf_nb)
accuracy = rl.accuracy(conf_nb)

In [None]:
## Precision and Accuracy
print(precision)
print(accuracy)

In [None]:
## The F-score for this classification is
rl.fscore(conf_nb)

## References and Further Readings

### Parsing

* Python online documentation: https://docs.python.org/2/library/string.html#deprecated-string-functions
* Python 2.7 Tutorial(Splitting and Joining Strings): http://www.pitt.edu/~naraehan/python2/split_join.html

### Regular Expression

* Python documentation: https://docs.python.org/2/library/re.html#regular-expression-syntax
* Online regular expression tester (good for learning): http://regex101.com/

### String Comparators

* GitHub page of jellyfish: https://github.com/jamesturk/jellyfish
* Different distances that measure the differences between strings:
    - Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance
    - Damerau–Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    - Jaro–Winkler distance: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
    - Hamming distance: https://en.wikipedia.org/wiki/Hamming_distance
    - Match rating approach: https://en.wikipedia.org/wiki/Match_rating_approach

### Fellegi-Sunter Record Linkage 

* Introduction to Probabilistic Record Linkage: http://www.bristol.ac.uk/media-library/sites/cmm/migrated/documents/problinkage.pdf
* Paper Review: https://www.cs.umd.edu/class/spring2012/cmsc828L/Papers/HerzogEtWires10.pdf

