<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, and Ridhima Sodhi. "ADA-KCMO-2018." Coleridge Initiative GitHub Repositories. 2018. https://github.com/Coleridge-Initiative/ada-kcmo-2018. [![DOI](https://zenodo.org/badge/119078858.svg)](https://zenodo.org/badge/latestdoi/119078858)

# Record Linkage
----

## Table of Contents

- [Introduction](#Introduction)
    - [Learning Objectives](#Learning-Objectives)
    - [Methods](#Methods)
- [The Principles of Record Linkage](#The-Principles-of-Record-Linkage)
- [Data Description](#Data-Description)
- [Python Setup](#Python-Setup)
- [Load the Data](#Load-the-Data)
- [Data Exploration](#Data-Exploration)
- [Record Linkage on Business Names](#Record-Linkage-on-Business-Names)
    - [The Importance of Pre-Processing](#The-Importance-of-Pre-Processing)
    - [Cleaning String Variables](#Cleaning-String-Variables)
    - [Regular Expressions – `regex`](#Regular-Expressions-–-regex)
    - [Handling Business Suffixes](#Handling-Business-Suffixes)
    - [Record Linkage: Exact Matching on One Field](#Record-Linkage:-Exact-Matching-on-One-Field)
- [Record Linkage on Addresses](#Record-Linkage-on-Addresses)
    - [Address Parsing](#Address-Parsing)
    - [Record Linkage: Exact Matching on Several Fields](#Record-Linkage:-Exact-Matching-on-Several-Fields)
    - [Record Linkage: Rule-Based Matching](#Record-Linkage:-Rule-Based-Matching)
    - [Record Linkage: Fellegi Sunter](#Record-Linkage:-Fellegi-Sunter)
- [Additional Resources](#Additional-Resources)

## Introduction
- Back to [Table of Contents](#Table-of-Contents)

This notebook will provide you with an instruction into Record Linkage using Python. Upon completion of this notebook you will be able to apply record linkage techniques using the *recordlinkage* package to combine data from different sources in Python. 
It will lead you through all the steps necessary for a successful record linkage starting with data preparation  including pre-processing, cleaning and standardization of data.
The notebook follows the underlying lecture and provides examples on how to implement record linkage techniques. 

### Learning Objectives

The goal of this notebook is for you to understand the record linkage techniques. You will be responsible for linking the different datasets in the ADRF, and the subsequent dataset will be used in your later projects.

### Methods

For this notebook exercise we are interested in business data from Kansas City, MO. Business names appear in different datasets: Wage Records and Employer Data from the Missouri Department of Labor, Business Registrations from the Kansas City, MO, Department of Revenue, and Water Consumption data from the Kansas City, MO, Water Services. Here, we will combine the three.

- **Analytical Exercise**: Merge Employer Wage Records, Business Registrations, and Water Services Data. 

- **Approach**: We will look at the data available to us, and clean & pre-process it to enable better linkage. Since the only identifiers we have for this case study are the names of the employers and addresses, we will have to use string matching techniques, part of Python's record linkage package. 

## The Principles of Record Linkage
- Back to [Table of Contents](#Table-of-Contents)

The goal of record linkage is to determine if pairs of records describe the same identity. For instance, this is important for removing duplicates from a data source or joining two separate data sources together. Record linkage also goes by the terms data matching, merge/purge, duplication detection, de-duping, reference matching, entity resolution, disambiguation, co-reference/anaphora in various fields.

There are several approaches to record linkage that include 
    - exact matching 
    - rule-based linking 
    - probabilistic linking 
- An example of **exact matching** is joining records based on social security number, exact name, or geographic code information. This is what you already have done in SQL by joining tables on an unique identifier. 
- **Rule-based matching** involves applying a cascading set of rules that reflect the domain knowledge of the records being linked. 
- In **probabilistic record linkages**, linkage weights are estimated to calculate the probability of a certain match.

In practical applications you will need record linkage techiques to combine information addressing the same entity that is stored in different data sources. Record linkage will also help you to address the quality of different data sources. For example, if one of your databases has missing values you might be able to fill those by finding an identical pair in a different data source. Overall, the main applications of record linkage are
    1. Merging two or more data files 
    2. Identifying the intersection of the two data sets 
    3. Updating data files (with the data row of the other data files) and imputing missing data
    4. Entity disambiguation and de-duplication

## Data Description
- Back to [Table of Contents](#Table-of-Contents)

The datasets used in this exercise will be:
- Employer Data from the Missouri Department of Labor (`kcmo_lehd.mo_qcew_employers`)
- Business Registrations from the Kansas City, MO, Department of Revenue (`public.mo_business_licenses`)
- Water Consumption data from the Kansas City, MO, Water Services (`kcmo_water.ucbprem_premises` and `kcmo_water.ubbchst_consumption_history`)

**Variables Used for Linking:**

We will link these datasets both on Business Name and Business Address.

The Employer Data and Business Registrations from KCMO both have variables for "Business Name":
- In the table `kcmo_lehd.mo_qcew_employers`: `legal_name`
- In the table `public.mo_business_licenses`: `legalname`

Employer Data, Business Registrations, and Water Data all have address information:
- Table `kcmo_lehd.mo_qcew_employers` has a UI address, a physical location and a MO location. We will use the physical location. Address parts are already parsed: `pl_addr1`, `pl_addr2`, `pl_city`, `pl_zip`.
- In the table `public.mo_business_licenses`: `address`
- In the table `kcmo_water.ucbprem_premises`, the address parts are already parsed: `ucbprem_street_name`, `ucbprem_street_number`, `ucbprem_ssfx_code`, `ucbprem_city`, and `ucbprem_zipc_code`

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

Python provides us with some tools we can use for record linkages so we don't have to start from scratch and code our own linkage algorithms. Before we start we need to load the package `recordlinkage`. To fully function this packages uses other packages which also need to be imported. We are adding a couple more packages to the ones you are already familiar with.

In [None]:
# general use imports
%pylab inline
import datetime
import glob
import inspect
import numpy as np
import os
import six
import warnings
import matplotlib.pyplot as plt
import jellyfish
import re

# pandas-related imports
from __future__ import print_function
import pandas as pd
import scipy
import sklearn

# record linkage package
import recordlinkage as rl
from recordlinkage.standardise import clean

# CSV file reading-related imports
import csv

# database interaction imports
import psycopg2
import psycopg2.extras

print( "Imports loaded at " + str( datetime.datetime.now() ) )

## Load the Data
- Back to [Table of Contents](#Table-of-Contents)

As we've done in previous notebooks, let's set up our database scheme, connect to the Database using `psycopg2`, and query the three datasets we would like to use.

In [None]:
# Database connection properties
db_name = "appliedda"
db_host = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host=db_host) #database connection

In [None]:
# Read SQL 
business_licenses_query = '''
SELECT address
        , legalname
        , filingperiod
FROM public.kcmo_business_licenses
WHERE filingperiod = '2016-12-31';
'''

employer_query = '''
SELECT legal_name
        , pl_addr1
        , pl_city
        , pl_zip
FROM kcmo_lehd.mo_qcew_employers
WHERE year = 2016
    AND qtr = 1
    AND pl_city = 'KANSAS CITY'
    AND pl_state = 'MO';
'''

water_services_query = '''
SELECT ucbprem_street_number
        , ucbprem_pdir_code_pre
        , ucbprem_street_name
        , ucbprem_ssfx_code
        , ucbprem_zipc_code
        , ucbprem_city
FROM kcmo_water.ucbprem_premises
WHERE ucbprem_city = 'KANSAS CITY'
        and ucbprem_stat_code_addr = 'MO';
'''

# Save table in dataframe
business = pd.read_sql(business_licenses_query, conn)
employer = pd.read_sql(employer_query, conn)
water = pd.read_sql(water_services_query, conn)

## Data Exploration
- Back to [Table of Contents](#Table-of-Contents)

Next, we want to get to know the data a bit so we need to know what kind of pre-processing we have to apply. What you want to check for example are formats, missing values, and the quality of your data in general. 

__Visualizing the top rows of the different datasets__

In [None]:
business.head()

In [None]:
employer.head()

In [None]:
water.head()

__Shape and Properties__

In [None]:
# Shape of the data frame
print(business.shape)
print(employer.shape)
print(water.shape)

In [None]:
business.describe()

In [None]:
employer.describe()

In [None]:
water.describe()

## Record Linkage on Business Names
- Back to [Table of Contents](#Table-of-Contents)

Our first record linkage example will be the linking of the MO Employer data with the KCMO Business Licenses data. We will link the tables on Business Name, a field that appears in both datasets.

### The Importance of Pre-Processing
- Back to [Table of Contents](#Table-of-Contents)

Data pre-processing is an important step in a data anlysis project in general, in record linkage applications in particular. The goal of pre-processing is to transform messy data into a dataset that can be used in a project workflow.

Linking records from different data sources comes with different challenges that need to be addressed by the analyst. The analyst must determine whether or not two entities (individuals, businesses, geographical units) on two different files are the same. This determination is not always easy. In most of the cases there is no common uniquely identifing characteristic for a entity. For example, is Bob Miller from New York the same person as Bob Miller from Chicago in a given dataset? This determination has to be executed carefully because consequences of wrong linkages may be substantial (is person X the same person as the person X on the list of identified terrorists). Pre-processing can help to make better informed decisions.

Pre-processing can be difficult because there are a lot of things to keep in mind. For example, data input errors, such as typos, misspellings, truncation, abbreviations, and missing values need to be corrected. Literature shows that pre-processing can improve matches. In some situations, 90% of the improvement in matching efficiency may be due to pre-processing. The most common reason why matching projects fail is lack of time and resources for data cleaning. 

In the following cells we will walk you through some pre-processing steps. These include but are not limited to removing spaces, parsing fields, and standardizing strings.

Let's look at the the most recurring business names in the different datasets.

In [None]:
business['legalname'].value_counts().head(20)

In [None]:
employer['legal_name'].value_counts().head(20)

***Right away, we notice that the record linkage between the different datasets will not be straightforward. The variable is messy and non-standardized, similar names can be written differently (in upper-case or lower-case characters, with or without suffixes, etc.) The essential next step is to process the variables in order to make the linkage the most effective and relevant possible.***


### Cleaning String Variables

- Back to [Table of Contents](#Table-of-Contents)

In order to clean the Business Name variables, we will use various string transformations and cleaning. The record linkage package comes with a built in cleaning function we can also use. Finaly, RegEx commands can be used for further cleaning (`replace`) and to extract information from strings (`match`).

In [None]:
# Create new business name varibles on which we will do the cleaning
business['name_clean'] = business['legalname']
employer['name_clean'] = employer['legal_name']

In [None]:
# Upcasing names
business['name_clean'] = business['name_clean'].str.upper()
employer['name_clean'] = employer['name_clean'].str.upper()

In [None]:
# Cleaning names (using the record linkage package tool, see imports)
# Clean removes any characters such as '-', '.', '/', '\', ':'. 
business['name_clean'] = clean(business['name_clean'], lowercase=False, strip_accents='ascii')
employer['name_clean'] = clean(employer['name_clean'], lowercase=False, strip_accents='ascii')

### Regular Expressions – `regex`

- Back to [Table of Contents](#Table-of-Contents)


Regular expressions (regex) are a way of searching for a character pattern. They can be used for matching or replacing operations in strings.

When defining a regular expression search pattern, it is a good idea to start out by writing down, explicitly, in plain English, what you are trying to search for and exactly how you identify when you've found a match.
For example, if we look at an author field formatted as "&lt;last_name&gt; , &lt;first_name&gt; &lt;middle_name&gt;", in plain English, this is how I would explain where to find the last name: "starting from the beginning of the line, take all the characters until you see a comma."


In a regular expression, there are special reserved characters and character classes. For example:
- "`^`" matches the beginning of the line or cell
- "`.`" matches any character
- "`+`" means one or more repetitions of the preceding expressions

Anything that is not a special charater or class is just looked for explicitly. A comma, for example, is not a special character in regular expressions, so inserting "`,`" in a regular expression will simply match that character in the string.

In our example, in order to extract the last name, the resulting regular expression would be:
"`^.+,`". We start at the beginning of the line ( "`^`" ), matching any characters ( "`.+`" ) until we come to the literal character of a comma ( "`,`" ).


_Note: if you want to actually look for one of these reserved characters, it must be escaped, so that, for example, the expression looks for a literal period, rather than the special regular expression meaning of a period. To escape a reserved character in a regular expression, precede it with a back slash ( "`\`" ). For example, "`\.`" will match a "`.`" character in a string._

__REGEX CHEATSHEET__


    - abc...     Letters
    - 123...     Digits
    - \d         Any Digit
    - \D         Any non-Digit Character
    - .          Any Character
    - \.         Period
    - [a,b,c]    Only a, b or c
    - [^a,b,c]   Not a,b, or c
    - [a-z]      Characters a to z
    - [0-9]      Numbers 0 to 9
    - \w any     Alphanumeric chracter
    - \W         any non-Alphanumeric character
    - {m}        m Repetitions
    - {m,n}      m to n repetitions
    - *          Zero or more repetitions
    - +          One or more repetitions
    - ?          Optional Character
    - \s         any Whitespace
    - \S         any non-Whitespace character
    - ^...$      Starts & Ends
    - (...)      Capture Group
    - (a(bc))    Capture sub-Group
    - (.*)       Capture All
    - (abc|def)  Capture abc or def

__Examples:__
    - `(\d\d|\D)`      will match 22X, 23G, 56H, etc...
    - `(\w)`           will match any characters between 0-9 or a-z
    - `(\w{1-3})`      will match any alphanumeric character of a length of 1 to 3. 
    - `(spell|spells)` will match spell or spells
    - `(corpo?)        will match corp or corpo
    - `(feb 2.)`       will match feb 20, feb 21, feb 2a, etc.


__Using REGEX to match characters:__

In python, to use a regular expression like this to search for matches in a given string, we use the built-in "`re`" package ( https://docs.python.org/2/library/re.html ), specifically the "`re.search()`" method. To use "`re.search()`", pass it first the regular expression you want to use to search, enclosed in quotation marks, and then the string you want to search within. 



__Using REGEX for replacing characters:__

The `re` package also has an "`re.sub()`" method used to replace regular expressions by other strings. The method can be applied to an entire pandas column (replacing expression1 with expression2) with the following syntax: `df['variable'].str.replace(r'expression1', 'expression2')`. Note the `r` before the first string to signal we are using regular expressions.

**Applications:**

In [None]:
# Remove multiple successive spaces:
business['name_clean'] = business['name_clean'].str.replace(r'\s\s+', ' ')
employer['name_clean'] = employer['name_clean'].str.replace(r'\s\s+', ' ')

# replace U S A by USA
business['name_clean'] = business['name_clean'].str.replace(r'\bU\sS\sA\b', 'USA')
employer['name_clean'] = employer['name_clean'].str.replace(r'\bU\sS\sA\b', 'USA')

# replace U S by US
business['name_clean'] = business['name_clean'].str.replace(r'\bU\sS\b', 'US')
employer['name_clean'] = employer['name_clean'].str.replace(r'\bU\sS\b', 'US')

**Handling the word "THE":**

In the KCMO Business Licenses data, the word "THE" at the beginning of a company name has been moved to the end. 

Instead of moving it back to the front of the name, let's just remove "THE" when it is the first or last word of the Business Name.

In [None]:
# Remove "THE" when it is the First Word:
business['name_clean'] = business['name_clean'].str.replace(r'^THE\b', '')
employer['name_clean'] = employer['name_clean'].str.replace(r'^THE\b', '')

# Remove "THE" when it is the Last Word:
business['name_clean'] = business['name_clean'].str.replace(r'\bTHE$', '')
employer['name_clean'] = employer['name_clean'].str.replace(r'\bTHE$', '')

### Handling Business Suffixes
- Back to [Table of Contents](#Table-of-Contents)

You may have noticed that several business names finish with a legal suffix ("CO", "INC", "LIMITED LIABILITY", etc.). Unfortunately, examples below will show that these suffixes are not consistent between tables, and they make record linkage impossible. Below we detail one possible way of dealing with legal suffixes. 

We will start by using regular expressions to standardize legal suffixes. We will then isolate them into a separate variable. When matching the dataframes, we can now choose to match on both business name and legal suffixes, or on business name stripped of the legal suffix.

**Standardizing legal suffixes:**

In [None]:
# replace Company by CO
business['name_clean'] = business['name_clean'].str.replace('COMPANY', 'CO')
employer['name_clean'] = employer['name_clean'].str.replace('COMPANY', 'CO')

# replace CORPORATION by CORP
business['name_clean'] = business['name_clean'].str.replace('CORPORATION', 'CORP')
employer['name_clean'] = employer['name_clean'].str.replace('CORPORATION', 'CORP')

# replace National Association by NA
business['name_clean'] = business['name_clean'].str.replace('NATIONAL ASSOCIATION', 'NA')
employer['name_clean'] = employer['name_clean'].str.replace('NATIONAL ASSOCIATION', 'NA')

# replace N A by NA
business['name_clean'] = business['name_clean'].str.replace(r'\bN\sA\b', 'NA')
employer['name_clean'] = employer['name_clean'].str.replace(r'\bN\sA\b', 'NA')

# replace Limited Liability Company by LLC
business['name_clean'] = business['name_clean'].str.replace('LIMITED LIABILITY COMPANY', 'LLC')
employer['name_clean'] = employer['name_clean'].str.replace('LIMITED LIABILITY COMPANY', 'LLC')

# replace L L C by LLC
business['name_clean'] = business['name_clean'].str.replace(r'\bL\sL\sC\b', 'LLC')
employer['name_clean'] = employer['name_clean'].str.replace(r'\bL\sL\sC\b', 'LLC')

# replace Limited Partnership by LP
business['name_clean'] = business['name_clean'].str.replace('LIMITED PARTNERSHIP', 'LP')
employer['name_clean'] = employer['name_clean'].str.replace('LIMITED PARTNERSHIP', 'LP')

# replace L P by LP
business['name_clean'] = business['name_clean'].str.replace(r'\bL\sP\b', 'LP')
employer['name_clean'] = employer['name_clean'].str.replace(r'\bL\sP\b', 'LP')

# replace Partnership by PTNSHP
business['name_clean'] = business['name_clean'].str.replace('PARTNERSHIP', 'PTNSHP')
employer['name_clean'] = employer['name_clean'].str.replace('PARTNERSHIP', 'PTNSHP')

# replace ASSOCIATION by ASSOC
business['name_clean'] = business['name_clean'].str.replace('ASSOCIATION', 'ASSOC')
employer['name_clean'] = employer['name_clean'].str.replace('ASSOCIATION', 'ASSOC')

# replace ASSOCIATES or ASSOCIATE by ASSOC
business['name_clean'] = business['name_clean'].str.replace(r'ASSOCIATES?', 'ASSOC')
employer['name_clean'] = employer['name_clean'].str.replace(r'ASSOCIATES?', 'ASSOC')

Do you see any other possible standardizations? Insert them below!

**Creating a legal suffix variable and removing legal suffixes from business names:**

Next, let's isolate the last word of the business name and keep it as a separate variable. We can then remove all business suffixes from the legal names.

In [None]:
# Get last word of the clean name
business['legal_suffix'] = business['name_clean'].str.split(' ').str.get(-1)
employer['legal_suffix'] = employer['name_clean'].str.split(' ').str.get(-1)

In [None]:
# Make list of legal suffixes
legal = pd.Series(['INC', 'LTD', 'CORP', 'CO', 'LLC', 'LP', 'ASSOC', 'PTNSHP'])

# Only keep the legal suffix if it is in the list
business['legal_suffix'] = business['legal_suffix'] * business['legal_suffix'].isin(legal)
employer['legal_suffix'] = employer['legal_suffix'] * employer['legal_suffix'].isin(legal)

In [None]:
# Remove the legal suffixes from the clean name
business['name_clean'] = business['name_clean'].str.replace(r'\b(INC|LTD|CORP|CO|LLC|LP|ASSOC|PRNSHP)\b', '')
employer['name_clean'] = employer['name_clean'].str.replace(r'\b(INC|LTD|CORP|CO|LLC|LP|ASSOC|PRNSHP)\b', '')

In [None]:
# Final cleaning, striping of strings
business['name_clean'] = clean(business['name_clean'], lowercase=False, strip_accents='ascii')
employer['name_clean'] = clean(employer['name_clean'], lowercase=False, strip_accents='ascii')

### Record Linkage: Exact Matching on One Field
- Back to [Table of Contents](#Table-of-Contents)

Now that the name standardizing and cleaning is done, we will procede to the record linkage. In this case we will present the simplest form of linkage: exact matching on a single variable (`name_clean`). We use the pandas `merge` for the join. Since we want to keep all data entries from both tables, we will do an outer merge on `name_clean`.

In [None]:
linked_names = pd.merge(business, employer, how = 'outer', on = 'name_clean', indicator = True)

In [None]:
# How did our merge perform?
linked_names['_merge'].value_counts(normalize = True)

> Only 20% of the business names were merged correctly. You can try improving the merge count by changing the name cleaning and standardizing.

## Record Linkage on Addresses
- Back to [Table of Contents](#Table-of-Contents)

Now that we have the business name cleaned and standardized across datasets, we can use the different name pairs for string matching. However, before doing that we also would like to look at other variables of interest which might help us in better linkage in terms of increasing probability of a correct match (and reducing errors). The variable which seem immediately important is the address of a given establishment. We can pre-process this attribute by using regex functions.

In [None]:
business.head()

In [None]:
employer.head()

In [None]:
water.head()

In [None]:
business = business[business['address'].notnull()]

In [None]:
employer = employer[(employer['pl_addr1'].notnull())&(employer['pl_zip'].notnull())]

In [None]:
water = water[(water['ucbprem_street_number'].notnull())
              &(water['ucbprem_pdir_code_pre'].notnull())
              &(water['ucbprem_street_name'].notnull())
              &(water['ucbprem_ssfx_code'].notnull())
              &(water['ucbprem_zipc_code'].notnull())]

The water data's address fields are entirely parsed into `street_name`, `street_number`, `ssfx_code` (street suffix code), `city`, and `zipcode`. In the employer data, the address is partially parsed but the `pl_addr1` field still contains the street name, number and ssfx code. In the business data finally all fields are concatenated into the `address` field.

The first step will therefore be to standardize the parsing across the 3 tables. All parsed address features will be prefixed with a `p_`.

### Address Parsing
- Back to [Table of Contents](#Table-of-Contents)


In [None]:
business['p_address'] = clean(business['address'], strip_accents = 'ascii'
                              , replace_by_none='[^ \\-\\_A-Za-z0-9#]+'
                              , replace_by_whitespace='[_]').str.upper()

In [None]:
employer['p_address'] = clean(employer['pl_addr1'], strip_accents = 'ascii'
                              , replace_by_none='[^ \\-\\_A-Za-z0-9#]+').str.upper()
employer['p_zipcode'] = clean(employer['pl_zip']).str.upper()

In [None]:
water['p_street_number'] = clean(water['ucbprem_street_number'], strip_accents = 'ascii').str.upper()
water['p_prefix'] = clean(water['ucbprem_pdir_code_pre'], strip_accents = 'ascii').str.upper().fillna('')
water['p_street_name'] = clean(water['ucbprem_street_name'], strip_accents = 'ascii').str.upper()
water['p_suffix'] = clean(water['ucbprem_ssfx_code'], strip_accents = 'ascii').str.upper().fillna('')
water['p_zipcode'] = clean(water['ucbprem_zipc_code'], strip_accents = 'ascii').str.upper()

**Extracting Zipcode: **

US Zipcode typically follows a set pattern of 5 digits or 5 digits followed by a hypen and then 4 digits. We can use this information to extract zipcodes. In the end, we will use only the first 5 digits of the zip code (the first few lines of data inform us that not all observations have the 9-digit zipcode).

In [None]:
# Pattern 1
# 5 digits of principal zipcode
business['p_zipcode_short'] = business['p_address'].str.extract(r'(\d{5})$')
# Breaking the code: 
# \d tells that we need a digit
# \d{5} tells that we need 5 digits consecutively
# () enclosing brackets tell that we need to extract this information in the new variable
# $ tells us that this pattern has to come at the end of the cell

In [None]:
# Pattern 2
# Let's extract the full zipcode from the business data
# '5 digits <hyphen> 4 digits'
business['p_zipcode_full'] = business['p_address'].str.extract(r'(\d{5}-\d{4})$')
# Breaking the code: 
# \d{5} ---- tells that we need 5 digits consecutively
# '-' this is just passing the string exactly as we need it
# \d{4} tells us 4 more digits
# $ tells us that this pattern has to come at the end of the cell

In [None]:
# Since we only want to keep the first 5 digits of the zipcode, let's match eithe of 
# the 2 first patterns and keep the first 5 digits.
# Pattern 3
# We can also pass both the expressions in our query as an OR.
business['p_zipcode'] = business['p_address'].str.extract(r'(\d{5}-\d{4}|\d{5})$').str[:5]
# Breaking the code: 
# | tells us we need one or the other of the two possible zipcodes<span style="background-color: #FFFF00"> 

In [None]:
business.head()

In [None]:
del business['p_zipcode_full']
del business['p_zipcode_short']

In [None]:
# Let's do the same for the water data and employer data
water['p_zipcode'] = water['p_zipcode'].str[:5]
employer['p_zipcode'] = employer['p_zipcode'].str[:5]

** Extracting State: **

The business data seems to have businesses from both Missouri and Kansas. Let's parse out the state name.

In [None]:
business['p_state'] = business['p_address'].str.extract(r'\b(MO|KS)\b')

In [None]:
business.head()

We can then restrict to businesses in Missouri.

In [None]:
business = business[business['p_state']=="MO"].reset_index(drop = True)

**Extracting city:**

Let's extract the address city when the business is in Kansas City. We will then restrict our business data to these observations in Kansas City.

In [None]:
business['p_city'] = business['p_address'].str.extract(r'\b(KANSAS CITY)\b')

In [None]:
business = business[business['p_city']=="KANSAS CITY"].reset_index(drop = True)

In [None]:
# Now that we have extrtacted them, let's strip the address field 
# in the business of zipcode, state, and city
business['p_address'] = business['p_address'].str.replace(r'(\d{5}-\d{4}|\d{5})$', '')
business['p_address'] = business['p_address'].str.replace(r'\b(MO|KANSAS CITY)\b', '')

** Extracting Street Name, Number, Suffix Code:**


In order to match the 3 datasets, we still have to parse out street name, number, and suffix code from the business and employer tables. We will proceed in similar fashion using regex.

In [None]:
# Cleaning: triming spaces
business['p_address'] = business['p_address'].str.replace(r'\s\s+', ' ').str.strip()
employer['p_address'] = employer['p_address'].str.replace(r'\s\s+', ' ').str.strip()

In [None]:
business[business['p_address'].str.contains(r'\bSTE\b')].head(2)

A few businesses have addresses with suite information ("STE" followed by a number or a letter) or apartment information ("APT" followed by a number or letter). Since this does not appear in the other datasets, let's get rid of these.

In [None]:
# Pattern: matching STE followed by any number of digits
business['p_address'] = business['p_address'].str.replace(r'\b((STE|SUITE|BLDG|APT|FL|UNIT|HNGR)\s\w+)\b', '')
employer['p_address'] = employer['p_address'].str.replace(r'\b((STE|SUITE|BLDG|APT|FL|UNIT|HNGR)\s\w+)\b', '')
# Breaking up the code: 
# \b tells us we start at the begining/end of a word
# STE|SUITE|BLDG|APT tells us we need to match one of those strings
# \s tells us we need to match a space
# \w tells that we need a alpha-numeric character
# \w+ tells us that we need any number of consecutive alpha-numeric character (a word)

# Pattern: Remove # followed by a number
business['p_address'] = business['p_address'].str.replace(r'\s\#\s\w+\b', '')
employer['p_address'] = employer['p_address'].str.replace(r'\s\#\s\w+\b', '')

In [None]:
# Cleaning
business['p_address'] = business['p_address'].str.replace(r'\s\s+', ' ').str.strip()
employer['p_address'] = employer['p_address'].str.replace(r'\s\s+', ' ').str.strip()

Now that the addresses on both the business and the employer tables are in similar format, let's parse out the different elements.

In [None]:
# Let's extract street number:
business['p_street_number'] = business['p_address'].str.extract(r'^(\d+|\d+\w+|\d+\-\w+)\b')
employer['p_street_number'] = employer['p_address'].str.extract(r'^(\d+|\d+\w+|\d+\-\w+)\b')
# \d+ matches a succession of digits
# \d+\w+ matches a succession of digits followed by letters (14C)
# \d+\-\w+ matches digits followed by a dash and more characters (14-C or 123-125)

# Let's remove the street number from the address
business['p_address'] = business['p_address'].str.replace(r'^(\d+|\d+\w)\b', '')
employer['p_address'] = employer['p_address'].str.replace(r'^(\d+|\d+\w)\b', '')

In [None]:
# Cleaning
business['p_address'] = business['p_address'].str.replace(r'\s\s+', ' ').str.strip()
employer['p_address'] = employer['p_address'].str.replace(r'\s\s+', ' ').str.strip()

In [None]:
business.head()

The only thing left in the address is street name, with prefix and suffix when applicable.

In [None]:
# Let's create a list of street prefixes based on the ones in the water data
prefixes = water[water['p_prefix'].notnull()]['p_prefix'].unique()
print(prefixes)

In [None]:
# Let's create a list of street suffix codes based on the ones in the water data
suffixes = water[water['p_suffix'].notnull()]['p_suffix'].unique()
print(suffixes)

In [None]:
# Test if first word is a prefix. Replace Null values by empty strings
business['p_prefix'] = (business['p_address'].str.extract(r'^(\w+)') 
                        * business['p_address'].str.extract(r'^(\w+)').isin(prefixes)).fillna('')
employer['p_prefix'] = (employer['p_address'].str.extract(r'^(\w+)') 
                        * employer['p_address'].str.extract(r'^(\w+)').isin(prefixes)).fillna('')

# Test if last word is a suffix. Replace Null values by empty strings
business['p_suffix'] = (business['p_address'].str.extract(r'(\w+)$') 
                        * business['p_address'].str.extract(r'(\w+)$').isin(suffixes)).fillna('')
employer['p_suffix'] = (employer['p_address'].str.extract(r'(\w+)$') 
                        * employer['p_address'].str.extract(r'(\w+)$').isin(suffixes)).fillna('')

In [None]:
business.head()

The street name is whatever is left in the address field once we remove prefix and suffix.

In [None]:
business['p_street_name'] = business['p_address']
employer['p_street_name'] = employer['p_address']

# Remove prefix when applicable
business['p_street_name'] = np.where(business['p_prefix']=='', business['p_street_name']
                                     , business['p_street_name'].str.replace(r'^(\w+)', ''))
employer['p_street_name'] = np.where(employer['p_prefix']=='', employer['p_street_name']
                                     , employer['p_street_name'].str.replace(r'^(\w+)', ''))

# Remove suffix when applicable
business['p_street_name'] = np.where(business['p_suffix']=='', business['p_street_name']
                                     , business['p_street_name'].str.replace(r'(\w+)$', ''))
employer['p_street_name'] = np.where(employer['p_suffix']=='', employer['p_street_name']
                                     , employer['p_street_name'].str.replace(r'(\w+)$', ''))

del business['p_address']
del employer['p_address']

In [None]:
# Cleaning
business['p_street_name'] = business['p_street_name'].str.replace(r'\s\s+', ' ').str.strip()
employer['p_street_name'] = employer['p_street_name'].str.replace(r'\s\s+', ' ').str.strip()

### Advanced Methods of Record Linkaged
- Back to [Table of Contents](#Table-of-Contents)

Now that we have standardized all three datasets, we can procede to the record linkage. This time, we will use all the fields we created for more accurate linkage. The `recordlinkage` package is a quite powerful tool for you to use when you want to link records using several variables, and you want to avoid linking only records that match perfectly on every variable. It comes with different built in distance metrics and comparison functions, and it also allows you to create your own.

Below we will provide examples of record linkage according to 3 methods:
- Exact Record Linkage on all fields
- Rule-Based Record Linkage
- Probabilistic Record Linkage

In the examples, we will link employer data with water data, but all methods can be used to link all 3 dataframes as well.

In [None]:
business.head()

In [None]:
water.head()

In [None]:
employer.head()

### Record Linkage: Exact Matching on Several Fields
- Back to [Table of Contents](#Table-of-Contents)

Exact record linkage is similar to what we did in the first section with business names. Here we use all cleaned variables.

In [None]:
linked_data_exact = pd.merge(employer, water, how = 'outer'
                             , on = ['p_street_number', 'p_prefix', 'p_street_name', 'p_suffix', 'p_zipcode']
                             , indicator = True)

In [None]:
linked_data_exact['_merge'].value_counts()

### Record Linkage: Rule-Based Matching
- Back to [Table of Contents](#Table-of-Contents)

Instead of matching only on exact features, we can try to improve our results by looking at *similar* entries. Indexing allows you to create candidate links, which basically means identifying pairs of data rows which might refer to the same real world entity. This is also called the comparison space (matrix). There are different ways to index data. The easiest is to create a full index and consider every pair a match. This is also the least efficient method, because we will be comparing every row of one dataset with every row of the other dataset. 

In [None]:
# This line is very slow -- we will not run it here.

# # Let's generate a full index first (comparison table of all possible linkage combinations)
# indexer = rl.FullIndex()
# pairs = indexer.index(employer.head(100), business.head(100))
# # Returns a pandas MultiIndex object
# print(len(pairs))

We can do better if we actually include our knowledge about the data to eliminate bad links from the start. This can be done through blocking. The recordlinkage packages gives you multiple options for this. For example, you can block by using variables, which means only links exactly equal on specified values will be kept. You can also use a neighborhood index in which the rows in your dataframe are ranked by some value and python will only link between the rows that are closeby.
> We suppose here that Zipcode and Street Number are correct. Let's only look for *similar* addresses on street name, prefix, and suffix.

In [None]:
indexerBL = rl.BlockIndex(on=['p_zipcode', 'p_street_number', 'p_prefix', 'p_suffix'])
pairs2 = indexerBL.index(employer, water)
# Returns a pandas MultiIndex object
print(len(pairs2))

In [None]:
# Initiate compare object (we are using the blocked ones here)
# You want to give python the name of the MultiIndex and the names of the datasets
compare = rl.Compare(pairs2, employer, water)

Now we have set up our comparison space. We can start to compare our files and see if we find matches. We will demonstrate an exact match and rule based approaches using distance measures. 

In [None]:
# Exact comparison
# This compares all the pairs of strings for exact matches 
# It is similar to a JOIN-- 
exact = compare.exact('p_street_name','p_street_name',name='exact')

In [None]:
# This command gives us the probability of match between two strings basis the levenshtein distance
# The measure is 0 if there are no similarities in thee string, 1 if it's identical  
levenshtein = compare.string('p_street_name','p_street_name', name='levenshtein')

In [None]:
# This command gives us the probability of match between two strings basis the jarowinkler distance
# The measure is 0 if there are no similarities in thee string, 1 if it's identical 
jarowinkler_name = compare.string('p_street_name','p_street_name', method='jarowinkler', name='jarowinkler')

In [None]:
# Finally- we can compare the different metrics for an aggregate comparison of their performance 
print(compare.vectors.describe())

Once we have our comparison measures we need to classify the measure in matches and non matches. A rule based approach would be to say if the similarity of our indicators is 0.70 or higher we consider this a match, everything else we won't match. This decision needs to be made by the analyst.

In [None]:
# Classify matches 
matches = compare.vectors[compare.vectors.max(axis=1) > 0.80]
matches = matches.sort_values("jarowinkler")
matches.head()

Now that we have the list of matches we can fuse our dataset, because at the end we want to have a combined dataset. We are using a function for this task.

In [None]:
def fuse(dfA, dfB, dfmatches):
    newDF = dfA.copy()
    columns = dfB.columns.values
    
    for col in columns:
        newDF[col] = newDF.apply(lambda _: '', axis=1)
        
    for row in dfmatches.iterrows():
        indexA = row[0][0]
        indexB = row[0][1]
        
        for col in columns:
            newDF.loc[indexA][col] = dfB.loc[indexB][col]
    return newDF

In [None]:
result = fuse(employer, water, matches)
result.head()

### Record Linkage: Fellegi Sunter
- Back to [Table of Contents](#Table-of-Contents)

Another way of classifying records is the Fellegi Sunter Method. If Fellegi Sunter is used to classify record pairs you would follow all the steps we have done so far. However, now we would estimate probabilities to construct weights. These weights will then be applied during the classification to give certain characteristics more importance. For example we are more certain that very unique names are a match than Bob Millers.

This method is detailed in the class presentation but will not be implemented here. 

---

## Additional Resources

### Parsing

* Python online documentation: https://docs.python.org/2/library/string.html#deprecated-string-functions
* Python 2.7 Tutorial(Splitting and Joining Strings): http://www.pitt.edu/~naraehan/python2/split_join.html

### Regular Expression

* Python documentation: https://docs.python.org/2/library/re.html#regular-expression-syntax
* Online regular expression tester (good for learning): http://regex101.com/

### String Comparators

* GitHub page of jellyfish: https://github.com/jamesturk/jellyfish
* Different distances that measure the differences between strings:
    - Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance
    - Damerau–Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    - Jaro–Winkler distance: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
    - Hamming distance: https://en.wikipedia.org/wiki/Hamming_distance
    - Match rating approach: https://en.wikipedia.org/wiki/Match_rating_approach

### Fellegi-Sunter Record Linkage 

* Introduction to Probabilistic Record Linkage: http://www.bristol.ac.uk/media-library/sites/cmm/migrated/documents/problinkage.pdf
* Paper Review: https://www.cs.umd.edu/class/spring2012/cmsc828L/Papers/HerzogEtWires10.pdf