# Record Linkage: Introduction and Exercises

----

## Introduction

This notebook will provide you with an instruction into Record Linkage using Python. Upon completion of this notebook you will be able to apply record linkage techniques using the *recordlinkage* package to combine data from different sources in Python. 
It will lead you through all the steps necessary for a sucessful record linkage starting with data preparation  including pre-processing, cleaning and standardization of data.
The notebook follows the underlying lecture and provides examples on how to implement record linkage techniques. 
## Learning Objectives

The goal of this notebook is for you to understand the recod linkage techniques. You will be responsible with linking the different datasets in the ADRF, and the subsequent dataset will be used in your later projects.

## Table of Contents
- [The Principles of Record Linkage](#The-Principles-of-Record-Linkage)
- [Setting Up the Case Study](#Setting-Up-the-Case-Study)
- [Data Description](#Data-Description)
- [Import of Packages](#Import-of-Packages)
- [Connecting to the Database](#Connecting-to-the-Database)
- [Data Exploration](#Data-Exploration)
- [The Importance of Pre-Processing](#The-Importance-of-Pre-Processing)
- [Pre-processing of Identifiers](#Pre-processing-of-Identifiers)
- [Record Linkage](#Record-Linkage)
- [Results](#Results)
- [References and Further Readings](#References-and-Further-Readings)

## Methods
- Back to [Table of Contents](#Table-of-Contents)

For this notebook exercise we are interested in data on the businesses of Kansas City, MO. Business names appear in different datasets: Business Licenses from KCMO Department of Revenue, Water Services data, and Patent Data. Here, we will combine the three.

- **Analytical Exercise**: Merge Patents Data and Water Services Data onto the Kansas City Business Licenses. 

- **Data Availability**: We have variables that refer to the Business Name and Business Address in all datasets. Kansas City Business Licenses have 4 years of data, Water Services <span style="background-color: #FFFF00">has 5</span>, and Patent Data goes back to 1976.

- **Approach**: We will look at the data available to us, and clean & pre-process it to enable better linkage. Since the only identifiers we have for this case study are the names of the firms and addresses, we will have to use string matching techniques, part of Python's record linkage package. 



## The Principles of Record Linkage
- Back to [Table of Contents](#Table-of-Contents)

The goal of record linkage is to determine if pairs of records describe the same identity. For instance, this is important for removing duplicates from a data source or joining two separate data sources together. Record linkage also goes by the terms data matching, merge/purge, duplication detection, de-duping, reference matching, entity resolution, disambiguation, co-reference/anaphora in various fields.

There are several approaches to record linkage that include 
    - exact matching 
    - rule-based linking 
    - probabilistic linking 
- An example of **exact matching** is joining records based on social security number, exact name, or geographic code information. This is what you already have done in SQL by joining tables on an unique identifier. 
- **Rule-based matching** involves applying a cascading set of rules that reflect the domain knowledge of the records being linked. 
- In **probabilistic record linkages**, linkage weights are estimated to calculate the probability of a certain match.

In practical applications you will need record linkage techiques to combine information addressing the same entity that is stored in different data sources. Record linkage will also help you to address the quality of different data sources. For example, if one of your databases has missing values you might be able to fill those by finding an identical pair in a different data source. Overall, the main applications of record linkage are
    1. Merging two or more data files 
    2. Identifying the intersection of the two data sets 
    3. Updating data files (with the data row of the other data files) and imputing missing data
    4. Entity disambiguation and de-duplication

## Data Description
- Back to [Table of Contents](#Table-of-Contents)

The datasets used in this exercise come from the Kansas City, MO, Department of Revenue (<span style="background-color: #FFFF00">`business_licenses`</span>), from the Water Services (<span style="background-color: #FFFF00">`water`</span>), and from the United States Patent and Trademark Office (<span style="background-color: #FFFF00">`patent_data`</span>).

**Variables Used for Linking:**

The datasets all have a variable for Business Name:
- In dataset <span style="background-color: #FFFF00">`business_licenses`</span>: `legal_name`
- In dataset <span style="background-color: #FFFF00">`water`</span>: <span style="background-color: #FFFF00">`business_name`</span>
- In dataset <span style="background-color: #FFFF00">`patent_data`</span>: `organization`

There is also address information in both the Business Licenses data and in the Water Services data:
- In dataset <span style="background-color: #FFFF00">`business_licenses`</span>: `address`
- In dataset <span style="background-color: #FFFF00">`water`</span>: <span style="background-color: #FFFF00">`address`</span>

<span style="background-color: #FFFF00"> 
**Filter Applied**: for the purposes of this exercise, we also removed certain records with the following criterion: 
All records which have a unique combination of ein & name_legal... This means any record for which one ein is linked to only one name_legal is not a part of this dataset.
All records which have a null value for address or setup_date
For 2005 data in particular, for some records, we have character values (A-Z) in the variable holding #employees. We have removed such records. 
</span>

## Import of Packages
- Back to [Table of Contents](#Table-of-Contents)

Python provides us with some tools we can use for record linkages so we don't have to start from scratch and code our own linkage algorithms. So before we start we need to load the package recordlinkage. To fully function this packages uses other packages which also need to be imported. Thus we are adding more packages to the ones you are already familiar with.

In [33]:
# general use imports
%pylab inline
import datetime
import glob
import inspect
import numpy as np
import os
import six
import warnings
import matplotlib.pyplot as plt
import jellyfish
import re

# pandas-related imports
from __future__ import print_function
import pandas as pd
import scipy
import sklearn

# record linkage package
import recordlinkage as rl
from recordlinkage.preprocessing import clean

# CSV file reading-related imports
import csv

# database interaction imports
import psycopg2
import psycopg2.extras


print( "Imports loaded at " + str( datetime.datetime.now() ) )

Populating the interactive namespace from numpy and matplotlib
Imports loaded at 2018-01-31 17:23:53.061443


## Load the Data
- Back to [Table of Contents](#Table-of-Contents)

As you've done in previous notebooks, let's set up our database scheme, connect to the Database using Psycopg2, and query the three datasets we would like to use.

In [None]:
# Database connection properties
# db_name = "appliedda"
# db_host = "10.10.2.10"
# conn = psycopg2.connect(database=db_name, host=db_host) #database connectionschema = "ada_class3"

In [None]:
# Read SQL 
# business_licenses_query = '''
# SELECT *
# FROM business_licenses
# WHERE
# '''

# water_services_query = '''
# SELECT *
# FROM water_services
# WHERE
# '''

# patent_data_query = '''
# SELECT *
# FROM patent_data
# WHERE
# '''

# Save table in dataframe
# business = pd.read_sql(business_licenses_query, conn)
# water = pd.read_sql(water_services_query, conn)
# patent = pd.read_sql(patent_data_query, conn)

In [8]:
#TEMP
business = pd.read_csv('../../data/KCMO/BusinessLicense2013_2018NYU_01222018.csv')
# water_services = pd.read_csv('../../data/Water/????.csv')
patent = pd.read_csv('../../data/USPTO/patent_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [9]:
# Renaming Columns
business.columns = ['business_activity', 'address', 'legal_name', 'dba_name', 'filing_period']
patent.columns = ['id', 'patent_id', 'type', 'number', 'country', 'date'
                       , 'assignee_id', 'name_first', 'name_last', 'organization'
                      ]

## Data Exploration
- Back to [Table of Contents](#Table-of-Contents)

Next, we want to get to know the data a bit so we need to know what kind of pre-processing we have to apply. What you want to check for example are formats, missing values, and the quality of your data in general. 

#### Visualizing the top rows of the different datasets

In [11]:
business.head()

Unnamed: 0,business_activity,address,legal_name,dba_name,filing_period
0,Consumer Lending,6005 NW 104TH TER KANSAS CITY MO 64154-1792,STATHOPOULOS INC,,12/31/13
1,Consumer Lending,6005 NW 104TH TER KANSAS CITY MO 64154-1792,STATHOPOULOS INC,,12/31/14
2,Temporary Help Services,6700 ANTIOCH RD STE 460 MERRIAM KS 66204-1200,COMPALLIANCE LLC,,12/31/16
3,Temporary Help Services,6700 ANTIOCH RD STE 460 MERRIAM KS 66204-1200,COMPALLIANCE LLC,,12/31/17
4,Temporary Help Services,6700 ANTIOCH RD STE 460 MERRIAM KS 66204-1200,COMPALLIANCE LLC,,12/31/13


In [12]:
patent.head()

Unnamed: 0,id,patent_id,type,number,country,date,assignee_id,name_first,name_last,organization
0,02/002761,D345393,2,2002761,US,1992-12-21,b6329498939970968a9366a03e957115,,,"Far Great Plastics Industrial Co., Ltd."
1,02/007691,5164715,2,2007691,US,1990-04-10,27c275396a1220e677472f96fe035340,,,"Stanley Electric Co., Ltd."
2,02/010248,5177974,2,2010248,US,1988-06-23,4637697c74b334d29644fe57cb4f7493,,,Pub-Gas International Pty. Ltd.
3,02/020141,5379515,2,2020141,US,1994-02-16,6c00cb129070696ef109f6264da00318,,,Canon Kabushiki Kaisha
4,02/020141,5379515,2,2020141,US,1994-02-16,4df0c42b474fe1b51374fb4f54c1d7a0,,,"Sumitomo Metal Industries, Ltd."


In [13]:
#water.head()

#### Shape and Properties

In [16]:
# Shape of the data frame
print(business.shape)
print(patent.shape)
#print(water.shape)

(104362, 5)
(6599785, 10)


In [19]:
business.describe()

Unnamed: 0,business_activity,address,legal_name,dba_name,filing_period
count,104362,104362,104362,36163,104362
unique,626,30654,33593,10351,6
top,Commercial and Institutional Building Construc...,2405 GRAND BLVD STE 1020 KANSAS CITY MO 64108-...,REDBOX AUTOMATED RETAIL LLC,AMERICAN TOWERS ASSET SUB II LLC,12/31/17
freq,20164,178,332,93,24961


In [17]:
patent.describe()

Unnamed: 0,id,patent_id,type,number,country,date,assignee_id,name_first,name_last,organization
count,6599785,6599785,6599785,6599785,6599785,6599785,5574600,62127,62128,5466526
unique,6204979,6422962,28,6204890,1,16720,374807,19936,22516,332498
top,8,4185912,9,8,US,1995-06-07,29a03fd21a4c9b1420a55ecba2105eae,Michael,Lee,International Business Machines Corporation
freq,131072,35,733624,151684,6599785,10537,110452,433,559,110452


In [None]:
# water.describe()

#### Exploring the Business Name variable
Let's look at the the most recurring business names in the different datasets.

In [25]:
business['legal_name'].value_counts().head(10)

REDBOX AUTOMATED RETAIL LLC        332
SP PLUS CORPORATION                200
MISSOURI CVS PHARMACY LLC          134
COMMERCE BANK                      117
BANK OF AMERICA N A                110
QUIKTRIP CORPORATION               109
AMERICAN TOWER ASSET SUB II LLC     93
FAMILY DOLLAR STORES OF MO INC      92
HRB TAX GROUP                       92
UMB BANK NA                         91
Name: legal_name, dtype: int64

In [26]:
# water['business_name'].value_counts().head(10)

In [23]:
patent['organization'].value_counts().head(10)

International Business Machines Corporation    110452
Samsung Electronics Co., Ltd.                   71621
Canon Kabushiki Kaisha                          64973
Sony Corporation                                46780
Kabushikikaisha Toshiba                         45828
GENERAL ELECTRIC COMPANY                        41385
Hitachi, Ltd.                                   38041
Fujitsu Limited                                 33349
Intel Corporation                               33287
Microsoft Corporation                           29533
Name: organization, dtype: int64

***Right away, we notice that the record linkage between the different datasets will not be straightforward. The variable is messy and non-standardized, similar names can be written differently (in upper-case or lower-case characters, with or without suffixes, etc.) The essential next step is to process the variables in order to make the linkage the most effective and relevant possible.***


## The Importance of Pre-Processing
- Back to the [Table of Contents](#Table-of-Contents)

Data pre-processing is an important step in a data anlysis project in general, in record linkage applications in particular. The goal of pre-processing is to transform messy data into a dataset that can be used in a project workflow.

Linking records from different data sources comes with different challenges that need to be addressed by the analyst. The analyst must determine whether or not two entities (individuals, businesses, geographical units) on two different files are the same. This determination is not always easy. In most of the cases there is no common uniquely identifing characteristic for a entity. For example, is Bob Miller from New Yor the same person as Bob Miller from Chicago in a given dataset? This detemination has to be executed carefully because consequences of wrong linkages may be substantial (is person X the same person as the person X on the list of identified terrorists). Pre-processing can help to make better informed decisions.

Pre-processing can be difficult because there are a lot of things to keep in mind. For example, data input errors, such as typos, misspellings, truncation, abbreviations, and missing values need to be corrected. Literature shows that preprocessing can improve matches. In some situations, 90% of the improvement in matching efficiency may be due to preprocessing. The most common reason why matching projects fail is lack of time and resources for data cleaning. 

In the following cells we will walk you through some pre-processing steps. These include but are not limited to removing spaces, parsing fields, and standardizing strings.

### Cleaning String Variables

In order to clean the Business Name variables, we will use various string transformations and cleaning. The record linkage package comes with a build in cleaning function we can also use. Finaly, RegEx commands can be used for further cleaning (`replace`) and to extract information from strings (`match`).

In [99]:
# Create new business name varibles on which we will do the cleaning
business['legal_name_clean'] = business['legal_name']
#water['business_name_clean'] = water['business_name']
patent['organization_clean'] = patent['organization']

In [100]:
# Upcasing names
business['legal_name_clean'] = business['legal_name_clean'].str.upper()
#water['business_name_clean'] = water['business_name_clean'].str.upper()
patent['organization_clean'] = patent['organization_clean'].str.upper()

<span style="background-color: #FFFF00"> Do we really want to use below package? Kind of useless in my opinion, I prefer teaching them in depth RegEx. </span>

In [101]:
# Cleaning names (using the record linkage package tool, see imports)
# Clean removes any characters such as '-', '.', '/', '\', ':', brackets of all types. 
# business['legal_name_clean']=clean(business['legal_name_clean']
#                                    , lowercase=False, strip_accents='ascii', remove_brackets=False)

In [102]:
# Replace dash character with spaces:
patent['organization_clean'] = patent['organization_clean'].str.replace('-', ' ')
#water['business_name_clean'] = water['business_name_clean'].str.replace('-', ' ')
business['legal_name_clean'] = business['legal_name_clean'].str.replace('-', ' ')

### Regular Expressions - regex
Regular expressions (regex) are a way of searching for a character pattern. They can be used for matching or replacing operations in strings.

When defining a regular expression search pattern, it is a good idea to start out by writing down, explicitly, in plain English, what you are trying to search for and exactly how you identify when you've found a match.
For example, if we look at an author field formatted as "<last_name> , <first_name> <middle_name>", in plain English, this is how I would explain where to find the last name: "starting from the beginning of the line, take all the characters until you see a comma."


In a regular expression, there are special reserved characters and character classes. For example:
- "`^`" matches the beginning of the line or cell
- "`.`" matches any character
- "`+`" means one or more repetitions of the preceding expressions

Anything that is not a special charater or class is just looked for explicitly. A comma, for example, is not a special character in regular expressions, so inserting "`,`" in a regular expression will simply match that character in the string.

In our example, in order to extract the last name, the resulting regular expression would be:
"`^.+,`". We start at the beginning of the line ( "`^`" ), matching any characters ( "`.+`" ) until we come to the literal character of a comma ( "`,`" ).


_Note: if you want to actually look for one of these reserved characters, it must be escaped, so that, for example, the expression looks for a literal period, rather than the special regular expression meaning of a period. To escape a reserved character in a regular expression, precede it with a back slash ( "`\`" ). For example, "`\.`" will match a "`.`" character in a string._


#### REGEX CHEATSHEET


    - abc...     Letters
    - 123...     Digits
    - \d         Any Digit
    - \D         Any non-Digit Character
    - .          Any Character
    - \.         Period
    - [a,b,c]    Only a, b or c
    - [^a,b,c]   Not a,b, or c
    - [a-z]      Characters a to z
    - [0-9]      Numbers 0 to 9
    - \w any     Alphanumeric chracter
    - \W         any non-Alphanumeric character
    - {m}        m Repetitions
    - {m,n}      m to n repetitions
    - *          Zero or more repetitions
    - +          One or more repetitions
    - ?          Optional Character
    - \s         any Whitespace
    - \S         any non-Whitespace character
    - ^...$      Starts & Ends
    - (...)      Capture Group
    - (a(bc))    Capture sub-Group
    - (.*)       Capture All
    - (abc|def)  Capture abc or def

#### Examples:
    - `(\d\d|\D)`      will match 22X, 23G, 56H, etc...
    - `(\w)`           will match any characters between 0-9 or a-z
    - `(\w{1-3})`      will match any alphanumeric character of a length of 1 to 3. 
    - `(spell|spells)` will match spell or spells
    - `(corpo?)        will match corp or corpo
    - `(feb 2.)`       will match feb 20, feb 21, feb 2a, etc.


#### Using REGEX to match characters

In python, to use a regular expression like this to search for matches in a given string, we use the built-in "`re`" package ( https://docs.python.org/2/library/re.html ), specifically the "`re.search()`" method. To use "`re.search()`", pass it first the regular expression you want to use to search, enclosed in quotation marks, and then the string you want to search within. 



#### Using REGEX for replacing characters

The `re` package also has an "`re.sub()`" method used to replace regular expressions by other strings. The method can be applied to an entire pandas column (replacing expression1 with expression2) with the following syntax: `df['variable'].str.replace(r'expression1', 'expression2')`. Note the `r` before the first expression to signal we are using regular expressions.

In [103]:
# Remove all content in parentheses:
# Parentheses are already regex reserved characters so we use the backslash to match them.
patent['organization_clean'] = patent['organization_clean'].str.replace(r'\(.+\)', '')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'\(.+\)', '')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'\(.+\)', '')

In [104]:
# Keep only alphanumeric characters, spaces, and '&' characters:
patent['organization_clean'] = patent['organization_clean'].str.replace(r'[^(A-Z\s0-9\&)]', '')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'[^A-Z,\s,0-9,\&]', '')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'[^A-Z,\s,0-9,\&]', '')

##### Dealing with the '&' character

In [105]:
business[business['legal_name_clean'].str.contains('R&R PACKAGING')]

Unnamed: 0,id,patent_id,type,number,country,date,assignee_id,name_first,name_last,organization,organization_clean
718,04/221366,4418169,4,4221366,US,1962-09-04,43492b06a5c68c0ed2a01fbdbc746fbb,,,M & T Chemicals Inc.,M & T CHEMICALS INC


In [178]:
patent[patent['organization_clean'].str.contains('R & R PACKAGING')]

Unnamed: 0,id,patent_id,type,number,country,date,assignee_id,name_first,name_last,organization,organization_clean
4422188,2008/12315944,8416059,12,12315944,US,2008-12-08,2f5cb927ba4e8aa74e2b352a1ff00036,,,"R&R Packaging, Inc.",R & R PACKAGING INC
4745997,2010/12661886,8215552,12,12661886,US,2010-03-25,8357d285ee114a21177d7fe46e3fde0b,,,"R & R Packaging, Inc.",R & R PACKAGING INC


In some cases, the & character is preceded and followed by spaces. In others, this is not the case.

Let's standardize this by replacing all '&' characters by ' & '

In [107]:
patent['organization_clean'] = patent['organization_clean'].str.replace(r'\s?&\s?', ' & ')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'\s?&\s?', ' & ')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'\s?&\s?', ' & ')

___Dealin with business suffixes:___

In [151]:
business[business['legal_name_clean'].str.contains('HUSSEY SEATING')]

Unnamed: 0,business_activity,address,legal_name,dba_name,filing_period,legal_name_clean
59700,Commercial and Institutional Building Construc...,38 DYER ST EXT NORTH BERWICK ME 03906-6763,HUSSEY SEATING CO,,12/31/17,HUSSEY SEATING CO


In [153]:
patent[patent['organization_clean'].str.contains('HUSSEY SEATING')].head(2)

Unnamed: 0,id,patent_id,type,number,country,date,assignee_id,name_first,name_last,organization,organization_clean
1413547,07/753746,5277001,7,7753746,US,1991-09-03,ac87d80dce24c30004bd6cb822925438,,,Hussey Seating Company,HUSSEY SEATING COMPANY
1487812,07/873474,5288128,7,7873474,US,1992-04-24,ac87d80dce24c30004bd6cb822925438,,,Hussey Seating Company,HUSSEY SEATING COMPANY


"Hussey Seating" has an office in Kansas City, MO, and has several patents in it's name. In the Business Licenses Records, the legal name has the suffix "CO", while in the patent database, it is written with the entire word "COMPANY". It is therefore essential to standardize company suffixes.

In [154]:
# replace Company by CO
patent['organization_clean'] = patent['organization_clean'].str.replace('COMPANY', 'CO')
# water['business_name_clean'] = water['business_name_clean'].str.replace('COMPANY', 'CO')
business['legal_name_clean'] = business['legal_name_clean'].str.replace('COMPANY', 'CO')

# replace National Association by NA
patent['organization_clean'] = patent['organization_clean'].str.replace('NATIONAL ASSOCIATION', 'NA')
# water['business_name_clean'] = water['business_name_clean'].str.replace('NATIONAL ASSOCIATION', 'NA')
business['legal_name_clean'] = business['legal_name_clean'].str.replace('NATIONAL ASSOCIATION', 'NA')

# replace CORPORATION by CORP
patent['organization_clean'] = patent['organization_clean'].str.replace(r'CORPORATION', 'CORP')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'CORPORATION', 'CORP')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'CORPORATION', 'CORP')

# replace N A by NA
patent['organization_clean'] = patent['organization_clean'].str.replace(r'\bN\sA\b', 'NA')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'\bN\sA\b', 'NA')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'\bN\sA\b', 'NA')

# replace L L C by LLC
patent['organization_clean'] = patent['organization_clean'].str.replace(r'\bL\sL\sC\b', 'LLC')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'\bL\sL\sC\b', 'LLC')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'\bL\sL\sC\b', 'LLC')

___Dealing with the word "THE":___

In [174]:
business[business['legal_name_clean'].str.contains('SHERWIN WILLIAMS')].head(2)

Unnamed: 0,business_activity,address,legal_name,dba_name,filing_period,legal_name_clean
2211,Paint and Wallpaper Stores,10117 STATE LINE RD KANSAS CITY MO 64114-4262,SHERWIN WILLIAMS CO THE,SHERWIN WILLIAMS STORE #7206,12/31/16,SHERWIN WILLIAMS CO THE
2212,Paint and Wallpaper Stores,10117 STATE LINE RD KANSAS CITY MO 64114-4262,SHERWIN WILLIAMS CO THE,SHERWIN WILLIAMS STORE #7206,12/31/17,SHERWIN WILLIAMS CO THE


In [175]:
patent[patent['organization_clean'].str.contains('SHERWIN WILLIAMS')].head(2)

Unnamed: 0,id,patent_id,type,number,country,date,assignee_id,name_first,name_last,organization,organization_clean
3908,05/232319,3980488,5,5232319,US,1972-03-07,930c051692288ba87d18f8bb85a04009,,,The Sherwin-Williams Company,THE SHERWIN WILLIAMS CO
7763,05/362746,3940385,5,5362746,US,1973-05-22,930c051692288ba87d18f8bb85a04009,,,The Sherwin-Williams Company,THE SHERWIN WILLIAMS CO


In the KCMO Business Licenses data, the word "THE" at the beginning of a company name was moved to the end.

Instead of moving it back to the front of the name, let's just remove "THE" when it is the first or last word of the Business Name.

In [176]:
# First Word:
patent['organization_clean'] = patent['organization_clean'].str.replace(r'^THE\b', '')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'^THE\b', '')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'^THE\b', '')

# Last Word:
patent['organization_clean'] = patent['organization_clean'].str.replace(r'\bTHE$', '')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'\bTHE$', '')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'\bTHE$', '')

___Additional Replacements:___

In [None]:
# replace U S A by USA
patent['organization_clean'] = patent['organization_clean'].str.replace(r'\bU\sS\sA\b', 'USA')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'\bU\sS\sA\b', 'USA')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'\bU\sS\sA\b', 'USA')

# replace U S by US
patent['organization_clean'] = patent['organization_clean'].str.replace(r'\bU\sS\b', 'US')
# water['business_name_clean'] = water['business_name_clean'].str.replace(r'\bU\sS\b', 'US')
business['legal_name_clean'] = business['legal_name_clean'].str.replace(r'\bU\sS\b', 'US')

___ Additional Changes: ___

Do you see any other possible standardizations? Insert them below!

In [179]:
# conditions= [
#     (d_2015['legal_name']=='INC'), 
#     (d_2015['legal_name']=='LTD'), 
#     (d_2015['legal_name']=='LIMITED'), 
#     (d_2015['legal_name']=='INCORPORATED'), 
#     (d_2015['legal_name']=='INCORPORATION'), 
#     (d_2015['legal_name']=='ASSOCIATION'), 
#     (d_2015['legal_name']=='CORP'), 
#     (d_2015['legal_name']=='CORPORATION'),
#     (d_2015['legal_name']=='CO'),
#     (d_2015['legal_name']=='LLC'), 
#     (d_2015['legal_name']=='ASSOC'), 
#     (d_2015['legal_name']=='ASSOCIATES'), 
#     (d_2015['legal_name']=='PTNRSHP') ,
#     (d_2015['legal_name']=='COMPANY')
# ]

# choices=['INC', 'LTD', 'LTD', 'INC', 'INC', 'ASSOCIATION', 'CORP', 'CORP', \
#          'CO', 'LLC', 'ASSOCIATION', 'ASSOCIATES', 'PARTNERSHIP', 'COMPANY']

# d_2015['legal_name_clean']=np.select(conditions, choices, default=None)

# # This is how our prepared data looks
# d_2015[['est_name', 'legal_name_clean', 'address', 'setup_date']].head()

Now that we have the business name cleaned and standardized across datasets, we can use the different name pairs for string matching. However, before doing that we also would like to look at other variables of interest which might help us in better linkage in terms of increasing probability of a correct match (and reducing errors). The variable which seem immediately important is the address of a given establishment. We can pre-process this attribute by using regex functions.

## Address Parsing:

In [None]:
# Parsing out the address


# Extracting ZIPCODE
# Since a US Zipcode typically follows a set pattern of 5 digits or 5digits followed by a hypen and then 4 digits
# We can use this information to extract zipcodes. 

# Pattern1
d_2015['zipcode']=d_2015['address'].str.extract('(\d{5})')

# Breaking the code: 
# \d ---- tells that we need a digit
# \d{5} ---- tells that we need 5 digits consecutively
# () enclosing brackets tell that we need to extract this information in the new variable
d_2015[['zipcode', 'address','setup_date']].head()

# Pattern 2
# What if we want the full zipcode: '5 digits <hyphen> 4 digits'
d_2015['zipcode_full']=d_2015['address'].str.extract('(\d{5}-\d{4})')

# Breaking the code: 
# \d ---- tells that we need a digit
# \d{5} ---- tells that we need 5 digits consecutively
# '-' this is just passing the string exactly as we need it
# \d{4} tells us 4 more digits
# () enclosing brackets tell that we need to extract this information in the new variable

# Pattern 3
# We can also pass both the expressions in our query as an OR.
d_2015['zipcode_either']=d_2015['address'].str.extract('(\d{5}-\d{4}|\d{5})')

# Breaking the code: 
# \d ---- tells that we need a digit
# \d{5} ---- tells that we need 5 digits consecutively
# '-' this is just passing the string exactly as we need it
# \d{4} tells us 4 more digits
# () enclosing brackets tell that we need to extract this information in the new variable



d_2005['zipcode']=d_2005['address'].str.extract('(\d{5})')
d_2005['day']=d_2005['setup_date'].str.extract('(\A\d{1,2})', expand=True)
d_2005['month']=d_2005['setup_date'].str.extract('\d{1,2}.(\d{1,2})', expand=True)
d_2005['year']=d_2005['setup_date'].str.extract('\w+.(\d{4})', expand=True)

## Record Linkage
The record linkage package is a quite powerful tool for you to use when you want to link records within a dataset or across multiple datasets. It comes with different bulid in distances metrics and comparison functions, however, it also allows you to create your own. In general record linkage is divided in several steps. 

In [None]:
# For the match later so we have different names for years
d_2015.rename(columns={'est_name':'est_name_2015'}, inplace=True)
d_2005.rename(columns={'est_name':'est_name_2005'}, inplace=True)

In [None]:
# Only keep variables relevant for linkage
d_2015 = d_2015[['est_name_2015', 'zipcode', 'year', 'legal_name_clean']]
d_2005 = d_2005[['est_name_2005', 'zipcode', 'year', 'legal_name_clean']]

In [None]:
d_2015.head()

In [None]:
d_2005.head()

We've already done the pre-processing, so the next step is indexing the data we would like to link. Indexing allows you to create candidate links, which basically means identifying pairs of data rows which might refer to the same real world entity. This is also called the comparison space (matrix). There are different ways to index data. The easiest is to create a full index and consider every pair a match. This is also the least efficient method, because we will be comparing every row of one dataset with every row of the other dataset. 

In [None]:
# Let's generate a full index first (comparison table of all possible linkage combinations)
indexer = rl.FullIndex()
pairs = indexer.index(d_2015, d_2005)
# Returns a pandas MultiIndex object
print(len(pairs))

We can do better if we actually include our knowledge about the data to eliminate bad link from the start. This can be done through blocking. The recordlinkage packages gives you multiple options for this. For example, you can block by using variables, which menas only links exactly equal on specified values will be kept. You can also use a neighbourhood index in which the rows in your dataframe are ranked by some value and python will only link between the rows that are closeby.

In [None]:
indexerBL = rl.BlockIndex(on='zipcode')
pairs2 = indexerBL.index(d_2015, d_2005)
# Returns a pandas MultiIndex object
print(len(pairs2))

In [None]:
# Initiate compare object (we are using the blocked ones here)
# You want to give python the name of the MultiIndex and the names of the datasets
compare = rl.Compare(pairs2, d_2015,d_2005)

Now we have set up our comparison space. We can start to compareour files and see if we find matches. We will demonstrate an exact match and rule based approches using distance measures. 

In [None]:
# Exact comparison
# This compares all the pairs of strings for exact matches 
# It is similar to a JOIN-- 
exact = compare.exact('est_name_2015','est_name_2005',name='exact')

In [None]:
# This command gives us the probability of match between two strings basis the levenshtein distance
# The measure is 0 if there are no similarities in thee string, 1 if it's identical  
levenshtein = compare.string('est_name_2015','est_name_2005', name='levenshtein')

In [None]:
# This command gives us the probability of match between two strings basis the jarowinkler distance
# The measure is 0 if there are no similarities in thee string, 1 if it's identical 
jarowinkler_name = compare.string('est_name_2015','est_name_2005', method='jarowinkler', name='jarowinkler')

In [None]:
# Finally- we can compare the different metrics for an aggregate comparison of their performance 
print(compare.vectors.describe())

## Results
Once we have our comparison measures we need to classify the measure in matches and non matches. A rule based approach would be to say if the similarity of our indicators is 0.70 or higher we consider this a match, everything else we won't match. This decision need to be made by the analyst.

In [None]:
# Classify matches 
matches = compare.vectors[compare.vectors.max(axis=1) > 0.80]
matches = matches.sort_values("jarowinkler")
matches.head()

Now that we have the list of matches we can fuse our dataset, becasue at the end we want to have a combined dataset. We are using a function for this task.

In [None]:
def fuse(dfA, dfB, dfmatches):
    newDF = dfA.copy()
    columns = dfB.columns.values
    
    for col in columns:
        newDF[col] = newDF.apply(lambda _: '', axis=1)
        
    for row in dfmatches.iterrows():
        indexA = row[0][0]
        indexB = row[0][1]
        
        for col in columns:
            newDF.loc[indexA][col] = dfB.loc[indexB][col]
    return newDF

In [None]:
result = fuse(d_2015, d_2005, matches)
result.head(10)

Another way of classiying records is the Fellegi Sunter Method. If Fellegi Sunter is used to classify record pairs you would follow all the step we have done so far. However now, we would estimate probabilities to construct weights. These weights will then be applied during the classification to give certain characteristics more importance. For example we are more certain that very unique names are a match than Bob Millers.

#### Fellegi Sunter

In [None]:
name = compare.string('est_name_2015','est_name_2005', method='jarowinkler', name='name')
legal = compare.string('legal_name_clean','legal_name_clean', method='jarowinkler', name='legal')
matches = compare.vectors[compare.vectors.max(axis=1) > 0.80]

In [None]:
# Running the Classifier
fs = rl.ECMClassifier()
model =fs.learn(matches)
pred = fs.predict(matches)
prob = fs.prob(matches)

In [None]:
prob

## References and Further Readings

- Back to the [Table of Contents](#Table-of-Contents)

### Parsing

* Python online documentation: https://docs.python.org/2/library/string.html#deprecated-string-functions
* Python 2.7 Tutorial(Splitting and Joining Strings): http://www.pitt.edu/~naraehan/python2/split_join.html

### Regular Expression

* Python documentation: https://docs.python.org/2/library/re.html#regular-expression-syntax
* Online regular expression tester (good for learning): http://regex101.com/

### String Comparators

* GitHub page of jellyfish: https://github.com/jamesturk/jellyfish
* Different distances that measure the differences between strings:
    - Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance
    - Damerau–Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    - Jaro–Winkler distance: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
    - Hamming distance: https://en.wikipedia.org/wiki/Hamming_distance
    - Match rating approach: https://en.wikipedia.org/wiki/Match_rating_approach

### Fellegi-Sunter Record Linkage 

* Introduction to Probabilistic Record Linkage: http://www.bristol.ac.uk/media-library/sites/cmm/migrated/documents/problinkage.pdf
* Paper Review: https://www.cs.umd.edu/class/spring2012/cmsc828L/Papers/HerzogEtWires10.pdf

