# Blocking and Comparing
---------

### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-08-09  
**Last Updated:** 2023-08-09  
**Version:** 1  

### Description

In this notebook, we'll be blocking our data together.
This limits the number of pairs by removing many extremely disparate pairs and hopefully far less matching pairs.
See [this paper](https://usc-isi-i2.github.io/papers/michelson06-aaai.pdf) for more detail on blocking schemas.

Then we'll compare fields using established functions along with custom functions.
Here's where you get to be creative in crafting your feature columns.

Both of these steps are imperative to record linkage.
Note that if you want your model to work properly, you also need to apply this **exact** same step to your applied data.


### Notes

*\*If you are unfamiliar with the origins of this synthetic data, please see the [Synthetic-Gold](https://github.com/DOH-PJG1303/Synthetic-Gold) github project. We ran the simulation for the state of Nebraska, so all data is relevant to that state.
To manage the size of the data we'll have publicly stored on Github, we only captured relevant data for each table for the population living in years 2019-2022*

## 1. Import libraries

In [None]:
import pandas as pd
import numpy as np

import jellyfish

# Record linkage specific resources
import recordlinkage as rl
from recordlinkage.preprocessing import clean, phonetic
from recordlinkage.index import Block
from recordlinkage.base import BaseCompareFeature

## 2. Load Data

In [None]:
# Read in preprocessed data from last file
df = pd.read_parquet('../../Data/Training/03. Preprocessed Data Hep C.parquet')

# Take a couple of fields from cleaner data to help us with blocking schema
df_supplemental = pd.read_parquet('../../Data/Training/01. Wrangled Clean Data Hep C.parquet')[['unique_id','parents_partnership_id','building_id']]

### 2.1 Join extra fields

In [None]:
# Add on those extra 3 columns
df = pd.merge(df, df_supplemental, on='unique_id',how='left')

## 3. Blocking

Our blocking schema consists of 4 different blocks:

* People who have the same DOB
    * Captures twins
* People who have the same sounding (metaphone) firstname AND lastname
    * Captures Jrs (same fname,lname as parent.  Usually father/son)
* People who live in the same building
    * Similar address for apartments, also set congregate phone-landline for buildings with 75+ people in script 01. 
* People who have the same parents
    * Captures siblings

In [None]:
# Create the index (pairs of records to compare)
indexer = rl.Index()

# Generate a blocking scheme as a union of the following blocks
indexer.add(Block('dob'))
indexer.add(Block(['meta_fname', 'meta_lname']))
indexer.add(Block('building_id'))
indexer.add(Block('parents_partnership_id'))

pairs = indexer.index(df, df)

## 4. Comparing

### 4.1 Define customized functions

In [None]:
class compare_dob_score(BaseCompareFeature):
    def _compute_vectorized(self, dob1, dob2):
        
        # Initialize an empty pandas series to hold the comparison scores
        score = pd.Series(np.nan, index=dob1.index)

        # If either date is missing, return 0
        score[(dob1.isnull()) | (dob2.isnull())] = -1

        # Extract the year, month, and day from each date
        dob_y_1 = dob1.str[2:4]
        dob_y_2 = dob2.str[2:4]

        dob_m_1 = dob1.str[5:7]
        dob_m_2 = dob2.str[5:7]

        dob_d_1 = dob1.str[8:]
        dob_d_2 = dob2.str[8:]

        # Check whether the year, month, and day are the same for each date
        same_y = (dob_y_1 == dob_y_2)
        same_m = (dob_m_1 == dob_m_2)
        same_d = (dob_d_1 == dob_d_2)

        # Check whether the month and day are swapped between the two dates
        swap_m_d = (dob_m_1 == dob_d_2) & (dob_d_1 == dob_m_2)
        swap_check = (dob_m_1 != dob_m_2)

        # If the year is the same and the month and day are swapped, return 2.5/3.0
        score[(same_y & swap_m_d & swap_check)] = 2.5/3.0

        # Otherwise, return the average of whether the year, month, and day are the same
        score[(~same_y | ~swap_m_d | ~swap_check)] = (same_y + same_m + same_d).astype(int) / 3.0
        
        return score

In [None]:
class compare_name_score(BaseCompareFeature):
    def _compute_vectorized(self, raw1, sdx1, meta1, raw2, sdx2, meta2):
        
        # Initialize an empty pandas series to hold the comparison scores
        score = pd.Series(np.nan, index=raw1.index)

        # If any of the data is missing, return -1.0
        score[(raw1.isnull()) | (sdx1.isnull()) | (meta1.isnull()) | (raw2.isnull()) | (sdx2.isnull()) | (meta2.isnull())] = -1

        # Check if raw1 is in raw2 or vice versa
        out = raw1.combine(raw2, lambda x, y: x in y or y in x if isinstance(x, str) and isinstance(y, str) else False).astype(int)

        # Check if sdx1 is equal to sdx2
        out += (sdx1 == sdx2).astype(int)

        # Check if meta1 is equal to meta2
        out += (meta1 == meta2).astype(int)

        # Store the average score in the 'score' series
        score[~((raw1.isnull()) | (sdx1.isnull()) | (meta1.isnull()) | (raw2.isnull()) | (sdx2.isnull()) | (meta2.isnull()))] = out / 4.0

        return score

In [None]:
class compare_custom_dlev_distance(BaseCompareFeature):
    def _compute_vectorized(self, s1, s2):
        """
        This function calculates the Damerau-Levenshtein distance between two strings s1 and s2.
        It normalizes the distance by the average length of the two strings.
        If either string is None, it returns -1.0.
        """
        out = [-1.0 if pd.isnull(x) or pd.isnull(y) 
               else 1.0 - (jellyfish.damerau_levenshtein_distance(x, y) / ((len(x) + len(y)) / 2.0)) 
               for x, y in zip(s1, s2)]
        return pd.Series(out)

In [None]:
class compare_swapped_fields(BaseCompareFeature):
    def _compute_vectorized(self, s1_1, s1_2, s2_1, s2_2):
        """
        This function determines whether or not someone accidentally swapped two fields
        """
        return ((s1_1 == s2_2) & (s1_2 == s2_1) ).astype(int)


|    | Field1   | Field2   | Method    |
|---:|:---------|:---------|:----------|
|  0 | fname_1  | fname_2  | jaro_wink |
|  1 | fname_1  | fname_2  | dlev_len  |
|  2 | mname_1  | mname_2  | is_equal  |
|  3 | lname_1  | lname_2  | jaro_wink |
|  4 | lname_1  | lname_2  | dlev_len  |
|  5 | phone_1  | phone_2  | dlev_len  |
|  6 | add_1    | add_2    | jaro_wink |
|  7 | add_1    | add_2    | dlev_len  |
|  8 | county_1 | county_2 | dlev_len  |
|  9 | state_1  | state_2  | is_equal  |
| 10 | zip_1    | zip_2    | dlev_len  |
| 11 | sex_1    | sex_2    | is_equal  |
| 12 | email_1  | email_2  | jaro_wink |
| 13 | city_1  | city_2  | jaro_wink |



In [None]:
# Create the Compare object
compare = rl.Compare()

# Perform all of the comparisons described above
compare.string('fname','fname','jarowinkler', label='jwink_fname', missing_value=-1.0)
compare.add(compare_custom_dlev_distance('fname','fname',label='dlev_fname'))
compare.string('lname','lname','jarowinkler',label='jwink_lname', missing_value=-1.0)
compare.add(compare_custom_dlev_distance('lname','lname',label='dlev_lname'))
compare.add(compare_custom_dlev_distance('phone','phone',label='dlev_phone'))
compare.string('add','add','jarowinkler',label='jwink_add', missing_value=-1.0)
compare.add(compare_custom_dlev_distance('add','add',label='dlev_add'))
compare.add(compare_custom_dlev_distance('county','county',label='dlev_county'))
compare.exact('state','state',label='exact_state', missing_value=-1.0)
compare.add(compare_custom_dlev_distance('zip','zip',label='dlev_zip'))
compare.exact('ssn','ssn', label='exact_ssn',missing_value=-1.0)
compare.string('city','city','jarowinkler',label='jwink_city', missing_value=-1.0)
compare.add(compare_swapped_fields(('fname','lname'),('fname','lname'),label='swapped_fname_lname'))
compare.add(compare_dob_score('dob','dob',label='dob_comparison_score'))
compare.add(compare_name_score(('fname','sdx_fname','meta_fname'),('fname','sdx_fname','meta_fname'),label='fname_comparison_score'))
compare.add(compare_name_score(('lname','sdx_lname','meta_lname'),('lname','sdx_lname','meta_lname'),label='lname_comparison_score'))

# Add our label column
compare.exact('person_id','person_id',label='label')


In [None]:
# Perform the comparisons (should take nearly 13 minutes)
features = compare.compute(pairs,df,df)

## 5. Sample Data

In [None]:
# # First remove instances where the rows are the exact same
# features = features.loc[features.index.map(lambda x: x[0] != x[1])]

### 5.1 Split by label

The label=0 class has far more pairs than the label=1 class.
We'll treat them differently when sampling

In [None]:
# Separate majority and minority classes
df_majority = features[features.label==0]
df_minority = features[features.label==1]

### 5.2 Sample "majority" class

As described above, the label=0 class has the majority of the pairs we've created.
We want to sample them in a non-random way.
If we were to sample the label=0 data randomly and select 5 rows, we'd see something like this (exaggerated for sample's sake):

| fname_comparison_score | dlev_fname | jwink_lname | dob_comparison_score | dlev_phone |
|-----------------------|------------|-------------|---------------------|------------|
|         low              |      low      |   low          |    medium                 |    low        |
|         low              |   low         |    low         |      low               |    medium        |
|         low              |    low        |    low         |      low               |    low        |
|         low              |    low        |    low         |      medium               |    low        |
|         low              |    low        |    low         |      low               |    low        |

You'll notice that most of the scores are "low".  This is because most of the generated pairs are very easy to determine as non-matches.
We're interested in finding the trickier situations for our model training.  That way, it can deal with tricky situations like twins, married people, and passed-down names.

To do this, we divide each of the five columns listed above into "low","medium",and "high" buckets to reflect their scores relative to the rest of the dataset.
Let's consider these as 0,1,2 for this example sake.
All buckets being low would be reflected as 0_0_0_0_0.
All buckets being high would be reflected as 2_2_2_2_2.
The way that the code is working is such that every combination between all zeros and all twos is sampled from equally.
This way, we get a very non-random sample that contains a wide variety of combinations.

#### Note: The following cell can take around 100 minutes to run

In [None]:
df_bins = pd.DataFrame()

# Discretize the fields into bins
for col in ['fname_comparison_score','dlev_fname', 'jwink_lname', 'dob_comparison_score', 'dlev_phone']:
    df_bins[col + '_bin'] = pd.qcut(df_majority[col], q=3, duplicates='drop')

# Create a 'strata' column that combines the bins
df_bins['strata'] = df_bins.apply(lambda x: '_'.join(x.astype(str)), axis=1)

# Combine the bins back to df_majority
df_majority = pd.concat([df_majority, df_bins], axis=1)

# Sample from each stratum
samples = []
for stratum, group in df_majority.groupby('strata'):
    samples.append(group.sample(min(len(group), 100000 // df_majority['strata'].nunique()), random_state=42, replace=False))

# Concatenate the samples into a single dataframe
df_majority_sampled = pd.concat(samples)

# Drop the bin and strata columns
df_majority_sampled = df_majority_sampled.drop(['fname_comparison_score_bin', 'dlev_fname_bin', 'jwink_lname_bin', 'dob_comparison_score_bin', 'dlev_phone_bin', 'strata'], axis=1)

### 5.3 Sample the minority class (randomly)

We're less concerned about specifically sampling this minority class.  So we'll do it randomly

In [None]:
# Sample minority class so it matches in size with newly-sampled majority class 
df_minority_sampled = df_minority.loc[df_minority.index.map(lambda x: x[0] != x[1])].sample(len(df_majority_sampled), random_state=42, replace=False)

### 5.4 Combine minority and majority samples

In [None]:
# Combine majority class with upsampled minority class
df_resampled = pd.concat([df_majority_sampled, df_minority_sampled])

## 5. Save!

In [None]:
# Write out to a .csv file in our data folder
df_resampled.to_parquet('../../Data/Training/04. Training Data Hep C.parquet',index=True)