# Create Training Set
---------


### Author Information
**Author:** PJ Gibson  
**Email:** Peter.Gibson@doh.wa.gov  
**Github:**   https://github.com/DOH-PJG1303

### Project Information
**Created Date:** 2023-05-16  
**Last Updated:** 2023-05-16  
**Version:** 1  

### Description
This notebook should serve to prepare our training data that we will use to train/test a machine learning model for record linkage.

### Notes


## 1. Import Libraries

In [7]:
# Standard data analysis tools
import pandas as pd
import numpy as np

# Record linkage specific resources
import recordlinkage as rl
from recordlinkage.preprocessing import clean, phonetic
from recordlinkage.index import Block

## 2. Read Data

Here, we're using synthetic data that I created to replicate Oregon's population.
The data represents the population of individuals in the years 2020-2022 who were born in Coos County in Oregon.
They may live elsewhere.
Interesting features that this training data includes:
* several people can live in the same building
* families exist.  If a child is <18, they live with their parent or parents
* fields change dependent on the year. If someone gets married and changes their name in 2020, their data in 2019 will look different than their data in 2020 for the lname field.

For more information, reach out to PJ and he'd be happy to elaborate.
Again, this data is synthetic.

In [8]:
df1 = pd.read_csv('Data/synthetic_df1.csv', dtype=str)[['ssn','fname','lname','dob','phone','add','unique_id','parents_partnership_id','house_id','building_id']]
df2 = pd.read_csv('Data/synthetic_df2.csv', dtype=str)[['ssn','fname','lname','dob','phone','add','unique_id','parents_partnership_id','house_id','building_id']]

## 3. Record Linkage Steps

In [9]:
# Clean the data
for col in ['fname', 'lname', 'dob', 'phone', 'add']:
    df1[col] = clean(df1[col])
    df2[col] = clean(df2[col])

# Generate metaphone versions of the fields
for col in ['fname', 'lname']:
    df1['meta_'+col] = phonetic(df1[col], method='metaphone')
    df2['meta_'+col] = phonetic(df2[col], method='metaphone')

# Create the index (pairs of records to compare)
indexer = rl.Index()

# Generate a blocking scheme as a union of the following blocks
indexer.add(Block('dob'))
indexer.add(Block(['meta_fname', 'meta_lname']))
indexer.add(Block('building_id'))
indexer.add(Block('parents_partnership_id'))

pairs = indexer.index(df1, df2)

# Create the Compare object
compare_precursor = rl.Compare()
compare = rl.Compare()

##############################################################################################################

# Calculate the average similarity score for each field and use it as the missing value
for col in ['fname', 'lname', 'dob', 'add']:
    compare_precursor.string(col, col, method='jarowinkler', missing_value=-1, label=col)

# Compute the comparison scores
features_precursor = compare_precursor.compute(pairs, df1, df2)

##############################################################################################################

# Calculate the average similarity score for each field
for col in ['fname', 'lname', 'dob', 'add']:
    average_score = features_precursor.replace(-1,None)[col].mean()
    compare.string(col, col, method='jarowinkler', missing_value=average_score, label=col)

# Compare the phone fields using damerau_levenshtein similarity
compare.string('phone', 'phone', method='damerau_levenshtein', label='phone')
compare.exact('ssn','ssn',label='label')

# Compute the comparison scores
features = compare.compute(pairs, df1, df2)

## 4. Final Preprocessing Step

Right now, there are far more pairs with a `label=0` than `label=1`.
Roughly 86.5% of the data has a `label=1`.

If we used this output to train our model, we could get a model that predicts a label of 0 for every record.
It would still have an accuracy score of 86.5%, which is misleading.
A model that predicts 0 for everything is essentially useless.

We'll format the data so that it has a roughly even number of each label class.
We use some semi-complex stratification in order to sample our `label=0` class to contain records that are "interesting".
If we were to randomly sample this majority label-class, we'd likely get nearly all instances where the two records are VERY dissimilar.
By sampling in a non-random way, we can find "interesting" `label=0` pairs, where for instance firstname, lasname, dob, and address are all very similar, but they still represent different people.

### 4.1 Split label classes

In [10]:
# Separate majority and minority classes
df_majority = features[features.label==0]
df_minority = features[features.label==1]

### 4.2 Non-randomly sample majority class

In [11]:
# Discretize the fields into bins
for col in ['fname', 'lname', 'dob', 'phone', 'add']:
    df_majority[col + '_bin'] = pd.qcut(df_majority[col], q=3, duplicates='drop')

# Create a 'strata' column that combines the bins
df_majority['strata'] = df_majority[['fname_bin', 'lname_bin', 'dob_bin', 'phone_bin', 'add_bin']].apply(lambda x: '_'.join(x.astype(str)), axis=1)

# Sample from each stratum
samples = []
for stratum, group in df_majority.groupby('strata'):
    samples.append(group.sample(min(len(group), 100000 // df_majority['strata'].nunique()), random_state=42))

# Concatenate the samples into a single dataframe
df_majority_sampled = pd.concat(samples)

# Drop the bin and strata columns
df_majority_sampled = df_majority_sampled.drop(['fname_bin', 'lname_bin', 'dob_bin', 'phone_bin', 'add_bin', 'strata'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_majority[col + '_bin'] = pd.qcut(df_majority[col], q=3, duplicates='drop')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_majority[col + '_bin'] = pd.qcut(df_majority[col], q=3, duplicates='drop')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_majority[col + '_bin'] = pd.qcut(df_majority[c

### 4.3 Randomly sample minority class

In [None]:
# Sample minority class so it matches in size with newly-sampled majority class 
df_minority_sampled = df_minority.sample(len(df_majority_sampled), random_state=42)

### 4.4 Combine

In [None]:
# Combine majority class with upsampled minority class
df_resampled = pd.concat([df_majority_sampled, df_minority_sampled])

## 5. Save

In [None]:
# Write out to a .csv file in our data folder
df_resampled.to_csv('./Data/synthetic_training_data.csv',header=True,index=True)