# Fuzzy matching exploration

Run EDA to test the fuzzy matching code from [PR #35](https://github.com/CodeForPhilly/paws-data-pipeline/pull/35) to determine if the original email + name fuzzy matching business logic is sufficient for joining the tables, or if we should continue to explore alternative approaches (e.g. [#36](https://github.com/CodeForPhilly/paws-data-pipeline/issues/36)).

**Usage guide:** This notebook contains some analysis code that was run on an extract from PAWS.  These files are loaded from a local Originals/ directory and are not published to GitHub (see block 6 below).  **WARNING:** this notebook is intended to be read-only to inform future work in the Python data pipeline.  If you modify the .ipynb file and rerun on real data, be sure to triple-check any changes before commiting any data into the repo.

**Future work:** Should brainstorm the matching criteria in more detail, possibly coming up with a flowchart or Sankey diagram showing how many records are each step. (e.g., there are only ~100 records in each dataset where the names do not match but the emails do.  Since the number is small, that indicates that fuzzy matching may not be necessary at all.)  Also productionalizing this workflow from an exploratory notebook into the full Python scripts.

**Current conclusions and understanding from this analysis:**

1. Using the email address by itself was surprisingly effective to match users between the datasets.  Running a case insensitive search (or normalizing all of the emails to lowercase) improves the matching effectiveness.
2. There are many records in salesforce that share an email address, including name changes, shared household emails, and duplicate entries from ticketed events.
3. The volgistics dataset was quite clean, with only three emails shared in the whole dataset.
4. Petpoint was a bit messier: about 100 records without an email address, and surprisingly ~1/3 of the given emails were not in salesforce.  More work should be done to understand this subset of emails and how to proceed.
5. In the petpoint and volgistics matches against salesforce, there were 100 cases each where the emails matched but the names did not.  Given the nature of these mismatches (typically name changes, middle names, nicknames, etc.), it appears that improving the matching rules (e.g. preprocessing) and/or correcting the source data would be more effective for our use case than establishing a fuzzy match threshold.


## Import libraries and define generic utilities

Copying useful methods from load_paws_data.ipynb, minus the export to a sqlite database.

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import re
from fuzzywuzzy import fuzz

In [2]:
pd.options.display.max_columns = None

In [3]:
def load_df(csv_name, drop_first_col=False):
    # adapted from load_to_sqlite
    df = pd.read_csv(csv_name, encoding='cp1252')
    
    # drop the first column - so far all csvs have had a first column that's an index and doesn't have a name
    if drop_first_col:
        df = df.drop(df.columns[0], axis=1)
    
    # strip whitespace and periods from headers, convert to lowercase
    df.columns = df.columns.str.lower().str.strip()
    df.columns = df.columns.str.replace(' ', '_')
    df.columns = df.columns.map(lambda x: re.sub(r'\.+', '_', x))
    return df

In [4]:
def clean_entry(entry):
    """
    Function to clean up all values returned from the SQL statement, so this 
    should be performed on every entry in the dataframe with an applymap
    
    1 Change 'None' or 'NaN' value to an empty string
    2 Cast value as string
    3 Lowercase value
    3 Strip leading and trailing white space
    4 Remove punctuation by only keeping letters, numbers and white space
    5 Replace internal multiple consecutive white spaces with a single white space
    """
    
    # convert None and NaN to an empty string
    if entry ==  None or entry == np.nan:
        entry = ''
    
    # convert to string, lowercase, and strip leading and trailing whitespace
    entry = str(entry).lower().strip()
    
#    # remove all non alphanumeric characters except white space
#    alphanumeric_and_space = ' 1234567890abcdefghijklmnopqrstuvwxyz'
#    entry = ''.join([c for c in entry if c in alphanumeric_and_space])
    
    # cut down (internal) consecutive whitespaces to one white space
    entry = re.sub(r'\s+', ' ', entry)
    
    return entry

# usage example:
# df = df.applymap(clean_entry)

In [5]:
from fuzzywuzzy import fuzz
def single_fuzzy_score(record1, record2):
    # Calculate a fuzzy matching score between two strings.
    # Uses a modified Levenshtein distance from the fuzzywuzzy package.
    # Update this function if a new fuzzy matching algorithm is selected.
    # Similar to the example of "New York Yankees" vs. "Yankees" in the documentation, we 
    # should use fuzz.partial_ratio instead of fuzz.ratio to more gracefully handle nicknames.
    # https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
    return fuzz.partial_ratio(record1, record2)

def df_fuzzy_score(df, column1_name, column2_name):
    # Calculates a new column of fuzzy scores from two columns of strings.
    # Slow in part due to a nonvectorized loop over rows
    return df.apply(lambda row: single_fuzzy_score(row[column1_name], row[column2_name]), axis=1)

## Load data 

In [6]:
orig_petpoint = load_df('./Original/PetPoint - AnimalIntakeWithResultsExtended.csv')
orig_salesforce = load_df('./Original/Salesforce - Accounts and Contacts.csv')
orig_volgistics = load_df('./Original/Volgistics - VolunteerInformation_Master.csv')

### EDA for new data sources

Before attempting matches, let's first understand the new data sources to extract the relevant fields.

In [7]:
DEFAULT_PREVIEW_ROWS = 1  # number of records to preview, or 0 to disable (empty preview)
def preview_df(df):
    return df.head(DEFAULT_PREVIEW_ROWS)

In [53]:
# First previewing some example data
orig_petpoint.pipe(preview_df)

In [54]:
orig_salesforce.pipe(preview_df)

In [55]:
orig_volgistics.pipe(preview_df)

In [56]:
# How big are the datasets?
print('Petpoint: {}'.format(orig_petpoint.shape))
print('Salesforce: {}'.format(orig_salesforce.shape))
print('Volgistics: {}'.format(orig_volgistics.shape))

Petpoint: (3122, 15)
Salesforce: (60187, 16)
Volgistics: (1242, 42)


In [12]:
# By going through the `orig_volgistics.columns`, etc., what are the minimal datasets we need?
petpoint = (
    orig_petpoint
    [['outcome_person_name', 'out_email', 'outcome_person_#']]
    .rename(columns={'outcome_person_name': 'petpoint_name', 'out_email': 'petpoint_email', 'outcome_person_#': 'petpoint_id'})
    .assign(lower_email=lambda df: df['petpoint_email'].str.lower())
)
salesforce = (
    orig_salesforce
    [['first_name', 'last_name', 'email', 'contact_id']]
    # contact_id=person in Salesforce nonprofit, account_id=household
    .assign(salesforce_name=lambda df: df['first_name'] + ' ' + df['last_name'])
    #.drop(columns=['first_name', 'last_name'])  # TODO: RE-ENABLE
    .rename(columns={'contact_id': 'salesforce_contact_id', 'email': 'salesforce_email'})  # 'account_id': 'sf_account_id'
    .assign(lower_email=lambda df: df['salesforce_email'].str.lower())
)
volgistics = (
    orig_volgistics
    [['first_name_last_name', 'email', 'number']]
    .rename(columns={'first_name_last_name': 'volgistics_name', 'number': 'volgistics_id', 'email': 'volgistics_email'})
    .assign(lower_email=lambda df: df['volgistics_email'].str.lower())
)  # note there are other interesting name fields available, such as nickname

In [13]:
# How many unique columns in each dataset, first by the ID's and then by name+email.  How do they compare?
# vs. the raw shape of the dataframes above?
print("Petpoint: all rows ({}) vs. unique ID's ({}) vs. unique email ({}) vs. unique name+email ({})".format(
    petpoint.shape[0], petpoint['petpoint_id'].nunique(), petpoint['petpoint_email'].nunique(), petpoint[['petpoint_name', 'petpoint_email']].drop_duplicates().shape[0]
))  # petpoint has multiple adoptions 
print("Volgistics: all rows ({}) vs. unique ID's ({}) vs. unique email ({}) vs. unique name+email ({})".format(
    volgistics.shape[0], volgistics['volgistics_id'].nunique(), volgistics['volgistics_email'].nunique(), volgistics[['volgistics_name', 'volgistics_email']].drop_duplicates().shape[0]
))
print("Salesforce: all rows ({}) vs. unique ID's ({}) vs. unique email ({}) vs. unique name+email ({})".format(
    salesforce.shape[0], salesforce['salesforce_contact_id'].nunique(), salesforce['salesforce_email'].nunique(), salesforce[['salesforce_name', 'salesforce_email']].drop_duplicates().shape[0]
))

Petpoint: all rows (3122) vs. unique ID's (2746) vs. unique email (2642) vs. unique name+email (2745)
Volgistics: all rows (1242) vs. unique ID's (1242) vs. unique email (1239) vs. unique name+email (1242)
Salesforce: all rows (60187) vs. unique ID's (60187) vs. unique email (45418) vs. unique name+email (59665)


## Testing the original business rules

How effective is email?  Email + fuzzy name?  Is there a particular threshold we should use?

### Email uniqueness exploration

In [58]:
# Before starting the email join, how unique is the field?  How many duplicates?
duplicate_volgistics_emails = volgistics.groupby('lower_email').count().reset_index().query("volgistics_id > 1")
volgistics[volgistics['lower_email'].isin(duplicate_volgistics_emails['lower_email'])]
# Only three examples where the emails are nonunique, for review later

Unnamed: 0,volgistics_name,volgistics_email,volgistics_id,lower_email


In [15]:
# Also some emails in salesforce are repeated several times
orig_salesforce['email'].value_counts().head().values

array([12,  8,  5,  5,  5], dtype=int64)

In [16]:
# In the original iteration, we also found that there were a number of names that were fully uppercase in
# Salesforce and/or volgistics, so let's run the standardization there as well when we do the fuzzy scores

### volgistics-salesforce matching

In [17]:
# Flagging the volgistics duplicate emails for separate data cleanup
excluded_volgistics = volgistics[volgistics['lower_email'].isin(duplicate_volgistics_emails['lower_email'])].copy()
volgistics = volgistics[~volgistics['lower_email'].isin(duplicate_volgistics_emails['lower_email'])]

In [105]:
# Then, based on the number of unique emails, the business logic for Volgistics looks like it should be pretty straightforward
volgistics_match = salesforce.merge(volgistics[['lower_email', 'volgistics_id', 'volgistics_name']], how='inner')
volgistics_match.shape

# Which emails do not match?
unmatched_volgistics = volgistics[~volgistics['lower_email'].isin(list(salesforce['lower_email']))]
unmatched_volgistics.head()
'hiding personal info';

In [106]:
volgistics_match.head(10)
'hiding personal info';

In [107]:
# showing a specific email example without saying it directly here.
# Presumably a name change plus a couple that is sharing an email address
orig_salesforce[orig_salesforce['email'] == volgistics_match['salesforce_email'][4]]
'hiding personal info';

In [108]:
volgistics_match['salesforce_name'] = volgistics_match['salesforce_name'].str.upper()
volgistics_match['volgistics_name'] = volgistics_match['volgistics_name'].str.upper()
volgistics_match['volgistics_fuzzy_name'] = df_fuzzy_score(volgistics_match, 'volgistics_name', 'salesforce_name')
volgistics_match.head()
'hiding personal info';

In [22]:
print("{}/{} names have a 100% fuzzy match score for volgistics-salesforce".format(
    volgistics_match[volgistics_match['volgistics_fuzzy_name']==100].shape[0], volgistics_match.shape[0]
))
# WARNING: this number is greater than the {} records in the original volgistics dataset,
# because some rows have become duplicates due to emails with multiple rows in Salesforce.
# See the block before last for more details.

1211/1304 names have a 100% fuzzy match score for volgistics-salesforce


In [109]:
volgistics_matched_emails_not_names = (
    volgistics_match
    [volgistics_match['volgistics_fuzzy_name']!=100]
    .sort_values('volgistics_fuzzy_name')
)
# Lots to look on here about mismatching names to feed back for manual analysis, starting in priority order.
# Many are nicknames, duplicates in the Salesforce record (see next block), name changes, middle names, etc.

volgistics_matched_emails_not_names
'hiding personal info';

In [110]:
email_that_repeats_four_times = volgistics_matched_emails_not_names['salesforce_email'].value_counts().reset_index()['index'][0]
volgistics[volgistics['volgistics_email'] == email_that_repeats_four_times]
orig_salesforce[orig_salesforce['email'] == email_that_repeats_four_times]
# four individual tickets for the same event in Salesforce
'hiding personal info';

### petpoint-salesforce matching

In [67]:
# First, recalling the uniqueness situation from above:
print("Petpoint: all rows ({}) vs. unique ID's ({}) vs. unique email ({}) vs. unique name+email ({})".format(
    petpoint.shape[0], petpoint['petpoint_id'].nunique(), petpoint['petpoint_email'].nunique(), petpoint[['petpoint_name', 'petpoint_email']].drop_duplicates().shape[0]
))  # petpoint has multiple adoptions

# Hypothesis: the unique ID will map 1:1 to name+email since the original ID field was labeled as `outcome_person_#`
print("Number of unique ID + email + name combinatios: {}". format(petpoint[['petpoint_name', 'petpoint_email', 'petpoint_id']].drop_duplicates().shape[0]))
# Since the number of unique ID's matches the number of unique ID's + emails + names in the dataset, each ID
# has a single value for email and name, meaning that we can simplify the petpoint dataframe to remove the duplicate rows.

Petpoint: all rows (3122) vs. unique ID's (2746) vs. unique email (2642) vs. unique name+email (2745)
Number of unique ID + email + name combinations: 2746


In [69]:
petpoint = petpoint.drop_duplicates()
petpoint.shape

(2746, 4)

In [71]:
# Then implement matching logic similar to the salesforce-volgistics joining
petpoint_match = salesforce.merge(petpoint[['lower_email', 'petpoint_id', 'petpoint_name']], how='inner')
petpoint_match.shape

# This did not work (way too many matches)...presence of NULLs due to the pandas convention of treating those with
# NULL == NULL as True?

(1347993, 8)

In [80]:
print("Number of petpoint IDs without an email: {} out of {}".format(petpoint['lower_email'].isnull().sum(), petpoint.shape[0]))

Number of petpoint IDs without an email: 99 out of 2746


In [81]:
# Filtering out petpoint ID's without an associated email address for further analysis
petpoint_no_email = petpoint[petpoint['lower_email'].isnull()].copy()
petpoint = petpoint[~petpoint['lower_email'].isnull()]

In [89]:
# Then, trying the petpoint/salesforce match again
petpoint_match = salesforce.merge(petpoint[['lower_email', 'petpoint_id', 'petpoint_name']], how='inner')
print(petpoint_match.shape)

# Which emails do not match?
unmatched_petpoint = petpoint[~petpoint['lower_email'].isin(list(salesforce['lower_email']))]
#unmatched_petpoint.head()
print("{} unmatched records from petpoint that have emails not in salesforce".format(unmatched_petpoint.shape[0]))

(1791, 8)
895 unmatched records from petpoint that have emails not in salesforce


In [111]:
# We'll need to figure out what to do with the ~1/3 of emails that are in petpoint but not salesforce.
# In the meantime, how do the names match up between the two systems, assuming the emails match and are accurate?

# Assiging fuzzy scores, closely following the same procedure as volgistics from above
petpoint_match['salesforce_name'] = petpoint_match['salesforce_name'].str.upper()
petpoint_match['petpoint_name'] = petpoint_match['petpoint_name'].str.upper()
petpoint_match['petpoint_fuzzy_name'] = df_fuzzy_score(petpoint_match, 'petpoint_name', 'salesforce_name')
petpoint_match.head()
'hiding personal info';

In [103]:
print("{}/{} records with matching emails have a 100% fuzzy match score on names for petpoint-salesforce".format(
    petpoint_match[petpoint_match['petpoint_fuzzy_name']==100].shape[0], petpoint_match.shape[0]
))

1673/1791 records with matching emails have a 100% fuzzy match score on names for petpoint-salesforce


In [112]:
petpoint_matched_emails_not_names = (
    petpoint_match
    [petpoint_match['petpoint_fuzzy_name']!=100]
    .sort_values('petpoint_fuzzy_name')
)
# Lots to look on here about mismatching names to feed back for manual analysis, starting in priority order.
# Many are nicknames, duplicates in the Salesforce record (see next block), name changes, middle names, etc.

petpoint_matched_emails_not_names
'hiding personal info';

## Exporting audit reports

These tables should be exported (e.g. using pandas method `.to_csv()` for audit/debugging purposes.  They will also be important to feed into a UI to help understand the nonmatching data and improve the data quality.

In [95]:
excluded_volgistics;  # duplicate emails in the volgistics table, which are excluded for now in the current business logic
unmatched_volgistics;  # volgistics records without a matching salesforce email
volgistics_matched_emails_not_names[['salesforce_email', 'salesforce_name', 'volgistics_name', 'salesforce_contact_id', 'volgistics_id', 'volgistics_fuzzy_name']];  # fuzzy match scores for mismatching names with the same email

petpoint_no_email[['petpoint_name', 'petpoint_id']];  # Petpoint entries without an email address
unmatched_petpoint;  # petpoint records with an email address not found in salesforce
petpoint_matched_emails_not_names[['salesforce_email', 'salesforce_name', 'petpoint_name', 'salesforce_contact_id', 'petpoint_id', 'petpoint_fuzzy_name']];  # fuzzy match scores for petpoint

In [101]:
# And the good matches towards the master table, keeping in mind that some rows may be duplicates for people
# with the same email (and possibly name) depending on how they're entered into salesforce
volgistics_match[['salesforce_name', 'salesforce_email', 'salesforce_contact_id', 'volgistics_id']];
petpoint_match[['salesforce_name', 'salesforce_email', 'salesforce_contact_id', 'petpoint_id']];