# OCR Matching Algorithm Results & Benchmarks

This notebook is to document the performance of the matching algorithm which checks the values returned by OCR against the voter registry dataframe. It will mainly serve to illustrate the results along two main axes, the search function and the scorer used by the fuzzy matching function.

In [1]:
# libraries used for matching
import pandas as pd
import numpy as np
from rapidfuzz import fuzz, process, utils
import time
from loguru  import logger
import sys
import json
import glob

# creating dataframe from the registry - this file must be requested for use
voter_records_2023_df = pd.read_csv('../data/raw_feb_23_city_wide.csv', dtype=str)

# removing default sink and adding Jupyter Notebook system standout as a sink for logger
logger.remove()
logger.add(sys.stdout, level="INFO", colorize=True, format="<r><u>{time}</u></r> <green>{level}</green> <black><bold>{message}</bold></black>")

# loading the OCR Results - this file must be requested for use
with open('../data/processed_ocr_data.json', 'r') as file:
    resulting_data = json.load(file)

## Vanilla

The original version of the full name column generation implements a lambda function, we then create a separate list of full name values, and the matching function iterates through the full name list to check every name returned by OCR.

In [2]:
# Creating Full Name column with lambda
start = time.time()
voter_records_2023_df['Full Name'] = voter_records_2023_df.apply(lambda x: f"{x['First_Name']} {x['Last_Name']}", axis=1)
full_name_list = list(voter_records_2023_df['Full Name'])
end = time.time()
logger.info(f"Columns generated in {end - start:.2f} seconds")

[31m[4m2024-08-20T22:59:38.891107-0400[0m[31m[0m [32mINFO[0m [30m[1mColumns generated in 3.81 seconds[0m[30m[0m


In [3]:
##
# FUZZY MATCHING FUNCTION
##

def score_function_fuzz(ocr_name, full_name_list):

    """
    Outputs the voter record indices of the names that are
    closest to `ocr_name`.
    """

    # empty dictionary of scores
    full_name_score_dict = dict()

    for idx in range(len(full_name_list)):

        # getting full name for row; ensuring string
        name_row = str(full_name_list[idx])

        # converting string to lower case to simplify matching
        name_row = name_row.lower()
        ocr_name = ocr_name.lower()

        # compiling scores; writing as between 0 and 1
        full_name_score_dict[idx] = fuzz.ratio(ocr_name, name_row)/100

    # sorting dictionary
    sorted_dictionary = dict(sorted(full_name_score_dict.items(), reverse=True, key=lambda item: item[1]))

    # top five key value pairs (indices and scores)
    indices_scores_list = list(sorted_dictionary.items())[:5]

    return indices_scores_list

### By Full Name

In [4]:
matched_list = list()
start_time = time.time()

for dict_ in resulting_data:
                temp_dict = dict()
                high_match_ids = score_function_fuzz(dict_['Name'], full_name_list)
                id_, score_ = high_match_ids[0]
                temp_dict['OCR NAME'] = str(dict_['Name'])
                temp_dict['MATCHED NAME'] = full_name_list[id_]
                temp_dict['SCORE'] = "{:.2f}".format(score_)
                temp_dict['VALID'] = False
                if score_ > 0.85:
                    temp_dict['VALID'] = True
                matched_list.append(temp_dict)

match_df = pd.DataFrame(matched_list, columns=["OCR NAME", "MATCHED NAME", "SCORE", "VALID"])
end_time = time.time()
#match_df

In [5]:
valid_matches = match_df['VALID'].sum()
total_records = len(match_df)
logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

[31m[4m2024-08-20T23:00:46.032062-0400[0m[31m[0m [32mINFO[0m [30m[1mMatch Time 67.120 secs | Matched Records: 53 of 66 - 80.30%[0m[30m[0m


## Vectorized Columns

For greater specificity when matching voter records, we wish to include the address of the voter as well as their name in the search. Using vectorized columns in lieu of the lambda function speeds up the column generation and we can omit the list as extract can iterate through a panda series directly.



In [6]:
start_time = time.time()
voter_records_2023_df["Full Name"] = voter_records_2023_df["First_Name"] + ' ' + voter_records_2023_df["Last_Name"]
voter_records_2023_df["Full Address"] =  voter_records_2023_df["Street_Number"] + " " + voter_records_2023_df["Street_Name"] + " " + voter_records_2023_df["Street_Type"] + " " + voter_records_2023_df["Street_Dir_Suffix"]
voter_records_2023_df["Full Name and Full Address"] = voter_records_2023_df["Full Name"] + ' ' + voter_records_2023_df["Full Address"]
end_time = time.time()

logger.info(f"Initialized columns in {end_time - start_time:.2f} seconds")

[31m[4m2024-08-20T23:00:47.043231-0400[0m[31m[0m [32mINFO[0m [30m[1mInitialized columns in 1.00 seconds[0m[30m[0m


## Extract

Extract simplifies the matching process significantly, we simply feed it a query and a list of values to check the query against, and it returns a list of tuples which range from the highest match score to the lowest.

In [7]:
def score_extract_default(query_name, names_list):
    # the default scorer produces a Levenshtein distance number, so we can use fuzz.ratio as a scorer to obtain a more readable % format
    # default_process is a processor that removes whitespace, lowers all letters, removes any non-alphanumeric characters
    # by limiting to 5 we can save a little bit of space and time and code, since we were looping through all of the items before and only taking the top 5 of the sorted list.
    list_of_match_tuples = process.extract(query=query_name, choices=names_list, scorer=fuzz.ratio, processor=utils.default_process, limit=5)
    # this will produce a list of tuples whose values are as follows:
    # (matched record: the record which the query matched*,
    # score: the result of fuzz.ratio between the query and the matched record,
    # index: when checked against an iterable i.e standard python list, this will be an index, when checked against panda dataframes, it will return a key)
    return list_of_match_tuples

### By Name + Address

When we receive data back from the OCR, it is in the form of a dictionary with keys for Name, Address, Ward, and Date. We combine the values for Name and Address to check against the OCR column we created earlier. It's important to note that with the % format returned by our new scorer we have to change the valid score check from 0.85 to 85.0.

In [8]:
matched_list = list()
start_time = time.time()

for dict_ in resulting_data:
                name_address_combo = f"{dict_['Name']} {dict_['Address']}"
                temp_dict = dict()
                high_match_ids = score_extract_default(name_address_combo, voter_records_2023_df["Full Name and Full Address"])
                record_, score_, id_ = high_match_ids[0]
                temp_dict['OCR RECORD'] = name_address_combo
                temp_dict['MATCHED RECORD'] = record_
                temp_dict['SCORE'] = "{:.2f}".format(score_)
                temp_dict['VALID'] = False
                if score_ > 85.0:
                    temp_dict['VALID'] = True
                matched_list.append(temp_dict)

match_df = pd.DataFrame(matched_list, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
end_time = time.time()
# match_df

In [9]:
valid_matches = match_df['VALID'].sum()
total_records = len(resulting_data)
logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

[31m[4m2024-08-20T23:01:04.648341-0400[0m[31m[0m [32mINFO[0m [30m[1mMatch Time 17.579 secs | Matched Records: 32 of 66 - 48.48%[0m[30m[0m


### By Ward

We are now searching by name and address, however combing through the entire registry dataframe is more time intensive than we'd like. As we have access to the Ward value in the data returned by OCR, we can search name an address only within the Ward and cut the search time considerably. Because the WARD value in the voter registry is entered as a float, we have to do a little string formatting.

In [10]:
matched_list = list()
start_time = time.time()

for dict_ in resulting_data:
                name_address_combo = f"{dict_['Name']} {dict_['Address']}"
                temp_dict = dict()
                high_match_ids = score_extract_default(name_address_combo, voter_records_2023_df[voter_records_2023_df['WARD'] == f"{dict_['Ward']}.0"]["Full Name and Full Address"])
                record_, score_, id_ = high_match_ids[0]
                temp_dict['OCR RECORD'] = name_address_combo
                temp_dict['MATCHED RECORD'] = record_
                temp_dict['SCORE'] = "{:.2f}".format(score_)
                temp_dict['VALID'] = False
                if score_ > 85.0:
                    temp_dict['VALID'] = True
                matched_list.append(temp_dict)

match_df = pd.DataFrame(matched_list, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
end_time = time.time()
# match_df

In [11]:
valid_matches = match_df['VALID'].sum()
total_records = len(match_df)
logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

[31m[4m2024-08-20T23:01:13.234195-0400[0m[31m[0m [32mINFO[0m [30m[1mMatch Time 8.572 secs | Matched Records: 26 of 66 - 39.39%[0m[30m[0m


## Hierarchical Search

At this point we can match by full name + address, as well as by Ward. OCR is imperfect, and has a tendency to return incorrect or entirely hallucinated Ward values. This means that searching by Ward is not always optimal, but if we can find a valid match within the Ward returned by OCR we'd like to keep it without having to search further. By using a hierarchical search we can split the difference between speed and accuracy. If a valid match is found in ward, we return it, and if not we move on to a name + address search against the entire registry. If no match is found for name + address in the registry, we default to the original Full Name search.

The first step in implementing this search is to refactor the extract function, so we can use different scores for different searches, and to implement adjustable limits to the returns. The fuzz.token_ratio scorer seems to perform better when matching addresses, and fuzz.ratio seems to perform better matching names, however not every combination of the scorers offered by rapidfuzz has been tested so far.

In [12]:
def score_extract(ocr_name, full_name_list, scorer_=fuzz.token_ratio, limit_=1):
    list_of_match_tuples = process.extract(query=ocr_name, choices=full_name_list, scorer=scorer_, processor=utils.default_process, limit=limit_)
    return list_of_match_tuples

Given the length of the function, it makes sense to abstract the search into its own function to preserve legibility.

In [13]:
def tiered_search(name, address):

    name_address_combo = f"{name} {address}"

    # Searches for a match within the Ward returned by OCR
    name_address_matches1 = score_extract(name_address_combo, voter_records_2023_df[voter_records_2023_df['WARD'] == f"{dict_['Ward']}.0"]["Full Name and Full Address"])
    name_address__name1, name_address__score1, name_address__id1 = name_address_matches1[0]

    # if score is more than 85, return the tuple
    if name_address__score1 >= 85.0:
        return (name_address__name1, name_address__score1, name_address__id1)

    # if score is below 85, do additional processing
    else:

        # computing matches based on name and address; only considers all other wards
        name_address_matches2 = score_extract(name_address_combo, voter_records_2023_df[voter_records_2023_df['WARD'] != f"{dict_['Ward']}.0"]["Full Name and Full Address"])
        name_address__name2, name_address__score2, name_address__id2 = name_address_matches2[0]

        # if the new voter records score is greater than 85, return tuple
        if name_address__score2 >= 85.0:
            return (name_address__name2, name_address__score2, name_address__id2)

        # if score is less than 85, perform full records search based on name
        # and return results with highest score
        else:
            # computing matches based on name alone; considers full voter records
            full_name_matches = score_extract(name, voter_records_2023_df["Full Name"], scorer_=fuzz.ratio)
            full_name__name, full_name__score, full_name__id = full_name_matches[0]

            # find max from three scores
            max_indx = np.argmax([name_address__score1, name_address__score2, full_name__score])

            # return records associated with that max
            if max_indx== 0:
                return (name_address__name1, name_address__score1, name_address__id1)
            elif max_indx == 1:
                return (name_address__name2, name_address__score2, name_address__id2)
            else:
                address = voter_records_2023_df.loc[full_name__id, 'Full Address']
                full_name_address = f"{full_name__name} {address}"
                return (full_name_address, full_name__score, full_name__id)

In [14]:
matched_list = list()
start_time = time.time()

for dict_ in resulting_data:
    temp_dict = dict()
    name_, score_, id_ = tiered_search(dict_['Name'], dict_['Address'])
    temp_dict['OCR RECORD'] = f"{dict_['Name']} {dict_['Address']}"
    temp_dict['MATCHED RECORD'] = name_
    temp_dict['SCORE'] = "{:.2f}".format(score_)
    temp_dict['VALID'] = False
    if score_ > 85.0:
        temp_dict['VALID'] = True
    matched_list.append(temp_dict)

## Editable Table
match_df = pd.DataFrame(matched_list, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
end_time = time.time()
# match_df

In [15]:
total_records = len(match_df)
valid_matches = match_df["VALID"].sum()
logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

[31m[4m2024-08-20T23:03:02.109415-0400[0m[31m[0m [32mINFO[0m [30m[1mMatch Time 108.843 secs | Matched Records: 59 of 66 - 89.39%[0m[30m[0m


# Benchmarking Scorers

Below are benchmarks of all scorers available through rapidfuzz for name + address search and full name search. I've commented out all the actual matching and tables so that this notebook runs faster and doesn't expose information. The match time is system dependent but should provide a general idea of how fast the scorers are in relation to each other. Sadly visualizing this data is difficult because the most promising results usually include the highest rate of false positives.

 Cursory analysis would suggest fuzz.ratio as the best scorer for the Full Name search and either fuzz.token_set_ratio (accuracy) or token_sort_ratio (speed, fewer false positives) as the best scorer for Full Name + Full Address search. 

First, let's create a function to simplify running the loop for test data.

In [16]:
def benchmark_scorer(record_list, scorer=fuzz.WRatio, name_address=False):
    matched_list = list()
    if name_address:
        df = voter_records_2023_df["Full Name and Full Address"]
    else:
        df = voter_records_2023_df["Full Name"]
        
    for dict_ in record_list:
        temp_dict = dict()
        if name_address:
            record_name_address = f"{dict_['Name']} {dict_['Address']}"
            matched_records = score_extract(record_name_address, df, scorer_=scorer)
            name_, score_, id_ = matched_records[0]
        else:
            matched_records = score_extract(dict_['Name'], df, scorer_=scorer)
            name_, score_, id_ = matched_records[0]
        temp_dict['OCR RECORD'] = f"{dict_['Name']} {dict_['Address']}"
        temp_dict['MATCHED RECORD'] = name_
        temp_dict['SCORE'] = "{:.2f}".format(score_)
        temp_dict['VALID'] = False
        if score_ > 85.0:
            temp_dict['VALID'] = True
        matched_list.append(temp_dict)
    
    ## Editable Table
    return matched_list

## fuzz.WRatio
### Name and Address

Match time:  140s  
Valid match rate: 100.00%  
Notes: Returns obvious false positives - all records seem to have a match score floor of 85.50. Without further research this is unusable for our purposes.

In [17]:
# start_time = time.time()
# name_address_results__WRatio = benchmark_scorer(resulting_data, name_address=True)
# match_df = pd.DataFrame(name_address_results__WRatio, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [18]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name

Match time: 64s  
Valid match rate: 98.48%  
Notes: Unusable for current search. Wratio returns fewer false positives when used in the full name check, but only because it seems to find matches more consistently. Where there are no good matches, it still returns obvious mismatches with an 85.5 match rating.

In [19]:
# start_time = time.time()
# name_address_results__WRatio = benchmark_scorer(resulting_data)
# match_df = pd.DataFrame(name_address_results__WRatio, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [20]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.ratio

### Name & Address
Match time: 14.3s  
Match Rate: 48.48%

In [21]:
# start_time = time.time()
# name_address_results__r = benchmark_scorer(resulting_data, scorer=fuzz.ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__r, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [22]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name

Match time: 11.327s  
Match rate: 80.3%  
Notes: This is the most consistently performant and accurate scorer for the full name check.

In [23]:
# start_time = time.time()
# full_name_results__r = benchmark_scorer(resulting_data, scorer=fuzz.ratio)
# match_df = pd.DataFrame(full_name_results__r, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [24]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.partial_ratio

### Name & Address

Match time: 77.635s  
Match rate: 51.52s

In [25]:
# start_time = time.time()
# name_address_results__pr = benchmark_scorer(resulting_data, scorer=fuzz.partial_ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__pr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [26]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name

Match time: 32.60s  
Match rate: 85.45  
Notes: Unusuable for rate of false positives.

In [27]:
# start_time = time.time()
# full_name_results__pr = benchmark_scorer(resulting_data, scorer=fuzz.partial_ratio)
# match_df = pd.DataFrame(full_name_results__pr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [28]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.token_set_ratio

### Name & Address

Match time: 122.430s  
Match rate: 57.58%  
Notes:  A very accurate scorer with a low false positive rate in the OCR data set, but slow.

In [29]:
# start_time = time.time()
# name_address_results__psetr = benchmark_scorer(resulting_data, scorer=fuzz.token_set_ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__psetr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [30]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name
Match time: 75.748s  
Match rate: 51.52%

In [31]:
# start_time = time.time()
# full_name_results__psetr = benchmark_scorer(resulting_data, scorer=fuzz.token_set_ratio)
# match_df = pd.DataFrame(full_name_results__psetr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [32]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.partial_token_set_ratio

### Name & Address
Match time: .003s  
Match rate: 100%  
Notes: Returns first record in the database for nearly every query, unusuable.

In [33]:
# start_time = time.time()
# name_address_results__ptsetr = benchmark_scorer(resulting_data, scorer=fuzz.partial_token_set_ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__ptsetr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [34]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name

Match time: 1.766s  
Match Rate: 100%  
Notes: Unusable/False positives.

In [35]:
# start_time = time.time()
# full_name_results__ptsetr = benchmark_scorer(resulting_data, scorer=fuzz.partial_token_set_ratio)
# match_df = pd.DataFrame(full_name_results__ptsetr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [36]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.token_sort_ratio

### Name & Address

Match time: 58.513  
Match Rate: 46.97%  
Notes: This is far more performant and does not return many false positives, which may be why it's match rate is so low. I would recommend it for use. The match rate would suffer but probably would better represent the validity of the match.

In [37]:
# start_time = time.time()
# name_address_results__tsr = benchmark_scorer(resulting_data, scorer=fuzz.token_sort_ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__tsr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [38]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name
Match time: 25.767s  
Match rate: 80.30%  
Notes: A solid blend of performance and accuracy, but ratio is still probably the better scorer.

In [39]:
# start_time = time.time()
# full_name_results__tsr = benchmark_scorer(resulting_data, scorer=fuzz.token_sort_ratio)
# match_df = pd.DataFrame(full_name_results__tsr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [40]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.partial_token_sort_ratio

### Name & Address
Match time: 124.710s  
Match rate: 54.55%  
Notes: One of the slowest scorers, with some obvious false positives. 

In [41]:
# start_time = time.time()
# name_address_results__ptsr = benchmark_scorer(resulting_data, scorer=fuzz.partial_token_sort_ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__ptsr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [42]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name
Match time: 54.77s  
Match rate: 96.97%  
Notes: Unusable/false positives.

In [43]:
# start_time = time.time()
# full_name_results__ptsr = benchmark_scorer(resulting_data, scorer=fuzz.partial_token_sort_ratio)
# match_df = pd.DataFrame(full_name_results__ptsr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [44]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.token_ratio

### Name & Address
Match time: 144.637s  
Match rate: 57.58%  
Notes: Slow, probably a bit too strict on some matching, but strong accuracy.

In [45]:
# start_time = time.time()
# name_address_results__tr = benchmark_scorer(resulting_data, scorer=fuzz.token_ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__tr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [46]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name
Match time: 59.102s  
Match rate: 83.33%

In [47]:
# start_time = time.time()
# full_name_results__tr = benchmark_scorer(resulting_data, scorer=fuzz.token_ratio)
# match_df = pd.DataFrame(full_name_results__tr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [48]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

## fuzz.partial_token_ratio

### Name & Address
Match time: .002s  
Match rate: 100%  
Notes: Unusable/false positives.

In [49]:
# start_time = time.time()
# name_address_results__ptr = benchmark_scorer(resulting_data, scorer=fuzz.partial_token_ratio, name_address=True)
# match_df = pd.DataFrame(name_address_results__ptr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [50]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")

### Full Name
Match time: 1.6s  
Match rate: 100%  
Notes: Unusable/False Postives.

In [51]:
# start_time = time.time()
# full_name_results__ptr = benchmark_scorer(resulting_data, scorer=fuzz.partial_token_ratio)
# match_df = pd.DataFrame(full_name_results__ptr, columns=["OCR RECORD", "MATCHED RECORD", "SCORE", "VALID"])
# end_time = time.time()
# match_df

In [52]:
# total_records = len(match_df)
# valid_matches = match_df["VALID"].sum()
# logger.info(f"Match Time {end_time-start_time:.3f} secs | Matched Records: {valid_matches} of {total_records} - {(valid_matches/total_records * 100):.2f}%")