# Evaluation

Let's compare StringPairFinder with the popular String Mating tool "FuzzyWuzzy".

In [1]:
# !pip install fuzzywuzzy

In [None]:
import pandas as pd
import stringpairfinder as spf

from fuzzywuzzy import fuzz

Using the `fuzz.ratio` method (the method used to compute the similarity in FuzzyWuzzy), let's create the equivalent of the `match_strings` function of StringPairFinder :

In [3]:
def fuzzywuzzy_match_strings(source_strings, target_strings):
    mapping = {}
    for string in source_strings:
      _, max_index = max((fuzz.ratio(spf.clean_string(string), spf.clean_string(option)), index) for index, option in enumerate(target_strings))
      mapping[string] = target_strings[max_index]

    return mapping

The dataset used for this example contains 300 differently coded company names. The objective is to use an algorithm to find the corresponding pairs.

In [4]:
companies = pd.read_csv('data/companies.csv').iloc[0:1000]
companies

Unnamed: 0,list1,list2
0,UroLogix,UROLOGIX
1,Tesla Motors,TESLA INC
2,"Lopez, John",John Lopez
3,TSC Group,TSC INC
4,Bockorny Group,BOCKORNY GROUP
...,...,...
995,PaxVax Inc,"PaxVax, Inc."
996,Biocatalytics,BIOCATALYTICS
997,Support Kids,SUPPORT KIDS
998,CancerVax Corp,CANCERVAX CORP


Let's simulate a situation where we have two lists containing the names of companies completely mixed up.

In [5]:
list1 = companies['list1'].sample(frac=1).to_list()
list2 = companies['list2'].sample(frac=1).to_list()

print('Subsample of list 1: \n', list1[0:8])
print('\nSubample of list 2: \n', list2[0:8])

Subsample of list 1: 
 ['From the Top', 'Rex Systems', 'David & Malito', 'Intrafusion', 'Owens-Illinois', 'US Oncology', 'Dueco Inc', 'Algoma Steel']

Subample of list 2: 
 ['BBK', 'Duffin Newman', 'HARSCO CORP', 'HASTINGS, NE', 'Solazyme Inc', 'HALE & DORR', 'MEGAXESS, INC', 'WELLMARK INC']


Application of FuzzyWuzzy and StringPairFinder algorithms using the two lists.

In [6]:
fuzzy_output = fuzzywuzzy_match_strings(list1, list2)

In [7]:
spf_output = spf.match_strings(list1, list2)

Creation of a dataset regrouping the results of the two algorithms and calculation of a column representing the success.

In [8]:
tab = companies.rename(columns={'list1': 'source', 'list2': 'true_value'})
tab['fuzzy_prediction'] = tab['source'].map(fuzzy_output)
tab['spf_prediction'] = tab['source'].map(spf_output)

tab.head()

Unnamed: 0,source,true_value,fuzzy_prediction,spf_prediction
0,UroLogix,UROLOGIX,UROLOGIX,UROLOGIX
1,Tesla Motors,TESLA INC,THERMONOR AS,TESLA INC
2,"Lopez, John",John Lopez,"GHERINI, JOHN",John Lopez
3,TSC Group,TSC INC,TSX GROUP,TSX GROUP
4,Bockorny Group,BOCKORNY GROUP,BOCKORNY GROUP,BOCKORNY GROUP


In [9]:
accuracy_fuzzy = (tab['fuzzy_prediction'] == tab['true_value']).mean()
accuracy_spf = (tab['spf_prediction'] == tab['true_value']).mean()

print(f"Fuzzy Wuzzy -> Percentage of success: {100 * accuracy_fuzzy}%")
print(f"StringPairFinder -> Percentage of success: {100 * accuracy_spf}%")

Fuzzy Wuzzy -> Percentage of success: 85.2%
StringPairFinder -> Percentage of success: 94.0%
