<a href="https://colab.research.google.com/github/AntoinePinto/string-pair-finder/blob/master/evaluation/evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/AntoinePinto/string-pair-finder.git
import sys
sys.path.append('/content/string-pair-finder')

In [15]:
import pandas as pd
import numpy as np

import string_pair_finder

The performance of StringPairFinder will be compared with that of the already existing Fuzzy Wuzzy library. This library allows to calculate a similarity score between two strings of characters.

In [10]:
!pip install fuzzywuzzy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [11]:
from itertools import product
from fuzzywuzzy import fuzz

def fuzzy_wuzzy_pair_finder(list1: list, list2: list) -> pd.DataFrame:
  """Returns each string in list 1 associated with the string in list 2 with 
  the highest similarity according to FuzzyWuzzy similarity score"""
  combinations = list(product(list1, list2))
  data = pd.DataFrame(combinations, columns=['list1', 'list2'])
  data['score'] = [fuzz.ratio(i, j) for i, j in combinations]
  data = data.loc[data.groupby("list1")["score"].idxmax()].set_index('list1')

  return data



The dataset used for this example contains 300 differently coded company names. The objective is to use an algorithm to find the corresponding pairs.

In [13]:
companies = pd.read_csv('/content/string-pair-finder/evaluation/data/companies.csv').iloc[0:300]
companies

Unnamed: 0,list1,list2
0,UroLogix,UROLOGIX
1,Tesla Motors,TESLA INC
2,"Lopez, John",John Lopez
3,TSC Group,TSC INC
4,Bockorny Group,BOCKORNY GROUP
...,...,...
295,Chubb Corp,CHUBB CORP
296,Regence Group,REGENCE GROUP
297,XBRL US Inc,"XBRL US, INC."
298,ESA Inc,ESA


Let's simulate a situation where we have two lists containing the names of companies completely mixed up.

In [14]:
list1 = companies['list1'].sample(frac=1).to_list()
list2 = companies['list2'].sample(frac=1).to_list()

print('Subsample of list 1: \n', list1[0:8])
print('\nSubample of list 2: \n', list2[0:8])

Subsample of list 1: 
 ['ESPP Coalition', 'Nanomech LLC', 'Abiomed Inc', 'Westvaco Corp', 'Kilkenny, Alan', 'Symonds NA', 'Lundbeck Inc', 'Curium US']

Subample of list 2: 
 ['NOBLE VENTURES', 'HNTB CORP', 'KENNAMETAL INC', 'FLYTECOMM, INC', 'ENTIA VENTURES', 'AMGEN, INC.', 'U S STEEL CORP', 'AQUILA INC']


Application of StringPairFinder and FuzzyWuzzyPairFinder algorithms using the two lists.

In [16]:
SPF_output = string_pair_finder.get_pairs(list1, list2)
fuzzy_output = fuzzy_wuzzy_pair_finder(list1, list2)

Creation of a dataset regrouping the results of the two algorithms and calculation of a column representing the success.

In [17]:
results = pd.DataFrame({'match_StringPairFinder': SPF_output['list2'],
                        'match_fuzzy': fuzzy_output['list2'],
                        'actual' : companies.set_index('list1')['list2']})
results['success_StringPairFinder'] = np.where(results['match_StringPairFinder'] == results['actual'], 'success', 'fail')
results['success_fuzzy'] = np.where(results['match_fuzzy'] == results['actual'], 'success', 'fail')

The StringPairFinder algorithm succeeded in matching 276 of the 300 company names (92% success rate).

The FuzzyWuzzy based algorithm only managed to link 98 out of 300 company names (32% success rate).

In [18]:
results['success_StringPairFinder'].value_counts()

success    276
fail        24
Name: success_StringPairFinder, dtype: int64

In [19]:
results['success_fuzzy'].value_counts()

fail       202
success     98
Name: success_fuzzy, dtype: int64