# Fuzzy Matching

This notebook compares "official" organization names from two databases: the Orbis company name database, and the GDelt knowledge graph of news articles. "Official" Orbis company names are reconciled with alternative organization identifiers, taking in an Orbis company name and returning all alternative names mentioned in the GDelt dataset, such as alternative spellings and naming variations that refer to the same organization.

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Execution settings

Set the input sources, number of Orbis company names, and the number of GDelt press articles to process.

In [2]:
# Inputs
ORBIS_INPUT = './input/orbis_test_small.xlsx' # Excel
GDELT_INPUT = './input/gdelt_test.csv' # CSV

# Number of Orbis records.
NUM_ROWS_ORBIS = 3000

# Range/number of GDelt records.
NUM_ROWS_GDELT_START = 0 
NUM_ROWS_GDELT_END = 1000

#### Orbis test datasets:
- List of Orbis companies in Sierra Leone: orbis_sierra_leone.xlsx
- Large list of Orbis companies in Sierra Leone: orbis_test_large.xlsx
- Small list of Orbis companies for testing: orbis_test_small.xlsx

#### GDelt test datasets:
- List of GDelt articless in Sierra Leone in 2020: gdelt_test.csv

#### Test inputs:
- https://drive.google.com/drive/folders/15QiHluI3dIIWPW6VLMXcxmeM2KuYD0JL?usp=sharing

# Input

In [3]:
import pandas as pd

### Orbis Data

Load a CSV file of "official" company names, which can be obtained by querying the Orbis company database. You can limit search results by adding filter criteria, such as the country of registration.

In [4]:
!pip install openpyxl

indata_orbis = pd.read_excel(ORBIS_INPUT)





In [5]:
len(indata_orbis)

16

In [6]:
indata_orbis = indata_orbis[:NUM_ROWS_ORBIS]

In [7]:
indata_orbis = indata_orbis[['Company name Latin alphabet']].dropna()

In [8]:
 indata_orbis['Company name Latin alphabet'].apply(str.lower)

0                  international business machines corp
1                                            pfizer inc
2                                 eli lilly and company
3                       south african airways (pty) ltd
4                                  ryanair holdings plc
5                                      associated press
6     u.s. international development finance corpora...
7                       bill & melinda gates foundation
8                                      world bank group
9                 european union trading and agency sae
10                                  royal navy reserves
11                                 africell holding sal
12                                   clinton foundation
13    united nations international fund for agricult...
14                                alliance news limited
15                            world health organization
Name: Company name Latin alphabet, dtype: object

In [9]:
indata_orbis['name_original'] = indata_orbis['Company name Latin alphabet']
indata_orbis['name'] = pd.DataFrame(indata_orbis['Company name Latin alphabet'].apply(str.lower))
outdata_orbis = indata_orbis[['name_original', 'name']]
outdata_orbis.sort_values(by='name', inplace=True)

#### Clean company names

We need to clean company names to get rid of odd artifacts and other wrinkles. Remove anything in parenthesis, all punctuation, and any extra whitespaces.

In [10]:
indata_orbis.head()

Unnamed: 0,Company name Latin alphabet,name_original,name
0,INTERNATIONAL BUSINESS MACHINES CORP,INTERNATIONAL BUSINESS MACHINES CORP,international business machines corp
1,PFIZER INC,PFIZER INC,pfizer inc
2,ELI LILLY AND COMPANY,ELI LILLY AND COMPANY,eli lilly and company
3,SOUTH AFRICAN AIRWAYS (PTY) LTD,SOUTH AFRICAN AIRWAYS (PTY) LTD,south african airways (pty) ltd
4,RYANAIR HOLDINGS PLC,RYANAIR HOLDINGS PLC,ryanair holdings plc


In [11]:
import string

# Helper methods for name cleaning.

def remove_parenthesis(name):
    import regex as re
    return re.sub(r'\(.*\)', '', name)

def remove_punctuation(name):
    return name.translate(str.maketrans('', '', string.punctuation))

def remove_double_space(name):
    name = ' '.join(name.split())
    return name

The "cleanco" package removes company suffixes such as "inc." and "limited".

In [12]:
!pip install cleanco
from cleanco import basename

outdata_orbis['name_clean'] = outdata_orbis['name'].apply(str.strip)
outdata_orbis['name_clean'] = outdata_orbis['name_clean'].apply(remove_parenthesis)
outdata_orbis['name_clean'] = outdata_orbis['name_clean'].apply(remove_punctuation)
outdata_orbis['name_clean'] = outdata_orbis['name_clean'].apply(basename)
outdata_orbis['name_clean'] = outdata_orbis['name_clean'].apply(basename) # run basename() twice because of multiple suffixes
outdata_orbis['name_clean'] = outdata_orbis['name_clean'].apply(remove_double_space)





In [13]:
outdata_orbis.sample(5)

Unnamed: 0,name_original,name,name_clean
14,ALLIANCE NEWS LIMITED,alliance news limited,alliance news
9,EUROPEAN UNION TRADING AND AGENCY SAE,european union trading and agency sae,european union trading and agency
3,SOUTH AFRICAN AIRWAYS (PTY) LTD,south african airways (pty) ltd,south african airways
1,PFIZER INC,pfizer inc,pfizer
2,ELI LILLY AND COMPANY,eli lilly and company,eli lilly


In [14]:
outdata_orbis.to_csv('./output/orbis_list.csv')

### GDelt Data

Load a CSV file of press mentions, which can be obtained by querying the GDelt global knowledge database. You can limit search results by adding filter criteria, such as press mentions by country.

In [15]:
indata_gdelt = pd.read_csv(GDELT_INPUT)

In [16]:
indata_gdelt.head()

Unnamed: 0,gkgrecordid,date,sourcecollectionidentifier,sourcecommonname,documentidentifier,counts,locations,organizations,themes,persons,tone,dates,amounts,translationinfo,country,year
0,20200316171500-563,2020-03-16 17:15:00.000,1,570news.com,https://www.570news.com/2020/03/16/new-africa-...,"[{count_type=KILL, count=11, object_type=, loc...","[{location_type=4, location_fullname=Monrovia,...","[{organization=Associated Press, character_off...","[{theme=UNGP_EDUCATION, character_offset=4791}...","[{person=Shakira Choonara, character_offset=40...","{tone=-3.86052303860523, positive_score=3.1133...",,"[{amount=30, object=of Africa 54 countries, ch...",,SL,2020
1,20200316171500-973,2020-03-16 17:15:00.000,1,yahoo.com,https://news.yahoo.com/africell-holding-comple...,,"[{location_type=1, location_fullname=Sierra Le...","[{organization=Africell Holding, character_off...","[{theme=GENERAL_GOVERNMENT, character_offset=1...","[{person=Ziad Dalloul, character_offset=817}, ...","{tone=4.67445742904841, positive_score=5.34223...",,"[{amount=12000000, object=customers, character...",,SL,2020
2,20200316171500-1417,2020-03-16 17:15:00.000,1,stuff.co.nz,https://www.stuff.co.nz/timaru-herald/news/120...,,"[{location_type=1, location_fullname=Sierra Le...",,"[{theme=TAX_WORLDFISH_BANJOS, character_offset...","[{person=John Trotter, character_offset=2650},...","{tone=1.6, positive_score=2.2, negative_score=...","[{date_resolution=1, month=0, day=0, year=1963...",,,SL,2020
3,20200316171500-1483,2020-03-16 17:15:00.000,1,bmj.com,https://www.bmj.com/content/368/bmj.m481/rr,,"[{location_type=1, location_fullname=Sierra Le...",,"[{theme=GENERAL_GOVERNMENT, character_offset=3...","[{person=Jenny Gibson, character_offset=63}]","{tone=1.36054421768708, positive_score=3.40136...",,,,SL,2020
4,20200316171500-2485,2020-03-16 17:15:00.000,1,venturesafrica.com,http://venturesafrica.com/apostories/25-police...,,"[{location_type=1, location_fullname=Sierra Le...",[{organization=Police Contributing Countries P...,"[{theme=TAX_FNCACT_ASSISTANT, character_offset...","[{person=Rex Dundun, character_offset=1319}, {...","{tone=1.04712041884817, positive_score=2.35602...",,"[{amount=25, object=newly deployed Individual ...",,SL,2020


In [17]:
len(indata_gdelt)

36000

In [18]:
indata_gdelt = indata_gdelt[['organizations']].dropna()

In [19]:
orgs_unextracted_gdelt = []

for index, row in indata_gdelt.iterrows():
    # row is a single-item list with a string surrounded
    # by curly braces. Extract the single item and remove
    # the surrounding curly braces.
    orgs_unextracted_gdelt.append(row[0][1:-1])

In [20]:
import regex as re
orgs_extracted_gdelt = []

# The rows are json-like formatted strings that contain non-quoted
# information which includes company names, each of which can be extracted 
# via regex and be treated as a subrow.
for row in orgs_unextracted_gdelt:
    row = row.split('},')
    for subrow in row:
        match = re.findall(r'(?:n=)(.*)(?:,)', subrow)
        orgs_extracted_gdelt.append(match[0])

In [21]:
orgs_extracted_gdelt = pd.DataFrame(orgs_extracted_gdelt)

In [22]:
orgs_extracted_gdelt.head()

Unnamed: 0,0
0,Associated Press
1,Associated Press
2,Associated Press
3,Africa National Institute For Communicable Dis...
4,United States


In [23]:
outdata_gdelt = pd.DataFrame(orgs_extracted_gdelt.value_counts())
outdata_gdelt.rename(columns={0: 'freq_gdelt'}, inplace=True)
outdata_gdelt.reset_index(inplace=True)
outdata_gdelt.rename(columns={0: 'name_gdelt'}, inplace=True)

In [24]:
outdata_gdelt = outdata_gdelt[NUM_ROWS_GDELT_START:NUM_ROWS_GDELT_END]

In [25]:
outdata_gdelt.head()

Unnamed: 0,name_gdelt,freq_gdelt
0,United States,39691
1,Foundation Trust,30845
2,Ambulance Service,10249
3,United Nations,7801
4,World Health Organization,6566


In [26]:
len(outdata_gdelt)

1000

In [27]:
outdata_gdelt['name_original'] = outdata_gdelt['name_gdelt']
outdata_gdelt['name_gdelt'] = pd.DataFrame(outdata_gdelt['name_gdelt'].apply(str.lower))
outdata_gdelt['name_gdelt'] = outdata_gdelt['name_gdelt'].apply(str.strip)
outdata_gdelt['name_gdelt'] = outdata_gdelt['name_gdelt'].apply(remove_parenthesis)
outdata_gdelt['name_gdelt'] = outdata_gdelt['name_gdelt'].apply(remove_punctuation)
outdata_gdelt['name_gdelt'] = outdata_gdelt['name_gdelt'].apply(basename)
outdata_gdelt['name_gdelt'] = outdata_gdelt['name_gdelt'].apply(basename) # run basename() twice because of multiple suffixes
outdata_gdelt['name_gdelt'] = outdata_gdelt['name_gdelt'].apply(remove_double_space)

In [28]:
outdata_gdelt.sample(5)

Unnamed: 0,name_gdelt,freq_gdelt,name_original
386,news agency,98,News Agency
895,finland national bureau of investigation,39,Finland National Bureau Of Investigation
815,university of queensland centre for clinical r...,44,University Of Queensland Centre For Clinical R...
87,pa mike brown,673,Pa Mike Brown
561,kotota international airport,67,Kotota International Airport


#### Acronyms

Although it's possible to compare company acronyms, there's not enough information embedded in an acronym to confidently match to full company names. For example, "US" could refer to "United States" or "United Steel" or "Universal Studios".

In [29]:
# Does not account for "company" or "inc" at the end of the full name.
# def create_acronym(name):
#     words = name.split() 
#     output = ''
#     for word in words: 
#         output += word[0] 
#     return output 

In [30]:
# Not really used at the moment.
# outdata_gdelt['acronym_gdelt'] = outdata_gdelt['name_gdelt'].apply(create_acronym)

In [31]:
outdata_gdelt.to_csv('./output/gdelt_list.csv')

# Scoring

Several scoring methods are implemented below, including those from py_stringsimjoin, py_stringmatching, FuzzyWuzzy, and Jellyfish.

In [6]:
# Import module for data manipulation
import pandas as pd
# Import module for linear algebra
import numpy as np
# Import module for Fuzzy string matching
from fuzzywuzzy import fuzz, process
# Import module for regex
import re
# Import module for iteration
import itertools
# Import module for function development
from typing import Union, List, Tuple
# Import module for TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
# Import module for cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
# Import module for KNN
from sklearn.neighbors import NearestNeighbors

In [7]:
# String matching - TF-IDF
def build_vectorizer(
    clean: pd.Series,
    analyzer: str = 'char', 
    ngram_range: Tuple[int, int] = (1, 4), 
    n_neighbors: int = 1, 
    **kwargs
    ) -> Tuple:
    # Create vectorizer
    vectorizer = TfidfVectorizer(analyzer = analyzer, ngram_range = ngram_range, **kwargs)
    X = vectorizer.fit_transform(clean.values.astype('U'))

    # Fit nearest neighbors corpus
    nbrs = NearestNeighbors(n_neighbors = n_neighbors, metric = 'cosine').fit(X)
    return vectorizer, nbrs

# String matching - KNN
def tfidf_nn(
    messy, 
    clean, 
    n_neighbors = 1, 
    **kwargs
    ):
    # Fit clean data and transform messy data
    vectorizer, nbrs = build_vectorizer(clean, n_neighbors = n_neighbors, **kwargs)
    input_vec = vectorizer.transform(messy)

### py_stringsimjoin

Given two tables A and B, this package provides commands to perform string similarity joins between two columns of these tables, such as A.name and B.name, or A.city and B.city.

http://anhaidgroup.github.io/py_stringsimjoin/v0.1.x/overview.html

In [8]:
!pip install py_stringsimjoin

import py_stringsimjoin as ssj

Collecting py_stringsimjoin
  Using cached py_stringsimjoin-0.3.2.tar.gz (1.1 MB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting py_stringmatching>=0.2.1
  Using cached py_stringmatching-0.4.2.tar.gz (661 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: py_stringsimjoin, py_stringmatching
  Building wheel for py_stringsimjoin (setup.py): started
  Building wheel for py_stringsimjoin (setup.py): finished with status 'error'
  Running setup.py clean for py_stringsimjoin
  Building wheel for py_stringmatching (setup.py): started
  Building wheel for py_stringmatching (setup.py): finished with status 'error'
  Running setup.py clean for py_stringmatching
Failed to build py_stringsimjoin py_stringmatching
Installing collected packages: py_stringmatching, py_stringsimjoin
  Running setup.py install for py_stringmatching: started
  

  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [126 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.8
      creating build\lib.win-amd64-3.8\py_stringsimjoin
      copying py_stringsimjoin\__init__.py -> build\lib.win-amd64-3.8\py_stringsimjoin
      creating build\lib.win-amd64-3.8\py_stringsimjoin\datasets
      copying py_stringsimjoin\datasets\base.py -> build\lib.win-amd64-3.8\py_stringsimjoin\datasets
      copying py_stringsimjoin\datasets\__init__.py -> build\lib.win-amd64-3.8\py_stringsimjoin\datasets
      creating build\lib.win-amd64-3.8\py_stringsimjoin\filter
      copying py_stringsimjoin\filter\filter.py -> build\lib.win-amd64-3.8\py_stringsimjoin\filter
      copying py_stringsimjoin\filter\filter_utils.py -> build\lib.win-amd64-3.8\py_stringsimjoin\filter
      copying py_stringsimjoin\filter\overlap_

ModuleNotFoundError: No module named 'py_stringsimjoin'

In [None]:
outdata_orbis.reset_index(inplace=True)
outdata_gdelt.reset_index(inplace=True)

#### Distance join

Join two tables using edit distance measure.

In [None]:
output_pairs_distance_join = ssj.edit_distance_join(outdata_orbis, outdata_gdelt,
                                      'index', 'index', 
                                      'name_clean', 'name_gdelt', 
                                      50,
                                      l_out_attrs=['name_clean'], 
                                      r_out_attrs=['name_gdelt'])

NameError: name 'ssj' is not defined

#### py_stringmatching

In [None]:
!pip install py_stringmatching

import py_stringmatching as sm

ws = sm.WhitespaceTokenizer(return_set=True)

#### Jaccard join

Join two tables using Jaccard similarity measure.

In [None]:
output_pairs_jaccard_join = ssj.jaccard_join(outdata_orbis, outdata_gdelt, 
                                             'index', 'index', 
                                             'name_clean', 'name_gdelt', 
                                             ws, 0.1, 
                                             l_out_attrs=['name_clean'], 
                                             r_out_attrs=['name_gdelt'])

#### Cosine join

Join two tables using a variant of cosine similarity known as Ochiai coefficient.

In [None]:
output_pairs_cosine_join = ssj.cosine_join(outdata_orbis, outdata_gdelt, 
                                             'index', 'index', 
                                             'name_clean', 'name_gdelt', 
                                             ws, 0.1, 
                                             l_out_attrs=['name_clean'], 
                                             r_out_attrs=['name_gdelt'])

#### Dice join

Join two tables using Dice similarity measure.

In [None]:
output_pairs_dice_join = ssj.dice_join(outdata_orbis, outdata_gdelt, 
                                             'index', 'index', 
                                             'name_clean', 'name_gdelt', 
                                             ws, 0.1, 
                                             l_out_attrs=['name_clean'], 
                                             r_out_attrs=['name_gdelt'])

#### Overlap join

Join two tables using overlap measure.

In [None]:
output_pairs_overlap_join = ssj.overlap_join(outdata_orbis, outdata_gdelt, 
                                             'index', 'index', 
                                             'name_clean', 'name_gdelt', 
                                             ws, 0.1, 
                                             l_out_attrs=['name_clean'], 
                                             r_out_attrs=['name_gdelt'])

#### Overlap coefficient join

Join two tables using overlap coefficient.

In [None]:
output_pairs_overlap_coefficient_join = ssj.overlap_coefficient_join(outdata_orbis, outdata_gdelt, 
                                             'index', 'index', 
                                             'name_clean', 'name_gdelt', 
                                             ws, 0.1, 
                                             l_out_attrs=['name_clean'], 
                                             r_out_attrs=['name_gdelt'])

#### Master list

We take the dataframe of Orbis organization names and cross-join it with the dataframe of GDelt organization names, so that every Orbis company name has a record associating it with every GDelt organization mention. For example, if there are 2 Orbis company names and 3 GDelt article mentions, then there will be six comparisons: 3 comparisons for each of the two Orbis company names.

In [None]:
# To cross join, merge on a temporary key and then drop it.
outdata_gdelt['key'] = 1
outdata_orbis['key'] = 1

master_list = pd.merge(outdata_gdelt, outdata_orbis, on='key').drop('key', 1)
master_list.rename(columns={'name_x': 'name_gdelt', 
                             'name_original_x': 'name_original_gdelt', 
                             'name': 'name_orbis', 
                             'name_clean': 'name_clean_orbis', 
                             'name_original_y': 'name_original_orbis'}, 
                    inplace=True)

In [None]:
master_list.head(5)

In [None]:
master_list.to_csv('./output/master_list.csv')

#### FuzzyWuzzy and Jellyfish

1) Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package: https://pypi.org/project/fuzzywuzzy/

2) A library for doing approximate and phonetic matching of strings: https://pypi.org/project/jellyfish/

In [None]:
!pip install fuzzywuzzy
!pip install jellyfish

import pandas as pd
from fuzzywuzzy import fuzz
import jellyfish

In [None]:
try:
    data = master_list
except:
    data = pd.read_csv('./output/master_list.csv')
    data.drop(columns='Unnamed: 0', inplace=True)

data = data.dropna() # To prevent errors processing matches.

#### Calculate fuzz ratios and jaro-wrinkler distances.

This cell calculates fuzz ratios and jaro-wrinkler distances for both spelled-out organization names and their phonetic metaphone variants. A progress bar is included to track execution of large datasets.

In [None]:
# Get matches of names as well as meta information.
# This is where the heavy lifting happens.

display('Match processing will take some time...')
display(str(len(data)) + ' rows...')

!pip install tqdm
from tqdm import tqdm
tqdm.pandas() # Introduces pd.apply_progress() for progress bars.

# Name comparisons. Run an apply() on two columns.
display('Calculating fuzz ratio for names...')
data['fuzz_ratio'] = data.progress_apply(lambda x: fuzz.ratio(x.name_gdelt, x.name_clean_orbis), axis=1)
display('Calculating fuzz partial ratio for names...')
data['fuzz_partial_ratio'] = data.progress_apply(lambda x: fuzz.partial_ratio(x.name_gdelt, x.name_clean_orbis), axis=1)
display('Calculating token sort ratio for names...')
data['fuzz_token_sort_ratio'] = data.progress_apply(lambda x: fuzz.token_sort_ratio(x.name_gdelt, x.name_clean_orbis), axis=1)
display('Calculating jaro distance for names...')
data['jaro_distance'] = data.progress_apply(lambda x: jellyfish.jaro_distance(x.name_gdelt, x.name_clean_orbis), axis=1)

# Metaphone generation.
display('Generating metaphones for uncleaned orbis names...')
data['metaphone_unclean_orbis'] = data['name_orbis'].progress_apply(jellyfish.metaphone)
display('Generating metaphones for cleaned orbis names...')
data['metaphone_clean_orbis'] = data['name_clean_orbis'].progress_apply(jellyfish.metaphone)
display('Generating metaphones for gdelt names...')
data['metaphone_gdelt'] = data['name_gdelt'].progress_apply(jellyfish.metaphone)

# Metaphone comparisons. Run an apply() on two columns.
display('Calculating fuzz ratio for metaphones...')
data['metaphone_fuzz_ratio'] = data.progress_apply(lambda x: fuzz.ratio(x.metaphone_gdelt, x.metaphone_clean_orbis), axis=1)
display('Calculating fuzz partial ratio for metaphones...')
data['metaphone_fuzz_partial_ratio'] = data.progress_apply(lambda x: fuzz.partial_ratio(x.metaphone_gdelt, x.metaphone_clean_orbis), axis=1)
display('Calculating token sort ratio for metaphones...')
data['metaphone_fuzz_token_sort_ratio'] = data.progress_apply(lambda x: fuzz.token_sort_ratio(x.metaphone_gdelt, x.metaphone_clean_orbis), axis=1)
display('Calculating jaro distance for metaphones...')
data['metaphone_jaro_distance'] = data.progress_apply(lambda x: jellyfish.jaro_distance(x.metaphone_gdelt, x.metaphone_clean_orbis), axis=1)

display('Done.')

In [None]:
data.sample(5)

#### py_stringsimjoin

Edit distance join

In [None]:
data = pd.merge(data, 
                output_pairs_distance_join, 
                how='outer', 
                left_on=['index_x', 'index_y'], 
                right_on=['r_index', 'l_index'])

data.rename(columns={'_sim_score': 'sim_score_distance'}, inplace=True)

#### py_stringmatching

Jaccard join

In [None]:
data = pd.merge(data, 
                output_pairs_jaccard_join, 
                how='outer', 
                left_on=['index_x', 'index_y'], 
                right_on=['r_index', 'l_index'])

data.rename(columns={'_sim_score': 'sim_score_jaccard'}, inplace=True)

Cosine join

In [None]:
data = pd.merge(data, 
                output_pairs_cosine_join, 
                how='outer', 
                left_on=['index_x', 'index_y'], 
                right_on=['r_index', 'l_index'])

data.rename(columns={'_sim_score': 'sim_score_cosine'}, inplace=True)

Dice join

In [None]:
data = pd.merge(data, 
                output_pairs_dice_join, 
                how='outer', 
                left_on=['index_x', 'index_y'], 
                right_on=['r_index', 'l_index'])

data.rename(columns={'_sim_score': 'sim_score_dice'}, inplace=True)

Overlap join

In [None]:
data = pd.merge(data, output_pairs_overlap_join, 
                how='outer', 
                left_on=['index_x', 'index_y'], 
                right_on=['r_index', 'l_index'])

data.rename(columns={'_sim_score': 'sim_score_overlap'}, inplace=True)

Overlap coefficient join

In [None]:
data = pd.merge(data, 
                output_pairs_overlap_coefficient_join, 
                how='outer', 
                left_on=['index_x', 'index_y'], 
                right_on=['r_index', 'l_index'])

data.rename(columns={'_sim_score': 'sim_score_overlap_coefficient'}, inplace=True)

In [None]:
data.to_csv('./output/matches_raw.csv')

# Sorting

Sort by "official" Orbis name, then by the various scores that can be easily changed for testing purposes.

In [None]:
import pandas as pd

In [None]:
try:
    indata = data
except:
    indata = pd.read_csv('./output/matches_raw.csv')
    indata.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
# Sort match data in a multindex and sort by name and score.
df_sorted = indata.set_index(['name_original_orbis', 'name_original_gdelt'])
df_sorted = df_sorted.sort_values(by=['name_original_orbis', 
                                      'fuzz_ratio', 
                                      'fuzz_partial_ratio', 
                                      'fuzz_token_sort_ratio'], 
                                  ascending=False)
df_sorted = df_sorted.sort_index()

In [None]:
df_sorted.to_csv('./output/matches_sorted.csv')

# Matching

In [None]:
import pandas as pd

In [None]:
try:
    df_sorted
except:
    indata = pd.read_csv('./output/matches_sorted.csv')
    df_sorted = indata.set_index(['name_original_orbis', 'name_original_gdelt'])

In [None]:
# Just in case we want to look at the df
# we should have the columns in a nice order.

df_unscored = df_sorted[[
    # 'acronym_gdelt', 
    'freq_gdelt', 
    'fuzz_ratio', 
    'fuzz_partial_ratio', 
    'fuzz_token_sort_ratio', 
    'jaro_distance', 
    'metaphone_unclean_orbis', 
    'metaphone_clean_orbis', 
    'metaphone_gdelt',
    'metaphone_jaro_distance',
    'metaphone_fuzz_ratio',
    'metaphone_fuzz_partial_ratio',
    'metaphone_fuzz_token_sort_ratio',
    'sim_score_distance',
    'sim_score_jaccard',
    'sim_score_cosine',
    'sim_score_dice',
    'sim_score_overlap',
    'sim_score_overlap_coefficient',
]]

In [None]:
len(df_unscored)

In [None]:
df_unscored.sample(10)

In [None]:
# Save progress here to allow fast manipulation of filtering below.
df_scored = df_unscored

In [None]:
`f67`1#### Calculate fuzz similarity

Three fuzz scores are added into a cumulativee "fuzz similarity". Other scoring measures may also be introduced here.

In [None]:
# An approach called "fuzz similarity"
# https://www.analyticsinsight.net/company-names-standardization-using-a-fuzzy-nlp-approach/
df_scored[R8'fuzz_similarity'] = (2 * df_scored['fuzz_partial_ratio'] * df_scored['fuzz_token_sort_ratio']) / (df_scored['fuzz_partial_ratio'] + df_scored['fuzz_token_sort_ratio'])

# Cumulative scores.
df_scored['total_score_name'] = df_scored['fuzz_ratio'] + df_scored['fuzz_partial_ratio'] + df_scored['fuzz_token_sort_ratio']
df_scored['total_score_metaphone'] = df_scored['metaphone_fuzz_ratio'] + df_scored['metaphone_fuzz_partial_ratio'] + df_scored['metaphone_fuzz_token_sort_ratio']

#### Threshold settings

Change the match threshold scores to experiment with accuracy and sensitivity. You can mix and match different scores to refine results and test different approaches.

In [None]:
# Save progress here to allow fast manipulation of matching below.
df_matches = df_scored

In [None]:
# Filter matches.
df_matches = df_matches[((df_matches['total_score_name'] > 280.0) & (df_matches['jaro_distance'] > 0.9))]

# Additional scoring methods for experimentation:
# df_matches = df_matches[df_matches['sim_score_distance'] <= 1]
# df_matches = df_matches[df_matches['sim_score_jaccard'] <= 2]
# df_matches = df_matches[df_matches['sim_score_cosine'] <= 2]
# df_matches = df_matches[df_matches['sim_score_dice'] <= 2]
# df_matches = df_matches[df_matches['sim_score_overlap'] <= 2]
# df_matches = df_matches[df_matches['sim_score_overlap_coefficient'] <= 2]

In [None]:
len(df_matches)

In [None]:
df_matches.head(50)

In [None]:
df_matches.to_csv('./output/matches_filtered.csv')

# Output

In [None]:
import pandas as pd

In [None]:
try:
    indata = df_matches
except:
    indata = pd.read_csv('./output/matches_filtered.csv')
    indata = indata.set_index(['name_original_orbis', 'name_original_gdelt'])

In [None]:
# Clean up the final output.
dataout = indata[['fuzz_similarity', 
                  'total_score_name', 
                  'total_score_metaphone', 
                  'freq_gdelt', 
                  'jaro_distance', 
                  'metaphone_jaro_distance', 
                  'sim_score_distance',
                  'sim_score_jaccard',
                  'sim_score_cosine',
                  'sim_score_dice',
                  'sim_score_overlap',
                  'sim_score_overlap_coefficient',
                 ]]

In [None]:
dataout.head(50)

In [None]:
dataout.to_csv('./output/OUTPUT.csv')