Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to achieve matching for multiple fields and priorities #64

Open
berndnoll opened this issue Aug 21, 2021 · 81 comments
Open

Comments

@berndnoll
Copy link

Excellent work here. Great performance.
Can you please give some advice on how to achieve deduplication for multiple fields, some fuzzy, some 'hard' matches.

Like this one does:
https://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html

compare_cl = recordlinkage.Compare()
compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

features = compare_cl.compute(pairs, dfA)

Thank you!

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Aug 22, 2021

Hi @berndnoll

Thanks for the comment.

It seems to me that your request involves applying string_grouper’s function match_strings on several fields of your dataset one after the other (using one Series for each field, preferably indexed) and merging the resulting similarity scores on the output fields left_id and right_id (that is, withhow='outer' and on=['left_id', 'right_id']. See pandas documentation).

Exact comparisons may be achieved by specifying a threshold similarity score very close to 1.0 (for example, min_similarity=0.9999) while fuzzy matches can be specified with a lower desired score.

Furthermore, note that before applying each merge operation, the output fields similarity would need to be renamed in both operands to the name of the corresponding field that produced the similarity score.

I hope this answers your question.

@berndnoll
Copy link
Author

Hi @ParticularMiner,
thank you very much for your helpful answer.
I managed to implement the merge operation and got this to work according to your advice.
Performance is fine with a small dataset (40k records).

However, I have a pretty large dataset (>3Mil records) and matching takes foreeeever. (already for one field)
So my follow up question is if you have any great ideas on how to improve performance for larger datasets?
Some record linkage libs provide the ability of blocking (e.g. by country, zipcode) which can significantly reduce matching time.

Thanks

@ParticularMiner
Copy link
Contributor

Hi @berndnoll

I’m glad to hear you succeeded in achieving your goal.

You could play around with the string_grouper option max_n_matches which may be used to limit the number of matches found per record and thus speed things up a bit. Start small, say with max_n_matches=20 and increase until your output doesn’t change anymore.

@iibarant
Copy link

Hi @berndnoll,

I also use this approach and found it very good and quick compared to recordlinkage that you mentioned above. Since you want to use exact match for some fields, why don't you break your whole data set by subsets like by state AND / OR suburb,
and run the match_strings procedure for each subset separately. That would save a LOT of computation. Then you would merge the resulting dataframes.

@ParticularMiner, what do you think?

Thank you!

@berndnoll
Copy link
Author

@iibarant, thanks for the advice.
That's exactly what I implemented today. However, running multiple matches for small subsets takes actually way more time than running one match over a big set... I am using country and zip code to break the data into subsets.
I think ideally blocking/exact matching should be part of the "inner" algorithm. Just my 2 cents.

@ParticularMiner
Copy link
Contributor

@iiberant @berndnoll

Thanks for your contributions. If you don't mind, could you share your code (and perhaps some accompanying sample data) so I can properly examine/assess your methods? I'm quite interested to know.

@iibarant
Copy link

iibarant commented Aug 30, 2021

Hi guys,

Unfortunately I am not allowed to share the data, but here's the experiment I ran with 7 states and the code I used.
I merged the address fields into 'full address' column and kept 'foad' (State) separately.

The existing column names are ['document_id','name','foad','full address'].

Please see the RunTime results:
image

The code for merging subsets is here:

import time
t0 = time.time() 

states = ['MT','DE','HI','ID','RI','DC','ME']

match_list = []

for state in states:
    df = df_cleaned[df_cleaned['foad'] == state].copy()
    matches = match_strings(df['full address'],max_n_matches = 300, min_similarity = 0.85)
    match_list.append(matches)

final_match = pd.concat(match_list, ignore_index=True)

t = time.time()-t0
print("Ran for:", t, " seconds with match size = ",len(final_match)

Running data separately by states and merging them after words takes significantly less time.

I have to try that for big states like California, NY, Florida and Texas. It will surely be faster to run separately and merge after.

Thank you.

@ParticularMiner
Copy link
Contributor

Thank you @iibarant !

I suppose separating the data by state makes complete sense especially if any two addresses not in the same state never need to be matched (that is, if I have interpreted your data correctly). Then, certainly, separating the data will proceed much faster than not separating the data.

It seems to me that @berndnoll (correct me if I’m wrong) has a somewhat different task which requires matching different fields for the same records and merging the results. (That is, he needs to merge along the horizontal axis rather than the vertical axis, as you do. So he might proceed differently.

@berndnoll
Copy link
Author

berndnoll commented Sep 1, 2021

Hi gents,
unfortunately not able to share the original file, but the one attached will do for our purposes.
My code below. Grouping/blocking can be switched on/off by setting attribute 'group' to True/False in line 13.

I did concatenate zipcode and city to be as close to my original data.
In the original data file I am doing this with country + zipcode, as country keys like NL, DE, US, ... are too short to deliver good results.

My observations are:
Running this grouped: 27 seconds
Ungrouped: 3.2 seconds

However, this might be due to my poor py skills... Looking forward to some pro comments :)
Ideally, grouping/blocking would be part of the actual string_grouper implementation, but I leave that up to @ParticularMiner ...

Cheers

us-cities-real-estate-sample-zenrows.zip

Removing scrambled code --- see below for correclty formatted version.

@ParticularMiner
Copy link
Contributor

Hi @berndnoll

Many thanks for your code and sample data. Unfortunately important indentations/tabs in your code got lost in your message, making some portions of the code unreadable. It would be helpful to the reader if you would bracket your code-snippets with the required markdown directives (```python and ```) as follows:

Code such as the following:

if using_github == True:
   sample_code(goes, here)

should be coded as follows:

```python
if using_github == True:
    sample_code(goes, here)
```

In this way, the indentations are preserved. Follow this link for more information.

@berndnoll
Copy link
Author

@ParticularMiner: Sorry for that, trying one more time...

import sys, csv, json, pandas as pd
import numpy
import string_grouper as sg
import time

print('Dedupe script')
t0 = time.time()

inputfilename = 'us-cities-real-estate-sample-zenrows.csv'
outputfilename = 'dupes.csv'
# match field definitions - groups have to go first
comps = [
        {'fieldname':'addressZipcode+addressCity','threshold':0.99,'weight':1,'valsep':'|','group':False},
        {'fieldname':'addressStreet','threshold':0.8,'weight':1,'valsep':'|'}
        ]  
keyfield = 'id'
mintotalscore = 0.8

df = pd.read_csv(inputfilename, dtype = str)
print('#Records:',len(df)-1)
df.fillna('', inplace=True)

lkfld = 'left_'+keyfield
rkfld = 'right_'+keyfield
keyfields = [lkfld,rkfld]
allfields = keyfields.copy()
allfields.append('totalscore')

outmatches = pd.DataFrame()
first = True
gfield = ''

for comp in comps:
  sim = 0.8 if 'threshold' not in comp else comp['threshold']
  weight = 1 if 'weight' not in comp else comp['weight']
  group = False if 'group' not in comp else comp['group']
  valsep = '|' if 'valsep' not in comp else comp['valsep']

  field = comp['fieldname']
  flds = field.split('+')
  if len(flds)>1: # Multiple fields? Create concatenated field
    field = '_'.join(flds)
    df[field] = df[flds].apply(lambda row: valsep.join(row.values.astype(str)), axis=1)

  allfields.append('left_'+field)
  allfields.append('right_'+field)
  allfields.append(field+'_sim')

  simfn = field + '_sim'

  if group == True:
    gfield = field
    gweight = weight
    gvals = df[gfield].unique().tolist()
    continue  # No matches for this field, as it's group

  print('Field: ',field)
  if gfield != '':
    print(' group field=',gfield)
    # Perform match per each group
    matches = pd.DataFrame()
    for gval in gvals:
      gdf = df[ df[gfield] == gval ]
      strs = gdf[field].astype(str)
      keys = gdf[keyfield].astype(str)
      gmatches = pd.DataFrame()
      if len(strs)>1:
        #print('group value:', gval)
        # Perform the match
        gmatches = sg.match_strings(strs, min_similarity=sim, master_id = keys, ignore_index=True, max_n_matches=20)
        gmatches = gmatches[gmatches[lkfld] != gmatches[rkfld]]
        gmatches['similarity'] = gmatches['similarity'] + 1 # Add a full score for the group
        gmatches['left_'+gfield] = gval
        gmatches['right_'+gfield] = gval
        gmatches[gfield+'_sim'] = 1
        matches = matches.append(gmatches)
  else:  
    strs = df[field].astype(str)
    keys = df[keyfield].astype(str)
    # Perform the match
    matches = sg.match_strings(strs, min_similarity=sim, master_id = keys, ignore_index=True, max_n_matches=20)

  # Only keep matches that have different keys
  matches = matches[matches[lkfld] != matches[rkfld]]
  matches['similarity'] = matches['similarity']*weight
  
  matches.rename(columns={'similarity': simfn }, inplace=True )
  # Merge results
  if first == True: # First field
    outmatches = matches
    outmatches['totalscore'] = matches[simfn]
    first = False
  else:
    outmatches = outmatches.merge(matches,on=keyfields,how='outer')
    outmatches['totalscore'] = outmatches['totalscore'].add(outmatches[simfn])

def onetwo(row): 
  if not hasattr(onetwo, "number"):
    onetwo.number = 0
  if onetwo.number == 1:
    onetwo.number = 2
  else:
    onetwo.number = 1
  return onetwo.number  

# Filter
outmatches = outmatches[outmatches[lkfld] != outmatches[rkfld]]
outmatches = outmatches[outmatches.totalscore >= mintotalscore]

# Eliminate duplicates
outmatches.sort_values([lkfld,rkfld])
outmatches['onetwo'] = outmatches.apply(lambda row: onetwo(row), axis=1)
delkeys = outmatches[rkfld].unique().tolist()
outmatches = outmatches[(outmatches[lkfld].isin(delkeys)) & (outmatches['onetwo'] != 2)]

# Output
nummatches = len(outmatches)
if nummatches > 0:
  t = time.time()-t0  
  print('duration:', t)
  print('output:', nummatches)
  outmatches.to_csv(outputfilename, sep='\t', index=False, columns=allfields)

print('Done.')

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Sep 2, 2021

@berndnoll @iibarant

Thanks again for your contributions and suggestions. I have borrowed from the code snippets you both provided to produce the following code:

[Follow this link to obtain the most recent code.]

I'd appreciate it if you would test and review it, so as to determine if it would be useful enough to add to string_grouper's utility package string_grouper_utils.

Of course feel free to discuss any issues.

FYI, using the sample data that @berndnoll provided, I found that grouping the data for exact matches before computing the similarities does not necessarily lead to faster nor slower computation. In fact, the speed is very much data-dependent.

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Sep 3, 2021

@iibarant @berndnoll

CAUTION:

You may have already noticed that grouping the exact matches before similarity-matching often leads to results that are different from those obtained without grouping. This is because of the way similarity-matching works: similarity-scores are dependent on the vocabulary-set (or corpus, as it's called in Natural Language processing).

This means that calling match_strings() on only a subset, or group, of the data (for example, all records whose 'state' field has the value 'AL') would in general yield different similarity-scores for those same records if match_strings() were called on the entire dataset. That's because the corpus in each case is different — in the former, it is the group, and in the latter, it is the entire dataset.

This is why it might be better not to group at all. Nevertheless, the choice is always up to the user, as long as he/she knows what he/she is doing.

Somehow, I had forgotten about this fact before. Sorry about that. I hope that hasn't caused you too much inconvenience.

@iibarant
Copy link

iibarant commented Sep 3, 2021

Thank you @ParticularMiner,

This is a good point. That's why I decided to combine states with much lower data collection (in my case < 10000) all together. Hope this would be acceptable.

Thank you for your solution. I will explore it next week.

Have a great weekend!

@ParticularMiner
Copy link
Contributor

@berndnoll @iibarant

By modifying the StringGrouper class a little bit, I have been able to fix the problem of inconsistent results between the two methods. So please note the update to the code in my previous post. Now you can obtain the same result whether or not you first group by exact fields.

@ParticularMiner
Copy link
Contributor

@berndnoll @iibarant

Latest update now includes total similarity scores.

@berndnoll
Copy link
Author

@ParticularMiner
Thank you so much for looking into this and for collaborating on this. Very much appreciated!

I basically copy and pasted your code to test it before applying to a bigger dataset.
After 30 minutes I restarted the script ... same result.
Reducing the input volume from 10,000 to 1,000 records makes it run through, but ehm ... I will not let it work on 3 mil rows for sure.

Here is the output:

Dedupe script
T1: 1.159604787826538
T2: 55.857362031936646
Done.

So ... either I got something wrong or you have a quantum computer?
What was the runtime of your examples?

Here is my code - based on the latest update you posted.

Cheers

import sys, csv, json, pandas as pd
import numpy as np
import string_grouper as sg
import time

import functools
from string_grouper import StringGrouper

def record_linkage(data_frame,
                   fields_2b_matched_fuzzily,
                   fields_2b_matched_exactly=None,
                   hierarchical=True,
                   max_n_matches=5000,
                   similarity_dtype=np.float32
                  ):
    '''
    Function that combines similarity-matching results of several fields of a DataFrame and returns them in another
    DataFrame
    :param data_frame: pandas.DataFrame of strings.
    :param fields_2b_matched_fuzzily: List of tuples.  Each tuple is a triple 
        (<field name>, <threshold>, <ngram_size>).
        <field name> is the name of a field in data_frame which is to be matched using a threshold similarity score of 
        <threshold> and an ngram size of <ngram_size>.
    :param fields_2b_matched_exactly: List of strings.  Each string is a field name of data_frame.  Defaults to None.
    :param hierarchical: bool.  Determines if the output DataFrame will have a hierarchical column-structure (True) or 
        not (False). Defaults to True.
    :param max_n_matches: int. Maximum number of matches allowed per string.
    :param similarity_dtype: numpy type.  Either np.float32 (the default) or np.float64.  A value of np.float32 allows
        for less memory overhead during computation but less numerical precision, while np.float64 allows for greater
        numerical precision but a larger memory overhead.
    :return: pandas.DataFrame containing matching results.
    '''
    class VariableStringSetStringGrouper(StringGrouper):
        # This class enables StringGrouper to apply matching on different succesive master datasets without 
        # resetting the underlying corpus, as long as each master dataset is contained in the corpus. 
        # This class inherits from StringGrouper, overwriting only two of its methods:
        # __init__() and _get_tf_idf_matrices()
        def __init__(self, corpus, **kwargs):
            # initializer is the same as before except that it now also sets the corpus
            super().__init__(corpus, **kwargs)
            self._vectorizer = self._fit_vectorizer()
            
        def _get_tf_idf_matrices(self):
            # _get_tf_idf_matrices() now no longer sets the corpus but rather builds the matrices from
            # the existing corpus
            master_matrix = self._vectorizer.transform(self._master)
            return master_matrix, master_matrix

    def match_strings(master, sg):
        sg._master = master
        sg = sg.fit()
        return sg.get_matches()

    def get_only_field_names(fields_2b_matched_fuzzily):
        return list(list(zip(*fields_2b_matched_fuzzily))[0])
                        
    def build_column_precursor_to(df, exact_field_value_pairs):
        exact_df = df.iloc[:, 0:0]
        for field_name, field_value in exact_field_value_pairs:
            exact_df[field_name] = field_value
        return exact_df

    def horizontal_linkage(df,
                           match_indexes,
                           fields_2b_matched_fuzzily,
                           exact_field_value_pairs=None,
                           hierarchical=True):
        horizontal_merger_list = []
        for field, sg in fields_2b_matched_fuzzily:
            matches = match_strings(df[field], sg)
            matches.set_index(match_indexes, inplace=True)
            matches = weed_out_trivial_matches(matches)
            
            if hierarchical:
                merger = matches[[f'left_{field}', 'similarity', f'right_{field}']]
                merger.rename(columns={f'left_{field}': 'left', f'right_{field}': 'right'}, inplace=True)
            else:
                merger = matches[['similarity']]
                merger.rename(columns={'similarity': field}, inplace=True)
            horizontal_merger_list += [merger]
            
        field_list = get_only_field_names(fields_2b_matched_fuzzily)
        key_list = None if not hierarchical else field_list
        merged_df = pd.concat(horizontal_merger_list, axis=1, keys=key_list, join='inner')
        
        title_exact = 'Exactly Matched Fields'
        title_fuzzy = 'Fuzzily Matched Fields'
        if exact_field_value_pairs:
            exact_df = build_column_precursor_to(merged_df, exact_field_value_pairs)
            merged_df = pd.concat([exact_df, merged_df],
                                  axis=1,
                                  keys=[title_exact, title_fuzzy],
                                  join='inner')
        
        totals = compute_totals(merged_df, field_list, hierarchical, exact_field_value_pairs, title_fuzzy)
        return pd.concat([totals, merged_df], axis=1)
    
    def weed_out_trivial_matches(matches):
        num_indexes = matches.index.nlevels//2
        return matches[
            functools.reduce(
                lambda a, b: a | b, 
                [
                    (matches.index.get_level_values(i) != matches.index.get_level_values(i + num_indexes)) \
                    for i in range(num_indexes)
                ]
            )
        ]
    
    def compute_totals(merged_df, field_list, hierarchical, exact_field_value_pairs, title_fuzzy):
        title_total = 'Total Similarity Score'
        if hierarchical:
            if exact_field_value_pairs:
                totals = merged_df[
                    [(title_fuzzy, field, 'similarity') for field in field_list]
                ].sum(axis=1) + len(exact_field_value_pairs)
                totals = pd.concat([totals], axis=1, keys=[('', '', title_total)])
            else:
                totals = merged_df[
                    [(field, 'similarity') for field in field_list]
                ].sum(axis=1) 
                totals = pd.concat([totals], axis=1, keys=[('', title_total)])
        else:
            if exact_field_value_pairs:
                totals = merged_df[
                    [(title_fuzzy, field) for field in field_list]
                ].sum(axis=1) + len(exact_field_value_pairs)
                totals = pd.concat([totals], axis=1, keys=[('', title_total)])
            else:
                totals = merged_df[field_list].sum(axis=1)
                totals.rename(title_total, inplace=True)
        return totals
    
    def get_index_names(df):
        empty_df = df.iloc[0:0]
        return [field for field in empty_df.reset_index().columns \
                if field not in empty_df.columns]
    
    def prepend(strings, prefix):
        return [f'{prefix}{i}' for i in strings]
    
    def append_to_first_of_each_tuple(list_of_tuples, list_of_appendices):
        return [(tupl[0], ) + (x, ) for tupl, x in list(zip(list_of_tuples, list_of_appendices))]
    
    index_name_list = get_index_names(data_frame)
    match_indexes = prepend(index_name_list, prefix='left_') + prepend(index_name_list, prefix='right_')
    
    # set the corpus for each fuzzy field
    stringGroupers = []
    for field, threshold, ngram_sz in fields_2b_matched_fuzzily:
        stringGroupers += [VariableStringSetStringGrouper(data_frame[field],
                                                          min_similarity=threshold,
                                                          ngram_size=ngram_sz,
                                                          max_n_matches=max_n_matches,
                                                          tfidf_matrix_dtype=similarity_dtype)]
 
    fuzzy_bundles = append_to_first_of_each_tuple(fields_2b_matched_fuzzily, stringGroupers)
    
    if not fields_2b_matched_exactly:
        return horizontal_linkage(data_frame,
                                  match_indexes,
                                  fuzzy_bundles,
                                  hierarchical=hierarchical)
    else:
        groups = data_frame.groupby(fields_2b_matched_exactly)
        vertical_merger_list = []
        for group_value, group_df in groups:
            values_matched_exactly = [group_value] if not isinstance(group_value, tuple) else list(group_value)
            exact_field_value_pairs = list(zip(fields_2b_matched_exactly, values_matched_exactly))
            vertical_merger_list += [horizontal_linkage(group_df,
                                                        match_indexes,
                                                        fuzzy_bundles,
                                                        exact_field_value_pairs=exact_field_value_pairs,
                                                        hierarchical=hierarchical)]
        return pd.concat(vertical_merger_list)

# =========== MAIN ===========
print('Dedupe script')
t0 = time.time()
inputfilename = 'us-cities-real-estate-sample-zenrows.csv'
outputfilename1 = 'deduped_noneexact.csv'
outputfilename2 = 'deduped_someexact.csv'

df = pd.read_csv(inputfilename, dtype=str)
df = df[['zpid', 'statusText', 'addressZipcode', 'addressState', 'address', 'imgSrc', 'hasVideo']]
df.set_index('zpid', inplace=True)

record_matches_none_exact = record_linkage(df,
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
                                          ('address', 0.8, 3),
                                          ('addressState', 0.99999, 2),
                                          ('addressZipcode', 0.99999, 3),
                                          ('hasVideo', 0.99999, 3)],
               fields_2b_matched_exactly=None,
               hierarchical=True,
               max_n_matches=100,
               similarity_dtype=np.float32)
outmatches = \
record_matches_none_exact[
    record_matches_none_exact.index.get_level_values(0) < record_matches_none_exact.index.get_level_values(1)
]

print('T1:',time.time()-t0)
t0 = time.time()
outmatches.to_csv(outputfilename1, sep='\t', index=False)

record_matches_some_exact = record_linkage(df,
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
                                          ('address', 0.8, 3)],
               fields_2b_matched_exactly=['addressState', 'addressZipcode', 'hasVideo'],
               hierarchical=True,
               max_n_matches=100,
               similarity_dtype=np.float32)
outmatches = record_matches_some_exact[record_matches_some_exact.index.get_level_values(0) < record_matches_some_exact.index.get_level_values(1)]

print('T2:',time.time()-t0)
t0 = time.time()
outmatches.to_csv(outputfilename2, sep='\t', index=False)

print('Done.')

@ParticularMiner
Copy link
Contributor

Hi @berndnoll

Very curious indeed! I'm not sure what's going on here.

It doesn't look like you are doing anything different from what I did. Except that you are using quite a small value (100) for max_n_matches (I used max_n_matches = 10000). As you can see, the results I posted before were from the same 10 000 row dataset you sent me in that csv file. And they took

T1: 3m 52.98s
T2: 5m 18.61s

What are your machine specs, by the way?
Mine are: Intel Core i7 CPU@2.20GHz; 16GB RAM; Cache Memory: 256 KB/ 6 MB with a standard Flash Memory Solid State Drive running the latest Microsoft Windows Home Operating System. The code was ran in a Jupyter notebook in the Microsoft Edge browser using Python 3.9.2. So no quantum computer here. 😃 But about six months ago I was using a hard disk drive which made everything run extremely slow especially with large datasets. Upgrading to a solid state drive made a HUGE difference.

@berndnoll
Copy link
Author

Hi @ParticularMiner,
my machine specs are similar to yours...Intel(R) Core(TM) i7-8650U CPU @ 2.11 GHz, 16 GB RAM, SSD, Win 10 Prof.

Looking at the runtimes I had with my (way less elegant) hack vs. your beautiful code I wonder what's causing the slower performance. Just the additional fields? I am not too familiar with code profiling in py, though.
Will do some more testing and digging starting Tue when labor day weekend is over and let you know what I find.

Cheers
Bernd

@ParticularMiner
Copy link
Contributor

Hi @iibarant @berndnoll

This can wait till any time of your convenience. No pressure! I'm on the other side of the Atlantic Ocean anyway. So I do not celebrate Labor Day today. 😃

Lessons learnt so far through experimentation:

  1. Prior grouping by fields that are to be matched exactly leads to a performance boost ONLY IF the resulting number of groups is not too large.

    Recommendation: do not group by any field that has many unique data-values.

    For example, from @berndnoll's sample dataset, the field 'hasVideo' has only two unique data-values ('true' and 'false') leading to only two groups. So here, yes, group by this field if you want to do exact comparisons on this field. Otherwise, calling match_strings() on this field would take so much longer since so many matches will be found.

    On the other hand, for the same dataset, the concatenated field 'addressZipcode+addressCity' has more than 6000 unique values (df['addressZipcode+addressCity'].nunique() = 6579). So calling string_grouper's match_strings() function on each of these 6579 groups one after the other would result in an utter loss of efficiency.

    Somewhere between these two limits is a threshold number of groups (or unique field values) below which grouping leads to faster code..

  2. (I already mentioned this in a previous comment.)
    Be aware of the fact that applying string_grouper's match_strings() function on different subsets of a single pandas Series of strings causes the underlying corpus to change from subset to subset leading to inconsistent similarity-scores across the subsets. Therefore one needs to modify the StringGrouper class slightly in order to fix the corpus between subsets. See my previous code posting on one way to do this.

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Sep 6, 2021

@iibarant @berndnoll

By way of illustration of the judicious and injudicious use of grouping, here are some runtimes of the examples I gave in my last post. The following function call with grouping on field 'hasVideo' (which has only two unique data-values: 'true' and 'false') takes just ~35 seconds:

record_matches_some_exact = record_linkage(df.iloc[:, :],
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
                                          ('address', 0.8, 3),
                                          ('addressState', 0.999999, 2),
                                          ('addressZipcode', 0.999999, 3)],
               fields_2b_matched_exactly=['hasVideo'],
               hierarchical=True,
               max_n_matches=10000,
               similarity_dtype=np.float32)

whereas the following call without grouping takes ~4.5 minutes to produce the same result:

record_matches_none_exact = record_linkage(df.iloc[:, :],
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
                                          ('address', 0.8, 3),
                                          ('addressState', 0.999999, 2),
                                          ('addressZipcode', 0.999999, 3),
                                          ('hasVideo', 0.999999, 3)],
               fields_2b_matched_exactly=None,
               hierarchical=True,
               max_n_matches=10000,
               similarity_dtype=np.float32)

Finally, the following call with grouping on field 'addressZipcode' (which has 6446 unique values) takes ~9 minutes: to produce the same result

record_matches_some_exact = record_linkage(df.iloc[:, :],
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
                                          ('address', 0.8, 3),
                                          ('addressState', 0.999999, 2),
                                          ('hasVideo', 0.999999, 3)],
               fields_2b_matched_exactly=['addressZipcode'],
               hierarchical=True,
               max_n_matches=10000,
               similarity_dtype=np.float32)

@ParticularMiner
Copy link
Contributor

Hi @berndnoll

Thanks for the compliment! However your code is essentially not that different from mine.

It is true that we both are using different numbers of fields but the main reason for the runtime difference between us is the value of the option parameter max_n_matches. You used max_n_matches = 20; I used max_n_matches = 10000. By so doing you miss a lot of matches, while I capture all possible matches.

Applying my code to the same fields you used (but with max_n_matches=10000) produced the following runtimes:
T1: 1.74s
T2: 3m 23.55s

@ParticularMiner
Copy link
Contributor

@berndnoll @iibarant

Looks like the magic number is ~375. (See plot below.)

So, at least, for a 10 000-row dataset 'grouping' is faster than 'not grouping' if the data can be grouped into less than 375 groups by the exact fields., otherwise 'not grouping' is faster.

Plot comparing Runtimes of record_linkage() calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame

Figure 1

@iibarant
Copy link

iibarant commented Sep 7, 2021

@ParticularMiner,

That's a great experiment.
I ran your exact code on data from @berndnoll and received the following run time figures:
T1 = 3 min 8.07 sec
T2 = 2 min 38.43 sec

But setting up the only addressState field for the exact match reduced the runtime to only 5.07 seconds !!!
So, my intuitive approach was proved.

Thank you very much for the professional code that is very convenient for any user.

@ParticularMiner
Copy link
Contributor

Hi @iibarant

Indeed your intuition was right! And it was right because we have only 51 states to group 10 000 records by. Such a small number of unique values is well below the crossover number of 375.

Hope you had an enjoyable Labor Day!

@iibarant
Copy link

iibarant commented Sep 7, 2021

Thank you! The Labor Day was great.

I tried the following code on my 3.5 million data frame:

record_matches = record_linkage(df,
               fields_2b_matched_fuzzily=[('full address', 0.8, 3)],
               fields_2b_matched_exactly=['foad'],
               hierarchical=True,
               max_n_matches=10000,
               similarity_dtype=np.float32)
interesting_record_matches_some_exact = \
record_matches[
    record_matches.index.get_level_values(0) < record_matches.index.get_level_values(1)].sort_values(('', '', 'Total Similarity Score'), ascending=False)

t = time.time()-t1
print("Runtime (sec): ", t)

And I got the following error:
Traceback (most recent call last):

File "", line 3, in
record_matches = record_linkage(df,

File "/Users/iibarant/Desktop/Python scripts/Explorations", line 174, in record_linkage
vertical_merger_list += [horizontal_linkage(group_df,

File "/Users/iibarant/Desktop/Python scripts/Explorations", line 74, in horizontal_linkage
matches = match_strings(df[field], sg)

File "/Users/iibarant/Desktop/Python scripts/Explorations", line 55, in match_strings
sg = sg.fit()

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 264, in fit
matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 467, in _build_matches
return awesome_cossim_topn(

File "/opt/anaconda3/lib/python3.8/site-packages/sparse_dot_topn/awesome_cossim_topn.py", line 119, in awesome_cossim_topn
alt_indices, alt_data = ct_thread.sparse_dot_topn_extd_threaded(

File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 133, in sparse_dot_topn.sparse_dot_topn_threaded.__pyx_fuse_0sparse_dot_topn_extd_threaded

File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 168, in sparse_dot_topn.sparse_dot_topn_threaded.sparse_dot_topn_extd_threaded

OverflowError: value too large to convert to int

@ParticularMiner
Copy link
Contributor

@iibarant

You may have to reduce the value of max_n_matches. 3.5 million records is testing the limits of your computer.

@ParticularMiner
Copy link
Contributor

@iiberant

Though this is an ambitious suggestion, you could also try to figure out how to use string_grouper to work on smaller, manageable pieces of your dataset, and afterwards piece together all the different results to form the whole, thus avoiding the OverflowError.

@iibarant
Copy link

iibarant commented Sep 7, 2021

@ParticularMiner
Reducing the max_n_matches to 5000 allowed to run the whole data for 1 hour 53 minutes which is not bad for the size of the dataframe (over3.5 million records).

@ParticularMiner
Copy link
Contributor

@iibarant
That's great. Did you ascertain that that captured all possible matches?

@ParticularMiner
Copy link
Contributor

@iibarant

Wow. Well your machine is performing as expected, I guess. Better than expected in fact.

And it looks like I've solved the OverflowError issue, right? Now I'm also beginning to wonder if spiltting is sometimes actually faster than not splitting ...

Are you satisfied with the performance? If so, I'll submit record_linkage() to @Bergvca (the maintainer) for inclusion into string_grouper.

I'm still not sure about what happened with the California data. I'll keep it in mind though. But if you have any ideas about it please let me know.

@iibarant
Copy link

iibarant commented Sep 9, 2021

@ParticularMiner,
your solution is great, thanks a lot for your contribution. I am satisfied with the performance and you can certainly submit the record_linkage() into string_grouper.

What I see as update is to add weights into similarity scores while combining the 'Total Similarity Score' when there are more than one 'fields_2b_matched_fuzzily'. So, 'Total Similarity Score' = weight1 x similarity1 + weight2 x similarity2 ...

I believe @berndnoll wanted to use that approach.

Thank you very much!

@ParticularMiner
Copy link
Contributor

@iibarant

Ok. Good. And many thanks to you and @berndnoll for collaborating and sharing data with me despite your professional constraints. Without your input I likely wouldn't have been able to come this far. By the way, do you and @berndnoll work together?

Yes indeed I already saw that @berndnoll had included weights in his code. These can be added relatively easily. Would you like to add them? Typically the sum of all the weights should be 1, right?

@berndnoll
Copy link
Author

@iibarant
Yes, I had implemented the weight in my solution. Default weight is 1.
That is especially helpful when you have identifiers in the data that indicate clear dupes - e.g. DUNS number.

@iibarant
Copy link

iibarant commented Sep 9, 2021

@ParticularMiner @berndnoll

using weights would be really helpful for fuzzily matching records in data sets with more than one string columns. Like we consider matching phone numbers less important than zip code (or vice versa).
When sum(weights) = 1, then it would be easier to interpret the results.

Thank you!

@ParticularMiner
Copy link
Contributor

@iibarant

FYI, out of curiosity, I just re-ran the call on the company data while forcing the program to split the data into two parts like your machine did. And I obtained a total runtime of 8 min 7.22 s (down from 11 min 53.76 s minutes without splitting). So indeed, it seems that if the dataset is large enough, splitting can prove to be a faster and more efficient alternative. Still my machine couldn't beat your machine time of 4.5 minutes, though. 😃

BUT ... Force-splitting down even further into four parts brought the runtime down to 5 min 12.54 s !

Afterwards, further splitting into eight parts increased the runtime to 7 min 6.21 s. So there's a limit.

I feel like there might be a relationship to our earlier case of splitting by grouping by another field. But that is research for another time.

@berndnoll
Copy link
Author

@ALL
Remarkable collaboration, I love this thread.
Re: weight
I am not sure if the sum of weight has to be 1.
Having a default of x1 per rule gives more opportunity to put emphasis on identifying attributes.

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Sep 10, 2021

@berndnoll @iibarant

How about this: the user may specify any number > 0 for each weight. The sum of the weights need not be 1. However, internally, the weights are scaled (or divided) by their sum so that we have:

<weighted mean similarity score> = (Σfield weightfield × similarityfield) / (Σfield weightfield)

where Σfield means "sum over fields"

Is this satisfactory?

The code has now been updated with the above feature.

I've also cleaned-up the clutter to improve the speed a bit. @iibarant could you please try this new code on your California data again? I want to know whether these improvements made any difference to the runtimes on your machine.

See updated code below:

import pandas as pd
import numpy as np
import functools
from string_grouper import StringGrouper
from scipy.sparse.csr import csr_matrix
from topn import awesome_topn


def record_linkage(data_frame,
                   fields_2b_matched_fuzzily,
                   fields_2b_matched_exactly=None,
                   hierarchical=True,
                   max_n_matches=None,
                   similarity_dtype=np.float32,
                   force_corrections=True,
                   n_blocks=None
                  ):
    '''
    Function that combines similarity-matching results of several fields of a DataFrame and returns them in another
    DataFrame
    :param data_frame: pandas.DataFrame of strings.
    :param fields_2b_matched_fuzzily: List of tuples.  Each tuple is a triple 
        (<field name>, <threshold>, <ngram_size>, <weight>).
        <field name> is the name of a field in data_frame which is to be matched using a threshold similarity score of 
        <threshold> and an ngram size of <ngram_size>. <weight> is a number that defines the **relative** importance 
        of the field to other fields -- the field's contribution to the total similarity will be weighted by this
        number.
    :param fields_2b_matched_exactly: List of tuples.  Each tuple is a pair (<field name>, <weight>).  <field name> is
        the name of a field in data_frame which is to be matched exactly.  <weight> has the same meaning as in
        parameter fields_2b_matched_fuzzily. Defaults to None.
    :param hierarchical: bool.  Determines if the output DataFrame will have a hierarchical column-structure (True) or 
        not (False). Defaults to True.
    :param max_n_matches: int. Maximum number of matches allowed per string.
    :param similarity_dtype: numpy type.  Either np.float32 (the default) or np.float64.  A value of np.float32 allows
        for less memory overhead during computation but less numerical precision, while np.float64 allows for greater
        numerical precision but a larger memory overhead.
    :param force_corrections: bool. Specifies whether corrections should be made to the results to account for 
        symmetry thus removing certain errors due to lack of numerical precision
    :param n_blocks: Tuple[(int, int)]. This parameter is provided to boost performance, if possible, by splitting the 
        dataset into n_blocks[0] blocks for the left operand (of the "comparison operator") and into n_blocks[1] blocks
        for the right operand before performing the string-comparisons blockwise.
    :return: pandas.DataFrame containing matching results.
    '''
    def get_field_names(fields_tuples):
        return list(list(zip(*fields_tuples))[0])
    
    def get_exact_weights(exact_field_tuples):
        return list(list(zip(*exact_field_tuples))[1])
    
    def get_fuzzy_weights(fuzzy_field_tuples):
        return list(list(zip(*fuzzy_field_tuples))[3])
    
    def get_field_value_pairs(field_names, values):
        return list(zip(field_names, values))
    
    def get_field_stringGrouper_pairs(fuzzy_field_tuples, string_groupers):
        return [(tupl[0], ) + (x, ) for tupl, x in list(zip(fuzzy_field_tuples, string_groupers))]
    
    def get_index_names(df):
        empty_df = df.iloc[0:0]
        return [field for field in empty_df.reset_index().columns \
                if field not in empty_df.columns]
    
    def prepend(strings, prefix):
        return [f'{prefix}{i}' for i in strings]
    
    def horizontal_linkage(df,
                           match_indexes,
                           fuzzy_field_grouper_pairs,
                           fuzzy_field_names,
                           fuzzy_field_weights,
                           exact_field_value_pairs=None,
                           exact_field_weights=None,
                           hierarchical=True,
                           force_corrections=False,
                           n_blocks=None):
        horizontal_merger_list = []
        for field, sg in fuzzy_field_grouper_pairs:
            matches = match_strings(df[field], sg, force_corrections=force_corrections, n_blocks=n_blocks)
            matches.set_index(match_indexes, inplace=True)
            matches = weed_out_trivial_matches(matches)
            if hierarchical:
                merger = matches[[f'left_{field}', 'similarity', f'right_{field}']]
                merger.rename(columns={f'left_{field}': 'left', f'right_{field}': 'right'}, inplace=True)
            else:
                merger = matches[['similarity']]
                merger.rename(columns={'similarity': field}, inplace=True)
            horizontal_merger_list += [merger]
            
        key_list = None if not hierarchical else fuzzy_field_names
        merged_df = pd.concat(horizontal_merger_list, axis=1, keys=key_list, join='inner')
        
        title_exact = 'Exactly Matched Fields'
        title_fuzzy = 'Fuzzily Matched Fields'
        if exact_field_value_pairs:
            exact_df = build_column_precursor_to(merged_df, exact_field_value_pairs)
            merged_df = pd.concat([exact_df, merged_df],
                                  axis=1,
                                  keys=[title_exact, title_fuzzy],
                                  join='inner')
        
        totals = compute_totals(merged_df,
                                fuzzy_field_names,
                                fuzzy_field_weights,
                                exact_field_weights,
                                hierarchical,
                                exact_field_value_pairs,
                                title_fuzzy)
        return pd.concat([totals, merged_df], axis=1)
    
    def weed_out_trivial_matches(matches):
        num_indexes = matches.index.nlevels//2
        return matches[
            functools.reduce(
                lambda a, b: a | b, 
                [
                    (matches.index.get_level_values(i) != matches.index.get_level_values(i + num_indexes)) \
                    for i in range(num_indexes)
                ]
            )
        ]
    
    def build_column_precursor_to(df, exact_field_value_pairs):
        exact_df = df.iloc[:, 0:0]
        for field_name, field_value in exact_field_value_pairs:
            exact_df[field_name] = field_value
        return exact_df

    def compute_totals(merged_df,
                       fuzzy_field_names,
                       fuzzy_field_weights,
                       exact_field_weights,
                       hierarchical,
                       exact_field_value_pairs,
                       title_fuzzy):
        title_total = 'Weighted Mean Similarity Score'
        fuzzy_weight_array = np.array(fuzzy_field_weights, dtype=float)
        if exact_field_value_pairs:
            exact_weight_array = np.array(exact_field_weights, dtype=float)
            total = fuzzy_weight_array.sum() + exact_weight_array.sum()
            fuzzy_weight_array /= total
            exact_field_contribution = (exact_weight_array/total).sum()
            if hierarchical:
                totals = merged_df[
                    [(title_fuzzy, field, 'similarity') for field in fuzzy_field_names]
                ].dot(fuzzy_weight_array) + exact_field_contribution
                totals = pd.concat([totals], axis=1, keys=[('', '', title_total)])
            else:
                totals = merged_df[
                    [(title_fuzzy, field) for field in fuzzy_field_names]
                ].dot(fuzzy_weight_array) + exact_field_contribution
                totals = pd.concat([totals], axis=1, keys=[('', title_total)])
        else:
            fuzzy_weight_array /= fuzzy_weight_array.sum()
            if hierarchical:
                totals = merged_df[
                    [(field, 'similarity') for field in fuzzy_field_names]
                ].dot(fuzzy_weight_array) 
                totals = pd.concat([totals], axis=1, keys=[('', title_total)])
            else:
                totals = merged_df[fuzzy_field_names].dot(fuzzy_weight_array)
                totals.rename(title_total, inplace=True)
        return totals
    
    index_name_list = get_index_names(data_frame)
    match_indexes = prepend(index_name_list, prefix='left_') + prepend(index_name_list, prefix='right_')
    
    # set the corpus for each fuzzy field
    stringGroupers = []
    for field, threshold, ngram_sz, _ in fields_2b_matched_fuzzily:
        stringGroupers += [FixedCorpusStringGrouper(data_frame[field],
                                                    min_similarity=threshold,
                                                    ngram_size=ngram_sz,
                                                    max_n_matches=max_n_matches,
                                                    tfidf_matrix_dtype=similarity_dtype)]
 
    fuzzy_field_grouper_pairs = get_field_stringGrouper_pairs(fields_2b_matched_fuzzily, stringGroupers)
    fuzzy_field_names = get_field_names(fields_2b_matched_fuzzily)
    fuzzy_field_weights = get_fuzzy_weights(fields_2b_matched_fuzzily)
    
    if not fields_2b_matched_exactly:
        return horizontal_linkage(data_frame,
                                  match_indexes,
                                  fuzzy_field_grouper_pairs,
                                  fuzzy_field_names,
                                  fuzzy_field_weights,
                                  hierarchical=hierarchical,
                                  force_corrections=force_corrections,
                                  n_blocks=n_blocks)
    else:
        exact_field_names = get_field_names(fields_2b_matched_exactly)
        exact_field_weights = get_exact_weights(fields_2b_matched_exactly)
        groups = data_frame.groupby(exact_field_names)
        vertical_merger_list = []
        for group_value, group_df in groups:
            values_matched_exactly = [group_value] if not isinstance(group_value, tuple) else list(group_value)
            exact_field_value_pairs = get_field_value_pairs(exact_field_names, values_matched_exactly)
            vertical_merger_list += [horizontal_linkage(group_df,
                                                        match_indexes,
                                                        fuzzy_field_grouper_pairs,
                                                        fuzzy_field_names,
                                                        fuzzy_field_weights,
                                                        exact_field_value_pairs,
                                                        exact_field_weights,
                                                        hierarchical=hierarchical,
                                                        force_corrections=force_corrections,
                                                        n_blocks=n_blocks)]
        return pd.concat(vertical_merger_list)
    
def match_strings(master, sg, force_corrections=True, n_blocks=None):
    sg._master = master
    sg = sg.fit(force_corrections=force_corrections, n_blocks=n_blocks)
    out = sg.get_matches()
    sg._matches_list = None
    sg._master = None
    return out
        

class FixedCorpusStringGrouper(StringGrouper):
    # This class enables StringGrouper to apply matching on different succesive master datasets without 
    # resetting the underlying corpus, as long as each master dataset is contained in the corpus. 
    # This class inherits from StringGrouper, overwriting only two of its methods:
    # __init__() and _get_tf_idf_matrices()
    def __init__(self, corpus, **kwargs):
        # initializer is the same as before except that it now also sets the corpus
        super().__init__(corpus, **kwargs)
        self._vectorizer = self._fit_vectorizer()
        self._master = None

    def _get_tf_idf_matrices(self, left_partition, right_partition):
        # _get_tf_idf_matrices() now no longer sets the corpus but rather builds the matrices from
        # the existing corpus
        # Build the two matrices
        left_matrix = self._vectorizer.transform(self._master.iloc[slice(*left_partition)])
        right_matrix = self._vectorizer.transform(self._master.iloc[slice(*right_partition)])
        return left_matrix, right_matrix

    def fit_blockwise_manual(self, n_blocks=(1, 1)):
        def divide_by(n):
            # mark blocks
            equal_block_sz = len(self._master)//n
            block_rem = len(self._master)%n
            block_ranges = []
            start = 0
            for block_id in range(n):
                block_ranges += [(start, start + equal_block_sz + (1 if block_id < block_rem else 0))]
                start = block_ranges[-1][1]
            return block_ranges
            
        block_ranges_left = divide_by(n_blocks[0])
        block_ranges_right = divide_by(n_blocks[1])
        max_n_matches = self._max_n_matches
        for left_block in block_ranges_left:
            for right_block in block_ranges_right:
                self._max_n_matches = min(right_block[1] - right_block[0], max_n_matches)
                master_matrix, duplicate_matrix = self._get_tf_idf_matrices(left_block, right_block)

                # Calculate the matches using the cosine similarity
                matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)
                
                # build match-lists from matrix
                r, c = matches.nonzero()
                d = matches.data
                (self._r, self._c, self._d) = (
                    np.append(self._r, r + left_block[0]),
                    np.append(self._c, c + right_block[0]),
                    np.append(self._d, d)
                )
                
        self._max_n_matches = max_n_matches
        return True

    def fit_blockwise_auto(self, left_partition=(None, None), right_partition=(None, None)):
        """Builds the _matches list which contains string matches indices and similarity"""
        # fit() has been extended here to enable StringGrouper to handle large datasets which otherwise
        # would lead to an OverflowError
        def begin(partition):
            return partition[0] if partition[0] else 0

        def end(partition):
            return partition[1] if partition[1] else len(self._master)
        
        def explicit(partition):
            return begin(partition), end(partition)

        master_matrix, duplicate_matrix = self._get_tf_idf_matrices(left_partition, right_partition)

        try:
            # Calculate the matches using the cosine similarity
            matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)
        except OverflowError:
            master_matrix = None
            duplicate_matrix = None
            max_n_matches = self._max_n_matches
            
            def split_partition(partition):
                data_begin = begin(partition)
                data_end = end(partition)
                data_mid = data_begin + (data_end - data_begin)//2
                return [(data_begin, data_mid), (data_mid, data_end)]
            
            left_halves = split_partition(left_partition)
            right_halves = split_partition(right_partition)
            for lhalf in left_halves:
                for rhalf in right_halves:
                    self._max_n_matches = min(rhalf[1] - rhalf[0], max_n_matches)
                    self.fit_blockwise_auto(left_partition=lhalf, right_partition=rhalf)
            self._max_n_matches = max_n_matches
            return True

        # build match-lists from matrix
        r, c = matches.nonzero()
        d = matches.data
        (self._r, self._c, self._d) = (
            np.append(self._r, r + begin(left_partition)),
            np.append(self._c, c + begin(right_partition)),
            np.append(self._d, d)
        )

        return False

    def fit(self, n_blocks=None, force_corrections=True):
        # initialize match-lists
        self._r = np.array([], dtype=np.int64)
        self._c = np.array([], dtype=np.int64)
        self._d = np.array([], dtype=self._config.tfidf_matrix_dtype)
        self._matches_list = pd.DataFrame()
        
        # do the matching
        if n_blocks:
            split_occurred = self.fit_blockwise_manual(n_blocks=n_blocks)
        else:
            split_occurred = self.fit_blockwise_auto()
        
        # self._matches_list = pd.DataFrame({'master_side': self._r, 'dupe_side': self._c, 'similarity': self._d})

        if split_occurred:
            # trim the matches to max_n_matches
            self._r, self._c, self._d = awesome_topn(self._r,
                                                     self._c,
                                                     self._d,
                                                     self._max_n_matches,
                                                     self._config.number_of_processes)
        
        # force symmetries to be respected?
        if force_corrections:
            matrix_sz = len(self._master)
            matches = csr_matrix(
                (
                    self._d, (self._r, self._c)
                ),
                shape=(matrix_sz, matrix_sz)
            )
            # release memory
            self._r = self._c = self._d = np.array([])

            # convert to lil format for best efficiency when setting matrix-elements
            matches = matches.tolil()
            # matrix diagonal elements must be exactly 1 (numerical precision errors introduced by
            # floating-point computations in awesome_cossim_topn sometimes lead to unexpected results)
            matches = StringGrouper._fix_diagonal(matches)
            # the list of matches must be symmetric! (i.e., if A != B and A matches B; then B matches A)
            matches = StringGrouper._symmetrize_matrix(matches)
            matches = matches.tocsr()
            self._matches_list = self._get_matches_list(matches)
        else:
            self._matches_list = pd.DataFrame(
                {
                    key: value for key, value in zip(
                        ('master_side', 'dupes_side', 'similarity'), (self._r, self._c, self._d)
                    )
                }
            )
            # release memory
            self._r = self._c = self._d = np.array([])
            
        self.is_build = True
        return self

Prepare Sample Data:

Use @berndnoll 's sample data:

inputfilename = 'data/us-cities-real-estate-sample-zenrows.csv'
df = pd.read_csv(inputfilename, dtype=str)

Examine data to determine which columns to use for test (that is, use only columns without null data):

for field in df.columns:
    if len(df[field].unique()) > 1 and not df[field].isna().values.any():
        print(f'{field} : {len(df[field].unique())}')
zpid : 10000
id : 10000
imgSrc : 9940
detailUrl : 10000
statusText : 24
address : 10000
addressState : 51
addressZipcode : 6446
isUndisclosedAddress : 2
isZillowOwned : 2
has3DModel : 2
hasVideo : 2
isFeaturedListing : 2
list : 2

Set the index:

df = df[['zpid', 'statusText', 'addressZipcode', 'addressState', 'address', 'imgSrc', 'hasVideo', 'addressCity', 'addressStreet']]
df.set_index('zpid', inplace=True)

Test the routine:

1. Grouping by fields that are to be matched exactly

Note that grouping by those fields that have very few unique values, such as 'addressState' (51 unique values) and 'hasVideo' (2 unique values), can lead to a significant performance boost. On the other hand, grouping by 'addressZipcode' (6446 unique values) degrades performance.

The following call took 8.11 seconds to run:

record_matches_some_exact = record_linkage(df.iloc[:, :],
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
                                          ('address', 0.8, 3, 1),
                                          ('addressZipcode', 0.999999, 3, 2)],
               fields_2b_matched_exactly=[('addressState', 4),
                                          ('hasVideo', 1)],
               hierarchical=True,
               max_n_matches=10000,
               similarity_dtype=np.float32,
               force_corrections=False)
interesting_record_matches_some_exact = \
record_matches_some_exact[
    record_matches_some_exact.index.get_level_values(0) < record_matches_some_exact.index.get_level_values(1)
].sort_values(('', '', 'Weighted Mean Similarity Score'), ascending=False)
interesting_record_matches_some_exact
Exactly Matched Fields Fuzzily Matched Fields
addressState hasVideo statusText address addressZipcode
Weighted Mean Similarity Score left similarity right left similarity right left similarity right
left_zpid right_zpid
2077667803 2077679643 1.000000 WY false Lot / Land for sale 1.0 Lot / Land for sale Jlda Minor Sub Division LOT C, Buffalo, WY 82834 1.000000 Jlda Minor Subdivision LOT C, Buffalo, WY 82834 82834 1.0 82834
2075244057 2075358943 0.997100 OH false Lot / Land for sale 1.0 Lot / Land for sale 0 Township Road 118, Kimbolton, OH 43749 0.973904 Township Road 118, Kimbolton, OH 43749 43749 1.0 43749
2077676622 2077676809 0.993867 ND false Lot / Land for sale 1.0 Lot / Land for sale 4 55th St SE, Christine, ND 58015 0.944802 2 55th St SE, Christine, ND 58015 58015 1.0 58015
2077093064 2078843498 0.993328 SD false Lot / Land for sale 1.0 Lot / Land for sale 17 Sidney Park Rd, Custer, SD 57730 0.939948 Sidney Park Rd, Custer, SD 57730 57730 1.0 57730
150690392 2076123604 0.992909 NJ false Lot / Land for sale 1.0 Lot / Land for sale 5 Windsor Ln, Gladstone, NJ 07934 0.936180 0 Windsor Ln, Gladstone, NJ 07934 7934 1.0 7934
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2070837516 2072047318 0.978032 HI false New construction 1.0 New construction D12C Plan, Kaikoi at Hoopili 0.802290 D12B Plan, Kaikoi at Hoopili 96706 1.0 96706
305578084 90035758 0.977991 MO false Condo for sale 1.0 Condo for sale 210 N 17th St UNIT 203, Saint Louis, MO 63103 0.801920 210 N 17th St UNIT 1202, Saint Louis, MO 63103 63103 1.0 63103
2071195670 88086529 0.977983 MI false Condo for sale 1.0 Condo for sale 6533 E Jefferson Ave APT 426, Detroit, MI 48207 0.801844 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 48207 1.0 48207
247263033 247263136 0.977941 IA false New construction 1.0 New construction 1 University Way #511, Iowa City, IA 52246 0.801474 1 University Way #503, Iowa City, IA 52246 52246 1.0 52246
2083656138 2083656146 0.977873 IN false Condo for sale 1.0 Condo for sale 3789 S Anderson Dr, Terre Haute, IN 47803 0.800855 3776 S Anderson Dr, Terre Haute, IN 47803 47803 1.0 47803

94 rows × 12 columns

2. Without grouping

The results above can also be obtained in another way. However, it can take much longer to compute in cases where some fuzzily matched fields have very few uniques values.

The following call took 3 minutes 33.52 seconds to run:

record_matches_none_exact = record_linkage(df.iloc[:, :],
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
                                          ('address', 0.8, 3, 1),
                                          ('addressZipcode', 0.999999, 3, 2),
                                          ('hasVideo', 0.999999, 3, 1),
                                          ('addressState', 0.999999, 2, 4)],
               hierarchical=True,
               max_n_matches=10000,
               similarity_dtype=np.float32,
               force_corrections=False)
interesting_record_matches_none_exact = \
record_matches_none_exact[
    record_matches_none_exact.index.get_level_values(0) < record_matches_none_exact.index.get_level_values(1)
].sort_values(('', 'Weighted Mean Similarity Score'), ascending=False)
interesting_record_matches_none_exact
statusText address addressZipcode hasVideo addressState
Weighted Mean Similarity Score left similarity right left similarity right left similarity right left similarity right left similarity right
left_zpid right_zpid
2077667803 2077679643 1.000000 Lot / Land for sale 1.0 Lot / Land for sale Jlda Minor Sub Division LOT C, Buffalo, WY 82834 1.000000 Jlda Minor Subdivision LOT C, Buffalo, WY 82834 82834 1.0 82834 false 1.0 false WY 1.0 WY
2075244057 2075358943 0.997100 Lot / Land for sale 1.0 Lot / Land for sale 0 Township Road 118, Kimbolton, OH 43749 0.973904 Township Road 118, Kimbolton, OH 43749 43749 1.0 43749 false 1.0 false OH 1.0 OH
2077676622 2077676809 0.993867 Lot / Land for sale 1.0 Lot / Land for sale 4 55th St SE, Christine, ND 58015 0.944802 2 55th St SE, Christine, ND 58015 58015 1.0 58015 false 1.0 false ND 1.0 ND
2077093064 2078843498 0.993328 Lot / Land for sale 1.0 Lot / Land for sale 17 Sidney Park Rd, Custer, SD 57730 0.939948 Sidney Park Rd, Custer, SD 57730 57730 1.0 57730 false 1.0 false SD 1.0 SD
150690392 2076123604 0.992909 Lot / Land for sale 1.0 Lot / Land for sale 5 Windsor Ln, Gladstone, NJ 07934 0.936180 0 Windsor Ln, Gladstone, NJ 07934 7934 1.0 7934 false 1.0 false NJ 1.0 NJ
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2070837516 2072047318 0.978032 New construction 1.0 New construction D12C Plan, Kaikoi at Hoopili 0.802290 D12B Plan, Kaikoi at Hoopili 96706 1.0 96706 false 1.0 false HI 1.0 HI
305578084 90035758 0.977991 Condo for sale 1.0 Condo for sale 210 N 17th St UNIT 203, Saint Louis, MO 63103 0.801920 210 N 17th St UNIT 1202, Saint Louis, MO 63103 63103 1.0 63103 false 1.0 false MO 1.0 MO
2071195670 88086529 0.977983 Condo for sale 1.0 Condo for sale 6533 E Jefferson Ave APT 426, Detroit, MI 48207 0.801844 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 48207 1.0 48207 false 1.0 false MI 1.0 MI
247263033 247263136 0.977941 New construction 1.0 New construction 1 University Way #511, Iowa City, IA 52246 0.801474 1 University Way #503, Iowa City, IA 52246 52246 1.0 52246 false 1.0 false IA 1.0 IA
2083656138 2083656146 0.977873 Condo for sale 1.0 Condo for sale 3789 S Anderson Dr, Terre Haute, IN 47803 0.800855 3776 S Anderson Dr, Terre Haute, IN 47803 47803 1.0 47803 false 1.0 false IN 1.0 IN

94 rows × 16 columns

One may choose to remove the field-values and output single-level column-headings by setting hierarchical to False:

record_matches_none_exact = record_linkage(df.iloc[:, :],
               fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
                                          ('address', 0.8, 3, 1),
                                          ('addressZipcode', 0.999999, 3, 2),
                                          ('hasVideo', 0.999999, 3, 1),
                                          ('addressState', 0.999999, 2, 4)],
               hierarchical=False,
               max_n_matches=10000,
               similarity_dtype=np.float32,
               force_corrections=False)
c:\users\heamu\.virtualenvs\distribution-test-bed\lib\site-packages\pandas\core\frame.py:4441: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
interesting_record_matches_none_exact = \
record_matches_none_exact[
    record_matches_none_exact.index.get_level_values(0) < record_matches_none_exact.index.get_level_values(1)
].sort_values(('Weighted Mean Similarity Score'), ascending=False)
interesting_record_matches_none_exact
Weighted Mean Similarity Score statusText address addressZipcode hasVideo addressState
left_zpid right_zpid
2077667803 2077679643 1.000000 1.0 1.000000 1.0 1.0 1.0
2075244057 2075358943 0.997100 1.0 0.973904 1.0 1.0 1.0
2077676622 2077676809 0.993867 1.0 0.944802 1.0 1.0 1.0
2077093064 2078843498 0.993328 1.0 0.939948 1.0 1.0 1.0
150690392 2076123604 0.992909 1.0 0.936180 1.0 1.0 1.0
... ... ... ... ... ... ... ...
2070837516 2072047318 0.978032 1.0 0.802290 1.0 1.0 1.0
305578084 90035758 0.977991 1.0 0.801920 1.0 1.0 1.0
2071195670 88086529 0.977983 1.0 0.801844 1.0 1.0 1.0
247263033 247263136 0.977941 1.0 0.801474 1.0 1.0 1.0
2083656138 2083656146 0.977873 1.0 0.800855 1.0 1.0 1.0

94 rows × 6 columns

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Sep 10, 2021

@berndnoll @iibarant

record_linkage() now has another optional parameter, n_blocks, for you to play with and attempt to improve performance (See latest update). It denotes the number of 'blocks' into which record_linkage() will divide your dataset before performing the string comparison.

The discovery yesterday was that there exists an optimum number of blocks which gives the fastest speed of comparison. See plot below:

Plot of Runtimes of record_linkage() vs the number of blocks (#blocks) into which the dataset (380 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison.

Figure1

@ParticularMiner
Copy link
Contributor

Hi @berndnoll @iibarant

Recall that last week I was able, on my computer, to shave the run-time from ≈12 minutes to ≈4 minutes of a call of record_linkage() on 380 000 records of the sample company name data (which is essentially the same as using match_strings() from string_grouper)?

Well, I have just extended last week's discovery and brought down that runtime further to ≈3 minutes!

The trick here is the manner in which the splitting is done, which I will try to explain below.

String comparison, as implemented by string_grouper, is essentially matrix multiplication. Your DataFrame of strings is converted (tokenized) into a matrix. Then that matrix is multiplied by itself.

Here is an illustration of multiplication of two matrices M and D:
Block Matrix 1 1

The discovery we made last week is that when the matrix (or DataFrame) is very large, the computer proceeds quite slowly with the multiplication (apparently due to the RAM being too full). Even @iibarant's powerful computer spat out an OverflowError.

So we decided to divide the DataFrame into smaller chunks (or blocks) and multiply the chunks one pair at a time instead to get the same result:

Block Matrix 2 2

But surprise ... the run-time of the process was drastically reduced as a result (from ≈12 minutes to ≈4 minutes when the DataFrame was divided into 4 blocks, that is, 4 blocks on the left × 4 on the right).

Encouraged by this discovery, I explored the block number space a bit further by deciding not to split both right and left matrix operands into the same number of blocks. I consequently discovered that if the left operand is not split but the right operand is, then even more gains in speed can be made.

Block Matrix 1 2

Here are some plots of the results of experiments that I performed:

Plots of Runtimes of record_linkage() vs the number of blocks (#blocks) into which the left matrix-operand of the dataset (380 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number of blocks into which the left matrix-operand was split.

BlockSpaceExploration

From the plot above, tt can be seen that the optimum split-configuration (run-time ≈3 minutes) is when the left operand is not split (#blocks = 1) and the right operand is split into six blocks (#nblocks = 6).

Here is an example of how the latest update of record_linkage() may be used with the just described "right-left splitting":

record_matches_some_exact = record_linkage(companies,
               fields_2b_matched_fuzzily=[('Company Name', 0.8, 3, 1)],
               fields_2b_matched_exactly=None,
               hierarchical=True,
               max_n_matches=10000,
               similarity_dtype=np.float32,
               force_corrections=True,
               n_blocks=(1, 6))

@iibarant
Copy link

@ParticularMiner, this is great. I didn't have a chance to test the previous update, but I will try this one.

@ParticularMiner
Copy link
Contributor

@iibarant @berndnoll

By the way, to run the latest code you first need to install topn from pypi. Installation command is as usual:

pip install topn

@ParticularMiner
Copy link
Contributor

Sorry about that @iibarant .

Then use the previous version for now, which you can get from the same comment where my code is. But click on the “edited’ link at the top of the comment and choose the previous edit.

@ParticularMiner
Copy link
Contributor

@iibarant

You posted 2 screens with the log of your topn installation error: one at the beginning and the other at the end. Somewhere in between the actual error that caused all this is missing. I'd appreciate it if you would send me what's missing.

I wrote topn and posted it to pypi, but it installs fine on my Windows laptop. So I'll need to see the missing errors to determine why your computer is unable to install it as is.

@ParticularMiner
Copy link
Contributor

Thanks @iibarant

Looks like your computer does not use a C++14 compiler, which is the reason for the error. Never mind. I'll rewrite the code to something more backward compatible.

@ParticularMiner
Copy link
Contributor

@iibarant

Please try installing topn again. I think it should go through now.

@iibarant
Copy link

iibarant commented Sep 13, 2021

@ParticularMiner, I was able to install now, thank you.
I deleted the log messages as they are not relevant anymore.

@ParticularMiner
Copy link
Contributor

Sure. No problem.

@iibarant
Copy link

iibarant commented Sep 13, 2021

@ParticularMiner, I started to run 3.5 million records with the following code and got a caveat warning:

Screen Shot 2021-09-13 at 4 24 17 PM

@ParticularMiner
Copy link
Contributor

@iibarant

I get that too. So far I've been ignoring the warning.

I don't condone it, but you could just suppress all warnings with this:

import warnings
warnings.filterwarnings("ignore")

placed at the beginning of your program.

@iibarant
Copy link

@ParticularMiner, 1 hour 19 minutes is really good for the size of the dataframe.

@ParticularMiner
Copy link
Contributor

@iibarant

Which value of n_blocks did you use? Or did you leave it up to the machine to choose?

@iibarant
Copy link

iibarant commented Sep 14, 2021 via email

@ParticularMiner
Copy link
Contributor

@iiberant

Ok. Thanks. I agree that it doesn't look bad at all. It could have easily been >6 hours especially if you consider that theoretically, matching-time scales as the square of the DataFrame size.

All the best with your work!

@berndnoll
I hope your code is also going smoothly?

@ParticularMiner
Copy link
Contributor

@iibarant @berndnoll @Bergvca @justasojourner @ALL

Please see https://github.com/ParticularMiner/red_string_grouper

@iibarant
Copy link

@ParticularMiner, did you make any changes to the latest version above?

@ParticularMiner
Copy link
Contributor

@iibarant

No significant changes. Just changed the name of one of the parameters.

@iibarant
Copy link

iibarant commented Sep 16, 2021 via email

@Bergvca
Copy link
Owner

Bergvca commented Sep 20, 2021

Amazing work! Love red_string_grouper. Is it possible to add the block splitting to string_grouper as well? Or do you think it doesn't make sense?

This is also the way to do it in a distributed environment (e.g. using Spark) - you send the different blocks to different machines and then gather and merge the results. I never thought it would improve performance on a single machine though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants