Return complete similarity matrix with get_matches() - including elements with 0 similarity #42

nbcvijanovic · 2021-04-01T06:58:09Z

Is it possible to return the full similarity matrix when getting matches from the string grouper class?
Example:
string_grouper = StringGrouper(master = master, duplicates=duplicates[:1], master_id=master_ID, duplicates_id=duplicates_ID[:1], min_similarity = 0.0, max_n_matches = 10000, regex = "[,-./#]").fit() matches_df = string_grouper.get_matches()
Matches_df would ideally contain a dataframe with the same number of rows as master. So a complete similarity comparison of the one duplicate to all the master examples. But it seems to do a cutoff at some point (0) due to low similarity and I can't change that no matter how low (negative) I set the min_similarity. Is there a way to allow the 0 similarities to be returned as well? I can pad them later but it would be convenient.

The text was updated successfully, but these errors were encountered:

ParticularMiner · 2021-04-01T14:35:17Z

Hi @nbcvijanovic,

Good question.

Unfortunately, this is a side-effect of using sparse matrices: string_grouper.get_matches() will always return only non-zero similarities. And since often the overwhelming majority of similarities are 0, it would defeat the very purpose of using sparse matrices to make string_grouper use up computer resources to store all these zeroes. So the short answer to your question is "no".

However, I suggest you solve your problem by taking the following example and modifying it to fit your needs:
(See update in my next message)

ParticularMiner · 2021-04-01T20:04:38Z

In the previous solution (now deleted), duplicates had only one element. For the more general case, you may use the following example.
(For large datasets, use with memory resource-limitations in mind)

import pandas as pd
import numpy as np
from string_grouper import StringGrouper

companies_df = pd.read_csv('data/sec__edgar_company_info.csv')[0:50000]

master = companies_df['Company Name']
master_id = companies_df['Line Number']
duplicates = pd.Series(["ADVISORS DISCIPLINED TRUST", "ADVISORS DISCIPLINED TRUST '18"])
duplicates_id = pd.Series([3, 5])

string_grouper = StringGrouper(
    master = master,
    duplicates=duplicates,
    master_id=master_id,
    duplicates_id=duplicates_id,
    ignore_index=True, # this option exists only in the latest unstable version. Ignore it if you don't have it
    min_similarity = 0,
    max_n_matches = 10000,
    regex = "[,-./#]"
).fit()
matches_df = string_grouper.get_matches()
matches_df

	left_Company Name	left_Line Number	similarity	right_id	right_side
0	#1 ARIZONA DISCOUNT PROPERTIES LLC	3	0.091157	3	ADVISORS DISCIPLINED TRUST
1	#1 ARIZONA DISCOUNT PROPERTIES LLC	3	0.063861	5	ADVISORS DISCIPLINED TRUST '18
2	05 CAT THIEF/GOLD IN MY STARS LLC	21	0.015313	3	ADVISORS DISCIPLINED TRUST
3	05 CAT THIEF/GOLD IN MY STARS LLC	21	0.010728	5	ADVISORS DISCIPLINED TRUST '18
4	05 DIXIE UNION/UNDER FIRE LLC	22	0.025397	3	ADVISORS DISCIPLINED TRUST
...	...	...	...	...	...
28754	BAAPLIFE3-2015, LLC	49976	0.021830	5	ADVISORS DISCIPLINED TRUST '18
28755	BAAPLIFE4-2016, LLC	49977	0.030983	3	ADVISORS DISCIPLINED TRUST
28756	BAAPLIFE4-2016, LLC	49977	0.021706	5	ADVISORS DISCIPLINED TRUST '18
28757	BABA JOE DIAMOND VENTURES US INC.	49989	0.027064	3	ADVISORS DISCIPLINED TRUST
28758	BABA JOE DIAMOND VENTURES US INC.	49989	0.018960	5	ADVISORS DISCIPLINED TRUST '18

28759 rows × 5 columns

# I can only suggest you try the following:

# 1. find the missing positional index pairs of (master_id, duplicates_id):

left_id_col = 'left_Line Number'
right_id_col = 'right_id'
left_idx = pd.Index(master_id).get_indexer(matches_df[left_id_col].values)
right_idx = pd.Index(duplicates_id).get_indexer(matches_df[right_id_col].values)
matched_pairs = zip(left_idx, right_idx)

M, D = len(master_id), len(duplicates_id)
all_pairs = pd.MultiIndex.from_product([range(M), range(D)])

missing_pairs = set(all_pairs) - set(matched_pairs)
missing_pairs = np.array(list(missing_pairs))

# 2. construct the missing-zeroes-matrix:
#    ensure the missing-zeroes-matrix has the same order of columns as matches_df:
missing_df = pd.DataFrame(
    {
        'left_Company Name': master.iloc[missing_pairs[:, 0]].reset_index(drop=True),
        'left_Line Number': master_id.iloc[missing_pairs[:, 0]].reset_index(drop=True),
        'similarity': pd.Series(np.full(len(missing_pairs), 0)),
        'right_id': duplicates_id.iloc[missing_pairs[:, 1]].reset_index(drop=True),
        'right_side': duplicates.iloc[missing_pairs[:, 1]].reset_index(drop=True)
    }
)

# 2. concatenate string_grouper's results with the missing-values-matrix:
full_matches_df = pd.concat([matches_df, missing_df], axis=0, ignore_index=True)
full_matches_df.sort_values(['similarity', 'right_id', 'left_Line Number'], ascending=[False, True, True])

	left_Company Name	left_Line Number	similarity	right_id	right_side
11775	ADVISORS DISCIPLINED TRUST	14940	1.000000	3	ADVISORS DISCIPLINED TRUST
11777	ADVISORS DISCIPLINED TRUST	14941	1.000000	3	ADVISORS DISCIPLINED TRUST
9929	ADVISORS DISCIPLINED TRUST 200	14017	0.902835	3	ADVISORS DISCIPLINED TRUST
9931	ADVISORS DISCIPLINED TRUST 201	14018	0.894230	3	ADVISORS DISCIPLINED TRUST
10055	ADVISORS DISCIPLINED TRUST 25	14080	0.874724	3	ADVISORS DISCIPLINED TRUST
...	...	...	...	...	...
52615	BABB DOUGLAS J	49996	0.000000	5	ADVISORS DISCIPLINED TRUST '18
57518	BABB HENRY C	49997	0.000000	5	ADVISORS DISCIPLINED TRUST '18
48830	BABB INTERNATIONAL INC	49998	0.000000	5	ADVISORS DISCIPLINED TRUST '18
30616	BABB JACK J	49999	0.000000	5	ADVISORS DISCIPLINED TRUST '18
70100	BABB JAMES G. III	50000	0.000000	5	ADVISORS DISCIPLINED TRUST '18

100000 rows × 5 columns

# OR you try the following:

# 1. build the full matrix with only zero similarities
M, D = len(master_id), len(duplicates_id)
all_pairs = pd.MultiIndex.from_product([range(M), range(D)])
all_pairs = np.array(list(all_pairs))
#    ensure the full-zeroes-matrix has the same order of columns as matches_df:
full_df = pd.DataFrame(
    {
        'left_Company Name': master.iloc[all_pairs[:, 0]].reset_index(drop=True),
        'left_Line Number': master_id.iloc[all_pairs[:, 0]].reset_index(drop=True),
        'similarity': pd.Series(np.full(len(all_pairs), 0)),
        'right_id': duplicates_id.iloc[all_pairs[:, 1]].reset_index(drop=True),
        'right_side': duplicates.iloc[all_pairs[:, 1]].reset_index(drop=True)
    }
)

# 2. combine string_grouper's results with the full matrix:
left_id_col = 'left_Line Number'
right_id_col = 'right_id'
full_matches_df = \
matches_df.set_index([left_id_col, right_id_col])\
.combine_first(
    full_df.set_index([left_id_col, right_id_col])
).reset_index()
# full_matches_df.right_id = full_matches_df.right_id.astype(duplicates_id.dtype)
full_matches_df.sort_values(['similarity', 'right_id', 'left_Line Number'], ascending=[False, True, True])

	left_Line Number	right_id	left_Company Name	similarity	right_side
29878	14940	3	ADVISORS DISCIPLINED TRUST	1.000000	ADVISORS DISCIPLINED TRUST
29880	14941	3	ADVISORS DISCIPLINED TRUST	1.000000	ADVISORS DISCIPLINED TRUST
28032	14017	3	ADVISORS DISCIPLINED TRUST 200	0.902835	ADVISORS DISCIPLINED TRUST
28034	14018	3	ADVISORS DISCIPLINED TRUST 201	0.894230	ADVISORS DISCIPLINED TRUST
28158	14080	3	ADVISORS DISCIPLINED TRUST 25	0.874724	ADVISORS DISCIPLINED TRUST
...	...	...	...	...	...
99991	49996	5	BABB DOUGLAS J	0.000000	ADVISORS DISCIPLINED TRUST '18
99993	49997	5	BABB HENRY C	0.000000	ADVISORS DISCIPLINED TRUST '18
99995	49998	5	BABB INTERNATIONAL INC	0.000000	ADVISORS DISCIPLINED TRUST '18
99997	49999	5	BABB JACK J	0.000000	ADVISORS DISCIPLINED TRUST '18
99999	50000	5	BABB JAMES G. III	0.000000	ADVISORS DISCIPLINED TRUST '18

100000 rows × 5 columns

ParticularMiner · 2021-04-02T07:17:07Z

On second thought @nbcvijanovic I think this issue you've pointed out is an important bug. So thank you for pointing it out!
See PR #43 for the fix.

Cheers!

nbcvijanovic · 2021-04-05T21:23:46Z

Glad I could help :) And thanks for the example code!

ParticularMiner mentioned this issue Apr 2, 2021

Fixed a bug which occurs when min_similarity=0 (User discovery) #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return complete similarity matrix with get_matches() - including elements with 0 similarity #42

Return complete similarity matrix with get_matches() - including elements with 0 similarity #42

nbcvijanovic commented Apr 1, 2021

ParticularMiner commented Apr 1, 2021 •

edited

ParticularMiner commented Apr 1, 2021 •

edited

ParticularMiner commented Apr 2, 2021

nbcvijanovic commented Apr 5, 2021

Return complete similarity matrix with get_matches() - including elements with 0 similarity #42

Return complete similarity matrix with get_matches() - including elements with 0 similarity #42

Comments

nbcvijanovic commented Apr 1, 2021

ParticularMiner commented Apr 1, 2021 • edited

ParticularMiner commented Apr 1, 2021 • edited

ParticularMiner commented Apr 2, 2021

nbcvijanovic commented Apr 5, 2021

ParticularMiner commented Apr 1, 2021 •

edited

ParticularMiner commented Apr 1, 2021 •

edited