Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return complete similarity matrix with get_matches() - including elements with 0 similarity #42

Open
nbcvijanovic opened this issue Apr 1, 2021 · 4 comments

Comments

@nbcvijanovic
Copy link

Is it possible to return the full similarity matrix when getting matches from the string grouper class?
Example:
string_grouper = StringGrouper(master = master, duplicates=duplicates[:1], master_id=master_ID, duplicates_id=duplicates_ID[:1], min_similarity = 0.0, max_n_matches = 10000, regex = "[,-./#]").fit() matches_df = string_grouper.get_matches()
Matches_df would ideally contain a dataframe with the same number of rows as master. So a complete similarity comparison of the one duplicate to all the master examples. But it seems to do a cutoff at some point (0) due to low similarity and I can't change that no matter how low (negative) I set the min_similarity. Is there a way to allow the 0 similarities to be returned as well? I can pad them later but it would be convenient.

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Apr 1, 2021

Hi @nbcvijanovic,

Good question.

Unfortunately, this is a side-effect of using sparse matrices: string_grouper.get_matches() will always return only non-zero similarities. And since often the overwhelming majority of similarities are 0, it would defeat the very purpose of using sparse matrices to make string_grouper use up computer resources to store all these zeroes. So the short answer to your question is "no".

However, I suggest you solve your problem by taking the following example and modifying it to fit your needs:
(See update in my next message)

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Apr 1, 2021

In the previous solution (now deleted), duplicates had only one element. For the more general case, you may use the following example.
(For large datasets, use with memory resource-limitations in mind)

import pandas as pd
import numpy as np
from string_grouper import StringGrouper
companies_df = pd.read_csv('data/sec__edgar_company_info.csv')[0:50000]
master = companies_df['Company Name']
master_id = companies_df['Line Number']
duplicates = pd.Series(["ADVISORS DISCIPLINED TRUST", "ADVISORS DISCIPLINED TRUST '18"])
duplicates_id = pd.Series([3, 5])
string_grouper = StringGrouper(
    master = master,
    duplicates=duplicates,
    master_id=master_id,
    duplicates_id=duplicates_id,
    ignore_index=True, # this option exists only in the latest unstable version. Ignore it if you don't have it
    min_similarity = 0,
    max_n_matches = 10000,
    regex = "[,-./#]"
).fit()
matches_df = string_grouper.get_matches()
matches_df
left_Company Name left_Line Number similarity right_id right_side
0 #1 ARIZONA DISCOUNT PROPERTIES LLC 3 0.091157 3 ADVISORS DISCIPLINED TRUST
1 #1 ARIZONA DISCOUNT PROPERTIES LLC 3 0.063861 5 ADVISORS DISCIPLINED TRUST '18
2 05 CAT THIEF/GOLD IN MY STARS LLC 21 0.015313 3 ADVISORS DISCIPLINED TRUST
3 05 CAT THIEF/GOLD IN MY STARS LLC 21 0.010728 5 ADVISORS DISCIPLINED TRUST '18
4 05 DIXIE UNION/UNDER FIRE LLC 22 0.025397 3 ADVISORS DISCIPLINED TRUST
... ... ... ... ... ...
28754 BAAPLIFE3-2015, LLC 49976 0.021830 5 ADVISORS DISCIPLINED TRUST '18
28755 BAAPLIFE4-2016, LLC 49977 0.030983 3 ADVISORS DISCIPLINED TRUST
28756 BAAPLIFE4-2016, LLC 49977 0.021706 5 ADVISORS DISCIPLINED TRUST '18
28757 BABA JOE DIAMOND VENTURES US INC. 49989 0.027064 3 ADVISORS DISCIPLINED TRUST
28758 BABA JOE DIAMOND VENTURES US INC. 49989 0.018960 5 ADVISORS DISCIPLINED TRUST '18

28759 rows × 5 columns

# I can only suggest you try the following:

# 1. find the missing positional index pairs of (master_id, duplicates_id):

left_id_col = 'left_Line Number'
right_id_col = 'right_id'
left_idx = pd.Index(master_id).get_indexer(matches_df[left_id_col].values)
right_idx = pd.Index(duplicates_id).get_indexer(matches_df[right_id_col].values)
matched_pairs = zip(left_idx, right_idx)

M, D = len(master_id), len(duplicates_id)
all_pairs = pd.MultiIndex.from_product([range(M), range(D)])

missing_pairs = set(all_pairs) - set(matched_pairs)
missing_pairs = np.array(list(missing_pairs))

# 2. construct the missing-zeroes-matrix:
#    ensure the missing-zeroes-matrix has the same order of columns as matches_df:
missing_df = pd.DataFrame(
    {
        'left_Company Name': master.iloc[missing_pairs[:, 0]].reset_index(drop=True),
        'left_Line Number': master_id.iloc[missing_pairs[:, 0]].reset_index(drop=True),
        'similarity': pd.Series(np.full(len(missing_pairs), 0)),
        'right_id': duplicates_id.iloc[missing_pairs[:, 1]].reset_index(drop=True),
        'right_side': duplicates.iloc[missing_pairs[:, 1]].reset_index(drop=True)
    }
)

# 2. concatenate string_grouper's results with the missing-values-matrix:
full_matches_df = pd.concat([matches_df, missing_df], axis=0, ignore_index=True)
full_matches_df.sort_values(['similarity', 'right_id', 'left_Line Number'], ascending=[False, True, True])
left_Company Name left_Line Number similarity right_id right_side
11775 ADVISORS DISCIPLINED TRUST 14940 1.000000 3 ADVISORS DISCIPLINED TRUST
11777 ADVISORS DISCIPLINED TRUST 14941 1.000000 3 ADVISORS DISCIPLINED TRUST
9929 ADVISORS DISCIPLINED TRUST 200 14017 0.902835 3 ADVISORS DISCIPLINED TRUST
9931 ADVISORS DISCIPLINED TRUST 201 14018 0.894230 3 ADVISORS DISCIPLINED TRUST
10055 ADVISORS DISCIPLINED TRUST 25 14080 0.874724 3 ADVISORS DISCIPLINED TRUST
... ... ... ... ... ...
52615 BABB DOUGLAS J 49996 0.000000 5 ADVISORS DISCIPLINED TRUST '18
57518 BABB HENRY C 49997 0.000000 5 ADVISORS DISCIPLINED TRUST '18
48830 BABB INTERNATIONAL INC 49998 0.000000 5 ADVISORS DISCIPLINED TRUST '18
30616 BABB JACK J 49999 0.000000 5 ADVISORS DISCIPLINED TRUST '18
70100 BABB JAMES G. III 50000 0.000000 5 ADVISORS DISCIPLINED TRUST '18

100000 rows × 5 columns

# OR you try the following:

# 1. build the full matrix with only zero similarities
M, D = len(master_id), len(duplicates_id)
all_pairs = pd.MultiIndex.from_product([range(M), range(D)])
all_pairs = np.array(list(all_pairs))
#    ensure the full-zeroes-matrix has the same order of columns as matches_df:
full_df = pd.DataFrame(
    {
        'left_Company Name': master.iloc[all_pairs[:, 0]].reset_index(drop=True),
        'left_Line Number': master_id.iloc[all_pairs[:, 0]].reset_index(drop=True),
        'similarity': pd.Series(np.full(len(all_pairs), 0)),
        'right_id': duplicates_id.iloc[all_pairs[:, 1]].reset_index(drop=True),
        'right_side': duplicates.iloc[all_pairs[:, 1]].reset_index(drop=True)
    }
)

# 2. combine string_grouper's results with the full matrix:
left_id_col = 'left_Line Number'
right_id_col = 'right_id'
full_matches_df = \
matches_df.set_index([left_id_col, right_id_col])\
.combine_first(
    full_df.set_index([left_id_col, right_id_col])
).reset_index()
# full_matches_df.right_id = full_matches_df.right_id.astype(duplicates_id.dtype)
full_matches_df.sort_values(['similarity', 'right_id', 'left_Line Number'], ascending=[False, True, True])
left_Line Number right_id left_Company Name similarity right_side
29878 14940 3 ADVISORS DISCIPLINED TRUST 1.000000 ADVISORS DISCIPLINED TRUST
29880 14941 3 ADVISORS DISCIPLINED TRUST 1.000000 ADVISORS DISCIPLINED TRUST
28032 14017 3 ADVISORS DISCIPLINED TRUST 200 0.902835 ADVISORS DISCIPLINED TRUST
28034 14018 3 ADVISORS DISCIPLINED TRUST 201 0.894230 ADVISORS DISCIPLINED TRUST
28158 14080 3 ADVISORS DISCIPLINED TRUST 25 0.874724 ADVISORS DISCIPLINED TRUST
... ... ... ... ... ...
99991 49996 5 BABB DOUGLAS J 0.000000 ADVISORS DISCIPLINED TRUST '18
99993 49997 5 BABB HENRY C 0.000000 ADVISORS DISCIPLINED TRUST '18
99995 49998 5 BABB INTERNATIONAL INC 0.000000 ADVISORS DISCIPLINED TRUST '18
99997 49999 5 BABB JACK J 0.000000 ADVISORS DISCIPLINED TRUST '18
99999 50000 5 BABB JAMES G. III 0.000000 ADVISORS DISCIPLINED TRUST '18

100000 rows × 5 columns

@ParticularMiner
Copy link
Contributor

On second thought @nbcvijanovic I think this issue you've pointed out is an important bug. So thank you for pointing it out!
See PR #43 for the fix.

Cheers!

@nbcvijanovic
Copy link
Author

Glad I could help :) And thanks for the example code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants