-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return complete similarity matrix with get_matches() - including elements with 0 similarity #42
Comments
Hi @nbcvijanovic, Good question. Unfortunately, this is a side-effect of using sparse matrices: However, I suggest you solve your problem by taking the following example and modifying it to fit your needs: |
In the previous solution (now deleted), import pandas as pd
import numpy as np
from string_grouper import StringGrouper companies_df = pd.read_csv('data/sec__edgar_company_info.csv')[0:50000] master = companies_df['Company Name']
master_id = companies_df['Line Number']
duplicates = pd.Series(["ADVISORS DISCIPLINED TRUST", "ADVISORS DISCIPLINED TRUST '18"])
duplicates_id = pd.Series([3, 5]) string_grouper = StringGrouper(
master = master,
duplicates=duplicates,
master_id=master_id,
duplicates_id=duplicates_id,
ignore_index=True, # this option exists only in the latest unstable version. Ignore it if you don't have it
min_similarity = 0,
max_n_matches = 10000,
regex = "[,-./#]"
).fit()
matches_df = string_grouper.get_matches()
matches_df
28759 rows × 5 columns # I can only suggest you try the following:
# 1. find the missing positional index pairs of (master_id, duplicates_id):
left_id_col = 'left_Line Number'
right_id_col = 'right_id'
left_idx = pd.Index(master_id).get_indexer(matches_df[left_id_col].values)
right_idx = pd.Index(duplicates_id).get_indexer(matches_df[right_id_col].values)
matched_pairs = zip(left_idx, right_idx)
M, D = len(master_id), len(duplicates_id)
all_pairs = pd.MultiIndex.from_product([range(M), range(D)])
missing_pairs = set(all_pairs) - set(matched_pairs)
missing_pairs = np.array(list(missing_pairs))
# 2. construct the missing-zeroes-matrix:
# ensure the missing-zeroes-matrix has the same order of columns as matches_df:
missing_df = pd.DataFrame(
{
'left_Company Name': master.iloc[missing_pairs[:, 0]].reset_index(drop=True),
'left_Line Number': master_id.iloc[missing_pairs[:, 0]].reset_index(drop=True),
'similarity': pd.Series(np.full(len(missing_pairs), 0)),
'right_id': duplicates_id.iloc[missing_pairs[:, 1]].reset_index(drop=True),
'right_side': duplicates.iloc[missing_pairs[:, 1]].reset_index(drop=True)
}
)
# 2. concatenate string_grouper's results with the missing-values-matrix:
full_matches_df = pd.concat([matches_df, missing_df], axis=0, ignore_index=True)
full_matches_df.sort_values(['similarity', 'right_id', 'left_Line Number'], ascending=[False, True, True])
100000 rows × 5 columns # OR you try the following:
# 1. build the full matrix with only zero similarities
M, D = len(master_id), len(duplicates_id)
all_pairs = pd.MultiIndex.from_product([range(M), range(D)])
all_pairs = np.array(list(all_pairs))
# ensure the full-zeroes-matrix has the same order of columns as matches_df:
full_df = pd.DataFrame(
{
'left_Company Name': master.iloc[all_pairs[:, 0]].reset_index(drop=True),
'left_Line Number': master_id.iloc[all_pairs[:, 0]].reset_index(drop=True),
'similarity': pd.Series(np.full(len(all_pairs), 0)),
'right_id': duplicates_id.iloc[all_pairs[:, 1]].reset_index(drop=True),
'right_side': duplicates.iloc[all_pairs[:, 1]].reset_index(drop=True)
}
)
# 2. combine string_grouper's results with the full matrix:
left_id_col = 'left_Line Number'
right_id_col = 'right_id'
full_matches_df = \
matches_df.set_index([left_id_col, right_id_col])\
.combine_first(
full_df.set_index([left_id_col, right_id_col])
).reset_index()
# full_matches_df.right_id = full_matches_df.right_id.astype(duplicates_id.dtype)
full_matches_df.sort_values(['similarity', 'right_id', 'left_Line Number'], ascending=[False, True, True])
100000 rows × 5 columns |
On second thought @nbcvijanovic I think this issue you've pointed out is an important bug. So thank you for pointing it out! Cheers! |
Glad I could help :) And thanks for the example code! |
Is it possible to return the full similarity matrix when getting matches from the string grouper class?
Example:
string_grouper = StringGrouper(master = master, duplicates=duplicates[:1], master_id=master_ID, duplicates_id=duplicates_ID[:1], min_similarity = 0.0, max_n_matches = 10000, regex = "[,-./#]").fit() matches_df = string_grouper.get_matches()
Matches_df would ideally contain a dataframe with the same number of rows as master. So a complete similarity comparison of the one duplicate to all the master examples. But it seems to do a cutoff at some point (0) due to low similarity and I can't change that no matter how low (negative) I set the min_similarity. Is there a way to allow the 0 similarities to be returned as well? I can pad them later but it would be convenient.
The text was updated successfully, but these errors were encountered: