Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding similarity column in the group_similar_strings output #14

Open
selfcontrol7WC opened this issue Nov 4, 2020 · 2 comments
Open

Comments

@selfcontrol7WC
Copy link

selfcontrol7WC commented Nov 4, 2020

Hi,

Thank you for this amazing code working just great so far in my use case.
Please, How can I add the similarities values from the computed cosine in the outputted result of group_similar_strings functions?
The output I am trying to make is a pandas.series containing the duplicated name with their respective cosine similarities value regarding the deduplicated_name.

So it would be something like this:
Line Number | Company Name | Company CIK Key | Similarity | deduplicated_name

Please any help?
Thank you.

@Bergvca
Copy link
Owner

Bergvca commented Nov 4, 2020

Hi @selfcontrol7WC

The deduplicated name can be seen as kind of a "group identifier", where all strings that have the same deduplicate_name belong to the same group. All strings in that group are similar to each other, and the group identifier is just a "random" string within the group of similar strings. So its not necessary clear which similarity to pick. For example, suppose you have 3 similiar strings with the following similarities:

string_a - string_b - 0.80
string_a - string_c - 0.99
string_b - string_c - 0.75

The deduplicated name will be "string_a", but for entry "string_c" for example do you pick 0.99? That means the low similarity of 0.75 will be lost. It is also possible to have another string (string_d) with similarity 0.99 to string_c, but 0.74 to string_a. If your cutoff value is 0.75, there will be no similarity between string_a and d, but string_d will still be in the same group.

Another possibility is to show for each entry the lowest similarity it has with any strings in the group. I think this might give a better indication on how similar a string within a group is. I think this is possible to do with some hacking.

@selfcontrol7WC
Copy link
Author

Hi,

Thank you for your prompt reply and your detailed explanation I appreciate it.

Yes, I clearly understand the tricky part regarding which random group identifier to pick and also the fact that the low accuracies will be lost. I did not think about this former.

In your example, if string_a is selected as the group identifier, I would therefore pick 0.99 for the accuracy of string_c but loos other similarities related to string _c then.

1. Thinking again about it, for simplicity, in my use case having the similarity values of each string within a group and their group identifier would be great for now.

How I see it:
Company Name | Similarity | deduplicated_name
string_a | 1 | string_a
string_b | 0.80 | string_a
string_c | 0.99 | string_a
string_d | 0.74 | string_a

2. Also, I like your approach to track and show the lowest similarities of each string within the group. In that case, I can not see it as part of the same single data frame returned by the group_similar_strings functions like in point 1 above. Is it going to be in a separate second data frame? Also, if we consider our same example, this data frame would be like a 3 dimensions data frame with the number of rows equals 10?

Something like this?
Company Name1 | Company Name2 | similarity
string_a | string_a | 1
string_a | string_b | 0.80
string_a | string_c | 0.99
string_a | string_d | 0.74
string_b | string_b | 1
string_b | string_c | 0.75
string_b | string_d | 0.77
string_c | string_c | 1
string_c | string_d | 0.99
string_b | string_d | 1

Sorry since I really have no idea where to start from, that's why I drew these tab to make it clear in my mind as well.

Please, can you guide me on how I can hack the code and get these results, Please?

Thank you again for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants