Adding similarity column in the group_similar_strings output #14

selfcontrol7WC · 2020-11-04T05:29:14Z

Hi,

Thank you for this amazing code working just great so far in my use case.
Please, How can I add the similarities values from the computed cosine in the outputted result of group_similar_strings functions?
The output I am trying to make is a pandas.series containing the duplicated name with their respective cosine similarities value regarding the deduplicated_name.

So it would be something like this:
Line Number | Company Name | Company CIK Key | Similarity | deduplicated_name

Please any help?
Thank you.

Bergvca · 2020-11-04T19:39:32Z

Hi @selfcontrol7WC

The deduplicated name can be seen as kind of a "group identifier", where all strings that have the same deduplicate_name belong to the same group. All strings in that group are similar to each other, and the group identifier is just a "random" string within the group of similar strings. So its not necessary clear which similarity to pick. For example, suppose you have 3 similiar strings with the following similarities:

string_a - string_b - 0.80
string_a - string_c - 0.99
string_b - string_c - 0.75

The deduplicated name will be "string_a", but for entry "string_c" for example do you pick 0.99? That means the low similarity of 0.75 will be lost. It is also possible to have another string (string_d) with similarity 0.99 to string_c, but 0.74 to string_a. If your cutoff value is 0.75, there will be no similarity between string_a and d, but string_d will still be in the same group.

Another possibility is to show for each entry the lowest similarity it has with any strings in the group. I think this might give a better indication on how similar a string within a group is. I think this is possible to do with some hacking.

selfcontrol7WC · 2020-11-05T03:05:15Z

Hi,

Thank you for your prompt reply and your detailed explanation I appreciate it.

Yes, I clearly understand the tricky part regarding which random group identifier to pick and also the fact that the low accuracies will be lost. I did not think about this former.

In your example, if string_a is selected as the group identifier, I would therefore pick 0.99 for the accuracy of string_c but loos other similarities related to string _c then.

1. Thinking again about it, for simplicity, in my use case having the similarity values of each string within a group and their group identifier would be great for now.

2. Also, I like your approach to track and show the lowest similarities of each string within the group. In that case, I can not see it as part of the same single data frame returned by the group_similar_strings functions like in point 1 above. Is it going to be in a separate second data frame? Also, if we consider our same example, this data frame would be like a 3 dimensions data frame with the number of rows equals 10?

Sorry since I really have no idea where to start from, that's why I drew these tab to make it clear in my mind as well.

Please, can you guide me on how I can hack the code and get these results, Please?

Thank you again for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding similarity column in the group_similar_strings output #14

Adding similarity column in the group_similar_strings output #14

selfcontrol7WC commented Nov 4, 2020 •

edited

Bergvca commented Nov 4, 2020

selfcontrol7WC commented Nov 5, 2020

Adding similarity column in the group_similar_strings output #14

Adding similarity column in the group_similar_strings output #14

Comments

selfcontrol7WC commented Nov 4, 2020 • edited

Bergvca commented Nov 4, 2020

selfcontrol7WC commented Nov 5, 2020

selfcontrol7WC commented Nov 4, 2020 •

edited