Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about version string_grouper group_similar_strings #80

Open
dariswan opened this issue Jan 26, 2022 · 4 comments
Open

Question about version string_grouper group_similar_strings #80

dariswan opened this issue Jan 26, 2022 · 4 comments

Comments

@dariswan
Copy link

Dear developer,

Could you get me an explanation about the different versions of string_grouper?
I only use one function named as "group_similar_strings", currently I am using 0.1.1 version, but the latest version now is 0.6.1

this library is very helpful and great, but when I used function group_similar_strings with customer similarity, sometimes the result missed group the group as I checked human eyes.
Is it worth it if I upgrade the version to the latest version,? what is the improvement?

@ParticularMiner
Copy link
Contributor

Hi @dariswan

The latest version is supposed to be much faster than older versions as your dataset-size increases. I would be interested to see how group_similar_strings failed. If possible, could you send me a code/data sample that reproduces the failure?

Thanks.

@dariswan
Copy link
Author

Hi @ParticularMiner

There are no failures in group_similar_strings but I saw them as human eyes, sometimes giving inaccurate results to a single term.
In my case, i tried to group the similar email with default similarity (80%), for example

Those 3 email suppose to in one group as human eyes

@ParticularMiner
Copy link
Contributor

ParticularMiner commented Jan 27, 2022

Hi @dariswan

For such a small set of strings the default similarity threshold (80%) is too large. Try 60%:

import pandas as pd
from string_grouper import group_similar_strings
emails = pd.Series(['messi1@gmail.com', 'messi12@gmail.com', 'messi21@gmail.com'])
email_df = emails.to_frame()
email_df[['group_id', 'group_rep']] = group_similar_strings(emails, min_similarity=0.64)
email_df
0 group_id group_rep
0 messi1@gmail.com 0 messi1@gmail.com
1 messi12@gmail.com 0 messi1@gmail.com
2 messi21@gmail.com 0 messi1@gmail.com

@dariswan
Copy link
Author

Hi @ParticularMiner

Yes I agreed with you, my threshold right now is 70%
It much better result, so back to the main question, the way this module clustering string is still in the same way between 0.6.1 and 0.3.2 (i upgraded lit bit)

Thank you for the answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants