Question about version string_grouper group_similar_strings #80

dariswan · 2022-01-26T20:27:53Z

Dear developer,

Could you get me an explanation about the different versions of string_grouper?
I only use one function named as "group_similar_strings", currently I am using 0.1.1 version, but the latest version now is 0.6.1

this library is very helpful and great, but when I used function group_similar_strings with customer similarity, sometimes the result missed group the group as I checked human eyes.
Is it worth it if I upgrade the version to the latest version,? what is the improvement?

ParticularMiner · 2022-01-26T21:08:34Z

Hi @dariswan

The latest version is supposed to be much faster than older versions as your dataset-size increases. I would be interested to see how group_similar_strings failed. If possible, could you send me a code/data sample that reproduces the failure?

Thanks.

dariswan · 2022-01-27T09:05:11Z

Hi @ParticularMiner

There are no failures in group_similar_strings but I saw them as human eyes, sometimes giving inaccurate results to a single term.
In my case, i tried to group the similar email with default similarity (80%), for example

messi1@gmail.com --> group_1
messi12@gmail.com --> group_2
messi21@gmail.com --> group_3

Those 3 email suppose to in one group as human eyes

ParticularMiner · 2022-01-27T09:42:18Z

Hi @dariswan

For such a small set of strings the default similarity threshold (80%) is too large. Try 60%:

import pandas as pd
from string_grouper import group_similar_strings

emails = pd.Series(['messi1@gmail.com', 'messi12@gmail.com', 'messi21@gmail.com'])
email_df = emails.to_frame()
email_df[['group_id', 'group_rep']] = group_similar_strings(emails, min_similarity=0.64)
email_df

	0	group_rep
0	messi1@gmail.com	messi1@gmail.com
1	messi12@gmail.com	messi1@gmail.com
2	messi21@gmail.com	messi1@gmail.com

dariswan · 2022-01-27T23:11:39Z

Hi @ParticularMiner

Yes I agreed with you, my threshold right now is 70%
It much better result, so back to the main question, the way this module clustering string is still in the same way between 0.6.1 and 0.3.2 (i upgraded lit bit)

Thank you for the answer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about version string_grouper group_similar_strings #80

Question about version string_grouper group_similar_strings #80

dariswan commented Jan 26, 2022

ParticularMiner commented Jan 26, 2022

dariswan commented Jan 27, 2022

ParticularMiner commented Jan 27, 2022 •

edited

dariswan commented Jan 27, 2022

Question about version string_grouper group_similar_strings #80

Question about version string_grouper group_similar_strings #80

Comments

dariswan commented Jan 26, 2022

ParticularMiner commented Jan 26, 2022

dariswan commented Jan 27, 2022

ParticularMiner commented Jan 27, 2022 • edited

dariswan commented Jan 27, 2022

ParticularMiner commented Jan 27, 2022 •

edited