Tips for working with large datasets #88

ryangdar · 2022-10-05T16:00:14Z

Hi I'm working with a 200MB file and using the command group_similar_strings, however, this is taking so long that it's never completing (running for several days). I've tried several n_gram sizes with no luck. Do you have any tips to run on large datasets?

ajinnah · 2023-01-17T02:12:47Z

Having the same issue with no solution so far, https://github.com/louistsiattalou/tfidf_matcher can handle much larger datasets without getting stuck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for working with large datasets #88

Tips for working with large datasets #88

ryangdar commented Oct 5, 2022

ajinnah commented Jan 17, 2023

Tips for working with large datasets #88

Tips for working with large datasets #88

Comments

ryangdar commented Oct 5, 2022

ajinnah commented Jan 17, 2023