Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about GoogleSuffixArrayDeduplicator #5

Closed
TristanThrush opened this issue Sep 28, 2022 · 3 comments
Closed

Question about GoogleSuffixArrayDeduplicator #5

TristanThrush opened this issue Sep 28, 2022 · 3 comments

Comments

@TristanThrush
Copy link

Hi, this repo makes a lot of deduplication code very easy to use; I'm very grateful for it!

I just noticed that the suffix array is built with this command:

https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/exact_dedup/suffix_array/suffix_array_google.py#L60

But right above that command, the suffix array is also built with this command:

https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/exact_dedup/suffix_array/suffix_array_google.py#L57

I could be misunderstanding the code, but if lines 60 and 57 actually do the same thing, it could be good to remove line 57. This is because line 57 tries to load the whole suffix array into memory which fails for large datasets.

Thanks!

@ChenghaoMou
Copy link
Owner

You are right, that cargo command is redundant. You can either use the latest commit or pip install --upgrade text-dedup to 0.2.1 for the fix.

@ChenghaoMou
Copy link
Owner

And thanks for the issue. It's much appreciated!

@TristanThrush
Copy link
Author

thanks; so fast!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants