Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Jaccard index? #2

Closed
luizirber opened this issue Aug 14, 2020 · 5 comments
Closed

Implement Jaccard index? #2

luizirber opened this issue Aug 14, 2020 · 5 comments

Comments

@luizirber
Copy link

Hi @Daniel-Liu-c0deb0t!

I didn't dig down too much into the codebase, but I wonder if there is interest in adding a Jaccard index function (as described in Faster Population Counts Using AVX2 Instructions.
The implementation from the paper is at https://github.com/CountOnes/hamming_weight/blob/1dd7554c0fc39e01c9d7fa54372fd4eccf458875/src/avx_jaccard_index.c
and while it is a bit heavy on the macros I think it could be translated using the infra you already have in triple_accel?

(This would be incredibly useful for me, so I can start looking deeper into it if you're also interested =])

@Daniel-Liu-c0deb0t
Copy link
Owner

Hey @luizirber, I think it would be a bit difficult to integrate Jaccard distance into the current codebase, since it is mainly adapted for the Levenshtein distance calculation. It would likely be easier to make a separate crate for Jaccard index. I considered this feature before, but I thought it would be better to only focus on edit distance. Additionally, I'm current too busy with other projects to implement it. Sorry. Eventually, I would likely implement SIMD popcount in Rust in a separate project, since I don't believe anyone has done it yet and its very useful. If you have time to look into it, then I think it shouldn't be too hard to get a Rust port, as the code is not too long. The dense macros are just to make it work on multiple platforms.

@luizirber
Copy link
Author

No problem =]

@Daniel-Liu-c0deb0t
Copy link
Owner

Hey @luizirber, I am interested in implementing an SSE/AVX popcount routine in another project: https://github.com/natir/nuc2bit. Perhaps you can use the code (when I am done) to implement a Jaccard distance function?

@luizirber
Copy link
Author

Sounds great, and happy to see you working with Pierre =]

@Daniel-Liu-c0deb0t
Copy link
Owner

Daniel-Liu-c0deb0t commented Aug 26, 2020

Hey, I wrote up a quick popcount implementation here using the shuffle-based algorithm. For Jaccard distance, you are going to have to modify the code to AND and OR the vectors before applying internal_popcount, in the innermost loop.

I haven't thoroughly benchmarked my implementation since I just finished it, but it should be pretty fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants