Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using CategoricalMatcher on massive amounts of Hashes #40

Open
tr7zw opened this issue Aug 30, 2019 · 0 comments
Open

Using CategoricalMatcher on massive amounts of Hashes #40

tr7zw opened this issue Aug 30, 2019 · 0 comments

Comments

@tr7zw
Copy link

tr7zw commented Aug 30, 2019

Hey, first of all, awesome lib.
I'm currently tinkering on a database to collect Minecraft Skins (64x64 images). Before adding them I clean them up (upgrade older 64x32 skins to 64x64 and remove data from un-seen/used areas), and then save them with a sha256 hash in order to deduplicate them. Now I'm trying to use the CategoricalMatcher to group together visually simular skins. To speed things up I precalculate my used JImageHash(PerceptiveHash(128) seems to work really good for this usecase) and save them as json(Using gson for conversion) to the database. I created a tiny fork of JImageHash to add a "categorizeImageAndAdd(Hash[] hash, String id)" method that just skips the BufferedImage -> Hash[] conversion. It's also noteworthy that all that data is unlabeled, but normalized.
Now to my problem: The first ~20k hashes can be added rellativly fast, but the further along I get, the slower everything becomes. The current test database has 111.000 skin = 111.000 Hashes inside, and together with recomputeCategories() the process takes about 3 hours. The goal is to have millions of Skin in that database, so this approach won't work anymore.
Is there a better way of doing this? It just comes to mind test a newly added hash against all other hashes, which doesn't sound too practical, or useing chunks of the database and match against them. Maybe the Database can be split up into 10-100k clusters that can be computed, and where only the categories fuzyHash gets stored. Then hashes could be compared against these fuzyHashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant