-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BanditPAM 200x slower than quadratic algorithms at 10k MNIST #175
Comments
First numbers for 20k of MNIST (one run each only, on colab): |
Thanks for the report @kno10 --- I've requested access to the colab notebooks from 2 of my personal email addresses, would you mind granting access so I can investigate? |
Sorry, I didn't click the right colab buttons, the link was meant to be public. It should work now. |
Hi @kno10, thank you for filing this issue and providing an easily reproducible benchmark. It led us to discovering a number of issues that are being worked on:
The first three points above are all addressed in the branch Thank you for providing all of these bugs in easily-reproducible ways. Please let me know if you have any other questions or comments while I continue to work on this. |
Distance matrix: Indeed, the distance computations are the main cost, but that is also the baseline any non-quadratic method will need to beat. But even pairwise distance computations can be vectorized (e.g., with AVX - mnist should benefit from this) so I would not expect the benefits to be that huge to really have the matrix, unless you recompute the values very often. In my opinion, the benefits of vectorization at this level are often overestimated (because people tend to look at interpreted code, and "vectorized" then also means calling a compiled library function as opposed to using the Python interpreter, no matter whether the actual underlying code is vectorized or not). Multithreading: The colab sheet uses Max Iterations: It converges long before the maximum iteration counter - usually <10 will be enough. I added an "iter" counter to the colab sheet, and it was just 3 iterations on average. I also set it to Swap Complexity: Don't use the FastPAM1 version of the trick anymore, the FasterPAM version is both theoretically better (guaranteed, not just expected gains - FasterPAM1 still has a theoretical worst case of O(k)), better understood, and more elegant. |
I've been comparing BanditPAM to FasterPAM on the first 10k instances of MNIST:
https://colab.research.google.com/drive/1-8fMll3QpsdNV5widn-PrPHa5SGXdAIW?usp=sharing
BanditPAM took 791684.18ms
FasterPAM took 3971.87ms, of which 90% are the time needed to compute the pairwise distance matrix
That is 200x slower. I will now try 20k instances.
The text was updated successfully, but these errors were encountered: