Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BanditPAM 200x slower than quadratic algorithms at 10k MNIST #175

Open
kno10 opened this issue Jan 13, 2022 · 5 comments
Open

BanditPAM 200x slower than quadratic algorithms at 10k MNIST #175

kno10 opened this issue Jan 13, 2022 · 5 comments
Assignees

Comments

@kno10
Copy link

kno10 commented Jan 13, 2022

I've been comparing BanditPAM to FasterPAM on the first 10k instances of MNIST:

https://colab.research.google.com/drive/1-8fMll3QpsdNV5widn-PrPHa5SGXdAIW?usp=sharing

BanditPAM took 791684.18ms
FasterPAM took 3971.87ms, of which 90% are the time needed to compute the pairwise distance matrix

That is 200x slower. I will now try 20k instances.

@kno10
Copy link
Author

kno10 commented Jan 13, 2022

First numbers for 20k of MNIST (one run each only, on colab):
BanditPAM: 4390447.39ms
FasterPAM: 16993.11ms
258x slower, so the gap has widened despite the latter being O(n²). But there is variance here, it may be similar.

@motiwari motiwari self-assigned this Jan 14, 2022
@motiwari
Copy link
Owner

Thanks for the report @kno10 --- I've requested access to the colab notebooks from 2 of my personal email addresses, would you mind granting access so I can investigate?

@kno10
Copy link
Author

kno10 commented Jan 15, 2022

Sorry, I didn't click the right colab buttons, the link was meant to be public. It should work now.

@motiwari
Copy link
Owner

motiwari commented Apr 22, 2022

Hi @kno10, thank you for filing this issue and providing an easily reproducible benchmark. It led us to discovering a number of issues that are being worked on:

  • Distance Matrix: In your benchmark script, the whole distance matrix is passed to FasterPAM. However, it's still not apples-to-apples with BanditPAM because computing the distance matrix up front is much faster (due to vectorization) than on-the-fly as is done in BanditPAM. To fix this point of comparison, I've allowed for a distance matrix to be passed to BanditPAM explicitly in the slowdown branch.

  • Multithreading: I think you may have misunderstood my comment here; my apologies for being unclear. Even when multithreaded, BanditPAM almost always returns the same output, even though it is a random algorithm. The call to banditpam.set_num_threads(1) is only necessary for instruction-level determinism, which is a stronger condition than necessary. For apples-to-apples comparison with FasterPAM, we need to multithread BanditPAM as well (I don't know Rust, but I believe the Rust implementation of FasterPAM is multithreaded? ). [Aside: due to incompatibility with M1 Macs, I've removed the .set_num_threads functionality altogether; see v3.0.4)

  • Max Iterations: BanditPAM had a much higher default value for the maximum number of iterations than FasterPAM (1000 vs. 100); this led to many more swap steps being executed. So BanditPAM was finding a very slightly better solution but executing for much longer. I've updated this to 100 in the slowdown branch as well.

  • Swap Complexity: Most interestingly, the dependency on k of the SWAP step in BanditPAM is still O(k). I'm accelerating this to O(1) using the FastPAM1 trick in the slowdown branch, which was nontrivial to implement in our framework. This is also the root cause of Much Slower than Scikit-Learn kMedoids on Large Dimensions #179.

The first three points above are all addressed in the branch slowdown. The fourth one is implemented as well, but for some reason is not working properly and often returns bad results. I'm going to continue working on this but need to put this on hold for some time due to my other commitments.

Thank you for providing all of these bugs in easily-reproducible ways. Please let me know if you have any other questions or comments while I continue to work on this.

@kno10
Copy link
Author

kno10 commented Apr 25, 2022

Distance matrix: Indeed, the distance computations are the main cost, but that is also the baseline any non-quadratic method will need to beat. But even pairwise distance computations can be vectorized (e.g., with AVX - mnist should benefit from this) so I would not expect the benefits to be that huge to really have the matrix, unless you recompute the values very often. In my opinion, the benefits of vectorization at this level are often overestimated (because people tend to look at interpreted code, and "vectorized" then also means calling a compiled library function as opposed to using the Python interpreter, no matter whether the actual underlying code is vectorized or not).

Multithreading: The colab sheet uses n_cpu=1 for FasterPAM, i.e., no parallelism. The wrapper then calls a non-multithreaded implementation; the parallelized version came much later and is not as parallelism-efficient as we would like. I set both to use a single thread for a more fair comparison.

Max Iterations: It converges long before the maximum iteration counter - usually <10 will be enough. I added an "iter" counter to the colab sheet, and it was just 3 iterations on average. I also set it to max_iter=1000 but it will not make a difference. How much quality do you lose in BanditPAM when reducing the iteration limit?

Swap Complexity: Don't use the FastPAM1 version of the trick anymore, the FasterPAM version is both theoretically better (guaranteed, not just expected gains - FasterPAM1 still has a theoretical worst case of O(k)), better understood, and more elegant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants