Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beginning infrastructure for Vecchia GP's #2311

Open
wants to merge 68 commits into
base: master
Choose a base branch
from

Conversation

alexpeters1208
Copy link

This PR is for a new feature that will make training Vecchia GP's much easier with GPyTorch. This PR introduces an Index, which provides an interface for partitioning datasets with different clustering algorithms, measuring distances between partitions with different distance metrics via DistanceMetrics, and reordering datasets with different ordering strategies via OrderingStrategies. The Index ensures that the neighborhood structure in the data adheres to the Vecchia ordering constraints that ensure joint densities, and provides an easy way to access blocks of observations and neighboring sets for both training and testing data. These things are needed to compute block mean and covariance estimates for Vecchia models, and we provide a function to do this with an index in the init.py file, though this function can certainly be improved and almost certainly belongs somewhere else.

We enable custom clustering algorithms, distance metrics, and ordering strategies for testing new approaches to Vecchia GP's, and tutorials for this customization are still in progress.

alexpeters1208 and others added 30 commits November 1, 2022 12:33
…pproximated MVN distribution, a Vecchia MLL wrapper to apply any MLL to small blocks of data with the VeccMVN distribution, and a Block class to perform the data blocking functionality, keep track of block order, neighbors, and constructing testing data blocks and neighbors.
@gpleiss
Copy link
Member

gpleiss commented May 26, 2023

@alexpeters1208 sorry for the very slow reply on this PR!

I'm a fan of the various indexing strategies that you've built out here. It would also be good if we could get VNNGPVariationalStrategy using these indexing strategies. I want to be sure that all vecchia-like GPs in GPyTorch are using the same nearest neighbor ideas, and so VNNGPVariationalStrategy should be using any code that you plan on adding for vecchia GPs.

Finally, I would rename gpytorch/vecchia to something like gpytorch/nearest_neighbors (since ideally we could add more nearest neighbor methods). Moreover, it's probably not worth adding a new examples folder for vecchia, but instead add it to the scalable regression methods folder.

@gpleiss
Copy link
Member

gpleiss commented May 26, 2023

Finally as a timeline I'll be offline for the next month or so, so it's probably best to pick this PR up later this summer.

@alexpeters1208
Copy link
Author

@alexpeters1208 sorry for the very slow reply on this PR!

I'm a fan of the various indexing strategies that you've built out here. It would also be good if we could get VNNGPVariationalStrategy using these indexing strategies. I want to be sure that all vecchia-like GPs in GPyTorch are using the same nearest neighbor ideas, and so VNNGPVariationalStrategy should be using any code that you plan on adding for vecchia GPs.

Finally, I would rename gpytorch/vecchia to something like gpytorch/nearest_neighbors (since ideally we could add more nearest neighbor methods). Moreover, it's probably not worth adding a new examples folder for vecchia, but instead add it to the scalable regression methods folder.

@gpleiss I am also sorry to get back so late, many things have come up lately and this project has been on the back burner. As far as getting this work merged (which I would still really love to do), are you asking that I implement VNNGPVariationalStrategy with my new machinery to get the PR approved? That sounds like a reasonable thing to do, and is of course aside from the renames and moving things around a bit, which is easier. I'm just trying to get a checklist together for what exactly needs to be updated/improved to get this thing merged and build on it going forward.

@gpleiss
Copy link
Member

gpleiss commented Sep 21, 2023

@alexpeters1208 sorry for the slow reply! Here's some thoughts:

I like the idea of adding more Vecchia machinery to GPyTorch. Here's some requirements that I'd like to see though:

  1. Compute nearest neighbors using faiss/scikit learn. Both have fast nearest neighbor algorithms, and that'll be better than building our own :)
  2. Given a nearest neighbor structure, we should compute the conditional factors (i.e. $p(y_i | \mathrm{NN}_i)$) in a parallel fashion. The VNNGPVariationalStrategy should demonstrate how to do this through our "batch-mode hack." Happy to explain this in more detail if it isn't clear.

Ultimately, it would be great to use the Vecchia machinary you have as part of VNNGP. I want to be careful though that we are using faiss/scikit-learn and the batch conditional computation strategy, because we've put in a bit of work into ensuring that these mechanisms are quite efficient and fast ;)

Also, I can't see the notebook you've created because the docs won't compile. Can you solve the warnings so I can take a look?

@alexpeters1208
Copy link
Author

alexpeters1208 commented Sep 21, 2023

@alexpeters1208 sorry for the slow reply! Here's some thoughts:

I like the idea of adding more Vecchia machinery to GPyTorch. Here's some requirements that I'd like to see though:

  1. Compute nearest neighbors using faiss/scikit learn. Both have fast nearest neighbor algorithms, and that'll be better than building our own :)
  2. Given a nearest neighbor structure, we should compute the conditional factors (i.e. $p(y_i | \mathrm{NN}_i)$) in a parallel fashion. The VNNGPVariationalStrategy should demonstrate how to do this through our "batch-mode hack." Happy to explain this in more detail if it isn't clear.

Ultimately, it would be great to use the Vecchia machinary you have as part of VNNGP. I want to be careful though that we are using faiss/scikit-learn and the batch conditional computation strategy, because we've put in a bit of work into ensuring that these mechanisms are quite efficient and fast ;)

Also, I can't see the notebook you've created because the docs won't compile. Can you solve the warnings so I can take a look?

Thanks for the quick reply.

  1. Currently, the KMeansIndex uses FAISS for its implementation. I can make it commensurate with the current VNNGP implementation in that it may also use scikit learn, if that is what is installed. However, the KMeansIndex is only an implementation of the the underlying BaseIndex, which provides an interface for nearest-neighbor models using any algorithm. We also implement Voronoi-based nearest neighbors with VoronoiIndex, but again, that is only an implementation of the underlying Index. Anything that uses one will be able to use them all, enabling a massive amount of flexibility.
  2. I will need to look a bit more into this. The contributions I've made do not actually involve using the NN structure for model training (though that is the obvious end goal), but rather only provide an interface for NN calculations across the board, and some implementations to get started (KMeans and Voronoi). I do take your point that the algorithm built from them for computing all the relevant joint densities should be parallelizable. I have an algorithm for doing this in the __init__.py file of my directory, but it is probably pretty lame.

I believed I've solved the warnings, waiting for checks to finish. I had renamed vecchia to nearest_neighbors in the source code but not in the documentation.

UPDATE: My checks are now failing due to faiss not being a direct GPyTorch dependency. I will look into how other parts of the package get around this, as I'm pretty sure this is not the only module using FAISS in examples.

@alexpeters1208
Copy link
Author

@gpleiss The failing checks are due to a "No module named faiss found". I've looked around a little bit and don't understand how to ensure that these CI checks can use faiss, despite the fact that it's not a direct GPyTorch dependency. Other parts of this package probably do the same thing, correct? VNNGP CI will either depend on Scikit-learn or faiss. How can I resolve this issue so that you can read my notebook outlining the new features?

@alexpeters1208
Copy link
Author

@gpleiss The failing checks are due to a "No module named faiss found". I've looked around a little bit and don't understand how to ensure that these CI checks can use faiss, despite the fact that it's not a direct GPyTorch dependency. Other parts of this package probably do the same thing, correct? VNNGP CI will either depend on Scikit-learn or faiss. How can I resolve this issue so that you can read my notebook outlining the new features?

@gpleiss Poking you on this

@gpleiss
Copy link
Member

gpleiss commented Nov 15, 2023

Sorry for the slow reply - you would need to add faiss-cpu to the doc requirements.

@alexpeters1208
Copy link
Author

alexpeters1208 commented Nov 15, 2023

Sorry for the slow reply - you would need to add faiss-cpu to the doc requirements.

Thank you Geoff - the docs are now rendered and you can view the notebook. I still have not gotten around to re-implementing VNNGP with this framework, but I plan to sometime soon. Hopefully the notebook will help clarify what this contribution is about - it is a unified framework for doing nearest-neighbor computations, not a re-implementation of any specific clustering algorithm. Any clustering algorithm available in Python can be utilized in a custom Index, enabling lots of flexibility with trying out different models.

@alexpeters1208
Copy link
Author

Sorry for the slow reply - you would need to add faiss-cpu to the doc requirements.

Thank you Geoff - the docs are now rendered and you can view the notebook. I still have not gotten around to re-implementing VNNGP with this framework, but I plan to sometime soon. Hopefully the notebook will help clarify what this contribution is about - it is a unified framework for doing nearest-neighbor computations, not a re-implementation of any specific clustering algorithm. Any clustering algorithm available in Python can be utilized in a custom Index, enabling lots of flexibility with trying out different models.

@gpleiss I also have another notebook that details how this new feature actually gets used in training GP's, but there may be data in there that I'm unsure I should share publically. I would be happy to send you this notebook via email or something, so you can see how this feature is intended to be used in practice.

@alexpeters1208
Copy link
Author

@gpleiss @Balandat Any word on how to move forward with this? I started reimplementing VNNGP with this framework some time ago, but have not completed that work yet. If possible, reading the rendered notebooks I provided would be the best place to start to understand the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants