Episode 18 - PyMC3 - Open Source Directions
Welcome to Open Source Directions, the webinar that informs you about the future of open source projects. This episode, the project being highlighted is PyMC3
- Host: Anthony Scopatz
- Co-Host: Carol Willing
- Guest 1: Chris Fonnesbeck: is a senior analyst for the New York Yankees and an adjoint professor of biostatistics at the Vanderbilt University Medical Center. He’s also the BDFL and a core contributor to the PyMC3 project.
PyMC3 is a probabilistic programming package for Python that allows users to fit Bayesian models using a variety of numerical methods, most notably Markov chain Monte Carlo (MCMC) and variational inference (VI). At the time of this episode on April 5th, PyMC3 had 4k stars on Github and about 44,452 downloads in March 2019 according to the Python package index.
The project began as some code that Chris wrote for himself in 2003 while a graduate student at the University of Georgia. At the time, there was not a wide selection of software packages for doing Bayesian statistics, and none of them were open source. The only widely-used application was called WinBUGS, which was a ground-breaking piece of software that played a major role in democratizing Bayesian methods. One drawback was that it only ran on windows, was not open source so you could not look at the source code or contribute to the project, and it used a DSL that was not very expressive. Ultimately these limitations made it so that Christ decided that he wanted to do Bayes in Python!
Luckily, things have come a long way since those early days. Now there is a large suite of probabilistic programming projects available, notably Stan (from Andrew Gelman’s group at Columbia), Google’s TensorFlow Probability, and Edward, which is also based on TensorFlow, to name just a few. While they all ostensibly do similar things, that is build and estimate Bayesian models using MCMC or related methods, each project takes a very different approach to accomplishing the task.
PyMC3 involved a complete rewrite of the PyMC project. It did not utilize any of PyMC because that was essentially a pile of Fortran extensions. PyMC3 relies on the calculation of gradients and to facilitate that, John Salvatier used Theano as a computational back end. Theano is a framework for building deep neural networks, but PyMC3 uses that infrastructure to do probabilistic programming. Just like with neural networks, a Bayesian model is constructed as a graph, and then computations are performed on that graph. Importantly, modern methods for fitting Bayesian models rely on taking the gradient of the model probability, so it's important to be able to do automatic or symbolic differentiation, and Theano provides that. Unfortunately, the Theano project recently halted development, so the next version of PyMC will be built on another technology, and we are currently exploring TensorFlow as a replacement.
PyMC3 was the brainchild of John Salvatier, who was one of the first to recognize the power of the gradient-based methods for Bayesian analysis. He was persuaded to use the PyMC project as a platform for implementing these methods, and he went about coding a prototype from scratch that eventually became PyMC3.
The PyMC-devs group now consists of over 12 core developers, each of whom makes important contributions to the maintenance of PyMC3, not only by contributing code, but also by writing docs and tutorials, answering questions on the Discourse page, and giving talks and short courses at conferences. Prior to PyMC3, the team was always small and had a high rate of turnover, but the project is now quite stable and very active.
Most developers tend to start out as users, and early on PyMC tended to be used by ecologists, astronomers, and statisticians. Now, the reach is quite broad having expanded outside of academia to include finance, economics, data science, biomedicine, and more. Scientists and analysts who utilize observational data seem to be the most common users.
PyMC3 is part of NumFOCUS, and with their help and advice, improvements are being made in the governance of the project. Additionally, they are doing simple things like updating the code of conduct and contribution guidelines, as well as driving more comprehensive efforts of community engagement on the discourse page and GitHub repository. They always encourage new contributors to the project and invite those who contribute repeatedly to become core developers.
This is a critical juncture in the PyMC process as far as current and future development is concerned. While it has been wonderful to have PyMC3 built upon Theano, because of its stability and efficiency, the project has been discontinued. There are now a number of open source frameworks like Theano which are quite robust and can take on the same level of work. Due to this, PyMC3 will continue to run on Theano, but future iterations like PyMC4 will need a different computational backend. The team has spent the last year researching alternatives to Theano, among the options are Pytorch, MXNet, TensorFlow, and other backends. Right now there is some work being done with Google to potentially go with the TensorFlow option. Some people are already helping to contribute here, so anyone interested in probabilistic programming should hop into the discourse or GitHub and help out. PyMC3 and PyMC4 will be developed in separate projects, in parallel and they are planned to continue on in development for a while since not everyone wants to or needs to make the jump.
Q. can PyMC3 integrate with Dask, to scale out and run on a cluster of machines? And on a different note, since SMC is inherently parallel, it is a good candidate for running with dask. How would you use Dask.distributed to take advantage of a cluster?
A. No, it doesn’t integrate with Dask at all. If you are really interested in doing the things that Dask dose, then on the PyMC3 side you would be looking at GPUs. Particularly when we move to PyMC4 we will be looking at TensorFlow being able to help us with TPUs. One idea is to build out the backend of PyMC4 with something like Dask + Numba + Cython and whatever else to go for a community based open source approach. This would be something like PyMC5 though since this is beyond the current scope.
Q. Can I use PyMC3 in a truly online fashion? I.e. use the posterior as prior when processing the next batch of data.
A. This alludes to turning the Basian frame. One of the touted benefits of Bayesian inference is to run the Bayesian analysis and posterior as the prior. The question comes up when you get the next set of data? It is nice to be able to take the posteriors from your previous set of data as priors for your next one. You can see how this could work in an online environment streaming data. This concept is easier said than done in many respects because you have to take the posterior data sets and turn them back into a prior form. What we do have in PyMC3 now is an empirical distribution which essentially allows you to construct a distribution, not from this parametric form, but essentially as a histogram. I don't know of a good example in production where this happens, but in theory, you could iteratively update your model but sometimes it just has to be manual.
Q. I just started reading "probabilistic programming and bayesian methods for hackers". Could you suggest other resources for people just starting out.
A. I’m implying by the question that they are just starting out with Basian to PyMC3 specifically. If you are talking about PyMC then there is a book by one of our older developers called [Bayesian analysis for Python](https://www.amazon.com/Bayesian-Analysis-Python-Introduction-probabilistic/dp/1789341655/ref=pd_lpo_sbs_14_t_0?_encoding=UTF8&psc=1&refRID=JJXMYXB29M9Q8PJ8RTM2 and I would highly recommend that as well. In general, just starting out with Basian, I would suggest [Bayesian Data] Analysis(https://www.ebay.com/p/Chapman-and-Hall-CRC-Texts-in-Statistical-Science-Bayesian-Data-Analysis-106-by-Andrew-Gelman-2003/102883921?iid=232601693788) by Gelman as I would consider it to be the Bible of Basian.
Q. How do you add additional dimensions to a model? In this case you might want to add IQ by gender.
A. That model can be generalized by making those means into a hierarchical model. What that means is that you have parameters that have their own sets of parameters. So rather than just having means in stadard deveations, the means can be a model themselves. You can model that mean as a function of, in this case, group, placebo, or drug but then you can add other covariants to your model. You can make your list of covariants as long as you want. Even those parameters can then have hyper parameters as well. The nice thing is that these models can be as simple or complex as your data allows.
Q. Since you are strongly considering Tensorflow where do you see PyMC4 fitting in with regards to Tensorflow Probability? Is the goal to have healthy competition within the Tensorflow ecosystem? Are there any particular advantages to PyMC over TFP? (I'm assuming it's much more mature)
A. So the plan is to build on top of it. TFP is really a library built on TensorFlow that gives you probability distributions, which did not exist. There are also a bunch of fitting methods such as MCMC, and variational inference. We will build on top of this where possible. We also plan to automate the usage. TFP may not be able to do everything that PyMC wants to do so there will perhaps be extensions in PyMC3 that are not in TFP, but we plan to extend it as much as possible.
Q. What are your opinions about Uber's Pyro?
A. I don’t have a lot of direct experience with it, but I think it looks impressive. It is definitely part of the ecosystem. Pyro is a probabilistic programming language developed by Uber, and it has some really smart people working on it. I think it is great to not just have one but to provide a multitude of options because there are always things that one does better than the other. Some of these approaches/packages use different strategies, some work in what is called a graph mode, whereas others are more iterative and interactive.