Skip to content

Commit

Permalink
Modify NUTS to output timed results and finish README!!
Browse files Browse the repository at this point in the history
  • Loading branch information
JamesYang007 committed Jul 18, 2020
1 parent a1e3f83 commit 9c20447
Show file tree
Hide file tree
Showing 14 changed files with 715 additions and 127 deletions.
272 changes: 214 additions & 58 deletions README.md
Expand Up @@ -18,6 +18,9 @@
- [Sampling from Joint Distribution](#sampling-from-joint-distribution)
- [Sampling Posterior Mean and Standard Deviation](#sampling-posterior-mean-and-standard-deviation)
- [Bayesian Linear Regression](#bayesian-linear-regression)
- [Benchmarks](#benchmarks)
- [Bayesian Linear Regression](#benchmarks-bayesian-linear-regression)
- [Gaussian Model](#gaussian-model)
- [Contributors](#contributors)
- [Third Party Tools](#third-party-tools)

Expand Down Expand Up @@ -78,14 +81,13 @@ simply from the type information.

A model object makes no heap-allocations and is minimal in size.
It is simply a small, contiguous slab of memory representing the binary tree.
The `model` object in the previous [section](#intuitive-model-specification)
is about `144 bytes` on `x86_64-apple-darwin17.7.0` using `clang-10`.
The `model` object in the [previous section](#intuitive-model-specification)
is about `120 bytes` on `x86_64-apple-darwin17.7.0` using `clang-11.0.3`.

For a more complicated model such as the following:
```cpp
ppl::Data<double> X;
std::array<ppl::Param<double>, 6> theta;

ppl::Param<double, ppl::vec> theta(6);
auto model = (
theta[0] |= ppl::uniform(-1., 1.),
theta[1] |= ppl::uniform(theta[0], theta[0] + 2.),
Expand All @@ -97,50 +99,29 @@ auto model = (
);
```

The size of the model is `192 bytes` on the same machine and compiler.
The size of the model is `616 bytes` on the same machine and compiler.

A model expression simply references the variables used in the expression
such as `theta[0], theta[1], ..., theta[5], X`.
The model does not store any data, but rather only serves to relate variables.
such as `theta[0], theta[1], ..., theta[5], X`, i.e. it does not copy any data.

### High-performance Inference Methods

Users interface with the inference methods via model expressions and other
configuration parameters for that particular method.
Hence, the inference algorithms are completely general and work with any model
so long as the model expression is properly constructed.
Moreover, due to the statically-known model specification, algorithms
Due to the statically-known model specification, algorithms
have opportunities to make compile-time optimizations.
See [Examples](#examples) for more detail. Drawing 100 samples from the posterior
of a hierarchical Gaussian model using a NUTS sampler, we see significant
gains over STAN and PyMC3 in both compilation and runtime.

```
Y ~ N(l1 + l2, sigma)
l1 ~ N(0, 10)
l2 ~ N(0, 10)
sigma ~ Uniform(0, 20)
```

### AutoPPL

```
Compilation time: 3.39s
Run time: 0.987008s
```
See [Benchmarks](#benchmarks) for more detail.

### STAN
```
Compilation time: 40.48s
Run time: 0.613672s (Warm-up)
0.704271s (Sampling)
1.31794s (Total)
```

### PyMC3
```
Run time: 38s
```
We were largely inspired by STAN and followed their
[reference](https://mc-stan.org/docs/2_23/reference-manual/index.html)
and also their
[implementation](https://github.com/stan-dev/stan)
to make statistical optimizations such as
computing ESS, performing adaptations, and stabilizing sampling algorithms.
However, our library works quite differently underneath, especially
with automatic differentiation and handling model expressions.

## Installation (dev)

Expand All @@ -162,7 +143,7 @@ ctest

### Sampling from Joint Distribution

Although `AutoPPL` was designed to perform inference on posterior distributions,
Although AutoPPL was designed to perform inference on posterior distributions,
one can certainly use it to sample from any joint distribution defined by the priors and conditional distributions.
As an example, we can sample `1000` points from a
standard normal distribution using Metropolis-Hastings in the following way:
Expand All @@ -176,8 +157,8 @@ auto model = (
ppl::mh(model, 1000);
```
Note that the parameter `theta` is bound to the storage `samples` at construction.
After calling `mh`, the `1000` samples are placed into `samples.`
The scalar parameter `theta` is bound to the storage `samples` at construction.
After calling `mh`, the `1000` samples are placed into `samples`.
In general, so long as the joint PDF is known,
or equivalently and more commonly if the conditional and prior PDFs are known,
Expand All @@ -196,18 +177,23 @@ auto model = (
ppl::mh(model, 1000);
```

Here, `theta2` depends on `theta1` and hence defines
the conditional distribution `theta2 | theta1`.

### Sampling Posterior Mean and Standard Deviation

The following is an example of fitting a Gaussian model to some data.
We put a Normal(0,3) prior on the mean and Uniform(0,2) prior on the
We put a `Normal(0,3)` prior on the mean and `Uniform(0,2)` prior on the
standard deviation.
It is recommended to use the state-of-the-art NUTS sampler to sample
from a posterior distribution.
While in the previous section, we used Metropolis-Hastings
to demonstrate how to use it,
it is recommended to use the state-of-the-art NUTS sampler to sample
from the posterior distribution.

```cpp
std::array<double, 1000> mu_samples, sigma_samples;

ppl::Data<double> x {1.0, 1.5, 1.7, 1.2, 1.5};
ppl::Data<double, ppl::vec> x {1.0, 1.5, 1.7, 1.2, 1.5};
ppl::Param<double> mu {mu_samples.data()};
ppl::Param<double> sigma {sigma_samples.data()};

Expand All @@ -217,23 +203,36 @@ auto model = (
x |= ppl::normal(mu, sigma)
);

size_t warmup = 1000;
size_t n_samples = 1000;
size_t n_adapt = 1000;
ppl::nuts(model, warmup, n_samples, n_adapt);
ppl::nuts(model);
```
By default, NUTS will run 1000 iterations each for warmup and actual sampling
(total of 2000 iterations).
The default adaptation method is to estimate the diagonal precision matrix.
If the user wishes to change the configuration, one must pass in a config object:
```cpp
NUTSConfig<...> config; // replace ... with ppl::unit_var or ppl::diag_var
config.n_samples = ...; // number of samples
config.warmup = ...; // number of warmup iters
config.seed = ...; // set seed (default: random)
config.max_depth = ...; // max depth of tree built during NUTS (default: 10)
ppl::nuts(model, config);
```

### Bayesian Linear Regression

This example covers ordinary linear regression in a Bayesian setting.
This example covers ridge regression in a Bayesian setting.
We created a fictitious dataset consisting of `(x,y)` coordinates.
The true relationship is the following: `y = x + 1`.
By specifying two parameters for the weight and bias, we propose
the following probabilistic model:

```
y ~ N(wx + b, 0.5)
w ~ U(0, 2)
b ~ U(0, 2)
y ~ Normal(x*w + b, 0.5)
w ~ Uniform(0, 2)
b ~ Uniform(0, 2)
```

In AutoPPL, we can write the following code and sample from the posterior:
Expand All @@ -247,15 +246,172 @@ ppl::Param<double> w {w_storage.data()};
ppl::Param<double> b {b_storage.data()};

auto model = (
w |= ppl::uniform(0, 2),
b |= ppl::uniform(0, 2),
w |= ppl::uniform(0., 2.),
b |= ppl::uniform(0., 2.),
y |= ppl::normal(x * w + b, 0.5)
);

size_t warmup = 1000;
size_t n_samples = 1000;
size_t n_adapt = 1000;
ppl::nuts(model, warmup, n_samples, n_adapt);
ppl::nuts(model);
```
## Benchmarks
In the following examples, we show benchmarks with STAN.
We list the benchmark settings for completion:
- Machine: x86_64-apple-darwin19.5.0
- CPU: 3.4 GHz Quad-Core Intel Core i5
- Compiler: Clang 11.0.3
### Bayesian Linear Regression <a name="benchmarks-bayesian-linear-regression"></a>
We collected a dataset regarding life expectancy released by WHO
([source](https://www.kaggle.com/kumarajarshi/life-expectancy-who)).
After cleaning and extracting three predictors: "Alcohol", "HIV/AIDS", and "GDP",
the dataset consisted of 157 points.
We performed a Bayesian linear regression with this data and the following model:
```
y ~ Normal(X*w + b, s*s + 2.)
w ~ Normal(0., 5.)
b ~ Normal(0., 5.)
s ~ Uniform(0.5, 8.)
```
where `w` is a 3-dimensional parameter vector,
and `b` and `s` are scalar parameters.
Using the same dataset and model specification,
we performed NUTS to sample various number of samples.
We also set the number of chains and cores to 1 and
adaptation method to diagonal precision matrix.
The following plots show benchmarks between
run-times and effective sample size (ESS) per second:
![](docs/figures/regression_benchmark_plot/runtime.png)
![](docs/figures/regression_benchmark_plot/ess_s.png)
The reported mean, standard deviation, and ESS values were
almost identical in all cases,
which is not surprising since we used the same algorithm to estimate ESS
and perform NUTS.
The runtimes have similar log-like behavior,
but it is clear that STAN (dotted lines) takes far longer
in both sampling and warmup times.
As for ESS/s, upon comparing by colors (corresponding to a parameter)
between dotted (STAN) and solid (AutoPPL) lines,
we see that AutoPPL has uniformly larger ESS/s.
This difference quickly becomes larger as sample size grows.
From these plots and that sampling results were identical
show that the drastic difference in ESS/s is simply from faster
computations such as automatic differentation during NUTS sampling,
and how we interpret the model expression to create compile-time optimizations.
ESS was computed as outlined
[here](https://mc-stan.org/docs/2_23/reference-manual/effective-sample-size-section.html).
We also made some adjustments to use Geyer's biased estimator for ESS
as in the current implementation of STAN
([source](https://github.com/stan-dev/stan/blob/525998129ea838ec685f1d1f65dc76063d0fd40d/src/stan/analyze/mcmc/compute_effective_sample_size.hpp)).
The following is the AutoPPL code for the model specification without data loading.
The full code can be found
[here](benchmark/regression_autoppl.cpp):
```cpp
auto X = ppl::make_data_view<ppl::mat>(X_data);
auto y = ppl::make_data_view<ppl::vec>(y_data);
ppl::Param<double, ppl::vec> w(3);
ppl::Param<double> b;
ppl::Param<double> s;
arma::mat storage(num_samples, w.size() + b.size() + s.size());
for (size_t i = 0; i < w.size(); ++i) {
w.storage(i) = storage.colptr(i);
}
b.storage() = storage.colptr(w.size());
s.storage() = storage.colptr(w.size() + b.size());
auto model = (s |= ppl::uniform(0.5, 8.),
b |= ppl::normal(0., 5.),
w |= ppl::normal(0., 5.),
y |= ppl::normal(ppl::dot(X, w) + b, s * s + 2.));
NUTSConfig<> config;
config.warmup = num_samples;
config.n_samples = num_samples;
auto res = ppl::nuts(model, config);
```

### Gaussian Model

We generated 1000 values from standard normal distribution to form our data.
Our model is defined as follows:

```
y ~ Normal(l1 + l2, s)
l1 ~ Normal(0., 10.)
l2 ~ Normal(0., 10.)
s ~ Uniform(0., 20.)
```

where all parameters are scalar.

The benchmark configurations are exactly the same as in the
[previous section](#benchmarks-bayesian-linear-regression).

The following plots show benchmarks between
run-times and effective sample size (ESS) per second:

![](docs/figures/gaussian_benchmark_plot/runtime.png)
![](docs/figures/gaussian_benchmark_plot/ess_s.png)

We note that both STAN and AutoPPL outputted almost identical
means, standard deviation, and ESS.

The runtimes have a similar linear trend,
and it is clear that STAN (dotted lines) takes far longer
in both sampling and warmup times.
The ESS/s for `l1` and `l2` overlap completely (red and blue) in both STAN and AutoPPL
and this is expected since they are symmetric in the model specification.
With the exception of the two smallest sample sizes (100, 500),
ESS/s is fairly constant as sample size varies.
It is quite evident that AutoPPL (solid) has a larger ESS/s by a significant factor.

The reason for this difference is in how we handle expressions
where data vector elements are iid (independent and identically distributed).
For most distributions, especially those that are in some exponential family,
they can be highly optimized in iid settings to perform quicker differentiation.
However, it is worth noting that this optimization does not apply when
the data are simply independent but not identically distributed
(as in the [linear regression](#benchmarks-bayesian-linear-regression) case),
or when the variable is a parameter, not data.

The following is the AutoPPL code without data generation.
The full code can be found
[here](benchmark/normal_two_prior_distribution.cpp).

```cpp
ppl::Param<double> l1, l2, s;

auto model = (
s |= ppl::uniform(0.0, 20.0),
l1 |= ppl::normal(0.0, 10.0),
l2 |= ppl::normal(0.0, 10.0),
y |= ppl::normal(l1 + l2, s)
);

arma::mat storage(n_samples, 3);
l1.storage() = storage.colptr(0);
l2.storage() = storage.colptr(1);
s.storage() = storage.colptr(2);

ppl::NUTSConfig<> config;
config.n_samples = n_samples;
config.warmup = n_samples;

auto res = ppl::nuts(model, config);
```
## Contributors
Expand Down

0 comments on commit 9c20447

Please sign in to comment.