Modify NUTS to output timed results and finish README!!

JamesYang007 · Jul 18, 2020 · 9c20447 · 9c20447
1 parent a1e3f83
commit 9c20447
Show file tree

Hide file tree

Showing 14 changed files with 715 additions and 127 deletions.
diff --git a/README.md b/README.md
@@ -18,6 +18,9 @@
     - [Sampling from Joint Distribution](#sampling-from-joint-distribution)
     - [Sampling Posterior Mean and Standard Deviation](#sampling-posterior-mean-and-standard-deviation)
     - [Bayesian Linear Regression](#bayesian-linear-regression)
+- [Benchmarks](#benchmarks)
+    - [Bayesian Linear Regression](#benchmarks-bayesian-linear-regression)
+    - [Gaussian Model](#gaussian-model)
 - [Contributors](#contributors)
 - [Third Party Tools](#third-party-tools)
 
@@ -78,14 +81,13 @@ simply from the type information.
 
 A model object makes no heap-allocations and is minimal in size.
 It is simply a small, contiguous slab of memory representing the binary tree.
-The `model` object in the previous [section](#intuitive-model-specification)
-is about `144 bytes` on `x86_64-apple-darwin17.7.0` using `clang-10`.
+The `model` object in the [previous section](#intuitive-model-specification)
+is about `120 bytes` on `x86_64-apple-darwin17.7.0` using `clang-11.0.3`.
 
 For a more complicated model such as the following:
 ```cpp
 ppl::Data<double> X;
-std::array<ppl::Param<double>, 6> theta;
-
+ppl::Param<double, ppl::vec> theta(6);
 auto model = (
     theta[0] |= ppl::uniform(-1., 1.),
     theta[1] |= ppl::uniform(theta[0], theta[0] + 2.),
@@ -97,50 +99,29 @@ auto model = (
 );
 ```
 
-The size of the model is `192 bytes` on the same machine and compiler.
+The size of the model is `616 bytes` on the same machine and compiler.
 
 A model expression simply references the variables used in the expression
-such as `theta[0], theta[1], ..., theta[5], X`.
-The model does not store any data, but rather only serves to relate variables.
+such as `theta[0], theta[1], ..., theta[5], X`, i.e. it does not copy any data.
 
 ### High-performance Inference Methods
 
 Users interface with the inference methods via model expressions and other
 configuration parameters for that particular method.
 Hence, the inference algorithms are completely general and work with any model
 so long as the model expression is properly constructed.
-Moreover, due to the statically-known model specification, algorithms
+Due to the statically-known model specification, algorithms
 have opportunities to make compile-time optimizations.
-See [Examples](#examples) for more detail. Drawing 100 samples from the posterior
-of a hierarchical Gaussian model using a NUTS sampler, we see significant
-gains over STAN and PyMC3 in both compilation and runtime.
-
-```
-Y ~ N(l1 + l2, sigma) 
-l1 ~ N(0, 10)
-l2 ~ N(0, 10)
-sigma ~ Uniform(0, 20)
-```
-
-### AutoPPL
-
-```
-Compilation time: 3.39s
-Run time: 0.987008s
-```
+See [Benchmarks](#benchmarks) for more detail. 
 
-### STAN
-```
-Compilation time: 40.48s
-Run time: 0.613672s (Warm-up)
-          0.704271s (Sampling)
-          1.31794s (Total)
-```
-
-### PyMC3
-```
-Run time: 38s
-```
+We were largely inspired by STAN and followed their 
+[reference](https://mc-stan.org/docs/2_23/reference-manual/index.html)
+and also their
+[implementation](https://github.com/stan-dev/stan) 
+to make statistical optimizations such as 
+computing ESS, performing adaptations, and stabilizing sampling algorithms.
+However, our library works quite differently underneath, especially
+with automatic differentiation and handling model expressions.
 
 ## Installation (dev)
 
@@ -162,7 +143,7 @@ ctest
 
 ### Sampling from Joint Distribution
 
-Although `AutoPPL` was designed to perform inference on posterior distributions,
+Although AutoPPL was designed to perform inference on posterior distributions,
 one can certainly use it to sample from any joint distribution defined by the priors and conditional distributions.
 As an example, we can sample `1000` points from a 
 standard normal distribution using Metropolis-Hastings in the following way:
@@ -176,8 +157,8 @@ auto model = (
 ppl::mh(model, 1000);
 ```
 
-Note that the parameter `theta` is bound to the storage `samples` at construction.
-After calling `mh`, the `1000` samples are placed into `samples.`
+The scalar parameter `theta` is bound to the storage `samples` at construction.
+After calling `mh`, the `1000` samples are placed into `samples`.
 
 In general, so long as the joint PDF is known, 
 or equivalently and more commonly if the conditional and prior PDFs are known,
@@ -196,18 +177,23 @@ auto model = (
 ppl::mh(model, 1000);
 ```
 
+Here, `theta2` depends on `theta1` and hence defines
+the conditional distribution `theta2 | theta1`.
+
 ### Sampling Posterior Mean and Standard Deviation
 
 The following is an example of fitting a Gaussian model to some data.
-We put a Normal(0,3) prior on the mean and Uniform(0,2) prior on the 
+We put a `Normal(0,3)` prior on the mean and `Uniform(0,2)` prior on the 
 standard deviation.
-It is recommended to use the state-of-the-art NUTS sampler to sample
-from a posterior distribution.
+While in the previous section, we used Metropolis-Hastings 
+to demonstrate how to use it,
+it is recommended to use the state-of-the-art NUTS sampler to sample
+from the posterior distribution.
 
 ```cpp
 std::array<double, 1000> mu_samples, sigma_samples;
 
-ppl::Data<double> x {1.0, 1.5, 1.7, 1.2, 1.5};
+ppl::Data<double, ppl::vec> x {1.0, 1.5, 1.7, 1.2, 1.5};
 ppl::Param<double> mu {mu_samples.data()};
 ppl::Param<double> sigma {sigma_samples.data()};
 
@@ -217,23 +203,36 @@ auto model = (
     x |= ppl::normal(mu, sigma)
 );
 
-size_t warmup = 1000;
-size_t n_samples = 1000;
-size_t n_adapt = 1000;
-ppl::nuts(model, warmup, n_samples, n_adapt);
+ppl::nuts(model);
+```
+
+By default, NUTS will run 1000 iterations each for warmup and actual sampling
+(total of 2000 iterations).
+The default adaptation method is to estimate the diagonal precision matrix.
+If the user wishes to change the configuration, one must pass in a config object:
+
+```cpp
+NUTSConfig<...> config; // replace ... with ppl::unit_var or ppl::diag_var
+config.n_samples = ...; // number of samples
+config.warmup = ...;    // number of warmup iters
+config.seed = ...;      // set seed (default: random)
+config.max_depth = ...; // max depth of tree built during NUTS (default: 10)
+
+ppl::nuts(model, config);
 ```
 
 ### Bayesian Linear Regression
 
-This example covers ordinary linear regression in a Bayesian setting.
+This example covers ridge regression in a Bayesian setting.
 We created a fictitious dataset consisting of `(x,y)` coordinates.
 The true relationship is the following: `y = x + 1`.
 By specifying two parameters for the weight and bias, we propose
 the following probabilistic model:
+
 ```
-y ~ N(wx + b, 0.5)
-w ~ U(0, 2)
-b ~ U(0, 2)
+y ~ Normal(x*w + b, 0.5)
+w ~ Uniform(0, 2)
+b ~ Uniform(0, 2)
 ```
 
 In AutoPPL, we can write the following code and sample from the posterior:
@@ -247,15 +246,172 @@ ppl::Param<double> w {w_storage.data()};
 ppl::Param<double> b {b_storage.data()};
 
 auto model = (
-        w |= ppl::uniform(0, 2),
-        b |= ppl::uniform(0, 2),
+        w |= ppl::uniform(0., 2.),
+        b |= ppl::uniform(0., 2.),
         y |= ppl::normal(x * w + b, 0.5)
 );
 
-size_t warmup = 1000;
-size_t n_samples = 1000;
-size_t n_adapt = 1000;
-ppl::nuts(model, warmup, n_samples, n_adapt);
+ppl::nuts(model);
+```
+
+## Benchmarks
+
+In the following examples, we show benchmarks with STAN.
+
+We list the benchmark settings for completion:
+- Machine: x86_64-apple-darwin19.5.0
+- CPU: 3.4 GHz Quad-Core Intel Core i5
+- Compiler: Clang 11.0.3
+
+### Bayesian Linear Regression <a name="benchmarks-bayesian-linear-regression"></a>
+
+We collected a dataset regarding life expectancy released by WHO 
+([source](https://www.kaggle.com/kumarajarshi/life-expectancy-who)).
+After cleaning and extracting three predictors: "Alcohol", "HIV/AIDS", and "GDP",
+the dataset consisted of 157 points.
+We performed a Bayesian linear regression with this data and the following model:
+
+```
+y ~ Normal(X*w + b, s*s + 2.)
+w ~ Normal(0., 5.)
+b ~ Normal(0., 5.)
+s ~ Uniform(0.5, 8.)
+```
+
+where `w` is a 3-dimensional parameter vector,
+and `b` and `s` are scalar parameters.
+
+Using the same dataset and model specification,
+we performed NUTS to sample various number of samples.
+We also set the number of chains and cores to 1 and 
+adaptation method to diagonal precision matrix.
+
+The following plots show benchmarks between
+run-times and effective sample size (ESS) per second:
+
+![](docs/figures/regression_benchmark_plot/runtime.png)
+![](docs/figures/regression_benchmark_plot/ess_s.png)
+
+The reported mean, standard deviation, and ESS values were 
+almost identical in all cases,
+which is not surprising since we used the same algorithm to estimate ESS
+and perform NUTS.
+
+The runtimes have similar log-like behavior,
+but it is clear that STAN (dotted lines) takes far longer
+in both sampling and warmup times.
+As for ESS/s, upon comparing by colors (corresponding to a parameter)
+between dotted (STAN) and solid (AutoPPL) lines,
+we see that AutoPPL has uniformly larger ESS/s.
+This difference quickly becomes larger as sample size grows.
+From these plots and that sampling results were identical
+show that the drastic difference in ESS/s is simply from faster
+computations such as automatic differentation during NUTS sampling,
+and how we interpret the model expression to create compile-time optimizations.
+
+ESS was computed as outlined 
+[here](https://mc-stan.org/docs/2_23/reference-manual/effective-sample-size-section.html).
+We also made some adjustments to use Geyer's biased estimator for ESS
+as in the current implementation of STAN
+([source](https://github.com/stan-dev/stan/blob/525998129ea838ec685f1d1f65dc76063d0fd40d/src/stan/analyze/mcmc/compute_effective_sample_size.hpp)).
+
+The following is the AutoPPL code for the model specification without data loading.
+The full code can be found
+[here](benchmark/regression_autoppl.cpp):
+
+```cpp
+auto X = ppl::make_data_view<ppl::mat>(X_data);
+auto y = ppl::make_data_view<ppl::vec>(y_data); 
+ppl::Param<double, ppl::vec> w(3);
+ppl::Param<double> b;
+ppl::Param<double> s;
+
+arma::mat storage(num_samples, w.size() + b.size() + s.size());
+for (size_t i = 0; i < w.size(); ++i) {
+    w.storage(i) = storage.colptr(i);
+}
+b.storage() = storage.colptr(w.size());
+s.storage() = storage.colptr(w.size() + b.size());
+
+auto model = (s |= ppl::uniform(0.5, 8.),
+              b |= ppl::normal(0., 5.),
+              w |= ppl::normal(0., 5.),
+              y |= ppl::normal(ppl::dot(X, w) + b, s * s + 2.));
+
+NUTSConfig<> config;
+config.warmup = num_samples;
+config.n_samples = num_samples;
+auto res = ppl::nuts(model, config);
+```
+
+### Gaussian Model
+
+We generated 1000 values from standard normal distribution to form our data.
+Our model is defined as follows:
+
+```
+y ~ Normal(l1 + l2, s) 
+l1 ~ Normal(0., 10.)
+l2 ~ Normal(0., 10.)
+s ~ Uniform(0., 20.)
+```
+
+where all parameters are scalar.
+
+The benchmark configurations are exactly the same as in the
+[previous section](#benchmarks-bayesian-linear-regression).
+
+The following plots show benchmarks between
+run-times and effective sample size (ESS) per second:
+
+![](docs/figures/gaussian_benchmark_plot/runtime.png)
+![](docs/figures/gaussian_benchmark_plot/ess_s.png)
+
+We note that both STAN and AutoPPL outputted almost identical
+means, standard deviation, and ESS.
+
+The runtimes have a similar linear trend,
+and it is clear that STAN (dotted lines) takes far longer
+in both sampling and warmup times.
+The ESS/s for `l1` and `l2` overlap completely (red and blue) in both STAN and AutoPPL
+and this is expected since they are symmetric in the model specification.
+With the exception of the two smallest sample sizes (100, 500),
+ESS/s is fairly constant as sample size varies.
+It is quite evident that AutoPPL (solid) has a larger ESS/s by a significant factor.
+
+The reason for this difference is in how we handle expressions
+where data vector elements are iid (independent and identically distributed).
+For most distributions, especially those that are in some exponential family,
+they can be highly optimized in iid settings to perform quicker differentiation.
+However, it is worth noting that this optimization does not apply when
+the data are simply independent but not identically distributed
+(as in the [linear regression](#benchmarks-bayesian-linear-regression) case),
+or when the variable is a parameter, not data.
+
+The following is the AutoPPL code without data generation.
+The full code can be found 
+[here](benchmark/normal_two_prior_distribution.cpp).
+
+```cpp
+ppl::Param<double> l1, l2, s;
+
+auto model = (
+    s |= ppl::uniform(0.0, 20.0),
+    l1 |= ppl::normal(0.0, 10.0),
+    l2 |= ppl::normal(0.0, 10.0),
+    y |= ppl::normal(l1 + l2, s)
+);
+
+arma::mat storage(n_samples, 3);
+l1.storage() = storage.colptr(0);
+l2.storage() = storage.colptr(1);
+s.storage() = storage.colptr(2);
+
+ppl::NUTSConfig<> config;
+config.n_samples = n_samples;
+config.warmup = n_samples;
+
+auto res = ppl::nuts(model, config);
 ```
 
 ## Contributors