## Optimization

In order to perform SGHMC in various of problems, the implementation should take functions as arguments to compute gradients of log density functions for various of problems. We intend to enable users to write their own custormized gradient functions in python, and pass them into our sghmc function.

Unlike the theory, the algorithm itself is simple. The function takes in user specified gradient functions together with data, initial guess and other parameters in the algorithm. In the sampling process, main computation is computing the gradients and matrices products. Consequently our first implementation is vectorized utilizing numpy matrices. Based on the numpy version, we also tried several optimizations:

- Precompilation of gradient functions. When profiling the algorithm, we found that the computation of gradient functions takes a large part in total time. Thus the first optimization we tried is to pass in numba compiled gradient functions.

- Rewriting matrices computation in c++. Another heavy computation is updating $r$, which involves several matrices products that will be heavy if the dimension of parameters to sample is large. Thus we pickout the matrices computation part and rewrote it in c++ and wraped it with pybind. Besides optimizing the matrices computation part, we also tried to implement the whole algorithm in c++. However, we found it difficult because the gradient functions could only run in python. These computations are embedded in the sampling process and can not be done in c++ because the gradient functions are user specified.

- Applying multiprocess to run multiple simlations simutabeously. We implemented a multithread sghmc function `sghmc_chains` to run multiple independent sampling process.

- Removing data shuffle in sampling procedure. SGHMC involves spliting data into mini batches. We found that shuffling data before spliting is a very heavy manipulation, which takes 3/4 of total time. Thus we eliminated the shuffling process in our codes for efficiency.

### Timing

#### sghmc, sghmc with numba, and sghmc with cpp

First we timed the performence of sghmc, sghmc with numba compiled gradient functions, and cpp partwise recoded sghmc function. The problem we use here is simply sampling posterior $\{\theta_i\}$ of a multinormal model: $y_i\sim N(\theta_i,1)$ and prior $\theta_i\sim N(0,1)$ for $i=1,2,\dots,ndim$. In all cases we pass in 10,000 simulated observations. The timing result is shown in the following table (time unit: ms):

| Problem dimension (ndim) |    sghmc   | sghmc with numba funcs | sghmc with cpp |
|:-----------------:|:----------:|:----------------------:|:--------------:|
|         10        | 209 ± 4.59 |        233 ± 38        |   182 ± 7.46   |
|        100        | 751 ± 60.8 |        691 ± 105       |   496 ± 5.35   |
|        1000       | 6290 ± 222 |       4870 ± 291       |   20100 ± 284  |

From the table, we can see that

- sghmc with numba funcs performs better than sghmc when problem dimension is large.

- sghmc with cpp performs better than both python versions when problem dimension is small. However, as problem dimension gets very large, the efficiency of cpp version decreases dramatically.

According to the above results, we decided to remove cpp version from our package because of its unstable performance.

#### `sghmc` vs. `sghmc_chains`

We performed another comparison between `sghmc` and `sghmc_chains` to compare average time on a single chain. We used the same multinormal problem as in above comparison with same data amount and problem dimension equals to 10 and 100 respectively.

![a](mpt1.png)

![b](mpt2.png)

From the comparasion we can see that, for a small problem, simulation for each chain using `sghmc_chains` is efficient than simulation for one chain using `sghmc`. However, when the problem gets heavy, `sghmc_chains` becomes inefficient.

We also compared average time on the mixture normal problem (which will be describe in <b>examples</b> section):

|          | sghmc | sghmc_chains (20 chains) |
|----------|:-----:|:------------------------:|
| time(ms) |   98  |       25(498 total)      |

We can see that `sghmc_chains` is efficient in simulating large sample in this example.

## examples
### simulated data
->machao

### real example: Bayesian Neural Network

Due to the complexity of the bayesian convolutional neural network in the original paper, we didn't implement that example. Instead we performed bayesian neural network regression on MPG(https://archive.ics.uci.edu/ml/index.php) data. The purpose of the problem is to predict MPG of a car through other factors. We use SGHMC to sample posterior weights. 

The neural network has 3 layers: one input layer with 9 input nodes, one hidden layer with 10 nodes, and one output layer with 1 node. The nodes in hidden layer use sigmoid function as activation function. The output of output node is simply weighted sum of its inputs.

We give priors $weights\sim N(0,1)$ and suppose output has distribution $y\sim N(output,1)$. We run 2000 iterations with fist 100 iterations as heatup.

The following figures shows how our network performed on this problem.

![c](bnnReg.png)

![d](bnnRMSE.png)

The first figure is fitted data vs true data on test set. The second figure is the rmse on train and test set respectively along with iterations.

The figure shows that our fitted BNN works on test data. The more iterations we run through SGHMC, the more accurate the prediction will be. We can also see that our model didn't overfit the data with in 2000 iterations. And it
