# Push Bayesian Deep Learning Tutorial
## Introduction

In this notebook we will introduce the concept of Bayesian Deep Learning and demonstrate its usage in Push by running a deep ensemble.

## The Posterior Predictive Distribution
The goal of Bayesian Deep Learning (BDL) methods is to estimate the posterior predictive distribution.

$$p(y|x, D) = \int p(y|x, w) p(w|D) \, dw
$$

where y is an output, x is an input, w are parameters, and D is is the data. This integral is intractable and must be approximated. The typical method of approximation is Monte Carlo:

$$p(y|x, D) = \int p(y|x, w) p(w|D) \, dw \approx \frac{1}{J} \sum_{j} p(y|x, w_j), \quad w_j \sim p(w|D)$$


This integral is a Bayesian Model Average over J models and parameter settings. A Deep Ensemble is an average over a number of randomly initialized models, and thus is the primary justification for Deep Ensembles being a Bayesian Deep Learning method.

## Approximating the Posterior Predictive Distribution

In Bayesian Deep Learning the integral we are approximating could be over multi-million dimensional parameter spaces, and the posterior is likely non-Gaussiand and multi modal. We are also limited in the number of samples we can draw to approximate the posterior due to computational reasons. Thus we desire (i) typical points in the posterior, representing regions where there is a lot of mass and (ii) a diversity of points.

Deep Ensembles have these two properties: through retraining a neural network multiple times with different initializations, unique low loss solutions in different basins of attractions can typically be found. Using methods like SGD, these points will center themselves in large basins of attractions.

We can also view Deep Ensembles as forming a posterior approximated as point masses at different modes, combined with simple Monte Carlo integration: with this posterior, the Bayesian predictive distribution is $p(y|x, \mathcal{D}) = \frac{1}{J} \sum_{j} p(y|x, w_j)$ where $w_j$ represent the different ensemble weights, which is exactly the standard deep ensemble procudure [1,3,8]
![](posterior.png)


**Figure 1.** 𝑝(𝑦|𝑥,𝐷)=∫𝑝(𝑦|𝑥,𝑤)𝑝(𝑤|𝐷)𝑑𝑤. **Top**: 𝑝(𝑤|𝐷), with representations from VI (orange) deep ensembles (blue), MultiSWAG (red). 

**Middle**: 𝑝(𝑦|𝑥,𝑤) as a function of 𝑤 for a test input 𝑥. This function does not vary much within modes, but changes significantly between modes. 

**Bottom**: Distance between the true predictive distribution and the approximation, as a function of representing a posterior at an additional point 𝑤, assuming we have sampled the mode in dark green. There is more to be gained by exploring new basins, than continuing to explore the same basin.

This idea is shown above in Figure 1 from [1]. The top panel is a multimodal posterior. The middle panel displays the predictive distribution $p(y|x, w)$ conditioned on paramaters w. Within a single basin, the predictive distribution does not change much, but between basins they are quite different. Therefor we would prefer to select different basins of attraction to provide a good approximation to the Bayesian Model Average integral.


In [1]:
from typing import *
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

import push.bayes.ensemble

# =============================================================================
# Simple Dataset + Neural Network
# =============================================================================

class RandDataset(Dataset):
    def __init__(self, D):
        self.xs = torch.randn(128*10, D)
        self.ys = torch.randn(128*10, 1)

    def __len__(self):
        return len(self.xs)

    def __getitem__(self, idx):
        return self.xs[idx], self.ys[idx]


class MiniNN(nn.Module):
    def __init__(self, D):
        super(MiniNN, self).__init__()
        self.fc1 = nn.Linear(D, D)
        self.fc2 = nn.Linear(D, D)

    def forward(self, x):
        x = self.fc1(x)
        x = torch.nn.ReLU()(x)
        x = self.fc2(x)
        return x
    

class BiggerNN(nn.Module):
    def __init__(self, n, D):
        super(BiggerNN, self).__init__()
        self.minis = []
        self.n = n
        for i in range(0, n):
            self.minis += [MiniNN(D)]
            self.add_module("mini_layer"+str(i), self.minis[-1])
        self.fc = nn.Linear(D, 1)
            
    def forward(self, x):
        for i in range(0, self.n):
            x = self.minis[i](x)
        return self.fc(x)






In [2]:


# L = 10
# D = 20
# dataset = RandDataset(D)
# dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

# epochs = 10
# num_ensembles = 3
# push.bayes.ensemble.train_deep_ensemble(
#     dataloader,
#     torch.nn.MSELoss(),
#     epochs,
#     BiggerNN, L, D,
#     num_ensembles=num_ensembles
# )

References:
https://cims.nyu.edu/~andrewgw/deepensembles/

[1] A.G. Wilson, P. Izmailov. Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Advances in Neural Information Processing Systems, 2020.

[2] P. Izmailov, S. Vikram, M.D. Hoffman, A.G. Wilson. What Are Bayesian Neural Network Posteriors Really Like? International Conference on Machine Learning, 2021.

[3] A.G. Wilson. The Case for Bayesian Deep Learning. 2019.

[4] A.G. Wilson. Thread on Deep Ensembles as Approximate Bayesian Inference. 2020.

[5] A.G. Wilson. Examining Critiques in Bayesian Deep Learning. Video. April 2021.

[6] C.E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. Advances in Neural Information Processing Systems, 2003.

[7] M. Osborne. Bayesian Gaussian processes for sequential prediction, optimisation and quadrature, PhD Thesis, 2010.

[8] F. Gustafsson, M. Danelljan, T. Schon. Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision. CVPR Workshop, 2020.

[9] P. Izmailov, P. Nicholson, S. Lotfi, A. G. Wilson. Dangers of Bayesian Model Averaging under Covariate Shift. Neural Information Processing Systems, 2021.