**Task 1: Restricted Boltzman Machine** </br>
Citation: Deep Learning Book from lecture slides </br>

```
@book{Goodfellow-et-al-2016,
    title={Deep Learning},
    author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
    publisher={MIT Press},
    note={\url{http://www.deeplearningbook.org}},
    year={2016}
}
```


Like the general Boltzmann machine, the restricted Boltzmann machine is an energy-based model with the joint probability distribution specified by its energy function:
> $P\left ( v = \mathit{v},h = \mathit{h} \right ) = \frac{1}{Z}esp\left ( -E\left ( v,h \right ) \right )$

The energy function for an RBM is given by
> $E\left ( v,h \right ) = -b^{T}v - c^{T}h - v^TWh$

and Z is the normalizing constant known as the partition function:
> $Z = \sum_{\overline{v}} \sum _{\overline{h}}esp\left ( -E\left ( \overline{v},\overline{h} \right ) \right )$

It is apparent that the computing Z by exhaustively summing over all states could be computationally intractable. In the case of restricted Boltzmann machines, Long and Servedio (2010) formally proved that the partition function Z is intractable. The intractable partition function Z implies that the normalized joint probability distribution P (v) is also intractable to evaluate. Though P(v) is intractable, the bipartite graph structure of the RBM has the special property that its hidden and visible units are conditionally independent given one another. Having said this: 

> $P\left ( h,v \right )=\frac{P\left ( h,v \right )}{P\left ( v \right )} $ </br>
> $= \frac{1}{P\left ( v \right )}\frac{1}{Z}exp\left \{ b^{T}v+c^{T}h+v^{T}Wh \right \}$ </br>
> $ = \frac{1}{Z'}\prod_{j}^{n_{h}}exp\left \{ c_{j}h_{j}+v^{T}W_{j}h_{j} \right \}$</br> 

It is now a simple matter of normalizing the distributions over the individual binary $h_{j}$.
>$P\left ( h_{j}=1 | \mathbf{v} \right ) = \sigma \left ( c_{j}+\mathbf{v}^{T}W_{j} \right )$ </br>
>$P\left ( h_{j}=1 , \mathbf{v} \right )=\frac{\widetilde{P}\left ( h_{j}=1|\mathbf{v} \right )}{\widetilde{P}\left ( h_{j}=0|\mathbf{v} \right ) + \widetilde{P}\left ( h_{j}=1|\mathbf{v} \right )}$</br>
>$= \frac{exp\left \{ c_{j}+\mathbf{v}^{T}W_{j} \right \}}{exp\left \{ 0 \right \}+exp\left \{ c_{j}+\mathbf{v}^{T}W_{j} \right \}}$</br>
>$=\sigma \left ( c_{j}+\mathbf{v}^{T}W_{j} \right )$</br>
>$=\frac{1}{Z'}exp\left \{\sum _{j=1}^{n_{h}}c_{j}h_{j} + \sum _{j=1}^{n_{h}}v^{T}W_{j}h_{j} \right \}$ </br>

A similar derivation will show that the other condition of interest to us, P(v | h): 
> $P\left ( v|\mathbf{h} \right )=\frac{1}{Z'}\prod _{k}exp\left \{ b_{k}+h^{T}W_{k} \right \}$

Therefore, we can conclude:
>$P\left ( v_{k}=1 | h \right )=\sigma \left ( b_{k} +h^{T}W_{k} \right )$</br>
>$= \frac{1}{Z'}exp\left \{ c^{T}h + v^{T}Wh\right \}$












**Task 2: Variational Autoencoder** </br>
Let's say we want to infer P(z|X) using Q(z|X). The KL divergence is then formulated as follows:
> $D_{KL}\left [ Q\left ( z|X \right )\parallel P\left ( z|X \right ) \right ] = \sum _{z}Q\left ( z|X \right )log\frac{Q\left ( z|X \right )}{P\left ( z|X \right )}$ </br>
> $=E\left [ log\frac{Q\left ( z|X \right )}{P\left ( z|X \right )} \right ]$</br>
>$=E\left [ logQ\left ( z|X \right ) - logP\left ( z|X \right ) \right ]$</br>

>$D_{KL}\left [ Q\left ( z|X \right )\parallel P\left ( z|X \right ) \right ] = E\left [ logQ\left ( z|X \right )-log\frac{P\left ( X|z \right )P\left ( z \right )}{P\left ( X \right )} \right ]$</br>
>$=E\left [ logQ\left ( z|X \right )-\left ( logP\left ( X|z \right ) + logP\left ( z \right )- logP\left ( X \right ) \right ) \right ]$</br>
>$=E\left [ logQ\left ( z|X \right )-logP\left ( X|z \right ) - logP\left ( z \right )+ logP\left ( X \right ) \right ]$</br>

Notice that the expectation is over z and P(X) doesn’t depend on z so we could move it outside of the expectation.
>$D_{KL}\left [ Q\left ( z|X \right )\parallel P\left ( z|X \right ) \right ]=E\left [ logQ\left ( z|X \right )-logP\left ( X|z \right )-logP\left ( z \right ) \right ] +logP\left ( X \right )$</br>
>$D_{KL}\left [ Q\left ( z|X \right )\parallel P\left ( z|X \right ) \right ]-logP\left ( X \right )=E\left [ logQ\left ( z|X \right )-logP\left ( X|z \right )-logP\left ( z \right ) \right ]$</br>
>$logP\left ( X \right )-D_{KL}\left [ Q\left ( z|X \right )\parallel P\left ( z|X \right ) \right ]=E\left [logP\left ( X|z \right ) -\left ( logQ\left ( z|X \right )-logP\left ( z \right ) \right )\right ]$</br>
>$= E\left [ logP\left ( X|z \right ) \right ] - E\left [ logQ\left ( z|X \right )-logP\left ( z \right )\right ]$</br>
>$= E\left [ logP\left ( X|z \right ) \right ] - D_{KL}\left [Q\left ( z|X \right )\parallel P\left ( z \right )\right ]$</br>

And this is the VAE objective function:
> $logP\left ( X \right ) - D_{KL}\left [ Q\left ( z|X \right )\parallel P\left ( z|X \right ) \right ] =E\left [ logP\left ( X|z \right ) \right ] - D_{KL}\left [Q\left ( z|X \right )\parallel P\left ( z \right )\right ]$</br>
