# SVI Part 1: An introduction to Stochastic Variational Inference
This is taken from [this](https://pyro.ai/examples/svi_part_i.html) page from the pyro examples section
Miguel Fuentes
Created: 4/29/2020
Last Updated: 4/30/2020

In [1]:
import math
import os
from tqdm import tqdm

import torch
import torch.distributions.constraints as constraints

import pyro
from pyro.optim import Adam
from pyro.infer import SVI, Trace_ELBO
import pyro.distributions as dist

pyro.set_rng_seed(101)

## Setup
We can perform SVI on more or less arbitrary stochastic functions with Pyro. Besides the inputs the main components of a pyro model are as follows:
1) Observations (included with pyro.sample using obs keyword)  
2) Latent variables (included with pyro.sample)  
3) Paramaters (included with pyro.param)   

Every set of paramaters defines a joint probability over the observations and the latent variables. To perform SVI we need these assumptions about the joint pdfs defined by the paramaters:  
- We can sample from the pdfs
- We can compute the pointwise log pdf at any point
- The pdf is differentiable w.r.t. the paramaters

## What exactly are we trying to learn?
We want to find the most likely paramaters for our model, this can be rewritten as 
$$\theta_{max} = \underset{\theta}{\text{argmax}} log(p_{\theta}(x))$$  
To compute this quantity we must integrate over the latent variables. Doing this is often intractible, and even if we can do it we usually end up with a really hard non-convex optimization problem.  
Additionally, we also want to compute posteriors for the latent variables once we have the most likely paramaters. This requires another challenging computation:  
$$p_{\theta_{max}}(z|x) = \frac{p_{\theta_{max}}(x,z)}{\int d\textbf{z}p_{\theta_{max}}(x,z)}$$  
We don't want to, of often can't, do these calculations so we need a better way. Variational inference gives us a scheme to calculate $\theta_{max}$ and getting an approximate estimate for $p_{\theta_{max}}(z|x)$. For this we need a few things, one of the most important is a guide.

## Guide
The idea here is to introduce a family of distributions paramaterizes by $\phi$, $q_{\phi}(z)$ over the latent variables. We will search this distribution space and try to find the best possible approximation of $p_{\theta_{max}}(z|x)$. In the literature qe call $\phi$ the variational paramaters and we call $q_{\phi}(z)$ the variational distribution. In pyro, this is called the guide because that is shorter and easier to remember.  
We will define our guide function the same way we would define any other model in pyro. However, ince we need the guide to produce a joint distribution over the latent variables, we need to impose some constraints:  
1) The model and guide should have the same call signature (args and kwargs)  
2) The guide should not include any observations
3) Any latent variable which is appears in the model (with a pyro.sample call) must also appear in the guide  
Once we have defined the guide we can go on to search the distribution space for the best posterior approximation. To do this we need an objective function.

## ELBO
asdasd

In [2]:
# create some data with 6 observed heads and 4 observed tails
data = []
for _ in range(6):
    data.append(torch.tensor(1.0))
for _ in range(4):
    data.append(torch.tensor(0.0))

In [3]:
# clear the param store in case we're in a REPL
pyro.clear_param_store()

def model(data):
    # define the hyperparameters that control the beta prior
    alpha0 = torch.tensor(10.0)
    beta0 = torch.tensor(10.0)
    # sample f from the beta prior
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    # loop over the observed data
    for i in range(len(data)):
        # observe datapoint i using the bernoulli likelihood
        pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])

def guide(data):
    # register the two variational parameters with Pyro
    # - both parameters will have initial value 15.0.
    # - because we invoke constraints.positive, the optimizer
    # will take gradients on the unconstrained parameters
    # (which are related to the constrained parameters by a log)
    alpha_q = pyro.param("alpha_q", torch.tensor(15.0),
                         constraint=constraints.positive)
    beta_q = pyro.param("beta_q", torch.tensor(15.0),
                        constraint=constraints.positive)
    # sample latent_fairness from the distribution Beta(alpha_q, beta_q)
    pyro.sample("latent_fairness", dist.Beta(alpha_q, beta_q))

In [4]:
# setup the optimizer
adam_params = {"lr": 0.0005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)

# setup the inference algorithm
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

n_steps = 1500
# do gradient steps
for step in tqdm(range(n_steps)):
    svi.step(data)

# grab the learned variational parameters
alpha_q = pyro.param("alpha_q").item()
beta_q = pyro.param("beta_q").item()

# here we use some facts about the beta distribution
# compute the inferred mean of the coin's fairness
inferred_mean = alpha_q / (alpha_q + beta_q)
# compute inferred standard deviation
factor = beta_q / (alpha_q * (1.0 + alpha_q + beta_q))
inferred_std = inferred_mean * math.sqrt(factor)

print("\nbased on the data and our prior belief, the fairness " +
      "of the coin is %.3f +- %.3f" % (inferred_mean, inferred_std))

100%|█████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:05<00:00, 297.51it/s]


based on the data and our prior belief, the fairness of the coin is 0.535 +- 0.090



