### SVI Part I: An Introduction to Stochastic Variational Inference in Pyro
- The different pieces of model() are encoded via the mapping:
  - observations ⟺ pyro.sample with the obs argument
  - latent random variables ⟺ pyro.sample
  - parameters ⟺ pyro.param
- Now let’s establish some notation. The model has observations x and latent random variables z as well as parameters θ. It has a joint probability density of the form: pθ(x,z)=pθ(x|z)pθ(z)
  - We assume that the various probability distributions pi that make up pθ(x,z) have the following properties:
    - we can sample from each pi
    - we can compute the pointwise log pdf pi
    - pi is differentiable w.r.t. the parameters θ
  - Model Learning
    - in this context our criterion for learning a good model will be maximizing the log evidence, i.e. we want to find the value of θ given by θmax=argmaxθlogpθ(x)
    - Variational inference offers a scheme for finding θmax and computing an approximation to the posterior pθmax(z|x). Let’s see how that works.
- (Guide function) The basic idea is that we introduce a parameterized distribution qϕ(z), where ϕ are known as the variational parameters. This distribution is called the variational distribution in much of the literature, and in the context of Pyro it’s called the guide (one syllable instead of nine!). The guide will serve as an approximation to the posterior.

In [0]:
"""
Do NOT run
"""
# random variables are specified in Pyro with primitive statement pyro.sample() 
# the first argument denotes the name of the random variable
def model():
    pyro.sample("z_1", ...)
# then the guide needs to have a matching sample statement
def guide():
    pyro.sample("z_1", ...)

- Learning will be setup as an optimization problem where each iteration of training takes a step in θ−ϕ space that moves the guide closer to the exact posterior. To do this we need to define an appropriate objective function

### ELBO - the evidence lower bound
- The ELBO, which is a function of both θ and ϕ, is defined as an expectation w.r.t. to samples from the guide: ELBO≡Eqϕ(z)[logpθ(x,z)−logqϕ(z)]

### SVI Class
- At present SVI only provides support for the ELBO objective
- The user needs to provide three things: the model, the guide, and an optimizer.

In [0]:
import pyro
from pyro.infer import SVI, Trace_ELBO
# The SVI object provides two methods, step(arg) and evaluate_loss(arg)
# svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

# arguments used to instantiate PyTorch optimizers for all the parameters
from pyro.optim import Adam
adam_params = {"lr": 0.005, "betas": (0.95, 0.999)}
optimizer = Adam(adam_params)

# simple example to illustrate the API
def per_param_callable(module_name, param_name):
    if param_name == 'my_special_parameter':
        return {"lr": 0.010}
    else:
        return {"lr": 0.001}
optimizer = Adam(per_param_callable)

### A simple example - fairness of a two-sided coin
- we encode heads and tails as 1s and 0s. We encode the fairness of the coin as a real number f, where f satisfies f∈[0.0,1.0] and f=0.50 corresponds to a perfectly fair coin. 
- Our prior belief about f will be encoded by a beta distribution, specifically Beta(10,10), which is a symmetric probability distribution on the interval [0.0,1.0] that is peaked at f=0.5.

In [0]:
import pyro.distributions as dist

def model(data):
    # define the hyperparameters that control the beta prior
    alpha0 = torch.tensor(10.0)
    beta0 = torch.tensor(10.0)
    # sample f from the beta prior
    # a single latent random variable ('latent_fairness'), 
    # which is distributed according to Beta(10,10).
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    # loop over the observed data
    for i in range(len(data)):
        # observe datapoint i using the bernoulli
        # likelihood Bernoulli(f)
        pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])

- Our next task is to define a corresponding guide, i.e. an appropriate variational distribution for the latent random variable f. The only real requirement here is that q(f) should be a probability distribution over the range [0.0,1.0], since f doesn’t make sense outside of that range. 
- A simple choice is to use another beta distribution parameterized by two trainable parameters αq and βq. Actually, in this particular case this is the ‘right’ choice, since conjugacy of the bernoulli and beta distributions means that the exact posterior is a beta distribution.

In [0]:
# model(data) and guide(data) take the same arguments
# use constraint=constraints.positive to ensure that 
# alpha_q and beta_q remain non-negative during optimization
def guide(data):
    # register the two variational parameters with Pyro.
    # - both parameters will have initial value 15.0.
    # - because we invoke constraints.positive, the optimizer
    # will take gradients on the unconstrained parameters
    # (which are related to the constrained parameters by a log)
    alpha_q = pyro.param("alpha_q", torch.tensor(15.0),
                         constraint=constraints.positive)
    beta_q = pyro.param("beta_q", torch.tensor(15.0),
                        constraint=constraints.positive)
    # sample latent_fairness from the distribution Beta(alpha_q, beta_q)
    pyro.sample("latent_fairness", dist.Beta(alpha_q, beta_q))

In [9]:
import math
import os
import torch
import torch.distributions.constraints as constraints
import pyro
from pyro.optim import Adam
from pyro.infer import SVI, Trace_ELBO
import pyro.distributions as dist

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)
n_steps = 2 if smoke_test else 2000

# enable validation (e.g. validate parameters of distributions)
pyro.enable_validation(True)

# clear the param store in case we're in a REPL
pyro.clear_param_store()

# create some data with 6 observed heads and 4 observed tails
data = []
for _ in range(6):
    data.append(torch.tensor(1.0))
for _ in range(4):
    data.append(torch.tensor(0.0))
data

[tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(0.),
 tensor(0.),
 tensor(0.),
 tensor(0.)]

In [17]:
# setup the optimizer
adam_params = {"lr": 0.0005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)

# setup the inference algorithm
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

# do gradient steps
for step in range(n_steps):
  # in the step() method we pass in the data, 
  # which then get passed to the model and guide
  svi.step(data)
  if step % 100 == 0:
    print('.', end='')

# grab the learned variational parameters
alpha_q = pyro.param("alpha_q").item()
beta_q = pyro.param("beta_q").item()

# here we use some facts about the beta distribution
# compute the inferred mean of the coin's fairness
inferred_mean = alpha_q / (alpha_q + beta_q)
# compute inferred standard deviation
factor = beta_q / (alpha_q * (1.0 + alpha_q + beta_q))
inferred_std = inferred_mean * math.sqrt(factor)

print("\nalpha_q value: ", alpha_q)
print("beta_q value: ", beta_q)
print("based on the data and our prior belief, the fairness " +
      "of the coin is %.3f + - %.3f" % (inferred_mean, inferred_std))

..................................................
alpha_q value:  16.174606323242188
beta_q value:  14.03137493133545
based on the data and our prior belief, the fairness of the coin is 0.535 + - 0.089
