<a href="https://colab.research.google.com/github/Antony-gitau/probabilistic_AI_playgraound/blob/main/stochastic_variational_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we want to implement a simple example that will help us understand the concept of SVI while building a probabilistic model.

recap:

- probabilistic models are statistical models that include one or more probability distribution functions in the model to account for uncertainities.

- pyro is one of the programming languages that helps us code probabilistic models. Other languages include pymc

- stochastic variational inference is a way for calculating the variational inference. [Variational inference](https://en.wikipedia.org/wiki/Variational_Bayesian_methods) is one of the technique of approximating intractable integrals of a bayesian inference. 


So in this simple example, we will try to determine whether the coin is fair or not. 

we will first start off by genearing some datapoints with 6 observed heads and 4 observed tails.

In [1]:
%%capture
!pip install -q --upgrade pyro-ppl torch
import torch
import torch.distributions.constraints as constraints
import pyro
import pyro.distributions as dist

In [2]:
#create an empty list of data
data = []

#populate the data variable with tensors. 6 1s (heads) and 4 0s (tails)
for _ in range(6):
  data.append(torch.tensor(1.0))

for _ in range(4):
  data.append(torch.tensor(0.0))


In [3]:
data

[tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(1.),
 tensor(0.),
 tensor(0.),
 tensor(0.),
 tensor(0.)]

Let's now create the probabilistic model. 

We know that;
- heads are encoded as 1 and tails as 0
- the fairness of a coin is 0.5 withot bias. we can encode the 0.5 into a value f, where f=0.5 and $f 𝞊 [0.0, 1.0]$
- we can know that coin might be unbiased. this is our prior information. but we dont know how unfair this coin is. that is why we will be quantifying.
- we know that beta distribution can be used to model probabilities of binary events. We only need to choose the α and β paramaters, and in this case, we can experiment with a probability distribution of just 10 samples.



In [19]:
# define the model function taking in the data

def model(data):
  #we define the prior infomation. we chose a beta distribution 
  alpha = torch.tensor(10.0)
  beta = torch.tensor(10.0)

# we generate sample f(fairness) as a beta distrbution with hyperparameters alpha and beta
  f = pyro.sample("latent_fairness", dist.Beta(alpha, beta))
 
 # loop over the observed data
  for i in range(len(data)):
    # we model the likelihood of abserving heads or tail (fairness of the coin) in a 
    #bernoulli function
    pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs = data[i])


Notes on the thought process of choosing likelihood function:

(so in the code above, we choose bernoulli disrtibution as the likelihood function. Its is a good choice for binary outcomes)

- type of observed data. continuos discrete and binary data require different likelihood functions

- the assumptions of the model. that is the parameters of the model.
- distribution should also be tractable

---
next up, we define a guide. Guide is the fancy name used by pyro in place of variational distribution, which are the parametirized distribution $q_𝜙(z)$. 

- the guide function will use the same names for random variables as in the model function

- the difference between alpha and beta in the model function and alpha_q and beta_q in the guide function is that the latter require gradient, hence are learnable parameters. unlike the former which are samples that define the shape of prior distribution. 

- also, we initiate the parameters at 15.0 from which they will be optimized by svi to get closer to the true posterior as possible.

In [11]:
# guide takes in same arguement as model
def guide(data):

  #these are the parametirized distributions over latent varibles with
  # learnable parameters that are adjusted during optimization
  alpha_q = pyro.param("alpha_q", torch.tensor(15.0),constraint = constraints.positive)
  beta_q = pyro.param("beta_q", torch.tensor(15.0), constraint =constraints.positive)

  # sample the latent fairness using beta(alpha_q and beta_q)
  pyro.sample("latent_fairness", dist.Beta(alpha_q, beta_q))

Now we optmize stochatically.

notes of learning rate.
A higher rate means larger steps and faster convergence but may results in overshooting the optimal values. vice versa for lower rates. 

a note on betas. this parameter controls the weight given to past gradient when computing the update direction for the model parameters. That is controls the moving average. a large value of beta means we are averaging over more gradients.
 

 a higher beta1 means more weight is given to recent gradients, and a higher beta2 values means more weight is given to the magnitude if the gradient and lower beta2 means more weight is given to direction of the gradient.

In [8]:
# set up an adam optimizer
from pyro.optim import Adam
adam_params = {"lr": 0.005, "betas": (0.90, 0.999)}
optimizer = Adam(adam_params)

Then we can use svi as the inference algorithm

In [20]:
from pyro.infer import SVI, Trace_ELBO
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

n_steps = 5000
for step in range(n_steps):
  svi.step(data)

In [24]:
# we can assign a varible to the variational parameters
alpha_q = pyro.param("alpha_q").item()
alpha_q

15.508711814880371

In [25]:
beta_q = pyro.param("beta_q").item()
beta_q

13.823347091674805

We can now compute the infered mean and standard deviation of the coin's fairness. 

In [30]:
import math
# compute the inferred mean of the coin's fairness
inferred_mean = alpha_q / (alpha_q + beta_q)
print("Inferred mean is " + str(inferred_mean))

# compute inferred standard deviation
factor = beta_q / (alpha_q * (1.0 + alpha_q + beta_q))
print("This is the factor " + str(factor))

inferred_std = inferred_mean * math.sqrt(factor)
print("Inferred std " + str(inferred_std))

print("\nBased on the data and our prior belief, the fairness " +
      "of the coin is %.3f +- %.3f" % (inferred_mean, inferred_std))

Inferred mean is 0.5287290559550342
This is the factor 0.029385669935780522
Inferred std 0.09063605108814138

Based on the data and our prior belief, the fairness of the coin is 0.529 +- 0.091


Recall that our prior had 6 heads and 4 tails. Meaning, the fairness of the frequencies was $6\over10$ = 0.60. 

To compute the exact posterior mean, we need to intergrate the product of likelihod and prior

In this case, we used bernoulli likelihood and beta prior. It is known that the exact posterior is also a beta distribution.

exact mean = $alpha\over(alpha + beta)$ = $10 \over (10+10)$ = 0.5
