### KLD 

To calculate the Kullback-Leibler (KL) divergence between two Beta distributions, you can use the following formula:

KLD(Beta1 || Beta2) = ln(Beta(α2, β2) / Beta(α1, β1)) + (α1 - α2)ψ(α1) + (β1 - β2)ψ(β1) + (α2 - α1 + β2 - β1)ψ(α1 + β1)

where:

Beta(α, β) is the Beta function.

ψ( ) is the digamma function.

α1, β1 are the parameters of the first Beta distribution.

α2, β2 are the parameters of the second Beta distribution.

In [None]:
'''In this code, we use the formula for the Beta distribution probability density function (PDF) to evaluate the PDF for each of the two Beta distributions at 1000 points evenly spaced between 0 and 1. 
We then set any zero values in the PDFs to a small non-zero value to avoid divide-by-zero errors when calculating the Kullback-Leibler divergence. 
Finally, we use the numpy.sum() function to calculate the sum of the product of the two PDFs and the logarithm of their ratio, which gives us the Kullback-Leibler divergence between the two distributions.'''


In [1]:
import numpy as np

# Define the parameters for two beta distributions
alpha1, beta1 = 2, 5
alpha2, beta2 = 3, 4

# Generate 1000 points in the range [0, 1] to evaluate the PDF
x = np.linspace(0, 1, 1000)

# Calculate the PDF for the two beta distributions
pdf1 = x ** (alpha1 - 1) * (1 - x) ** (beta1 - 1) / \
       (np.math.gamma(alpha1) * np.math.gamma(beta1) / np.math.gamma(alpha1 + beta1))
pdf2 = x ** (alpha2 - 1) * (1 - x) ** (beta2 - 1) / \
       (np.math.gamma(alpha2) * np.math.gamma(beta2) / np.math.gamma(alpha2 + beta2))

# Set any zero values in the PDFs to a small non-zero value to avoid divide-by-zero errors
pdf1[pdf1 == 0] = np.finfo(float).eps
pdf2[pdf2 == 0] = np.finfo(float).eps

# Calculate the Kullback-Leibler divergence using the numpy.sum() function
kld = np.sum(pdf1 * np.log(pdf1 / pdf2))

print("Kullback-Leibler divergence:", kld)


Kullback-Leibler divergence: 389.77545312246724


### Reparmaetrization

The Beta distribution can be reparametrized in terms of its mean and variance. Let's say you have a Beta distribution with parameters α and β. The mean and variance of this Beta distribution are given by:

mean = α / (α + β)

variance = (α * β) / ((α + β)**2 * (α + β + 1))

To reparametrize the Beta distribution in terms of its mean and variance, you can solve the above equations for α and β:

α = mean * (mean * (1 - mean) / variance - 1)

β = (1 - mean) * (mean * (1 - mean) / variance - 1)

Once you have the mean and variance, you can use these equations to compute the new values of α and β, and then use the reparametrized Beta distribution to generate random samples.

Here's an example code in Python that shows how to reparametrize the Beta distribution using its mean and variance:

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.


In [3]:
import numpy as np

def reparametrize_beta(mean, variance):
    alpha = mean * ((mean * (1 - mean)) / variance - 1)
    beta = (1 - mean) * ((mean * (1 - mean)) / variance - 1)
    return alpha, beta

# Define the original mean and variance of the Beta distribution
mean = 0.4
variance = 0.06

# Reparametrize the Beta distribution using the mean and variance
alpha, beta = reparametrize_beta(mean, variance)

# Print the original and reparametrized parameters
print(f"Original mean and variance: mean={mean}, variance={variance}")
print(f"New alpha and beta: alpha={alpha}, beta={beta}")


Original mean and variance: mean=0.4, variance=0.06
New alpha and beta: alpha=1.2000000000000002, beta=1.7999999999999998


### The conjugate prior

The conjugate prior for the Beta distribution is also a Beta distribution. This means that if we assume a Beta prior for the parameter of a Binomial distribution, the posterior distribution after observing data from the Binomial distribution will also be a Beta distribution.

More specifically, if we have a Binomial distribution with parameter θ, and we assume a Beta prior with parameters α and β, then the posterior distribution for θ after observing data from the Binomial distribution is a Beta distribution with updated parameters α' = α + x and β' = β + n - x, where x is the number of successes observed and n is the total number of trials.

The Beta distribution is a conjugate prior for the Binomial distribution because the product of a Binomial likelihood and a Beta prior results in a Beta posterior. This property is useful because it allows us to update our beliefs about the parameter of the Binomial distribution in a closed form, which can be computationally efficient and mathematically elegant.

Conjugate priors are often used in Bayesian statistics because they allow for easy computation of posterior distributions, and they often have intuitive interpretations that make them easy to work with.

Reference:

Murphy, K. (2012). Machine learning: a probabilistic perspective. MIT press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (Vol. 2). Boca Raton, FL: Chapman & Hall/CRC.

In VAE models with Beta-distributed latent variables, the KL divergence (KLD) can be calculated as follows:

Let p(z) be the prior distribution on the latent variable z, which is a Beta distribution with parameters (α, β), and let q(z|x) be the approximate posterior distribution of z given the observed data x. We can then compute the KLD between q(z|x) and p(z) as follows:

D_KL(q(z|x) || p(z)) = ∫ q(z|x) log(q(z|x) / p(z)) dz

We can further simplify this expression as follows:

D_KL(q(z|x) || p(z)) = -H(q(z|x)) - ∫ q(z|x) log(p(z)) dz

where H(q(z|x)) is the entropy of the posterior distribution, which can be computed analytically for the Beta distribution.

To compute the second term, we note that the Beta distribution is a conjugate prior for the Bernoulli distribution. Therefore, if the likelihood function p(x|z) is Bernoulli-distributed, then the posterior distribution p(z|x) is also Beta-distributed. The updated parameters of the posterior distribution can be computed as follows:

α' = α + ∑ᵢ₌₁ᴺ xᵢ

β' = β + ∑ᵢ₌₁ᴺ (1 - xᵢ)

where N is the number of observed data points and x_i is the ith data point.

We can then compute the second term as follows:

∫ q(z|x) log(p(z)) dz = log B(α', β') - log B(α, β) + (α - α') ψ(α') + (β - β') ψ(β') + (α' + β' - α - β) ψ(α' + β')

where B is the Beta function and ψ is the digamma function.

Finally, the KLD can be computed as:

D_KL(q(z|x) || p(z)) = -H(q(z|x)) + log(B(α', β') / B(α, β)) + (α - α') ψ(α') + (β - β') ψ(β') + (α' + β' - α - β) ψ(α' + β')

For a more detailed derivation of the KLD in VAE models with Beta-distributed latent variables, you can refer to the following paper:

"β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework" by Iulia-Alexandra Ganea and Thomas Kober. (https://openreview.net/forum?id=Sy2fzU9gl)

This paper introduces the β-VAE framework, which is a modification of the standard VAE that includes a tunable hyperparameter β to control the trade-off between reconstruction accuracy and latent space regularization. The paper derives the KLD formula for the case where the latent variable is assumed to be Beta-distributed, and also provides empirical results demonstrating the effectiveness of the β-VAE framework on a variety of image datasets.

In [14]:
import tensorflow as tf

def kl_divergence_beta(p_alpha, p_beta, q_alpha, q_beta):
    def log_beta(z, alpha, beta):
        return (alpha - 1) * tf.math.log(z) + (beta - 1) * tf.math.log(1 - z) - beta_func(alpha, beta)

    def entropy_beta(alpha, beta):
        return beta_func(alpha, beta) - (alpha - 1) * tf.math.digamma(alpha) - (beta - 1) * tf.math.digamma(beta) + (alpha + beta - 2) * tf.math.digamma(alpha + beta)

    def beta_func(alpha, beta):
        return tf.math.lgamma(alpha) + tf.math.lgamma(beta) - tf.math.lgamma(alpha + beta)

    kld = entropy_beta(q_alpha, q_beta) - entropy_beta(p_alpha, p_beta)
    kld += log_beta(1e-6, p_alpha, p_beta) - log_beta(1e-6, q_alpha, q_beta)
    kld += (p_alpha - q_alpha) * tf.math.digamma(q_alpha) + (p_beta - q_beta) * tf.math.digamma(q_beta)
    kld += (q_alpha - p_alpha + q_beta - p_beta) * tf.math.digamma(q_alpha + q_beta)
    return kld

In [16]:
kl_divergence_beta(1.,2., 1., 2.)

<tf.Tensor: shape=(), dtype=float32, numpy=0.0>

Note that this code assumes that p_alpha and p_beta are the prior parameters of the beta distribution on the latent variable, and q_alpha and q_beta are the approximate posterior parameters learned by the VAE. 

You may need to modify the code to match the specific implementation of your VAE.

For a more complete example of a VAE with beta-distributed latent variables in TensorFlow,

you may refer to the following GitHub repository: https://github.com/Schlumberger/joint-vae/blob/master/jointvae/models.py





<tf.Tensor: shape=(), dtype=float32, numpy=13.618285>