# VAE with Normalizing Flows: Theory Explanation

Variational Autoencoders (VAE) and Normalizing Flows (NF) are both powerful techniques in generative modeling, each with its own strengths. Combining them allows us to significantly improve the flexibility of the latent space in a VAE.

## Variational Autoencoders (VAE):
VAE is a type of generative model designed to learn the underlying distribution of the data. It works by learning a probabilistic mapping from the input data to a latent space (lower-dimensional space) and then reconstructing the input data from the latent variables.

### Key components of a VAE:
- **Encoder**: Maps input data to a probabilistic distribution over the latent space (mean and variance).
- **Decoder**: Reconstructs the data from the latent variable using the learned distribution.
- **Reparameterization Trick**: To backpropagate through the stochastic sampling process in VAEs, we use a reparameterization trick that allows gradients to flow through the latent space.

### Loss Function in VAE:
The VAE loss consists of two components:
- **Reconstruction Loss** (e.g., MSE or Binary Cross-Entropy): Measures how well the decoder reconstructs the input data from the latent representation.
- **KL Divergence**: Regularizes the latent space by measuring the difference between the learned distribution (from the encoder) and the prior distribution (usually a standard Gaussian).

## Normalizing Flows (NF):
Normalizing Flows are a family of generative models that apply a sequence of invertible transformations to a simple distribution (e.g., a Gaussian) to transform it into a more complex distribution. The key idea is that these transformations allow us to model more complex distributions by ensuring that the Jacobian (determinant of the derivative) of the transformation is tractable.

By using invertible transformations, we can sample from the transformed distribution and also compute the log-likelihood of the samples efficiently.

### How Normalizing Flows work:
- **Base Distribution**: Start with a simple base distribution (e.g., Gaussian).
- **Transformation**: Apply a series of invertible transformations that progressively map the base distribution to a more complex one.
- **Log-Determinant**: For each transformation, we compute the log-determinant of the Jacobian, which is required to calculate the likelihood of the transformed samples.

## Combining VAEs with Normalizing Flows:
The combination of VAEs and normalizing flows allows us to improve the flexibility of the latent space. Normalizing flows act as an additional layer on top of the VAE, providing more expressive power to the latent variable distribution. This helps model more complex data distributions that a standard VAE might struggle with.

Here’s how the combination works:
- **VAE Encoder**: The encoder maps the input to a distribution in the latent space (mean and log variance).
- **Normalizing Flows**: After sampling the latent variable from the VAE's encoder, normalizing flows transform it into a more flexible distribution, allowing for more complex data generation.
- **VAE Decoder**: The decoder reconstructs the data from the transformed latent variables.

The loss function is modified to include the log-determinant of the transformations applied by the normalizing flows, which ensures that the model accounts for the complexity of the latent distribution.

## Training Process:
### Forward Pass:
1. The input data is passed through the encoder to obtain the mean and log variance of the latent distribution.
2. Samples are drawn from this distribution using the reparameterization trick.
3. Normalizing flows are applied to the sampled latent variable.
4. The decoder reconstructs the input data from the transformed latent variable.

### Loss Calculation:
- **Reconstruction Loss**: Measures how accurately the decoder reconstructs the input.
- **KL Divergence**: Regularizes the latent space, encouraging it to be close to a standard Gaussian.
- **Log-Determinant of the Jacobian**: The flow transformation introduces a Jacobian term that needs to be added to the loss to account for the complexity of the transformation.

### Optimization:
- The model parameters (encoder, decoder, and flow layers) are optimized using gradient descent to minimize the overall loss.

## Key Takeaways:
- **Normalizing Flows** improve the flexibility of the latent space by using invertible transformations, allowing for more complex data distributions.
- Combining VAEs and normalizing flows enhances the VAE's ability to model complex data distributions.
- The loss function in a VAE with normalizing flows includes the reconstruction loss, KL divergence, and log-determinant of the flow transformation.
- **Normalizing flows** enable expressive latent variable models that can generate high-quality data, even from complex distributions.

By integrating normalizing flows into VAEs, we achieve more accurate generative models capable of learning and generating complex data structures like images, text, and more.


In [None]:
# Install required libraries
!pip install tensorflow_probability matplotlib pandas

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow.keras import layers, Model
import matplotlib.pyplot as plt

In [None]:
# Load a time-series dataset (e.g., temperature data)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
data = pd.read_csv(url)
# Extract time-series data
data_series = data['Passengers'].values.astype(np.float32)
# Normalize the data
data_series = (data_series - np.mean(data_series)) / np.std(data_series)
# Split into train and test sets
train_data = data_series[:100]
test_data = data_series[100:]
# Create sliding windows for time-series forecasting
def create_windows(data, window_size):
    X, y = [], []
    for i in range(len(data) - window_size):
        X.append(data[i:i + window_size])
        y.append(data[i + window_size])
    return np.array(X), np.array(y)
window_size = 10
X_train, y_train = create_windows(train_data, window_size)
X_test, y_test = create_windows(test_data, window_size)

In [None]:
# Define GP-VAE components
tfd = tfp.distributions

class GPVAE(Model):
    def __init__(self, latent_dim, time_steps, kernel_scale=1.0):
        super(GPVAE, self).__init__()
        self.latent_dim = latent_dim
        self.time_steps = time_steps
        self.kernel_scale = kernel_scale
        # Encoder: Maps inputs to latent space
        self.encoder = tf.keras.Sequential([
            layers.InputLayer(input_shape=(time_steps,)),
            layers.Dense(64, activation='relu'),
            layers.Dense(latent_dim * 2)  # Mean and log-variance
        ])
        # Decoder: Reconstructs data from latent space
        self.decoder = tf.keras.Sequential([
            layers.InputLayer(input_shape=(latent_dim,)),
            layers.Dense(64, activation='relu'),
            layers.Dense(time_steps)  # Reconstructs time series
        ])
    def call(self, inputs):
        # Encode inputs
        encoded = self.encoder(inputs)
        z_mean, z_logvar = tf.split(encoded, num_or_size_splits=2, axis=-1)
        z_std = tf.exp(0.5 * z_logvar)
        # Sample latent variable z
        eps = tf.random.normal(shape=tf.shape(z_mean))
        z = z_mean + eps * z_std
        # GP prior on z (time-series dependency)
gp_kernel = tfp.math.psd_kernels.ExponentiatedQuadratic(amplitude=self.kernel_scale)
gp_prior = tfd.GaussianProcess(kernel=gp_kernel, index_points=tf.range(self.time_steps, dtype=tf.float32)[:, None])
        # KL divergence between latent z and GP prior
        kl_divergence = tfd.kl_divergence(
            tfd.MultivariateNormalDiag(loc=z_mean, scale_diag=z_std),
            gp_prior
        )
        # Decode reconstructed sequence
        reconstruction = self.decoder(z)
        return reconstruction, kl_divergence
    def compute_loss(self, x):
        reconstruction, kl_divergence = self(x)
        # Reconstruction loss
        reconstruction_loss = tf.reduce_mean(tf.square(x - reconstruction))
        # Total loss = reconstruction + KL divergence
        return reconstruction_loss + tf.reduce_mean(kl_divergence)

In [None]:
# Initialize the model
latent_dim = 5
time_steps = window_size
gp_vae = GPVAE(latent_dim=latent_dim, time_steps=time_steps)
# Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
# Training loop
epochs = 50
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        loss = gp_vae.compute_loss(X_train)
    gradients = tape.gradient(loss, gp_vae.trainable_variables)
    optimizer.apply_gradients(zip(gradients, gp_vae.trainable_variables))
    # Print progress
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.numpy():.4f}")

In [None]:
# Make predictions on the test set
reconstructions, _ = gp_vae(X_test)
# Plot predictions vs actual
plt.figure(figsize=(10, 6))
plt.plot(y_test, label="True")
plt.plot(reconstructions.numpy()[:, -1], label="Predicted")
plt.fill_between(
    range(len(y_test)),
    reconstructions.numpy()[:, -1] - 0.1,
    reconstructions.numpy()[:, -1] + 0.1,
    color='gray', alpha=0.3, label='Uncertainty'
)
plt.legend()
plt.title("GP-VAE: Time-Series Forecasting with Uncertainty")
plt.show()