# Tabular Variational Autoencoder
This simple tutorial shows how to train a Variational Autoencoder (VAE) on a small table and generate synthetic rows. It's designed for a non-technical audience, so follow along step by step.

## 1. Setup
Run the cell below to install the required libraries. If you're running in an environment that already has them, this step will finish quickly.

In [1]:
!pip install torch pandas scikit-learn --quiet

[0m

## 2. Load a sample dataset
We'll use the classic Iris flower dataset that comes with scikit-learn.

In [2]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
print('Dataset shape:', data.shape)
data.head()

Dataset shape: (150, 4)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## 3. Build a simple VAE
This VAE has two main parts: an encoder that compresses the data and a decoder that rebuilds it.

In [3]:
import torch
from torch import nn

class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim=2):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 8),
            nn.ReLU(),
            nn.Linear(8, latent_dim*2)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 8),
            nn.ReLU(),
            nn.Linear(8, input_dim)
        )

    def encode(self, x):
        h = self.encoder(x)
        mu, logvar = h.chunk(2, dim=1)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = (0.5*logvar).exp()
        eps = torch.randn_like(std)
        return mu + eps*std

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        recon = self.decode(z)
        return recon, mu, logvar


## 4. Train the model
We'll train for just a few epochs because the dataset is small.

In [4]:
import torch
from torch.utils.data import DataLoader, TensorDataset

X = torch.tensor(data.values, dtype=torch.float32)
dataset = TensorDataset(X)
loader = DataLoader(dataset, batch_size=16, shuffle=True)

model = VAE(input_dim=X.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(50):
    for batch in loader:
        batch = batch[0]
        recon, mu, logvar = model(batch)
        recon_loss = nn.functional.mse_loss(recon, batch)
        kl_loss = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
        loss = recon_loss + kl_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if epoch % 10 == 0:
        print(f'Epoch {epoch}: loss={loss.item():.4f}')


Epoch 0: loss=14.2252
Epoch 10: loss=1.6809


Epoch 20: loss=1.3177


Epoch 30: loss=0.9026
Epoch 40: loss=0.8603


## 5. Generate new synthetic rows
After training, we can sample new rows from the model.

In [5]:
with torch.no_grad():
    z = torch.randn(5, 2)
    synthetic = model.decode(z).numpy()
synthetic_df = pd.DataFrame(synthetic, columns=iris.feature_names)
synthetic_df


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.900254,2.9039,4.240979,1.386844
1,5.583257,2.989634,3.455221,1.041633
2,5.535187,3.204916,2.698673,0.713613
3,5.912194,2.908473,4.24741,1.389859
4,5.309346,3.204262,2.320914,0.550678


You're now ready to adapt this notebook to your own tables. Replace the dataset loading step with your data and retrain the model.