# Dimensionality Reduction

In this lab we will look at linear dimensionality reduction via PCA, and then look at non-linear dimensionality reduction via (variational) auto-encoders.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Download the MIT CBCL Faces Database from [here](https://github.com/HyTruongSon/Pattern-Classification/blob/master/MIT-CBCL-database/svm.train.normgrey).

In [None]:
def load_data(fname):
    raw_data = []
    with open(fname, 'r') as f:
        for line in f.readlines()[2:]:
            if line.strip() == "":
                continue
            if int(line.split()[-1]) == -1:
                # not face, skip
                continue
            else:
                raw_data.append([float(yy) for yy in line.split()[:-1]])
    return np.array(raw_data)

data = load_data('svm.train.normgrey')
data.shape

## Principal Component Analysis

### Visualizing Faces

In [None]:
viz_idx = np.random.permutation(len(data))[:10]

fig, axes = plt.subplots(figsize=(10, 3), nrows=2, ncols=5)

for i in range(2):
    for j in range(5):
        k = i * 5 + j
        axes[i, j].imshow(data[viz_idx[k]].reshape(19, 19), cmap=plt.cm.gray, interpolation='bilinear')
        axes[i, j].axis('off')

fig.tight_layout()

### Compute PCA

Dimensionality reduction is essentially a basis selection problem. PCA aims to identify the "best" linear transform, such that we start viewing the data along the directions of maximum variance. The key intuition here is that directions that have high variance in the data space, also have the highest signal to "learn" from. 

As it turns out, the principal components that best describe this reconstruction are the eigenvectors of the covariance matrix. Equivalently, these are the right-singular vectors of the data matrix whose each row represents a sample.

In [None]:
X = data - np.mean(data, axis=0, keepdims=True)
W, S, _ = np.linalg.svd(X.T)
S.shape, W.shape

### Visualizing Eigen Faces

The columns of $V$ form the eigenfaces. We take the first $q$ column vectors to form the "eigenbasis" of face vectors, i.e. eigen faces.

In [None]:
q = 100
W = W[:, :q]
W.shape

In [None]:
fig, axes = plt.subplots(figsize=(10, 5), nrows=2, ncols=5)

for i in range(2):
    for j in range(5):
        k = i * 5 + j
        axes[i, j].imshow(W[:, k].reshape(19, 19), cmap=plt.cm.gray, interpolation='bilinear')
        axes[i, j].axis('off')
        axes[i, j].set_title(f'Dimension = {k}')

fig.tight_layout()

Interestingly, we see that different structures/features are highlighted in each of the eigen faces.  

### Generate a new face

How about creating a new face which contains a smile and a moustache? Via visual inspection, it seems like dimension 4 and dimension 7 can combine to give us a moustache person with a smile. Abstractly, this is how we've always thought of vector space. This is abstract vector spaces applied to the vector space of faces!

**NOTE**: This is purely by inspection for now but one can imagine how controlled generation can happen if you control for such facial features, and attach semantics to them. Getting nice faces may be much harder.

In [None]:
gen = W[:, 4] + W[:, 7]
fig, ax = plt.subplots()
ax.imshow(gen.reshape(19, 19), cmap=plt.cm.gray, interpolation='bilinear')
ax.axis('off');

## Explained Variance

How much of the total variance along each of the principle components do we explain by picking the first $q$ components only? Remember that we have essentially diagonalized the system into orthogonal components, such that all covariances are zero. The explained variance can then be deduced by the diagonal singular value matrix $S$ above.

In [None]:
f'Explained variance: {np.sum(S[:q]) / np.sum(S) * 100:.2f}%'

## Variational Autoencoders

Variational Autoencoders (VAEs) provide us a way to do non-linear dimensionality reduction. We will compress the information contained in the MNIST dataset into two-dimensional vectors. By interpolating this 2-D space, we will be able to visualized whether our learned latent variables have capture enough information about the digits.

Formally, we posit that the generative process behind each MNIST digit involves a continuous latent variables $z \sim \mathcal{N}(\mu, \Sigma)$. For simplicity, we assume that the covariance is diagonal. The digit $x$ is generate via an appropriate model $p(x \mid z)$. We assume a standard normal prior over $z$, as $p(z) = \mathcal{N}(\mathbf{0}, \mathbf{I})$.

### Encoder

We first define a _variational_ encoder $q(z \mid x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi(x))$ (this is our variational posterior). Note that both the mean and the diagonal covariance are parameterized by some parameters, collectively denoted by $\phi$. In our case, this will be a simple neural network with one hidden layer.

In [None]:
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, in_size, latent_size):
        '''
        Arguments:
            in_size (int): Dimension of inputs
            latent_size (int): Dimension of final latent space.
        '''
        super().__init__()
        
        self.enc = nn.Sequential(
            nn.Linear(in_size, 512),
            nn.ReLU()
        )
        self.mu_net = nn.Linear(512, latent_size)
        self.sigma_net = nn.Sequential(
            nn.Linear(512, latent_size),
            nn.Softplus()  # variance has to be positive
        )

    def forward(self, x):
        '''
        Arguments:
            x (Tensor): Shape batch_size x in_size
        Returns:
            mu (Tensor): Variational mean, shape (batch_size x latent_size)
            sigma (Tensor): Variational covariance, shape (batch_size x latent_size)
        '''
        encoded = self.enc(x)
        mu = self.mu_net(encoded)
        sigma = self.sigma_net(encoded)

        return mu, sigma

### Decoder

For the decoder, we will model $p(x \mid z)$ as $\mathcal{N}(\mu_\theta(z), \sigma^2)$ independently for each pixel of the observed MNIST image sample. Note that here only the mean is parameterized for simplicity. In this case, our decoder will mirror the encoder and reverse the process for reconstruction.

In [None]:
class Decoder(nn.Module):
    def __init__(self, latent_size, out_size):
        '''
        Arguments:
            latent_size (int): Dimension of final latent space.
            out_size (int): Dimension of output vectors
        '''
        super().__init__()

        self.dec = nn.Sequential(
            nn.Linear(latent_size, 512),
            nn.ReLU(),
            nn.Linear(512, out_size),
            nn.Sigmoid()
        )

    def forward(self, z):
        '''
        Arguments:
            z (Tensor): batch_size x latent_size
        Returns:
            x_hat (Tensor): batch_size x in_size
        '''

        x_hat = self.dec(z)

        return x_hat


### Training VAEs

VAEs are trained by minimizing a regularized reconstruction loss. This can in fact be derived as a consequence of maximizing the evidence lower bound $\mathcal{L}$.

$$
\log{p(\mathbf{X}\mid\theta)} = \underbrace{\mathbb{E}_q[\log{p(\mathbf{X},\mathbf{Z}\mid\theta)} - \log{q(\mathbf{Z})}]}_{\mathcal{L}(q,\theta)} + \mathrm{KL}(q(\mathbf{Z}) \mid\mid  p(\mathbf{Z}\mid\mathbf{X},\theta))
$$

For any approximate posterior $q$, the lower bound can be rewritten as,

$$
\mathcal{L}(q,\theta) = \underbrace{\mathbb{E}_{q(z\mid x)}\left[ p(\mathbf{X}\mid \mathbf{Z}, \theta) \right]}_{\text{Reconstruction loss}} - \overbrace{\mathrm{KL}(q(\mathbf{Z} \mid \mathbf{X}) \mid\mid p(\mathbf{Z}))}^{\text{Regularizer}}
$$

When the output is Gaussian, this reconstruction loss is just the squared error as we've observed a few times before. This is what we will use below. Further, the KL-divergence between two Gaussians can be easily computed in closed form. This loss is easily computable and we can optimize using gradient descent.

**NOTE**: While the specifics of this loss are interesting to understand, it is enough to realize that our intuitive way of encoding images into low-dimensional vectors via simply a reconstruction loss can be justified theoretically. As generally done, we bias the solutions via a regularizer. Previously, we have been using the L2-norm of the parameters as a regularizer. In this case, we use the $KL$-divergence, which simply says don't go too far away from the prior.


In [None]:
class VAELoss(nn.Module):
    def __init__(self, beta=1.):
        super().__init__()
        self.beta = 1.

    def forward(self, x, x_hat, mu, sigma):
        mse_loss = (x - x_hat).pow(2).sum()
        kl_loss = (sigma.pow(2) + mu.pow(2) - 1.0 - 2. * sigma.log()).div(2.).sum()

        return mse_loss + self.beta * kl_loss

class VAE(nn.Module):
    def __init__(self, in_size, latent_size):
        super().__init__()
        self.encoder = Encoder(in_size, latent_size)
        self.decoder = Decoder(latent_size, in_size)

    def forward(self, x):
        mu, sigma = self.encoder(x)
        z = mu + sigma * torch.randn_like(mu)
        x_hat = self.decoder(z)
        return x_hat, mu, sigma

In [None]:
import torch
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader

mnist = MNIST(root='/tmp', download=True, train=True, transform=ToTensor())

In [None]:
from tqdm.auto import tqdm

device = 'cuda:0' if torch.cuda.is_available() else None

vae = VAE(784, 2).to(device)
criterion = VAELoss(beta=1.)
optim = torch.optim.Adam(vae.parameters())

for _ in tqdm(range(20)):
    for b_x, _ in tqdm(DataLoader(mnist, batch_size=2000), leave=False):
        B, *_ = b_x.shape
        x = b_x.view(B, -1).to(device)

        optim.zero_grad()
        
        x_hat, mu, sigma = vae(x)
        loss = criterion(x, x_hat, mu, sigma)

        loss.backward()
        optim.step()
    print(f'Loss: {loss.item()}')

### Navigating the latent space

Now let us take any two interesting samples from the dataset (e.g. a digit 0 and digit 8). We will interpolate between the two digits by first encoding these samples into the latent space, and along the direction between the two in the same space (and decoding back into the original space for visualization).

Formally, consider two samples $x_0$ and $x_1$ and corresponding mean encodings $z_0$ and $z_1$, a convex interpolation in the latent space will be along the line, for a scalar parameter $t \in [0, 1]$.

$$
z = z_0 + t \times (z_1 - z_0)
$$

In [None]:
with torch.no_grad():
  x_0 = mnist.data[mnist.targets == 0][0].to(device).view(1, -1).float() / 255
  x_1 = mnist.data[mnist.targets == 8][0].to(device).view(1, -1).float() / 255
  z_0, _ = vae.encoder(x_0)
  z_1, _ = vae.encoder(x_1)

  z = (z_0 + torch.linspace(0, 1, 20).to(device).unsqueeze(-1) * (z_1 - z_0))
  x_hat = vae.decoder(z)

x_hat = x_hat.reshape(z.size(0), 28, 28).cpu().numpy()

fig, axes = plt.subplots(figsize=(20, 5), ncols=x_hat.shape[0])
for i in range(x_hat.shape[0]):
  axes[i].imshow(x_hat[i], cmap=plt.cm.gray)
  axes[i].axis('off')
  axes[i].set_aspect('equal')

fig.tight_layout()