In [None]:
!nvidia-smi

# Nova seção

# New Section

# New Section

# Variational Autoencoders

In this notebook, we'll introduce and explore "variational autoencoders," which are a very successful family of models in modern deep learning. In particular we will:


1.   Illustrate the connection between autoencoders and classical *Principal Component Analysis (PCA)*
3.   Train a non-linear variational auto-encoder that uses a deep neural network

### Overview
As explained in lecture, variational autoencoders are a way of discovering *latent, low-dimensional structure* in a dataset. In particular, a random data vector $W \in \mathbb{R}^d$ can be said to have low-dimensional structure if we can find some functions $e: \mathbb{R}^d \to \mathbb{R}^k$ and $d: \mathbb{R}^k \to \mathbb{R}^d$, with $k \ll d$, such that $$d(e(W)) \approx W.$$
In other words, $e(W)$ is a parsimonious, $k$-dimensional representation of $W$ that contains all of the information necessary to approximately reconstruct the full vector $W$. Traditionally, $f(W)$ is called an *encoding* of $W$.

It turns out that this is meaningless unless we restrict what kinds of functions $d$ and $e$ are allowed to be, because it's possible to write down some (completely ugly) one-to-one function $\mathbb{R}^d \to \mathbb{R}^1$ for any $d$. This gives rise to the notion of *variational autoencoders* where, given some sets of reasonable functions $F$ and $G$, we aim to minimize 
$$\mathbb{E}[\mathrm{loss}(W, d(e(W))]$$ over functions $d \in D$ and $e \in E$. As usual, this is done by minimizing the sample analog. 



## Linear Autoencoders and PCA: Theory


Let $W$ be a centered random vector in $\mathbb{R}^d$, and let $\Sigma_n = \mathbb{E}_n[W_iW_i'] \in \mathbb{R}^{d \times d}$ be its covariance matrix.


Consider mutually orthogonal rotations $$X_{ik}:= c_k'W_i,$$ of original $W_i$'s,  where  $$c_\ell'c_k = 0 \text{ for } \ell \neq k \text{ and } c_k'c_k=1 \text{ for each } k.$$ 

The condition on $c_k's$ ensures that for all $\ell \neq k$:
 $$
  \mathbb{E}_n X_{ik} X_{i\ell} =0.$$ 
The rotations $X_{ik}$ are called principal components of $W_i$. 

In applications, $W_i$ represent high-dimensional raw features (images, for example), and $X^K_{i} = (X_{ik})_{k=1}^K$  represent a lower-dimensional
encoding or embedding of $W_i$.   


The principal components can be seen as the solution of least squares problem: minimize
$$
\sum_{j} \mathbb{E}_n  (W_{ij} - \hat W_{ij})^2$$
subject to
$$
\hat W_{ij} :=  a_j' X^K_{ik}, \quad X_{ik}:= c_k'W_i.
$$
Therefore they gives the variational autoencoder corresponding to linear classes with  $\text{loss}(w, v) = ‖w−v‖^2$ .



\\


**Some Theoretical Details$^*$.** Let $c_1, \ldots c_d$ denote eigenvectors of $\Sigma$, where the $c_k$ are normalized so that $\|c_k\| = 1$ and listed such that corresponding eigenvalues $\lambda_k$, which satisfy $\Sigma c_k = \lambda_kv_k$, are decreasing in $k$. 

**Lemma.** The following statements are equivalent.
1. Vectors $c_1, c_2, \ldots c_d$ are the eigenvectors of $\Sigma$
2. For each $1 \le k \le d$, the subspace $E_k = \mathrm{span}(c_1 \ldots c_k)$ minimizes $$\mathbb{E}_n\|W - \Pi_S W\|^2$$ over subspaces $S\subset \mathbb{R}^d$ of dimension $k$, where $\Pi_S$ denotes the orthogonal projection onto $S$
3. For each $1 \le k \le d$, the minimum of $$\mathbb{E}_n\|W - ABW\|^2$$ over matrices $A \in\mathbb{R}^{d \times k}$ and $B \in \mathbb{R}^{k \times d}$ is attained at $A = C^k$, $B=(C^k)'$, where $C^k$ is the $d \times k$ matrix whose $j^{\text{th}}$ column is $c_j$. 

To interpret, since the matrices $A$ and $B$ of statement 3 are the same as linear functions between $\mathbb{R}^d$ and $\mathbb{R}^k$, the lemma shows that PCA gives the variational autoencoder corresponding to *linear* classes  with square loss.

## Linear Autoencoders and PCA: Practice

Having just proved that linear autoencoders are the same as PCA, let's do a small sanity check. In particular, let's perform PCA two ways: first using a standard (linear algebra) toolkit, and second as a linear autoencoder using a neural network library. 
If all goes well, they should give you the same reconstructions! 

To make it a bit more fun, we will use the [*Labeled Faces in the Wild*](https://www.kaggle.com/jessicali9530/celeba-dataset) dataset which consists of standardized images of roughly 5,000 celebreties' faces. In this data, PCA amounts to looking for a small number of "proto-faces" such that a linear combination of them can accurately reconstruct any celebrity's face. 

In [None]:
# First, let's download and inspect the data!
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(slice_=(slice(70, 198), slice(78, 174)))

# 3D Array "faces.images" contains images as 2d arrays, stacked along dimension 0
n_examples, height, width = faces.images.shape

# 2D Array "design_matrix" encodes each image as a 1d numeric row, as is conventional in statistics
design_matrix = faces.images.reshape((n_examples, -1))

n_features = design_matrix.shape[1]

print(
    "Labeled Faces in the Wild Dataset: \n\
    Number of examples: {}\n\
    Number of features: {}\n\
    Image height: {}\n\
    Image width: {}".format(
        n_examples,
        n_features,
        height,
        width))

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Let's gather all the images corresponding to Arnold Scwarzenegger to use as examples

# Make a list (of length one!) of labels corresponding to Arnold
# Array "faces.target_names" tells us which numeric label (index) corresponds to which person name (value)
clint_labels = np.where(faces.target_names == 'Clint Eastwood')

# Get indices of all images corresponding to this label
# Array "faces.target" tells us which image (index) corresponds to which numeric image labels (value)
clint_pics = np.where(np.isin(faces.target, clint_labels))[0]

# Make a helper function so we can take a look at our target images
def plot_faces(images, n_row=2, n_col=3):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(3.5* n_col, 2.2 * n_row))
    plt.subplots_adjust(0.6, 0.5, 1.5, 1.5)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((height, width)), cmap=plt.cm.gray)
        plt.xticks(())
        plt.yticks(())
    plt.tight_layout()
    plt.show()

# Let's try it out!
plot_faces(
    faces.images[clint_pics[:6],:,:] #first six images of Arnold appearing in the dataset
)

In [None]:
# 1. Find the first 32 principal components of the dataset using the Scikit-learn library 
# For extra fun, you can do so directly using the singular value decomposition (your mileage may vary!)

# We'll use a standard library, which uses linear algebra to compute the principal components.
from sklearn.decomposition import PCA

# There's no need to de-mean the data. Can you explain why?
pca = PCA(n_components=256, svd_solver='randomized').fit(design_matrix)

In [None]:
# 2. Plot the first 6 "eigenfaces," the six images whose linear span best explains the variation in our dataset
eigenfaces = pca.components_
plot_faces(eigenfaces[:6])

In [None]:
# 3. Plot Arnold's face (any image will do!) reconstructed using 1, 8, 64, and 256 principal components
face_vector = design_matrix[clint_pics[1]]

def reconstruct(image_vector, n_components):
  return eigenfaces[:n_components].T @ (eigenfaces[:n_components] @ image_vector.reshape((-1,1)))

reconstructions = [reconstruct(face_vector, k) for k in [1,2,4,8,64,256]]
plot_faces(reconstructions)

In [None]:
# 4. Train linear autoencoder with 8, 64, and 256 neurons using Keras (example below has dim. 64)
# 5. Compare reconstructions of Arnold's face both using MSE and visually

In [None]:
import tensorflow as tf
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense

encoding_dimension = 256
input_image = Input(shape=(n_features,))
encoded = Dense(encoding_dimension, activation='linear')(input_image)
decoded = Dense(n_features, activation='linear')(encoded)

autoencoder = Model(input_image, decoded)

autoencoder.compile(optimizer='adam', loss='mse')

In [None]:
autoencoder.fit(design_matrix, design_matrix,
                epochs=100,
                batch_size=20,
                shuffle=True)

In [None]:
# Compute neural reconstruction
reconstruction = autoencoder.predict(face_vector.reshape(1,-1))

# Do visual comparison
plot_faces([reconstructions[5],reconstruction],n_row=1,n_col=2)

# Do numeric comparison
# We also normalize the black/white gradient to take values in [0,1] (divide by 255)
print('Mean-squared discrepancy: {}'.format(np.mean(np.power((reconstructions[4].T - reconstruction)/255,2))))

Note that NN-programmed auto-encoder is not doing as well as PCAs. Can we improve performance by doing changing parameters (training with more epochs, changing batch size, using penalization)?

## Neural Autoencoders

Finally, let's train a nonlinear autoencoder for the same data where $F$ and $G$ are neural networks, and we restrict the dimension to be $k=64$. 

Visually compare the reconstructions some of Arnold's faces. How much better does the convolutional (nonlinear) model perform than the linear one? 

The convolutional NNs feature layers that "summarize" information by doing local averages over the segments of the immage. (For more details, see, e.g. Smola et al book).

**Note: you will want to click ```Edit > Notebook Settings``` and enable a GPU hardware accelerator, otherwise this will take excruciatingly long.**



In [None]:
from tensorflow.keras.layers import Add, Conv2D, Conv2DTranspose, MaxPool2D, Flatten, Reshape, Concatenate, Dropout,GaussianNoise

# 1. Train convolutional autoencoder using Keras
latent_dim = 256

input_image = Input(shape=(height*width,))
y = Reshape((height, width, 1))(input_image)
y = GaussianNoise(0.1)(y, training=True)
x = Conv2D(latent_dim*2, 3, activation="relu",padding='same')(y)
x = MaxPool2D(2)(x)
x = Conv2D(latent_dim*4, 3, activation="relu",padding='same')(x)
x = MaxPool2D(2)(x)
x = Conv2D(latent_dim*4, 3, activation="relu",padding='same')(x)
x = MaxPool2D(2)(x)
x = Flatten()(x)
y = Flatten()(y)
x = Concatenate()([x,y])
encoding = Dense(latent_dim, activation="relu")(x)

z = Dense(16*12*latent_dim*4,activation="relu")(encoding)
# z = Dense(16*12*latent_dim*4,activation="relu")(z)
z = Reshape((16,12,latent_dim*4))(z)
z = Conv2DTranspose(latent_dim*2,3,strides=2, activation="relu", padding='same')(z)
z = Conv2DTranspose(1,3, strides=2, activation="relu", padding='same')(z)
output_image = Reshape((height*width,))(z)
convolutional_autoencoder = Model(input_image, output_image)
convolutional_autoencoder;
convolutional_autoencoder.summary()

# 149.5731

In [None]:
opt = tf.keras.optimizers.Adamax()
convolutional_autoencoder.compile(optimizer="Adamax", loss='mse')
convolutional_autoencoder.fit(design_matrix, design_matrix,
                epochs=300,
                batch_size=50,
                shuffle=True)

Results


In [None]:
# Compute neural reconstruction
reconstruction = convolutional_autoencoder.predict(face_vector.reshape(1,-1))

# Do visual comparison
plot_faces([reconstructions[5],reconstruction],n_row=1,n_col=2)
print('Mean-squared discrepancy: {}'.format(np.mean(np.power((reconstructions[5].T - reconstruction)/255,2))))