---
layout: post
title:  "Autoencoders in molecular simulations"
date:   2023-04-07 10:14:54 +0700
categories: DeepLearning
---

# Introduction

Autoencoders is a neural network architecture in unsupervised learning that maps the input into itself. Say, we have a set of unlabeled training examples $$ \{ x^{(1)}, x^{(2)}, .. \} $$, an autoencoder aims for the transformation $$ \hat{y}^{(i)} = x^{(i)} $$. This sounds trivial at first, but for the purpose, we set the hidden layer of the autoencoder to be smaller than the input (or bigger). This bottleneck (limitation in representational power), during training, would force the neural net to learn patterns in the input so that it can compress into a latent distribution z and then reconstruct with least information lost possible. The hypothesis function is simple:

$$ h_{\theta} (x) \approx x $$

It is the identity function and the neural net would learn to approximate it. For example, we have input x to be the pixel intensity values from a 28x28 images (784 pixels) so n = 784, and there are 300 hidden units in the hidden layer. The output would have 784 units as well. Since there are only 300 units in the middle, it must try to learn the correlation among pixels (the structure) and reconstruct the 784 pixel input x from the structure. 

The space of compressed representation is called the latent space. The network has two part: an encoder to compress the input and a decoder to expand the pattern back into its orignial size and content. Doing the encoding is like doing a PCA: reduce the dimension of the input. Mathematically, the encoder is the matrix transformation U, we would have $$ \hat{x} = U . y $$ So $$ \hat{x} = U . U^T . x $$. So x goes through the transformation of $$ U^T $$ into y and then transformed by U back into x. During the training, the autoencoder learns to minimize the distance between input and output (itself): $$ min \mid x - \hat{x} \mid $$ which is $$ min_U \mid x - U^T . U . x \mid $$. Autoencoders can have different architectures, including simple feedforward networks, convolutional networks for images, recurrent networks for sequential data. Variations of autoencoders do exist, such as denoising autoencoders, sparse autoencoders, variational autoencoders (VAE), and adversarial autoencoders. 

Denosing autoencoders add noise to the input data during training and then learn to reconstruct the original data from the noisy inputs. This is to prevent the trivial solution of simply copying the input over (in case of overcomplete autoencoder: the hidden layer is bigger than input). This gives result in task such as to reduce noise of images. VAEs are probabilistic models that learn to generate new data points from the learned latent space. AAEs use adversarial training to learn the distribution of the input data and the latent space.

# Variational autoencoder

A variational autoencoder follows usual autoencoder architecture: with an encoder and a decoder. The layer in the middle, however, is probabilistic. Instead of produce a vector representation for the input, it provides a mean $$ \mu $$  and a standard deviation $$ \sigma $$ for a Gaussian distribution. The decoder samples the coding randomly from a Gaussian distribution with given mean and standard deviation and then decodes the vector as usual. This is possible since during training, the cost function pushes the encoding to morph from the latent space into a Gaussian distribution space. Then to generate new instance, we just need to sample a random point in the Gaussian distribution and decode it. 

The cost function has two parts: a usual part to reconstruct the input (MSE - mean squared error). A second part which is the Kullback-Leibler (KL) divergence between the Gaussian distribution and the actual distribution of the latent space.

The first part is to ensure that the output resembles the input:

$$ min \mid x - \hat{x} \mid^{2} $$

The second part is to min the distance between latent distribution z and a Gaussian distribution $$ (\mu, \sigma) $$ using the KL difference:

$$ KL(q(z \mid x) || N(\mu, \sigma)) $$

The resulting decoder can act as a generator: take a random vector in z (or Gaussian) and decode it as an image. Or we can do image interpolation: take two points in Gaussian space and use the decoder to decode images from one point to another.

# Code example
![interpolation](https://user-images.githubusercontent.com/7457301/230882269-00fa231e-a634-4fb3-9049-4c5da264725f.png)
![reconstructed](https://user-images.githubusercontent.com/7457301/230882274-794a0c39-c474-4a60-9cf3-460121ce9eff.png)

# VAE in molecular simulation
<img width="605" alt="Screen Shot 2023-04-10 at 17 35 23" src="https://user-images.githubusercontent.com/7457301/230885884-56fed235-d70d-43b9-804f-e60c54a457b2.png">

In physics, coarse-graining (CG) is the technique to simplify the particle representation by grouping selected atoms into pseudo-beads, this improves simulation computation a great deal, hence facilitate the innovation in studying chemical space (such as protein folding). When doing so, some of the information would be lost in the process. If we want to study interactions at atomic level, we have the opposite process called backmapping: restoring fine-grain (FG) coordinates from CG coordinates. We add lost atomistic details back into the CG representation. Some of the difficulties involve: the randomness of backmapping - many FG configurations can be mapped to the same CG representations, hence the reverse generative map is one-to-many; the geometric consistency requirement: the FG coordinates should abide by the CG geometry in the way that they can be reduced back to the CG. Also, the CG transformation is equivariant, the FG transformation should be equivariant as well; different mapping protocols: there are different dimensionality reduction techniques of CG, depending on tradeoff between accuracy and speed, hence there should be a general backmapping approach that is robust among CG mapping choices.

<img width="607" alt="Screen Shot 2023-04-10 at 17 49 07" src="https://user-images.githubusercontent.com/7457301/230887881-9dd37fa9-c05f-4643-8799-4e635cb779e2.png">
Image: CG conversion and back (FG)

Recently, researchers have made progress significantly in this task, by utilising machine learning (specifically CGVAE model - Coarse-Graining Variational Auto-Encoder) for generative purpose. This approach can approximate atomistic coordinates based on coarse grained structures pretty well. The mathematical problem is to model the conditional distribution $$ p(x\mid X) $$ let FG molecular structures to be x and CG structure distribution to be X. The probability distribution x is considered latent (hidden and can be approximated/accessible via a similar but attackable distribution). The molecular geometry is parametrized and the requirement of geometry equivariant is incorporated into the backmapping function. Those are novel improvements in the field.



