##  $\beta$ - VAE Model

This is a notebook to experiment with the $\beta$ - VAE CV. First we'll try to understand why is the $\beta$ factor needed to control the equilibrium between regularization and reconstruction. Then, we'll try to produce a good 1D CV for biasing.

In [2]:
from deep_cartograph.run import deep_cartograph 
import importlib.resources as resources
from deep_cartograph import data

import logging
import yaml
import os

# Get the path to the data
data_folder = resources.files(data)

# Set logging level
logging.basicConfig(level=logging.INFO)

def run_deep_carto(
    configuration: dict,
    output_folder: str):
    """
    Run the deep_cartograph workflow
    
    Parameters
    ----------
    
    configuration : dict
        Configuration dictionary containing the parameters for the workflow.
    output_folder : str
        Path to the output folder where results will be saved.

    """
    
    # Input trajectory and topology
    
    input_path = f"{data_folder}/protein_1AH7/input"
    traj_path = os.path.join(input_path, f'GaMD_traj.xtc')
    top_path = os.path.join(input_path, f'topology.pdb')

    ################
    # Run workflow #
    ################
    deep_cartograph(
        configuration = configuration,
        trajectory_data = traj_path,
        topology_data = top_path,
        output_folder = output_folder,
        restart = True)

    return


Due to the on going maintenance burden of keeping command line application
wrappers up to date, we have decided to deprecate and eventually remove these
modules.

We instead now recommend building your command line and invoking it directly
with the subprocess module.


### Test 1: Naive training with $\beta$ = 1

Here we use $\beta$ of 1, meaning that the regularization term will be used in the loss without any dampening. 

In [None]:
# Input configuration
config_path = f"{data_folder}/protein_1AH7/config.yml"

with open(config_path) as config_file:
    configuration = yaml.load(config_file, Loader = yaml.FullLoader)
    
# Output folder
output_folder = f"{data_folder}/protein_1AH7/output_1"

# Modify annealing parameters
kl_annealing_args = {
    'type': 'linear',
    'start_beta': 1.0,
    'max_beta': 1.0,
}
configuration['train_colvars']['common']['training']['kl_annealing'] = kl_annealing_args
run_deep_carto(configuration, output_folder)

INFO:deep_cartograph:Analyze geometry
INFO:deep_cartograph:Elapsed time (Analyze geometry): 00 h 00 min 00 s
INFO:deep_cartograph:Compute features
INFO:MDAnalysis.core.universe:The attribute(s) types have already been read from the topology file. The guesser will only guess empty values for this attribute, if any exists. To overwrite it by completely guessed values, you can pass the attribute to the force_guess parameter instead of the to_guess one
INFO:MDAnalysis.guesser.base:There is no empty types values. Guesser did not guess any new values for types attribute
INFO:MDAnalysis.core.universe:attribute masses has been guessed successfully.


INFO:MDAnalysis.core.universe:The attribute(s) types have already been read from the topology file. The guesser will only guess empty values for this attribute, if any exists. To overwrite it by completely guessed values, you can pass the attribute to the force_guess parameter instead of the to_guess one
INFO:MDAnalysis.guesser.base:There is no empty types values. Guesser did not guess any new values for types attribute
INFO:MDAnalysis.core.universe:attribute masses has been guessed successfully.
INFO:MDAnalysis.core.universe:The attribute(s) types have already been read from the topology file. The guesser will only guess empty values for this attribute, if any exists. To overwrite it by completely guessed values, you can pass the attribute to the force_guess parameter instead of the to_guess one
INFO:MDAnalysis.guesser.base:There is no empty types values. Guesser did not guess any new values for types attribute
INFO:MDAnalysis.core.universe:attribute masses has been guessed successful

INFO: 
Detected KeyboardInterrupt, attempting graceful shutdown ...
INFO:lightning.pytorch.utilities.rank_zero:
Detected KeyboardInterrupt, attempting graceful shutdown ...
ERROR:deep_cartograph.tools.train_colvars.cv_calculator:VAE training failed. Error message: name 'exit' is not defined
INFO:deep_cartograph.tools.train_colvars.cv_calculator:Retrying VAE training...
INFO:deep_cartograph.tools.train_colvars.cv_calculator:Model architecture: VariationalAutoEncoderCV(
  (loss_fn): ELBOGaussiansLoss()
  (norm_in): Normalization(in_features=97, out_features=97, mode=mean_std)
  (encoder): FeedForward(
    (nn): Sequential(
      (0): Linear(in_features=97, out_features=32, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): Dropout(p=0.0, inplace=False)
      (3): Linear(in_features=32, out_features=16, bias=True)
      (4): LeakyReLU(negative_slope=0.01, inplace=True)
      (5): Dropout(p=0.0, inplace=False)
    )
  )
  (mean_nn): Linear(in_features=16, out_fea

SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


Here we observe what is known as posterior collapse or KL vanishing. The model finds a way to reduce the loss without learning any useful representation, just by reducing the KL divergence between the posterior distribution (learned by the encoder) and the gaussian with mean 0 and variance one. The latent space loses its ability to represent any information specific to the input data. The encoder effectively ignores the input x and just outputs the prior. This is why we see all data points clustered around (0,0) – the model has found a "lazy" way to satisfy the KL constraint without actually encoding any useful information.

This is a common situation when training VAE and is affected by the following factors:

- Too high $\beta$: the regularization term has too much weight early on and the encoder stays trapped learning an uninformative distribution.
- Overly Powerful/Flexible Decoder: If your decoder is too strong or has too much capacity it can learn to reconstruct the input data x even if the latent code z contains little to no information.
- Learning Rate to high or too low
- Simple or small datasets
- Large Batch Sizes
- Initialization


### Test 2: Linear annealing from $\beta$ = 0 to $\beta$ = 0.01

In [None]:
# Input configuration
config_path = f"{data_folder}/protein_1AH7/config.yml"

with open(config_path) as config_file:
    configuration = yaml.load(config_file, Loader = yaml.FullLoader)
    
# Output folder
output_folder = f"{data_folder}/protein_1AH7/output_2"

# Modify annealing parameters
kl_annealing_args = {
    'type': 'linear',
    'start_beta': 0.0,
    'max_beta': 0.001,
    'start_epoch': 1000,
    'n_epochs_anneal': 5000
}
configuration['train_colvars']['common']['training']['kl_annealing'] = kl_annealing_args
run_deep_carto(configuration, output_folder)

### Test 3: Linear annealing from $\beta$ = 0.00001 to $\beta$ = 0.01

### Test 4: Cyclical annealing