# Methods for preventing Posterior Collapse in Sentence Variational Autoencoders

## Introduction

In this notebook, a study is performed to study the phenomenon of Posterior collapse in sentence variational autoencoders, and ways to. A formal introduction to the nature of this phenomenon can be found in the attached paper (or in the Github repository). This notebook provides a high-level overview of the experimental design of this paper. 

The notebook consists of three stages:
 - Training
 - Testing (Inference)
 - Analysis
 
Furthermore, the notebook relies on various implementations which were abstracted into their own modules. These dependencies can be split into a number of items:
- `models/`: The model definitions of the VAE and the RNNLM, defined using PyTorch, and factory functions which create them from a config
- `trainers/`: Model-specific trainers where each trainer-module defines three things: The complete trainer loop, what each batch-training looks like, and what results can be extracted from a batch-training. 
- `tools/`: Model-agnostic construction objects, which are instrumental in the setting up, such as storing results, tokenizers, and data preprocessing
- `inference/`: Evaluation-specific logic, such as evaluating individual model-types, or using a trained model to predict next words
- `utils`: Simple functional utilities
- `config`: Configuration schema which defines the various parameters needed in this pipeline
- `metrics`: Custom loss functions such as the ELBO definition, and calculating the perplexity from negative log likelihood

The results of this experiment will be stored in `results/`. These results include tabular n-iteration records of training and m-iteration records of validaiton. Also, events for tensorboard are sent here. Saved models are stored in a seperate folder.

## Setting up

The assumption is that users who run this notebook have the following libraries installed:
- Pytorch (1.4.0)
- nltk
- pandas
- numpy
- tensorboard (though, roughly similar analysis is shown below using pandas)

In [2]:
# Pytorch-specific imports
import torch
import torch.nn as nn

# Import our config and utils
import utils
from config import Config

We follow the approach to parametrize each experimental run using a mutable `config` file. The idea is to define a generic config at the start, and when running multiple experiments within the notebook, we __clone__ the config and mutate the appropriate properties.

In [3]:
config = Config(
    run_label='Notebook-training', #custom label assigned to training
    batch_size=64,
    vae_latent_size=16,
    embedding_size=256,
    rnn_hidden_size=256,
    vae_encoder_hidden_size=320,
    vae_decoder_hidden_size=320,
    vocab_size=10000,
    
    # Hyperparameters
    param_wdropout_k=[0, 0.5, 1], #list of potential parameters for word-dropout
    mu_force_beta_param=[0, 2, 3, 5, 10], #list of potential parameters for mu-forcing
    freebits_param=[-1, 0.25, 0.5, 1, 2, 8], #list of potential parameters for freebits
    
    #Booleans we define when automatically running the notebook / script
    will_train_rnn=True, 
    will_train_vae=True,
    will_grid_search=False, # Set to true to run through all possible parameters
    
    nr_epochs=5,
    
    # Various paths
    results_path = 'results',
    train_path = '/data/02-21.10way.clean',
    valid_path = '/data/22.auto.clean',
    test_path  = '/data/23.auto.clean',
    device = 'cuda' if torch.cuda.is_available() else 'cpu',
    
    # Defining intervals for inside the training loops
    validate_every = 300,
    print_every = 100,
    train_text_gen_every = 3000
)

## Training the models

For training, we will use the Penn Treebank dataset. We have defined a Custom Dataset which loads this into data-loaders. All models will use PyTorch's default implementation of Adam as optimization algorithm (with a `lr` of `0.001`).

Note: for this, we will assume you have stored three fragments of this dataset, and are pointing to them as established in `config.train_path` and `config.valid_path`.

In [4]:
from tools.customdata import CustomData
cd = CustomData(config)
train_loader = cd.get_data_loader(type="train", shuffle=True)
valid_loader = cd.get_data_loader(type='valid', shuffle=False)
print(f"Length of training data: {len(train_loader)}")
print(f"Length of validation data: {len(valid_loader)}")

Init Custom Dataset
Data loaders are ready
Created a train data loader. Shuffle = True, Batch Size = 64
Created a valid data loader. Shuffle = False, Batch Size = 64
Length of training data: 623
Length of validation data: 27


### RNN: The Baseline

For the baseline, we define an RNN with hyperparameters as found from 'Bowman et al'. These parameters are fixed, and set in the config (`vocab_size`, `rnn_hidden_size` and `embedding_size`). The creation of rnn is abstracted into its own function, only relying on this config. The RNN will be trained (if the config's `will_train_rnn` allows it), and uses Cross Entropy Loss as criterion for maximization.

In [4]:
from models.make_rnnlm import make_rnnlm

rnn_lm = make_rnnlm(config, trained=False)

criterion = nn.CrossEntropyLoss(
    ignore_index=0,
    reduction='sum'
)
optim = torch.optim.Adam(rnn_lm.parameters(), lr=0.001)

For the training loop, we define a `ResultsWriter` as an abstraction over Tensorboard's SummaryWriter and Pandas, which will store results both in tensorboard, as well as save results in a CSV file, in a subdirectory under results corresponding to `config.run_label` + `rnn`.

In [5]:
from tools.results_writer import ResultsWriter
import trainers.trainer_rnn
from trainers.trainer_rnn import train_rnn

rnn_results_writer = ResultsWriter(label=f'{config.run_label}--rnn')

if config.will_train_rnn:
    train_rnn(
        rnn_lm,
        optim,
        train_loader,
        valid_loader,
        config=config,
        nr_epochs=config.nr_epochs,
        device=config.device,
        results_writer=rnn_results_writer
)
    
# We should not forget to close the writer once we are done
rnn_results_writer.tensorboard_writer.close()

NameError: name 'rnn_lm' is not defined

### Training Sentence VAE's: 

In this section, we can demonstrate Posterior collapse in various ways. One rather basic wayby simply running the training of a Sentence VAE, and paying attention to it's KL-loss. As the model proceeds through its iterations, the KL gets closer to 0 (thus pushing the posterior to a prior).

In [18]:
from copy import deepcopy
from tools.results_writer import ResultsWriter

from models.make_vae import make_vae
vanilla_params = {
    'free_bits_param': 0,
    'mu_force_beta_param': 0,
    'param_wdropout_k': 1
}

# Clone config to not mutate original, set the parameters in there
vanilla_run_config = deepcopy(config)
vanilla_run_config.freebits_param = vanilla_params['free_bits_param']
vanilla_run_config.mu_force_beta_param = vanilla_params['mu_force_beta_param']
vanilla_run_config.param_wdropout_k = vanilla_params['param_wdropout_k']

# Create VAE model, its optimizer
vae = make_vae(vanilla_run_config).to(vanilla_run_config.device)
vae_optim = torch.optim.Adam(vae.parameters(), lr=0.001)

# Create a results-writer
vae_results_writer = ResultsWriter(label=f'{vanilla_run_config.run_label}--vae')

Additionally, prior to running, we use a 'decoding' tool to transform the words the VAE greedily selected during inference, to transfrom it back into the original source language (English in this case). This decoding-tool uses the same tokenizer as is used in the CustomData to reverse-map the sampled indices back to their original words. The resulting text gets shown in this notebook as well as stored in tensorboard.

Note: the predicted sentence generates up until the final padding tag. The cutting off of the padding generation was not finalized, so expect the sentence length between prediction and true to differ.

In [19]:
from trainers.trainer_vae import train_vae

# Decoding tool (different from a neural network decoder: samples from categorical distribution, 
# and reverses back to original word using tokenizer and its vocab.)
sentence_decoder = utils.make_sentence_decoder(cd.tokenizer, 1)

if vanilla_run_config.will_train_vae:
    train_vae(
        vae,
        vae_optim,
        train_loader,
        valid_loader,
        nr_epochs=vanilla_run_config.nr_epochs,
        device=vanilla_run_config.device,
        results_writer=vae_results_writer,
        config=vanilla_run_config,
        decoder=sentence_decoder,
    )

vae_results_writer.tensorboard_writer.close()

Iteration: 0 || NLLL: 229.60691833496094 || Perp: 7065.600082313294 || KL Loss: 5.269302845001221 || MuLoss: 0.0 || Total: 234.876220703125
VAE is generating sentences on 0: 

	 The true sentence is: "and both the trust and manville are seeking to avoid the bad publicity that , in the asbestos era , [UNK] the manville name ." 

	 The predicted sentence is: "ge salmonella realistic complain styles reopen onerous conform repaid eurocom thinking jitters latin harvard 9.6 marketer trusts singled ldp sponsored ig edison looked cool stevens telerate campus contras change murdoch vice 50 fare wachovia eggs republics yamaichi pennsylvania dressed 38 high-grade ship manhattan filing fluor equal fundamentals dominated rubles proof bethlehem placed drove cleanup suggest target suez seize werner worrying ufo estimates catholic bush" 

Iteration: 100 || NLLL: 143.69952392578125 || Perp: 277.2543991298171 || KL Loss: 0.031154032796621323 || MuLoss: 0.0 || Total: 143.73068237304688
Iteration: 200 || 

### Countering the Posterior collapse: Training three methods

As one might notice, the KL is incredibly low. This means that the generated Posterior distribution has lost all latent information, and has reverted to a standard Gaussian prior. Previously, we had done a grid search over all possible values. However, to keep things brief, we define the four best setups we had found for our three main hyperparameters. More details about these can be found in the paper.

Note: Word dropout was found to improve nothing, hence it has been set to 1 (meaning not apply word dropout).

In [22]:
from collections import OrderedDict
import numpy as np
# Subset of parameter-grid, with best hyperparameter. 
# These can be adjusted and played around with. 
param_grid = [
    # Very best setup (validation perplexity of 103)
    {
        'free_bits_param': 2,
        'mu_force_beta_param': 2,
        'param_wdropout_k': 1
    },
    # Best only free-bits (validation perplexity of 102.3)
    {
        'free_bits_param': 2,
        'mu_force_beta_param': 0,
        'param_wdropout_k': 1
    },
    
    # Best only mu-force (validation perplexity of 111)
    {
        'free_bits_param': 0,
        'mu_force_beta_param': 5,
        'param_wdropout_k': 1
    },
]

# However, we could do grid search as well.

# Uncomment if you want to do grid search
# config.will_grid_search = True
if config.will_grid_search:
    potential_params = OrderedDict({
        'free_bits_param': np.hstack([config.freebits_param]),
        'param_wdropout_k': np.hstack([config.param_wdropout_k]),
        'mu_force_beta_param': np.hstack([config.mu_force_beta_param]),
    })
    
    param_grid = utils.make_param_grid(potential_params)

We can then run training for all these parameters, using a similar code approach as before. This will result in saved models (best models get saved in `config.results_path/saved_models`, and saved-results in our `config.results_path/{run}` directory.

In [24]:
from trainers.trainer_vae import train_vae

for param_setting in param_grid:

    # Copy config, set new variables
    run_config = deepcopy(config)
    run_config.freebits_param = param_setting['free_bits_param']
    run_config.mu_force_beta_param = param_setting['mu_force_beta_param']
    run_config.param_wdropout_k = param_setting['param_wdropout_k']

    vae = make_vae(run_config).to(run_config.device)
    optimizer = torch.optim.Adam(params=vae.parameters())

    # Format to be a compact string representation
    path_to_results = f'{run_config.results_path}/vae'
    params2string = '-'.join([f"{i}:{param_setting[i]}" for i in param_setting.keys()])

    results_writer = ResultsWriter(
        label=f'{run_config.run_label}--vae-{params2string}',
    )

    sentence_decoder = utils.make_sentence_decoder(cd.tokenizer, 1)

    if run_config.will_train_vae:
        print(f"Training params: {params2string}")
        train_vae(
            vae,
            optimizer,
            train_loader,
            valid_loader,
            nr_epochs=run_config.nr_epochs,
            device=run_config.device,
            results_writer=results_writer,
            config=run_config,
            decoder=sentence_decoder,
        )

    results_writer.tensorboard_writer.close()

    print(f"Finished training for {params2string}!!!")

Training params: free_bits_param:2-mu_force_beta_param:2-param_wdropout_k:1
Iteration: 0 || NLLL: 238.51205444335938 || Perp: 7224.250485393984 || KL Loss: 32.0 || MuLoss: 1.9688620567321777 || Total: 272.48089599609375
VAE is generating sentences on 0: 

	 The true sentence is: "british life insurer london & general , which firmed 2 pence to [UNK] pence -lrb- $ [UNK] -rrb- , and composite insurer royal insurance , which finished 13 lower at 475 , were featured in the talk ." 

	 The predicted sentence is: "formula severely dominate scientist authors accuse pricings municipals comedy offensive guinness jacobson teagan distributes preclude prosperity ousted memorandum proceed defaulted eroded relocation ecological shown leaks color command intellectual criminal committees oliver phone pork-barrel institution democratic studies creation reinforcement desirable windsor pitney bourbon arise enthusiasm usage suppliers traffic tiger snow corn delta meets n.j pushing turbines ingersoll cheney

### Intermezzo: Analysis - Best model selections

In this project, the decision was made to only vary the hyperparameters of the counter posterior collapse methods, and to keep the other models fixed as close as possible to the parameters Bowman et al. chose in their paper. 

Based on the param-grid defined above, we can define our based models by any potential metrics. One such metric that can be chosen is to get the models which minimize perplexity. As such, to do this, we use __tensorboard__ to explore the various trajectories of the model's metrics (such as perplexity, KL, etc.). Tensorboard can't be embedded into notebooks at the moment of writing, as far as we know, so instead CSVs are stored with validation performances of the model.

In [None]:
import pandas as pd
from pathlib import Path

validation_csv_paths = list(Path('results/runs').glob('**/valid.csv'))
validation_df = pd.concat([pd.read_csv(path) for path in validation_csv_paths])


In the resulting dataframe, `validation_df`, all results of the runs so far are stored. It is also possible to query for the specific run_label by editing the glob search string, though this might not always be desired as multiple runs could provide a larger scope of which models are interesting.

Using this dataframe, we can naively select the best models by grabbing the mean across all runs. The intuition here is that as this experiment is performed more often, models with the same parameters will have more measurements, and start to approximate their actual performance better (and become invariant from coincidence of evaluation noise). However, due to the scope of this notebook, we run this experiment only once.

In [None]:
mean_performances = validation_df.groupby('label').mean().sort_values('perp_metric')

For the best VAE, we make the simple assumption currently to select the model with the highest perplexity.
This can similarily be done for other selections, such as a model with only free-bits / only mu-force.

In [None]:
best_vae_parameter = mean_performances.iloc[0]

best_muforce_only_parameter = mean_performances.query('freebits_param == 0 and word_dropout == 1').iloc[0]
best_freebits_only_parameter = mean_performances.query('mu_force_beta_param == 0 and word_dropout == 1').iloc[0]

print(f"Best VAE has parameters: {best_vae_parameter.name} \n"
     f"Best VAE with only mu force has parameters: {best_muforce_only_parameter.name} \n"
     f"Best VAE with only freebits has parameters {best_freebits_only_parameter.name}")

Notice how these optimal params have higher KL's than the close to 0 vanilla. It seems like this is a stronger effect for only freebits though, contributing to the idea that freebits has a much stronger effect on the KL in our implementation than mu-force.

In [None]:
print(best_vae_parameter)
print(best_muforce_only_parameter)
print(best_freebits_only_parameter)

#### Tensorboard based
In case you have tensorboard open, you can examine all models trained. We based our decision off the lowest models in perplexity, corresponding almost always with the highest model in KL.

## Testing: Evaluating these selected models on our test-set

Given we have our selection of models, all we need to do now is to load them in, load in our test dataset, and run them against our various models, measuring the perplexity meanwhile.

In [6]:
# Predefined selected best models
best_saved_models = [
    'vae_best_mu0_wd1_fb0.pt', # Vanilla VAE
    'vae_best_mu2_wd1_fb2.pt', # Mu only
    'vae_best_mu0_wd1_fb2.pt', # Freebits only
    'vae_best_mu5_wd1_fb0.pt', # Best in total: Freebits and mu
]

test_loader = cd.get_data_loader(type='test', shuffle=False)
print(f"Length of test data: {len(test_loader)}")

Created a test data loader. Shuffle = False, Batch Size = 64
Length of test data: 38


In [6]:
from metrics import make_elbo_criterion
from models.make_vae import make_vae
from inference.evaluate_vae import evaluate_vae

for model_path in best_saved_models:
    if 'vae' in model_path:
        vae = make_vae(
            config, 
            trained=True, 
            model_path=f'{config.results_path}/saved_models/{model_path}'
        ).to(config.device)

        loss_fn = make_elbo_criterion(config.vocab_size, config.vae_latent_size, -1, 0)

        (test_total_loss, test_total_kl_loss, test_total_nlll, test_total_mu_loss), test_perp = evaluate_vae(
            vae,
            test_loader,
            -1,
            config.device,
            loss_fn,
            0,
            'test'
        )

        print(f'For model {model_path}: \n')
        print(f'Test Results || Elbo loss: {test_total_loss} || KL loss: {test_total_kl_loss} || NLLL {test_total_nlll} || Perp: {test_perp} ||MU loss {test_total_mu_loss}')



For model vae_best_mu2_wd1_fb2.pt: 

Test Results || Elbo loss: 125.95756892154091 || KL loss: 24.57783538416812 || NLLL 101.3797331358257 || Perp: 56.05236310996025 ||MU loss 61.16006439610531
For model vae_best_mu0_wd1_fb2.pt: 

Test Results || Elbo loss: 125.70883118478875 || KL loss: 23.484046333714534 || NLLL 102.22478505184776 || Perp: 58.11954359587994 ||MU loss 70.91709869786312
For model vae_best_mu5_wd1_fb0.pt: 

Test Results || Elbo loss: 107.74100183185779 || KL loss: 0.5465875628747439 || NLLL 107.19441433956749 || Perp: 70.62220585998166 ||MU loss 6.876160006774099


Once the test is over, the models can't be finetuned more. For further intuition into these results, the paper provides a general overview. To summarize: Freebits seems to improve performance the most, while driving the KL up.

### Test Analysis: Text generation

Now, given these models, the next check we can do is to see how well they perform on text generation. For this, the idea is to give all these models a particular sentence, and see how well they do in continuing the latter part. This sampling is done in a greedy-fashion.

As a start, a simple sentence is chosen: 'The chairman proposed that next fall a new treaty __is signed__'. The bolded words are to be predicted, but anything goes as long as it makes sense. Each time this is run, the models will try new predictions on the same sentence.

In [13]:
from inference.text_generation import generate_next_words
from models.make_vae import make_vae

best_saved_models = [
    'vae_best_mu0_wd1_fb0.pt', # Vanilla VAE
    'vae_best_mu2_wd1_fb2.pt', # Mu only
    'vae_best_mu0_wd1_fb2.pt', # Freebits only
    'vae_best_mu5_wd1_fb0.pt', # Best in total: Freebits and mu
]

sample_sentence = 'the chairman proposed that next fall a new treaty'

for model_path in best_saved_models:
    vae = make_vae(
        config, 
        trained=True, 
        model_path=f'{config.results_path}/saved_models/{model_path}'
    ).to(config.device)
    
    # Set to only returns predictions, not posterior
    vae.graph_mode = True
    
    sent = generate_next_words(vae, cd, sample_sentence, device=config.device)
    decoded_sent = cd.tokenizer.decode(sent)
    print(f'The model from {model_path} predicts the following: \n'
         f'\t {cd.tokenizer.decode(sent)} \n')


Start of the sentence: the chairman proposed that next fall a new treaty || Max Length 10 .
[1, 9060, 1937, 7154, 9058, 6192, 3743, 585, 6177, 9286]
The model from vae_best_mu0_wd1_fb0.pt predicts the following: 
	 the chairman proposed that next fall a new treaty , for the french electronics , loan rows of 

Start of the sentence: the chairman proposed that next fall a new treaty || Max Length 10 .
[1, 9060, 1937, 7154, 9058, 6192, 3743, 585, 6177, 9286]
The model from vae_best_mu2_wd1_fb2.pt predicts the following: 
	 the chairman proposed that next fall a new treaty has headed to take advantage of permission in maturing 

Start of the sentence: the chairman proposed that next fall a new treaty || Max Length 10 .
[1, 9060, 1937, 7154, 9058, 6192, 3743, 585, 6177, 9286]
The model from vae_best_mu0_wd1_fb2.pt predicts the following: 
	 the chairman proposed that next fall a new treaty in that case , moody 's inc. of the trust 

Start of the sentence: the chairman proposed that next fal

While the quality of the generated sentences vary above, the difference between a vanilla VAE and that of the 'better' models (with higher KL values) is noticeable; the vanilla VAE in general has no specific knowledge about what the sentence is really about ('the chairman proposed that next fall a new treaty , for the french electronics , loan rows of'), where the other models start to make sense. Our 'best' model for instance 'the chairman proposed that next fall a new treaty drove margin ex-dividend over a year or next month', indicating a semantic topic for formal agreements over a date-range, whereas the model with the most freebits indicates monetary and policy 'that next fall a new treaty increase against $ 6.25 a pound and that no policy '