# Neural Network Basics

**Kyle Caverly - June 2020**

A few helpful resources related to Neural Networks along with additional notes.

## Building Blocks of a Neural Network

##### [Concepts and Models: Artificial Neural Nets](https://missinglink.ai/guides/neural-network-concepts/complete-guide-artificial-neural-networks/)

Extensive article that covers all most basic elements of a Neural Network, including:

* Neurons/Perceptrons, Layers, Weights and Activations
* Backpropagation
* Activation Functions
* Bias Neutrons
* Hyperparameters

## Activation Functions

##### [7 Types of Activation Functions](https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/)

Helpful overview of the primary activation functions, and the importance of using a non-linear activation layer.

##### [Visualizing Activation Functions](https://dashee87.github.io/deep%20learning/visualising-activation-functions-in-neural-networks/)

Interactive web tool, visualizing both the activation functions and their derivatives.

## Loss Functions

##### [Understanding Softmax & Negative Log Likelihood Loss](https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/)

Simple visualization exploring the function of Softmax probabilities and the negative log likelihood loss calculations, primarily used in multi-class classification problems.

##### [Kullback-Leibler Divergence Explained](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained)

Kullback-Leibler (KL) Divergence measures the amount of information we lose when we choose an approximation for a prior distribution. It is frequently used to compare prior/posterior probability distributions in models like Variational Auto Encoders.
* KL Divergence is heavily related to the idea of information entropy. We can interpret information entropy as "the minimum number of bits it would take us to encode our information." Entropy ultimately provides us with a way to measure the information available in our original distribution. 
* KL Divergence is simply a measure that measures the expected information loss between a prior known distribution with a approximating distribution. It is the difference in entropy between these two distribution.
* KL Divergence is not a distance metric, as KL Divergence is not symettric, if you change which distribution is the prior and which is the approximated distribution, you will end up with different KL Divergence metrics.
* Optimizing for KL Divergence: You may notice that KL Divergence measures the exact loss we commonly compare when performing classification tasks using a softmax function. However, with Softmax functions, it is most common to optimize using "Cross-Entropy" Loss, not KL Divergence. This is not because KL Divergence is not the appropriate metric in this use case. Mathetmatically, KL Divergence is simply the Cross Entropy minus the Original Entropy. As the Original Entropy is a scalar value, its derivative is simply 1. As such, optimizing for KL Divergence is mathematically the same as optimizing for Cross-Entropy loss.

## Optimization

#### Algorithms

##### [Overview of Gradient Descent Optimization Algorithms](https://ruder.io/optimizing-gradient-descent/)


Great explanation specifically surrounding SGD and the challenges associated with different Gradient Descent variants.

Overview of the Three Primary Gradient Descent Variants:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent

##### [Visualization of ANN Training**](https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)

Really helpful article that sums up what is going on behind the scenes during Neural Network training really well.

##### [Andrej Karpathy: Yes You Should Understand Backprop](https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b)

##### [ADAM: A Method for Stochastic Optimization](https://arxiv.org/pdf/1412.6980.pdf)

#### Challenges

##### [Vanishing Gradients](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)

##### [Exploding Gradients](https://machinelearningmastery.com/exploding-gradients-in-neural-networks/)

#### Techniques

##### [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf)

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalizaton a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about intitialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.
* Discuss the value of mini-batches: "Using mini-batches of examples, as opposed to one example at a time, is helpful in several ways. First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases. Second, computation over a batch can be much more efficient than m computations for individual examples, due to the parallelism afforded by the modern computing platforms.

##### [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf?utm_source=sciontist.com&utm_medium=refer&utm_campaign=promote)

##### [On the Importance of Initialization and Momentum in Deep Learning](http://proceedings.mlr.press/v28/sutskever13.pdf)

Original paper detailing Batch Normalization, the uptick in training and convergence speed.

## Transfer Learning and Multi-Task Learning

##### [Overcoming Catastrophic Forgetting in Neural Networks](https://arxiv.org/pdf/1612.00796.pdf)

Discusses the ability for neural networks to learn tasks sequentially without forgetting or reduced performance on previous tasks. This has been widely thought of as an inenviable feature of connectionist models. They argue that it is possible to overcome this limitation and maintain expertise on tasks  which have not been trained on for a long time.
* Critically, intelligent agents must demonstrate a capacity for continual leaning: that is, the ability to learn consecutive tasks without forgetting how to perform previously trained tasks. While this is immediately applicable in areas like reinforcement learning, it is incredibly applicable for both transfer learning and multi-task learning generally.
* Catastrophic forgetting is the tendency for knowledge of previously learnt task(s) to be abruptly lost as information relevant to the current task is incorporated. This phenomenon, termed **catastrophic forgetting** occurs specifically when the network is trained sequentially on multiple tasks because the weights in the network that are important for task A are changed to meet the objectives of task B.
* Current approaches have typically ensured that data for all tasks are simultaneously available during training. In which the weights of the network is jointly optimized for performance on all tasks, in a multi-task learning setting.
* They discuss a method in which the previously learnt weights are strengthened, and stiff to move as a new task is learned, hopefully maintaining the previous learning while adapting to a new task. Ultimately however, it comes down to the capacity of an individual network. Learning too many tasks will lead to a decrease in performance across all tasks. 
* For context, as of 2020 Google has found that their Machine Translation algorithms can handle 8 language pairs at a specific time, trained inside a multi-task setting before seeing sharp decreases in accuracy across all tasks.

##### [A Survey on Deep Transfer Learning](https://arxiv.org/pdf/1808.01974.pdf)

"**Transfer learning** relaxes the hypothesis that the training data must be independent and identically distributed (i.i.d.) with the test data, which motivates us to use transfer learning to solve the problem of insufficient training data."
* "Data dependence is one of the most serious problem in deep learning. Deep learning has a very strong dependence on massive training data compared to traditional machine learning methods, because it needs a large amount of data to understand the latent patterns of data."
* In transfer learning, the training data and test data are not required to be iid and the model in target domain is not needed to be trained from scratch, which can significantly reduce the demand of training data and training time in the target domain.
* Instances-based deep transfer learning refers to use a specific weight adjustment strategy, select partial instances from the source domain as supplements to the training set in the target domain by assigning appropriate weight values to these selected instances.
* Mapping-based deep transfer learning refers to mapping instances from the source domain and target domain into a new data space. In this new data space, instances from two domains are similarly and suitable for a union deep neural network.
* Network-based deep transfer learning refers to the reuse of the partial network that is pre-trained in the source domain, including its network structure and connection parameters, transfer it to be a part of a deep neural network which is used in the target domain. This is the common model in NLP processing, and the reason for successful deep learning based models like BERT etc. 
* Adversarial-based deep transfer learning refers to introduce adversarial technology inspired by generative adversarial nets (GAN) to find transferable representations that are applicable to both the source domain and the target domain. It is based on the assumption that "For effective transfer, good representation should be discriminate for the main learning task and indiscriminate between the source domain and target domain."

## Generalization and Model Capacity

#### Generalization

##### [Improving Neural Networks by Preventing Co-Adaption of Feature Detectors](https://arxiv.org/pdf/1207.0580.pdf)

##### [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to eal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem.
* "The key idea is to randomly drop units (along with thier connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets."
* Dropout is a technique that addresses both of these issues (overfitting and generating various model architectures). It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently. The term "dropout" refers to dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections. The choice of which units to drop is random.
* Applying dropout to a neural network amounts to sampling a "thinned" network from it. The thinned network consists off all the units that survived dropout. A neural net with n units, can be seen as a collection of 2^n possible thinned neural networks. These networks all share weights so that the total number of parameters is still O(n^2), or less. For each presentation of each training case, a new thinned network is sampled and trained. So training a neural network with dropout can be seen as training a collection of 2^n thinned networks with extensive weight sharing, where each thinned network gets trained very rarely, if at all.
* Authors relate the idea of dropout to the role of sexual reproduction for humans. "Simlarly, each hidden unit in a neural network trained with dropout must learn to work with a randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes. However, the hidden units within a layer will still learn to do dierent things from each other."
* "In a standard neural network, the derivative received by each parameter tells it how it should change so that the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must perform well in a wide variety of different contexts provided by the other hidden units."
* "We found that as a side-effect of doing dropout, the activaions of the hidden units become sparse, even when no sparsity inducing regularizers are present. Thus, dropout automatically leads to sparse representations."

##### [Understanding Deep Learning Requires Rethinking Generalization](https://arxiv.org/pdf/1611.03530.pdf)

"Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice."
* "Deep neural networks easily fit random labels. More precisely when trained on a completely random labeling of the true data, neural networks achieve 0 training error. In such a scenario, learning is impossible. Thus the effective capacity of neural networks is sufficient for memorizing the entire data set." Futhermore, when trained on random labels, "several properties of the training process for multiple standard architectures is largely unaffected by this transformation of the labels." As in, training was not substantially slowed, and still converged, despite no learnable structure in the data.
* "Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. In contrast with classical convex empirical risk minimization, where explicit regularization is necessary to rule out trivial solutions, we found that regularization plays a rather different role in deep learning. It appears to be more of a tuning paramtere that often helps improve the final test error of a model, but the absence of all regularization does not necessarily imply poor generalization error."
* When explicit regularization techniques like dropout and weight decay, common models are still able to fit random training data extremely well if not perfectly.

##### [Deep Double Descent: Where Bigger Models and More Data Hurt](https://arxiv.org/abs/1912.02292) [(Blog)](https://openai.com/blog/deep-double-descent/)

"We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better." Heavily relates to the interpolation threshold, overfitting, and model generalization.

* Discusses the **bias-variance** tradeoff in traditional statistical learning as "The idea is that models of higher complexity have lower bias but higher variance. According to this theory, once model complexity passes a certain threshold, models "overfit" with the variance term dominating the test error, and hence from this point onward, increasing model complexity will only decrease performance (ie. increase test error). Hence conventional wisdom in classical statistics is that, once we pass a certain threshold, "larger models are worse.". However, modern neural networks exhibit no such phenomenon, which conventional wisdom among practioners is that "larger models are better".
* Show that many deep learning settings have two different regimes. In the under-parameterized regime, where the model complexity is small compared to the number of samples, the test error as a function of model complexity follows the U-Like behaviour predicted by the classical bias/variance tradeoff. However, once model complexity is sufficiently large to interpolate (ie. achieve close to zero training error) then increasing complexity only decreases test error, following the modern intuition of "bigger models are better".
* Models with more noise often show even greater effects of the Double Descent phenomenon, making the importance of model size even greater. As it can be difficul to identify if the model has overfit or is experiencing double descent with smaller noisy models.



##### [Stiffness: A New Perspective on Generalization in Neural Networks](https://arxiv.org/pdf/1901.09491.pdf)

Develop a perspective on generalization of neural networks by proposing and investigating the concept of neural network stiffness. We measure how stiff a network is by looking at how a small gradient step in the networks parameters on one example affects the loss on another example.
* Stiffness captures the resistance of the functional approximation learned to deformation by gradient steps.
* As training begins, the model is often seen as "stiff" within classes. Ie. A gradient change to a class benefits only the members of the same class. However, as the model approaches overfitting, the model becomes less stiff, as gradient changes from one example does not lead to a consistent improvement on other examples within the same class. This is reasonable, as with overfitting, more detailed specific features have to be learned, which may not truly be apparent throughout the entire class.
* They argue that when trained on CIFAR10, stiffness is noticably higher for the hierarchial super-classes present, which reveals that the network is aware of higher-order semantically meaningful categories to which the images belong.

##### [Do CIFAR-10 Classifiers Generalize to CIFAR-10](https://arxiv.org/pdf/1806.00451.pdf)

Argues that the impressive accuracy numbers of the best performing models are questionable because the same test sets have been used to select these models for years now. They look to understand the danger of overfitting on test data. While the majority of the paper explores this from a Computer Vision perspective, the ideas broadly are applicable across a variety of problem spaces.
* "Properly evaluating progress in machine learning is subtle. After all, the goal of a learning algorithm is to produce a model that generalizes well to unseen data. Since we usually do not have access to the ground truth data distribution, we instead evaluate a model's performance on a seperate test set. This is indeed a principled evaluate protocol, as long as we do not use the test set to select our models. Unfortunately, we typically have limited access to new data from the same distribution. It is now commonly accepted to re-use the same test set multiple times throughout the algorithm and model design process."
* To explore the result of overfitting on a specific test dataset, they test pretrained CIFAR-10 models, on a truly new unseen set of image data which has only a minute distributional shift. They find a large drop in accuracy (4% to 10%), ultimately identifying that slightly distributional shifts are a large area of concern for true generalization.

#### Model Efficiency

##### [Open AI: Algorithmic Efficiency](https://openai.com/blog/ai-and-efficiency/)

* "Our results suggest that for AI tasks with high levels of investment (researcher time and/or compute) algorithmic efficiency might outpace gains from hardware efficiency (Moore’s Law)."  
* Conclusion that algorithmic efficiency improvements are more valuable than scalable compute in SOTA tasks.
* Interesting note that SOTA tasks, are often when a process goes from being infeasible to hard to implement.

Overall, this is further support that research shouldnt lean on increasing model size, and look to focus on sample efficiency (learning from less data) and algorithmic efficiency.

## Confidence, Trust and Explainability in Neural Networks

#### Probability vs. Confidence

##### [The Importance of What We Dont Know](http://mlg.eng.cam.ac.uk/yarin/thesis/1_introduction.pdf)

Deep Learning Models are generally seen as deterministic functions, as such we often only have point estimates of parameters and predictions at hand. This thesis, discusses tools which can be leveraged to get a better understanding of uncertainty in deep learning models.

* Define **model expressiveness** as "the complexity of functions a model can capture"
* Intuitively explain out of distribution test data: "For example, given several pictures of dog breeds as training data -- when a user uploads a photo of his dog -- the hypothetical website should return a prediction with rather high confidence. But what should happen if a user uploads a photo of a cat and asks the website to decide on a dog breed? The above is an example of out of distribution test data."
* Given an out of distribution use case, A possible desired behaviour of a model in such cases would be to return a prediction (attempting to extrapolate far away from our observed data), but return an answer with the added information that the point lies outside of the data distribution. As such, we want our model to possess some quantity conveying a high level of uncertainty with such inputs (alternatively, conveying low confidence).
* Discuss how in classificaion models, the probabilty vector obtained at the end of the pipeline (the softmax output) is often erroneously interpreted as model confidence. As a model can be uncertain in its predictions even with a high softmax output.

##### [On Calibration of Modern Neural Networks](https://arxiv.org/pdf/1706.04599.pdf)

Paper discussing confidence calibration (the gap between probability and model accuracy), and how different training strategies and post training calibration methods like temperature scaling help.
* Specifically, a network should provide a calibrated confidence measure in addition to its prediction. In other words, the probability associated with the predicted class label should reflect its ground trust correctness likelihood."
* "While neural networks today are undoubtably more accurate than they were a decade ago, we discover with great surprise that modern neural networks are no longer well-calibrated."
* They propose ECE, as a measure of calibration, by binning predictions into equally spaced bins, and taking a weighted average of the bins' accuracy/confidence difference.
* "Although increasing depty and width may reduce classification error, we observe that these increases negatively affect model calibration... ECE Metric grows substantially with model capacity."
* "We do observe that models trained wth Batch Normalization tend to be more miscalibrated."
* "Model calibration continues to improve when more regularization is added, well after the point of achieving optimal accuracy."
* Discuss the disconnect between NLL and accuracy, which occurs because neural networks can overfit to NLL without overfitting to the 0/1 loss This phenomenon renders a concrete explanation of miscalibration: the network learns better classification accuracy at the expense of well-modeled probabilities. Argue that overfitting may manifest itself in probabilistic error rather than classification error.
* "Our most important discovery is the surprising efectiveness of temperature scaling despite its remarkable simplicity. Temperature scaling outperforms all other methods on vision tasks, and performs comparably to other methods on the NLP dataset."
* Temperature scaling is a post-training process, in which a variable T, is learned minimizing for NLL on a validation set, which "softens" the softmax, without ultimately affecting model accuracy.

##### [Evidential Deep Learning to Quantify Classification Uncertainty](https://arxiv.org/pdf/1806.01768.pdf) ([Blog](https://towardsdatascience.com/softmax-and-uncertainty-c8450ea7e064))

Discussion on drawbacks of Softmax as a means of producing a probability vector. Propose a method slightly augmenting the loss function to form opinions which withold classification if they are not confident.
* Outlines how with basic data augmentation, a classifier on MNIST data can have 100% confidence for an incorrect guess.
* Discusses how a neural network is capable of forming opinions for classification tasks. Thus quantifying uncertainty surrounding multi-class prediction.
* They argue that such a method, outperforms on out-of-distribution sample data, and is more protected from adversarial attacks.

#### Explainability & Trust

##### [LIME: Locally Interpretable Model-Agnostic Explanations](https://www.oreilly.com/content/introduction-to-local-interpretable-model-agnostic-explanations-lime/)

Discussion of LIME, a leading technique used to describe the predictions of any machine learning classifier. Leading work, focused on developing and interacting with an approximation of black box networks locally around a specific prediction, to explore the model and build trust.

##### [Shapley Values](https://christophm.github.io/interpretable-ml-book/shapley.html)

A comprehensive resource discussing Shap values, commonly used to identify feature importance and interpretability in ml models.

## Pruning, Lottery Tickets and Distillation

#### Pruning & Lottery Tickets

##### [What's Hidden in a Randomly Weighted Neural Network?](https://arxiv.org/abs/1911.13299)

Interesting work exploring successful sub-networks found in a random initialization of a neural network.

"Training a neural network is synonymous with learning the values of the weights. In contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever modifying the weight values."

##### [The Lottery Ticket Hypothesis: Finding, Sparse Trainable Neural Networks](https://arxiv.org/pdf/1803.03635.pdf)

Really interesting work exploration the selection of sub-networks inside Neural Networks. 

Really interesting work exploring model pruning, initialization and network substructures. Found that if you randomly initialize a network, train to convergence, prune, and "rewind training" by recreating the pruned sub-network with the randomly initiated weights, and retraining, it is possible to match or surpass accuracy. Further supporting the general consensus that neural networks are aggressively overparameterized with model performance still dependent on intialization to a certain extent.

##### [The Lottery Ticket Hypothesis: A Survey](https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/)

An outstanding summary article walking through developments related to the Lottery Ticket Hypothesis and weight pruning in general.

#### Distillation

##### [What is Knowledge Distillation](https://nervanasystems.github.io/distiller/knowledge_distillation.html)

## Neural Networks in Unsupervised/Semi-Supervised Settings

#### Variational Autoencoders

##### [An Introduction to Variational Auto-Encoders](https://arxiv.org/pdf/1906.02691.pdf)

"Variational autoencoders provide a principled framework for learning deep latent-variable models and corresponding inference models. In this work, we provide an introduction to variational automencoders and some important extensions."
* "One major division in machine learning is generative versus discriminative modeling. While in discriminative modeling one aims to learn a predictor given the observations, in generative modeling one aims to solve the more general problem of learning a joint distribution over all the variable. A generative model simulates how the data is generated in the real world."
* "Generative modeling can be useful more generally. One can think of it as an auxiliary task. For instance, predicting the immediate future may help us build useful abstractions of the world that can be used for multiple prediction tasks downstream. This quest for disentangled, semantically meaningful, statistically independent and causal factors of variation in data is generally known as unsupervised representation learning, and the variational autoencoder (VAE) has been extensively employed for that purpose."
* "The VAE can be viewed as two coupled, but independently parameterized models: the encoder or recognition model, and the decoder or generative model. These two models support each other. The recognition model delivers to the generative model an approximation to its posterior over latent random variables, which it needs to update its parameters inside an iteration of "expectation maximization" learning. Reversely, the generative model is a scaffolding of sorts for the recognition model to learn meaningful representations of the data, including possible class-labels. The recognition model is the approximate inverse of the generative model according to Bayes rule."
* "The most common criterion for probabilistic models (including VAE) is maximum log-likelihood. As we will explain, maximization of the log-likelihood criterion is equivalent to minimization of a Kullback Leibler divergence between the observed data and model distributions."
* Discuss KL Annealing, and mode collapse: "In our work, we found that stochastic optimization with the unmodified lower bounce objective can get stuck in an undesirable stable equilibrium. At the start of training, the likelihood term log p(x|z) is relatively weak, such that an intitially attractive state is ehere q(z|x) = p(z), resulting in a stable equilibrium from which it is difficult to escape. The solution proposed is to use an optimization schedule where the weights of the latent cost Dkl is slowly annealed from 0 to 1 over many epochs."

##### [Balancing Reconstruction Error and Kullback-Leibler Divergence in Variational Autoencoders](https://arxiv.org/pdf/2002.07514.pdf)

"In the loss function of a Variational Autoencoder there is a well known tension between two components: the reconstruction loss, improving the quality of the resulting images, and the Kullback-Leibler divergence, acting as a regularizer of the latent space." This paper explores this balance and explores solutions.
* The loss function of VAEs are composed of two parts: one is just the log-likelihood of the reconstruction, while the second one is a term aimed to enforce a known prior distribution of the latent space - typically a specifical normal distribution. Technically, this is achieved by minimizing the KL Divergence.
* Loglikelihood and KL-Divergence are typically balanced by a suitable beta parameter, since they have somewhat contrasting effects; the former will try to improve the quality of the reconstruction, neglecting the shape of the latent space; on the other side, KL-Divergence is normalizing and smoothing the latent space, possibly at the cost of some additional "overlapping" between latent variables, eventually resulting in a more noisy encoding.
* If not properly tuned, KL-Divergence can also easily induce a sub-optimal use of network capacity, where only a limited number of latent variables are exploited for generation; this is the so called overpruning/variable-collapse/sarsity phenomenon.
* Tuning down beta typically reduces the number of collapsed varables and improves the quality of reconstructed sample. However, this may not result in a better quality of generated samples, since we loose control on the shape of the latent space, that becomes harder to be exploited by a random generator.

##### [Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing](https://arxiv.org/abs/1903.10145)

##### [Diagnosing and Enhancing VAE Models](https://arxiv.org/abs/1903.05789)

##### [On the Importance of the Kullback-Leibler Divergence Term in Variational Autoencoders for Text Generation](https://www.aclweb.org/anthology/D19-5612.pdf)

Variational Autoencoders (VAEs) are known to suffer from learning uninformative latent representation of the input due to isues such as approximated posterior collapse, or entanglement of the latent space. We impose an explicit contraint on the KL Divergence term inside the VAE objective function. While the explicit constraint naturally avoids posterior collapse, we use it to further understand the significance of the KL term in controlling the information transmitted through the VAE channel. Within this framework, we explore different properties of the estimated posterior distribution, and highlight the trade-off between the amount of information encoded in a latent code during training, and the generative capacity of the model.