# Random Text Conditioning Notes

Given some input text to condition on, the algorithm currently produces audio samples that are very similar to each other, regardless of the input latent vector 'z'. This is likely a form a mode collapse caused by feeding the generator with data containining discontinuities (text input data is discrete). Since we have a limited, fixed set of descriptions for sound effects, using this text to create text embeddings results in a discrete set of conditioning variables. When fed into the generator, this causes the generator to to converge to a single example it can use to fool the discriminator, regardless of the input latent vector. In order to solve this, we need to feed continuous data to the generator. We can do this by sampling from a Gaussian distribution, N(u(t), sigma(t)), with u(t) and sigma(t) being functions of the original text embedding. Note that at generation time, we should use the provided text embedding directly, without re-sampling from the distribution. Since we don't know exactly what a good value or function for u(t) and sigma(t) would be, we can simple have the computer learn these functions by using a fully connected dense layer to output u(t) and sigma(t).

## Relevant Code from StackGAN implementation

In [2]:
# g-net
def generate_condition(self, c_var):
    conditions =\
        (pt.wrap(c_var).
         flatten().
         custom_fully_connected(self.ef_dim * 2).
         apply(leaky_rectify, leakiness=0.2))
    mean = conditions[:, :self.ef_dim]
    log_sigma = conditions[:, self.ef_dim:]
    return [mean, log_sigma]

In [5]:
def sample_encoded_context(self, embeddings):
        '''Helper function for init_opt'''
        c_mean_logsigma = self.model.generate_condition(embeddings)
        mean = c_mean_logsigma[0]
        if cfg.TRAIN.COND_AUGMENTATION:
            # epsilon = tf.random_normal(tf.shape(mean))
            epsilon = tf.truncated_normal(tf.shape(mean))
            stddev = tf.exp(c_mean_logsigma[1])
            c = mean + stddev * epsilon

            kl_loss = KL_loss(c_mean_logsigma[0], c_mean_logsigma[1])
        else:
            c = mean
            kl_loss = 0

        return c, cfg.TRAIN.COEFF.KL * kl_loss
    
# cfg.TRAIN.COEFF.KL = 2

In [4]:
# reduce_mean normalize also the dimension of the embeddings
def KL_loss(mu, log_sigma):
    with tf.name_scope("KL_divergence"):
        loss = -log_sigma + .5 * (-1 + tf.exp(2. * log_sigma) + tf.square(mu))
        loss = tf.reduce_mean(loss)
        return loss

After applying the resampling to the text embedding we start to get some decent output after 10000 steps. However, the generator is still ignoring the latent 'z' vector, in favor of generating all its output directly from the input context. Part of the problem is that currently the text embedding occupies over 2/3rds of the generator's input space (z vector dims = 100, text embedding dims = 256). Since the text embedding is also highly correlated with the output that the discriminator expects, the generator tends to rely heavily on the text embedding to produce its output. This causes the generator to learn to ignore 'z', as using 'z' to create variety doesn't help the generator fool the discriminator. In order to mitigate th over reliance on the text embedding, dropout is added to the layer that produces the embedding in both the discriminator and the generator. The size of the text embedding is also reduced to 128 to make it more balanced with the latent 'z' vector. 

Another approach to mitigating this issue would be to save the text embeddings, and re-sampled embeddings, for previous runs of training. Then, occasionally replay these during training. It is important to ensure that the generator uses the saved re-sampled embedding as is, without trying to re-sample it again. This will ensure that the generator would need to rely on the latent 'z' vector (which is not saved), in order to fool the discriminator a second time.