# Random Text Conditioning Notes

Given some input text to condition on, the algorithm currently produces audio samples that are very similar to each other, regardless of the input latent vector 'z'. This is likely a form a mode collapse caused by feeding the generator with data containining discontinuities (text input data is discrete). Since we have a limited, fixed set of descriptions for sound effects, using this text to create text embeddings results in a discrete set of conditioning variables. When fed into the generator, this causes the generator to to converge to a single example it can use to fool the discriminator, regardless of the input latent vector. In order to solve this, we need to feed continuous data to the generator. We can do this by sampling from a Gaussian distribution, N(u(t), sigma(t)), with u(t) and sigma(t) being functions of the original text embedding. Note that at generation time, we should use the provided text embedding directly, without re-sampling from the distribution. Since we don't know exactly what a good value or function for u(t) and sigma(t) would be, we can simple have the computer learn these functions by using a fully connected dense layer to output u(t) and sigma(t).

## Relevant Code from StackGAN implementation

In [2]:
# g-net
def generate_condition(self, c_var):
    conditions =\
        (pt.wrap(c_var).
         flatten().
         custom_fully_connected(self.ef_dim * 2).
         apply(leaky_rectify, leakiness=0.2))
    mean = conditions[:, :self.ef_dim]
    log_sigma = conditions[:, self.ef_dim:]
    return [mean, log_sigma]

In [3]:
def sample_encoded_context(self, embeddings):
        '''Helper function for init_opt'''
        c_mean_logsigma = self.model.generate_condition(embeddings)
        mean = c_mean_logsigma[0]
        if cfg.TRAIN.COND_AUGMENTATION:
            # epsilon = tf.random_normal(tf.shape(mean))
            epsilon = tf.truncated_normal(tf.shape(mean))
            stddev = tf.exp(c_mean_logsigma[1])
            c = mean + stddev * epsilon

            kl_loss = KL_loss(c_mean_logsigma[0], c_mean_logsigma[1])
        else:
            c = mean
            kl_loss = 0

        return c, cfg.TRAIN.COEFF.KL * kl_loss