# Unconditional natural  language generation: language modeling (recap)

A language model is a probability distribution over the sequence of words, modeling language (production), thus if the set of words is $w$, then for arbitrary $\mathbf w = \langle w_1,\dots, w_n\rangle$ ($w_i\in W$) sequence it defines a $P(\mathbf w)$ probability. 

Probability with chain rule:

$$P(\mathbf w)= P(w_1)\cdot P(w_2 \vert w_1 )\cdot P(w_3\vert w_1, w_2)\cdot\dots\cdot P(w_n\vert w_1,\dots, w_{n-1})$$

so this means, that for the modeling we need only to give the conditional probability of the "continuation", the next word, thus for $w$ word and $\langle w_1,\dots,w_n\rangle$ sequence the probability that the next word will be $w$

$$P(w ~\vert ~ w_1,\dots,w_n)$$

There are character based models also, which take the individual characters as units, not the words, and model language as a distribution over sequences of characters (think T9...)

## Measurement of performance: Perplexity

A language model $\mathcal M$'s perplexity over the word series $\mathbf w = \langle w_1,\dots, w_n\rangle$ is:

$$\mathbf{PP}_{\mathcal M}(\mathbf w) = \sqrt[n]{\frac{1}{P_{\mathcal M}(\mathbf w)}}$$

With the chain rule can be rewritten as:

$$\mathbf{PP}_{\mathcal M}(\mathbf w) = {\sqrt[n]{\frac{1}{P_{\mathcal M}(w_1)}\cdot \frac{1}{P_{\mathcal M}(w_2 \vert w_1 )}\cdot \frac{1}{P_{\mathcal M}(w_3\vert w_1, w_2)}\cdot\dots\cdot \frac{1}{P_{\mathcal M}(w_n\vert w_1,\dots, w_{n-1})}}}$$

which is exactly the geometric mean of the reciprocals of the conditional probabilities of all words in the corpus.

In case of a bigram model this is further simplified to:
$$\mathbf{PP}_{\mathcal M}(\mathbf w) = \sqrt[n]{\frac{1}{P_{\mathcal M}(w_1)}\cdot \frac{1}{P_{\mathcal M}(w_2 \vert w_1 )}\cdot \frac{1}{P_{\mathcal M}(w_3\vert w_2)}\cdot\dots\cdot \frac{1}{P_{\mathcal M}(w_n\vert w_{n-1})}}$$

Taking the logarithm of perplexity we get:

$$
\log \mathbf{PP}_{\mathcal M}(\mathbf w) = \frac{1}{n}(-\log{P_{\mathcal M}(w_1)} + -\log {P_{\mathcal M}(w_2 \vert w_1 )}+ -\log{P_{\mathcal M}(w_3\vert w_1,w_2)}+\dots+ -\log{P_{\mathcal M}(w_n\vert w_1,\dots, w_{n-1}))}
$$

exactly the negative log likelihood of the corpus, so minimizing perplexity is equivalent to maximizing log likelihood on the data.

## But what is it good for?
For example:
- Predictive text input ("autocomplete")
- Generating text
- Spell checking
- Language understanding
- And most importantly representation learning - this we will be studiying in detail in a next lecture

## Generating text with a language model

The language model produces a tree with probable continuations of the text:

<img src="https://4.bp.blogspot.com/-Jjpb7iyB37A/WBZI4ImGQII/AAAAAAAAA9s/ululnUWt2vw9NMKuEr-F9H8tR2LEv36lACLcB/s1600/prefix_probability_tree.png" width=400 heigth=400>

Using this tree we can try different algorithms to search for the best "continuations". A full breadth-first search oi usually impossible, due to the high branching factor of the tree.

Alternatives:
- "Greedy": we choose the continuation which has the highest direct probability, This will most probably be suboptimal, since the probability of the full sequence is tha product of the continuations, and if we would have chosen a different path, we might ahve been able to choose later words with hihg probabilities.
- Beam-search: we always store a fixed $k$ number of partial sequences, and we always try to expand these, always keeping the most probable $k$ from the possible continuations. 

Example ($k$=5):

<img src="http://opennmt.net/OpenNMT/img/beam_search.png" width=600 heigth=600>
 

## The "old way": N-gram based solutions

With _gross_ simplification we assume, that the distribution is only dependent on the prior $n-1$ words (where $n$ is typicly $<=4$), thus we assume a Markov chain of the order $n$:

 $$P(w ~\vert ~ w_1,\dots,w_k) = P(w ~\vert ~ w_{k- n + 2},\dots,w_k)$$

We simply compute these probabiltites in a frequentist style by calculating the $n$-gram statistics of the corpus at hand:

$$P(w_2 ~\vert ~w_1) = \frac{c(\langle w_1, w_2 \rangle)}{c(w_1)}$$

$$P(w_{k+1} \vert~ w_1,\dots,w_k)_\mathrm = \frac{c(\langle w_1,...,w_k, w_{k+1} \rangle)}{c(\langle w_1, \dots w_k\rangle)}$$

Please note, that in this case we are using "memorization", a form of database learning, with minimal compression - "counting".

But what do we do the given $n$-grams rarely or never occure? We have to employ some __smoothing__ solutions, like: 

##### Additive smoothing
We pretend, that we have seen the $n$-grams more times than we have actually did with a fixed $\delta$ number, in simplest case by $n=2$:

$$P(w_2 ~\vert ~w_1) = \frac{c(\langle w_1, w_2 \rangle) + \delta}{\sum_{w\in V} [c(\langle w_1, w\rangle) + \delta]}$$

Widespread solution for $\delta$ is $1$.

The main problem with this kind of smoothing is that it does not take into account by "supplementing" the data the frequency of components of shorter $n$-grams, eg. if neither $\langle w_1, w_2 \rangle$  nor $\langle w_1, w_3 \rangle$ occures in the corpus, it assumes the frequency of both bigrams to be $\delta$, irrespective of the ratio of frequencies of $w_2$ and $w_3$.
Most smoothing techniques are trying to accomodate this, eg: simple interpolation:

##### Interpolatcion

In case of bigrams, we add - with a certain weight - the probabilities coming from the individual frequencies:

$$P(w_2 ~\vert ~w_1)_{\mathrm{interp}} = \lambda_1\frac{c(\langle w_1, w_2 \rangle)}{c(w_1)} + (1 - \lambda_1)\frac{c(w_1)}{\sum_{w\in V}c(w)}$$

Racursice solution for arbitrary $k$:

$$P(w_{k+1} \vert~ w_1,\dots,w_k)_\mathrm{interp} = \lambda_k\frac{c(\langle w_1,...,w_k, w_{k+1} \rangle)}{c(\langle w_1, \dots w_k\rangle)} + (1-\lambda_k)P_\mathrm{interp}(\langle w_2,\dots,w_{k+1}\rangle)$$

$\lambda_k$ is empiricly set by examining the corpus, typically by [Expectation Maximization algorithm](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm), which - as we have mentioned - iteratively tunes the parameters to maximize the maximum likelyhood.


Good overview about the smoothing methods: [MacCartney, NLP Lunch Tutorial: Smoothing](https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf)

 
#### General problems

- Even the core assumption is not too realistic, since the probabilities are for sure influenced in a way by words further than $n$, but for practical reasons, it has to be limited (sparsity, computatuion capacity).
- On a large enough corpus, the memory footprint of the $n$-gram models is _huge_, eg. for the 1T n-gram corpus of Google ([see here](https://catalog.ldc.upenn.edu/LDC2006T13)) containing 1,024,908,267,229 tokens the $n$-gram counts are as follows:
    - unigram: 13,588,391, 
    - bigram: 314,843,401, 
    - trigram: 977,069,902, 
    - fourgrams: 1,313,818,354 
    - fivegram: 1,176,470,663.

## Language modeling with RNNs

One way to circumvent the Markov assumption is to use RNN-s, which are capable of modeling the long-term dependencies inside the sequence of words. The text is thus considered to be a time-series, and thus an appropriate architecture can be used (as we have already seen):

<img src="http://drive.google.com/uc?export=view&id=1y8QYr9ftTvXAxgzS-ldnGlijVpmK2l21" width=600 heigth=600>



Notable features:

- Input is a "one-hot" encoded vector, which we on the spot transform into an "embedding vector"
- For each output step, we get a probability distribution over the whole vocabulary with softmax
- This above is a simple RNN, but LSTMs can be used without any problems

### Teaching: crossentropy/negative log likelihood loss with _teacher forcing_


<img src="http://drive.google.com/uc?export=view&id=1XsBoRp7cNay3svFLRDv2JEDyC7m7CUdC" width=600 heigth=600>


- The loss is generally the well-known crossentropy, which is in this case (since the input is a one-hot vector):
  $$J^{(i)}(\Theta) = -\log (\hat y[x^{(i+1)}])$$
  the negative logarithm of the probability assigned by the network to the right word / next word.

- Again, we are trying to maximize the likelihood of the data set!

# Conditional language modeling

Model the probability 

$$
P(\langle y_1,\dots,y_n\rangle|C)
$$
where $C$ expresses some conditions on the generated $\langle y_1,\dots,y_n\rangle$ text.

+ Text in a certain style
+ Text about a certain topic
+ Machine translation
+ Question answering
+ Chatbots
+ Summarization
+ Image captioning
+ Natural language generation from structured data (e.g., news)

## Encoder-decoder architectures for conditional text generation

Encode the condition and use a sequence-producing decoder, e.g., an LSTM:

E.g., image-captioning:

<img src="https://miro.medium.com/max/512/1*vzFwXFJOrg6WRGNsYYT6qg.png">

(Image from [Image Captioning in Deep Learning](https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2)) 

Seq2seq architectures for conditional generation on the basis of text input:

<img src="https://docs.chainer.org/en/stable/_images/seq2seq.png" width="600px">

## Sampling strategies during generation (again)

Although the most well known strategies are deterministic,
- greedy
- beam-search

the need for "creative" generation increased the use of stochastic sampling. An important example is

- stochastic beam-search, where the $k$ alternatives are always sampled from the continuation distributions, instead of simply choosing the $k$ most probable choices.

recently some alternatives have been proposed to increase the variance of the samples but still keep quality:
- top-$k$ sampling: at each time step, sample from the $k$ most probable tokens.
- top-$p$ sampling: at each time step, sample from those most probable tokens that together have $p$ probability:

<img src="https://miro.medium.com/max/1284/0*J37qonVPJvKZpzv2" width="400px">

(Image source: [Ben Mann: How to sample from language models](https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277))

## Evaluation

How to evaluate the quality of the produced texts? The most used automated metrics make use of one or more __reference texts__, e.g.,
- the correct translation(s)
- correct image caption(s)
- a corpus of "good texts" in the case of unconditional language modeling,
and don't require exact matches, which would be neither realistic nor desirable for most use cases.

__BLEU-n (bilingual evaluation understudy)__

Perhaps the most important family of metrics, which comes from machine translation. It basically gives the percentage of $n$-grams in the generated text which matches the reference text(s) (on a token level):

<img src="https://x-wei.github.io/images/Ng_DLMooc_c5wk3/pasted_image013.png" width="500px">

([Image source](http://x-wei.github.io/Ng_DLMooc_c5wk3.html))

$n$ is typically in the 2-4 range.

__ROUGE-n (Recall-Oriented Understudy for Gisting Evaluation)__

A modification of BLEU  -- BLEU is precision-oriented, while ROUGE measures the recall of $n$-grams in the reference texts.

## Problems with token-by-token + likelihood-based training/models

### The gap between global evaluation metrics and local objective

The BLEU, ROUGE etc. scores and, especially, human quality judgments are __global__ (top-down) evaluation metrics that concern the generated texts as a whole, while the maximum likelihood training objective simply concentrates on bringing the local continuation probabilities close to that of the corpus.

### "Exposure bias"

> The current approach to training them consists of maximizing the likelihood of
each token in the sequence given the current (recurrent) state and the previous
token. At inference, the unknown previous token is then replaced by a token
generated by the model itself. This discrepancy between training and inference
can yield errors that can accumulate quickly along the generated sequence.

>The main problem is that mistakes made early in the sequence generation process are fed as input to the model and can be quickly amplified because the model might be in a part of the state space it has never seen at training time.

(From: [Bengio et al (2015) "Scheduled sampling for sequence prediction with recurrent neural networks."](https://papers.nips.cc/paper/5956-scheduled-sampling-for-sequence-prediction-with-recurrent-neural-networks.pdf))

### "Linguistic creativity" 

For many applications, "linguistic creativity" is very important: we want to produce distributions which can be sampled to produce __both high quality and highly varying__ texts: e.g., a chat bot should not always produce the same answer to the same question.

# A solution to "Exposure bias": Scheduled sampling (2015)

An influential idea introduced in  [Bengio et al (2015) "Scheduled sampling for sequence prediction with recurrent neural networks."](https://papers.nips.cc/paper/5956-scheduled-sampling-for-sequence-prediction-with-recurrent-neural-networks.pdf)

As the radical difference between training and inference comes from using the ground truth $y_{t-1}$ instead of the predicted $\hat{y}_{t-1}$ to predict $y_t$, we can gradually switch to using the model's own predictions using the training:

> We propose to change the training process in order to gradually force the model to deal with its own mistakes, as it would have to during inference. Doing so, the model explores more during training and is thus more robust to correct its own mistakes at inference as it has learned to do so during training.

at every training decoding time step we "toss a coin" to decide which input to use according to a Bernoulli distribution. At the start, we always choose the ground truth but the probability of using the model's own prediction is gradually increased to 1 according to a schedule:

<img src="https://d3i71xaburhd42.cloudfront.net/df137487e20ba7c6e1e2b9a1e749f2a578b5ad99/4-Figure2-1.png" with="800">

(Figure from the paper.)

They used two sampling strategies:
+ A greedy arg-max based strategy
+ Sampling from the distribution

__Evaluation__

This solution led to notable improvements (eg. 28.8-30.6 in image captioning BLEU-4  score), but is still limited in a number of ways:

- Since the sampling operations used were not differentiable, the errors were not backpropagated through the sampling, the $\hat y$-s were basically treated as "external data augmentation":

> Note that when we sample the previous token $\hat y_{t−1}$ from the model itself while training, We could back-propagate the gradient of the losses at times t→T through that decision.  This was not done in the experiments described in this paper and is left for future work.

- Sampling the model's predictions only for specific token positions produces strange "hybrid" input sequences which cannot diverge naturally from the ground truth $y$ and is constrained to be aligned (and constantly realigned) with it.
- The method does not address the problem of not connecting the global evaluation metrics (BLEU etc.) with the training objective.

# Differentiable sampling methods (2017)

A number of solutions focused on solving the first problem: nondifferentiable sampling.

As, for instance, the paper [Goyal et al. (2017) Differentiable Scheduled Sampling for Credit Assignment](https://arxiv.org/pdf/1704.06970.pdf) shows, cascading errors cannot really be corrected if the sampling is not backpropagated or the sampling operator is not differentiable.

Their example is the following. Reference output:

> The cat purrs.

Incorrect prediction:

> The dog barks.

Where __barks__ was produced by the model on the basis of its own previous prediction __dog__. As this was a case of "cascading error" it would be important to change the weights to avoid it, but without backpropagating through the random sampling we increase the probability of predicting "purrs" after "dog" which is obviously not a good strategy. 


The ideal is to have a smooth sampling operation through which we can backpropagate this type of cascading error:

<img src="https://d3i71xaburhd42.cloudfront.net/c14f416bab5a4936deded9797883e14ac0b55fff/2-Figure1-1.png">

(Figure from the paper [Goyal et al. (2017): Differentiable Scheduled Sampling for Credit Assignment](https://arxiv.org/pdf/1704.06970.pdf))

As the figure shows, normal greedy sampling is not continuous at the points where the maximum changes, so there is no informative gradient that could guide the parameters towards the optimum.

## Differentiable greedy sampling ("Soft Argmax")

How could we make the sampling operation differentiable? Relying on the fact that the RNN inputs are immediately converted to token embeddings, one possible strategy is to replace the arg-max operation with the sum of all embeddings weighted by the (peaked) probabilites assigned by the model. (The $\alpha$ parameter controls how soft/hard is the weight distribution, as $\alpha \rightarrow \infty$.)

<img src="https://d3i71xaburhd42.cloudfront.net/c14f416bab5a4936deded9797883e14ac0b55fff/2-Figure2-1.png">

(Figure from [Goyal et al. (2017): Differentiable Scheduled Sampling for Credit Assignment](https://arxiv.org/pdf/1704.06970.pdf))

##  Soft reparametrized sampling

A drawback of the (soft) argmax strategy is its low randomness: the distributions are not really explored by always choosing/letting dominate the token predicted to be most probable token. A soft version of actually __sampling__ the predicted distribution can be optimized by using an analogue of the VA reparametrization trick: injecting randomness from a standard distribution:

Instead of the soft greedy
$$
\overline e_{i-1}=\sum_y e_y \frac{\exp(\alpha s_{i-1}(y))}{\sum_{y'}\exp(\alpha s_{i-1}(y'))}
$$

(where $\hat e_{i-1}$ is the sampled embedding and $s_{i-1}$ is the scoring function for the $i-1$-th output).
we sample a value $G_y$ from the same $G$ distribution for all elements of the vocabulary, and modify the scores with the sampled values:

$$
\overline e_{i-1}=\sum_y e_y \frac{\exp(\alpha (s_{i-1}(y) + G_y))}{\sum_{y'}\exp(\alpha (s_{i-1}(y')+G_{y'}))}
$$

# How do we choose the $G$ distribution? $G$ refers to the so called standard Gumbel distribution with probability density function

$$
f(x) = e^{-(x+ e^{-x})}
$$

This is the standard, because Gumbel is actually a parametric family of distributions with parameters $\mu$ and $\beta$. For the standard version, $\mu = 0$ and $\beta = 1$.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Gumbel-Density.svg/1280px-Gumbel-Density.svg.png" width="500px">

(Image source: [Wikipedia](https://en.wikipedia.org/wiki/Gumbel_distribution))

which has the nice property that one can sample from a categorical distribution described by the log probabilities (logits)
$s_1,\dots,s_n$  by drawing samples $z_1,\dots,z_n$ from it, adding the samples to the logits and choosing the alternative with the maximal value:
$$
\underset{i}{\operatorname{argmax}} (s_i + z_i)
$$

(See, e.g., https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ for a discussion of this so called Gumbel-max trick.) 

Accordingly, the above definition of $\overline e_{i-1}$ indeed differs from a proper sampling of the output distribution by using a softened version of argmax -- as $\alpha \rightarrow \infty$ this value gets closer and closer to the argmax value, i.e., to a proper sample from the distribution.



## Improvements with differentiable sampling

Compared to the original scheduled sampling method, making sampling differentiable and backpropagating errors noticably improved the results:

<img src="https://d3i71xaburhd42.cloudfront.net/c14f416bab5a4936deded9797883e14ac0b55fff/5-Table1-1.png" width="600px">

# Autoencoders (2016--) 

Inspired by the success of sequence2sequence encoder-decoder architectures for conditional text generation, text generating __autoencoders__ learning a mapping into and reconstruction from latent representations were also developed in the last few years. 

__VAEs__

The first attempts were with variational autoencoders, and standard RNN-based seq2seq architectures: 

<img src= "https://d3i71xaburhd42.cloudfront.net/d82b55c35c8673774a708353838918346f6c006f/3-Figure1-1.png">

(Image from [Generating Sentences from a Continuous Space (2016](https://arxiv.org/pdf/1511.06349.pdf))

but performance problems led to experiments with other types of encoder/decoder architectures such as (de)convolutional encoders/decoders:

<img src="https://d3i71xaburhd42.cloudfront.net/81aee1c76e6bd4b915b016f7a8b70abe42841dd8/4-Figure2-1.png" width="700px">

(Image from [A Hybrid Convolutional Variational Autoencoder for Text Generation](https://arxiv.org/pdf/1702.02390.pdf))

__Adversarial autoencoder__
As text generation still has not been that great, research shifted to the so called adversarial autoencoders, that also make use of an adversarial discriminator that tries to distinguish between embeddings sampled from the imposed latent space distribution (e.g. standard normal), and those produced by the encoder from a real examples. Text generating adversarial autonencoders are still an actively researched area and lead naturally to the idea of text generating GANs...

# The advent  Text generating GANs: SeqGAN (2016)

## Why would you use GANs for text generation and why is it difficult?

As we know, __exposure bias__ is the problem of difference between the training and test setting of traditional neural encoder-decoder based text generating models. A radical solution is to switch to a Generative Adversarial setting, in which the training and inference setting does not differ in that way, since the training objective is simply to improve the quality of the generated text (to mislead the discriminator).

Switching to GANs also solves the problem of having a global objective (as opposed to the local maximum likelihood): not all text generation tasks have well established global evaluation metrics like BLEU for MT.

__Nice thought, but there seem to be some formidable problems__:

As the "GAN-father", Ian Goodfellow  himself explained in 2016 on Reddit(!) in an [answer](https://www.reddit.com/r/MachineLearning/comments/40ldq6/generative_adversarial_networks_for_text/) to a question regarding the possibility of text generating GANs:

> GANs work by training a generator network that outputs synthetic data, then running a discriminator network on the synthetic data. The gradient of the output of the discriminator network with respect to the synthetic data tells you how to slightly change the synthetic data to make it more realistic.

> You can make slight changes to the synthetic data only if it is based on continuous numbers.  If it is based on discrete numbers, there is no way to make a slight change.

> For example, if you output an image with a pixel value of 1.0, you can change that pixel value >to 1.0001 on the next step.

> If you output the word "penguin", you can't change that to "penguin + .001" on the next step, because there is no such word as "penguin + .001". You have to go all the way from "penguin" to "ostrich".

> Since all NLP is based on discrete values like words, characters, or bytes, no one really knows how to apply GANs to NLP yet.

That is, the problem is (again) non-differentiability: for training the generator, we need gradients from the discriminator's fake vs real judgments, but the discreteness of the text output makes this loss non-differentiable with respect to the generator's parameters. However, Goodfellow added a very important final remark:

> In principle, you could use the REINFORCE algorithm, but REINFORCE doesn't work very well, and no one has made the effort to try it yet as far as I know.

## Text generation as a reinforcement learning problem

As can be expected, it didn't take long until people actually implemented the first RL-based text generating GAN, called SeqGan -- the model is described in the paper [Yu et al. (2017), SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient](https://arxiv.org/pdf/1609.05473.pdf). As they explain, using RL solves both the problem of

- global reward function vs local decisions, and
- non-differentiability of the generator's loss.

How do we transfer stepwise text generation to an RL setting?

The SeqGan solution: for the GAN's generator

- states are the generated (possibly partial) token sequences
- actions are adding a new token from the vocabulary, consequently the number of actions = the number of types in the vocabulary,
- state transitions are deterministic (adding a token deterministically produces the extended sequence),
- the policy, in contrast, is stochastic, since there are, in vast majority of cases, several alternative continuations for a sequence,
- the reward is coming from the discriminator: it is the probability that the discriminator assigns to the generated sample's being real.

The parameters of the generator are learned with policy gradient, while the discriminator is a classifier trained with standard GD and binary cross-entropy loss:
 
<img src="https://thu-coai.github.io/cotk_docs/_images/seqgan.png" width=600>

(Figure from the paper)

## Generator and discriminator architectures

__Generator__

The original SeqGan generator was basically the simplest possible LSTM language model. In contrast to classic image-generating GANs, the generator does not need a random sample as an input -- stochasticity comes from sampling from the stochastic policy.

__Discriminator__

In contrast to the generator, the discriminator was a CNN-based model, concretely a HighwayNet variant (with sigmoid output for binary classification).

__Important limitation: fixed length sequences__

SeqGAN learns from and generates only token sequences of a given length (T in the paper).

## Training

Training pseudo-code  from the paper:

<img src="http://shaofanlai.com/archive/storage/pEEps8sn9v1kHcQy4cvKumDuUXM1C4D0B9HETBVgmPTxwUdTvR" width="500px">

Notable details:

+ The generator is first pretrained with MLE on the text corpus.
+ Policy gradient training uses Monte Carlo search.

## SeqGan results
Results seemed rather convincing. 

__Corpus-level BLEU__: In absence of a reference output, the paper used a BLEU metric which counted shared n-grams between the output and the whole corpus.

On synthetic data set produced by a randomly initialized LSTM "oracle", the performances were measured by the average MLL that was achieved by the generated sequences according to the oracle RNN (PG-BLEU is (non-adversarial) model, trained by MCTS RL and BLEU score as reward):

<img src="https://pbs.twimg.com/media/Ct7seAgVYAQ01dD?format=jpg&name=large" width="600">


<img src="https://d3i71xaburhd42.cloudfront.net/2966ecd82505ecd55ead0e6a327a304c8f9868e3/6-Table4-1.png" width="450px">

(Figures from the paper)

## Further RL-based GANs

SeqGan was followed by a large number of "more modern" RL-based architectures, following the basic template of solving the problem of differentiability by using some type of Policy Gradient-based learning. From the RL point of view the main difference has been a switch to __Actor-Critic__ algorithms/models, see, e.g.,

+ [Fedus, Goodfellow at al_ MaskGAN: Better Text Generation via Filling in the ______](https://arxiv.org/abs/1801.07736) and, for an overview,
+ [Keneshloo et al (2019): Deep Reinforcement Learning forSequence-to-Sequence Models](https://arxiv.org/pdf/1805.09461.pdf).

# New evaluation metric proposals for GANs

In addition to the well known $n$-gram based metrics (BLEU, ROUGE), researchers tried to use LM-based metrics:

- __Language Model score:__ take a good LM, and look at the probability of the generated texts. Of course, this has the problem that it doesn't reward "creativity/variance" and support "mode collapse":
>a model that always generates a few highly likely sentences will score very well.
(from: [Semeniuta at al (2019): On Accurate Evaluation of GANs for LanguageGeneration](https://arxiv.org/pdf/1806.04936.pdf))

- __Reverse Language Model score:__ a language model is trained on generated samples and then is evaluated on (held out) real texts. This is imperfect as well, as results depend on how well the LM can represent the generating distribution.

A new proposal is

- __Frechet InferSent Distance:__ On analogy of the __Frechet Inception Distance__ used for image GANs, this metric uses a specific sentence embedding model, the [InferSent sentence embeddings](https://arxiv.org/abs/1705.02364) learned from supervised Natural Language Inference Data (classifying sentence pairs into the classes Entailment, Contradiction and Neutral), samples the embeddings of the "real" and "generated" distributions, and calculates the Fréchet distance between them.

See  [Semeniuta at al (2019): On Accurate Evaluation of GANs for LanguageGeneration](https://arxiv.org/pdf/1806.04936.pdf) for details.

# Text GANs without RL

## RL problems

In spite of the results of RB-based GANs, the notorious difficulty of RL training plus performance problems (typically, __mode collapse__) led to the question: is RL training unavoidable for text/sequence producing GANs? The answer is no, as a number of alternatives have been developed:

## The Gumbel softmax-based solution
Remember: the trick was based on the fact that one can sample from a categorical distribution given by the log probabilities (logits)
$s_1,\dots,s_n$  by drawing samples $z_1,\dots,z_n$ from it, adding the samples to the logits and choosing the alternative with the maximal value:
$$
\underset{i}{\operatorname{argmax}} (s_i + z_i)
$$
Representing the result as distribution, the full sampling operation can be described as
$$
\operatorname{one-hot}(\underset{i}{\operatorname{argmax}} (s_i + z_i))
$$
This is, of, course, not differentiable with respect to the $s_i$-s, but the the Softmax version is:
$$
\operatorname{Softmax} (\frac{1}{\tau}(\mathbf s + \mathbf z))
$$
is a differentiable (in $\mathbf s$) approximation of sampling the categorical distribution represented by $\mathbf s$.

Using this trick we can build a __differentiable__ "traditional" sequence GAN, which, as usual, starts generation from random noise, in the form of randomly initialized LSTM hidden states and cell states:

<img src="https://media.arxiv-vanity.com/render-output/2475123/x1.png">

(Figure from [Kusner, Hernandez-Lobato (2016), GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution](https://arxiv.org/pdf/1611.04051.pdf))

In the paper [Kusner, Hernandez-Lobato (2016), GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution](https://arxiv.org/pdf/1611.04051.pdf) the discriminator is also an LSTM, which reads the generator's output at every step:

<img src="https://media.arxiv-vanity.com/render-output/2475123/x2.png">

(Figure from the paper)


+ Note that the because of the use of Gumbel-softmax sampling, the outputs of the generator are smooth approximations of the one-hot encoded random samples. In this sense the generator's output is no longer discrete.

+ To make the real sequences more similar to the smooth fakes and a bit noisy, they were also converted first by label smoothing (0.9 for the correct character, uniform for the rest) and then using Gumbel-softmax before feeding into the discriminator.

## Cooperative training

As the [original GAN paper](http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf) proved, GANs minimize the so called Jensen-Shannon divergence between the original distribution and the learned distribution, which is a symmetrized and smoothed form of the KL-divergence:

$$
\operatorname{JSD}(P_{\mathrm{Data}} \| P_G) = \frac{1}{2}(\operatorname{KL}(P_{\mathrm{Data}} \| M) + \operatorname{KL}(P_G \| M)) 
$$

where $M=\frac{1}{2}(P_{\mathrm{Data}} + P_G)$ is the mean distribution of $P_{\mathrm{Data}}$ and $P_G$. 

Using this insight, cooperative training as described in [Lu et al. (2019), CoT: Cooperative Training for Generative Modeling of Discrete Data](https://arxiv.org/pdf/1804.03782.pdf) tries optimizing for this objective using a so called "Mediator" model representing  the  M mean distribution instead of a Discriminator:

<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/7efebd73c871990272aaca85a6a94e75801bd736/3-Figure1-1.png">

(Figure from the paper.)

Optimization is an alternating iteration of Mediator and Generator training steps.

__Mediator training__

The Mediator is trained with balanced samples from the data and the generator distribution, and is optimized simply with an MLE objective.

__Generator training__

The Generator, in contrast, is trained with a JSD objective using the Mediator's current approximation of M. Interestingly enough, empirically it turns out that this can be done without REINFORCE, only minimizing the KL-divergence between the continuation probability distributions from the Generator and the Mediator -- see the paper, [Lu et al. (2019), CoT: Cooperative Training for Generative Modeling of Discrete Data](https://arxiv.org/pdf/1804.03782.pdf) for the (very technical!!) details.

# Are GANs really superior? (2020)

A very recent paper entitled [Language GANs falling short?](https://arxiv.org/pdf/1811.02549.pdf) argues that the perceived superiority of GANs over traditional MLE generative models is actually an illusion: the authors examine GANs and MLE models with respect to the quality/diversity trade-off by controlling the models' entropy with the softmax temperature parameter during generation and find that MLE models in general perform better on both fronts, and on several metrics. The following is a typical figure:

<img src="https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_199319%2Fproject_315481%2Fimages%2Fx5.png" width="400">

(Figure from the paper)

Oracle LM based tests:
+ NLL oracle measures the NLL of the produced sequences as measured by the oracle. ("Quality" or "reality" of the sequences.) 
+ NLL test, in contrast measures the "diversity" of the samples because it is the generator's NLL with respect to held out data from the oracle.

# A note on transformers

Transformers are not RNNs but are still used for text generation does the problem of "exposure bias" and "global reward vs local decisions" etc. apply to them?

The answer is, unfortunately, positive: although transformer-based language models like OpenAI's GPT are not RNNs, but feed-forward networks, they are still __autoregressive__, they generate text token by token, on the basis of the already generated previous tokens:

<img src="http://jalammar.github.io/images/xlnet/gpt-2-autoregression-2.gif" width="700px">

(Image from the [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/))

Consequently, the problems associated with MLE training can occur in transformer-based models as well, in fact, the GPT models were trained with the MLE objective. 

The decoders of transformer-based seq2seq models are, of course, also __autoregressive__ in this sense:

<img src="http://jalammar.github.io/images/t/transformer_decoding_2.gif">

(Image source: [The illustrated transformer](http://jalammar.github.io/images/t/transformer_decoding_2.gif))

This means that the problems and the suggested solutions can be adapted to transformer decoders as well, in fact,
there are attempts to use

+ [scheduled sampling for transformers](https://www.aclweb.org/anthology/P19-2049.pdf)
+ [transformer-based adversarial text generation](https://arxiv.org/pdf/1809.11155.pdf)

etc.

# What about conditional generation?

So far we have concentrated on advanced methods of unconditional text generation: given a data set of texts, the task was "simply" to learn the generative distribution, i.e., generate texts that are similar to those in the data set.

However, as we have already seen, in real-life applications, generated texts typically have to satisfy some conditions, e.g., it might have to

- be suitable in a certain context (e.g., after a dialog history)
- express certain emotions
- contain some personalization features
- convey some specific content
- contain certain words in certain positions:

<img src="https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_33479%2Fproject_389323%2Fimages%2Fmodel.jpg">

We have also seen that, from a probabilistic point of view, instead of learning an unconditional distribution over the the possible sequences in the language's alphabet/vocabulary, the models have to learn conditional distributions of the form

$$
P(\langle y_1,\dots,y_n\rangle|C)
$$
where $C$ expresses all conditions which the generated $\langle y_1,\dots,y_n\rangle$ has to satisfy.

Conditional text generation is a huge area (way larger than unconditional), so we can only scratch "the surface of the surface" here with a few remarks.

## How can the conditions be represented?

Conditions can vary from simple class-like (e.g., a short list of emotions to be expressed by the generated text), to complex, partly or wholly structured ones (like a dialog history), and the difficulty of representing them usefully for the decoder is a function 

- __Fixed length embeddings:__ Before the appearance of attention mechanisms, in the age of (attention-less) RNN decoders/generators information about conditions were exclusively accessed/represented by __fixed size embeddings__, e.g. context, dialog history, image to be captioned etc. was simply compressed into a fixed size vector and used for decoding in this form.

- __Attended representations:__ The introduction of attention mechanisms made it possible for the decoders to access/query way more complex representations of (pre)conditions, e.g., external memory, token-level dialog history, etc.

An example: in state of the art GPT-2 based chatbots (immediate) dialog history and "persona information" is part of the token-level input:

<img src="https://miro.medium.com/max/2617/1*r7vi6tho6sfpVx-ZQLPDUA.png" width="700px">

(Image from [Wolf: How to build a State-of-the-Art Conversational AI with Transfer Learning](https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313))

-__"Disentangled" representations__: An important research direction is to generate texts based on interpretable, "natural" latent codes describing conditions to be satisfied. Autoencoders are especially important architectures in this area, see 

- __Global vs token level condition representations__:
The possibility of attended structured conditions makes it possible to have finer, token-level, representations of seemingly global contexts, e.g. emotions can be represented by adding "emotion-embeddings" to word embeddings, 
see, e.g., [Affective Neural Response Generation](https://arxiv.org/pdf/1709.03968.pdf).

- __Hybrid solutions__: In the long run, condition representation will probably converge to (cognitively more plausible) hybrid solutions combining compressed and attended/queried, more detailed elements, e.g., for dialog history there can be a detailed, attended representation of the immediately preceding part and (fixed size embedding(s) of earlier history.

## The relevance of advanced methods from unconditional generator models

Since the decoders of the conditional text generators architectures are the same architectures that are used for conditional generation, the same problems of quality, variance, tension between token-by-token generation and global reward apply to conditional models as well. Consequently, the techniques developed for unconditional generation can be expected to gradually appear in this area as well, especially in those domain where generating "creative" texts is an important requirement.