# natural language generation

## definition

NLP = Natural Language Understanding (NLU) + Natural Language Generation (NLG)

**text generation task**: 

input: $\{w_{1:k}\} \sim p_∗$, a text sequence/prefix sampling from true distribution. 

output: ${\hat w}_{k+1:T} \sim p_{\theta}(\cdot|\{w_{1:k}\} )$, a continuation decoded by language model $p_{\theta}$

goal: resulting completion $\hat x = (w_1, . . . , w_k, w_{k+1}, . . . , w_T)$ resembles a sample from true distribution $p_∗$

Example task:

- language modeling: k = 0

- dialogue system: input is a dialogue history and output is a next utterance.

decoding algorithm: Given language model $p_{\theta}$ and a prefix, finding the optimal continuation is intractable, so in practice deterministic or stochastic decoding algorithms are used to generate continuations.

## application

close-ended generatition:

- machine translation

- summarization

- table-to-text

- task-driven chat

- reading comprehension QA

open-ended generation:

- ChitChat

- story/poem generation

- image description

- open-domain QA

## model: encoder-decoder or decoder

## training objective: language modelling

**language modeling**: model a probability distribution $p_∗(x)$ over variable-length text sequences $x = (w_1, ... , w_T)$ composed of tokens from a vocabulary, $w_t \in \mathcal{V}$. 

We approximate true distribution $p_∗$ by a neural language model $p_{\theta}$ parameterized by $\theta$

$$p_{\theta}(x) = \prod^T_{t=1} p_{\theta}(w_t|w_{<t})$$

training objective: **Maximum Likelihood**: find parameters $\theta$ that maximize log likelihood of next token $w_t$ given preceding context words $\{w_{< t}\}$ in a finite set of samples $\mathcal{D}$ from true distribution $p_∗$

$$
L_{MLE}=-\sum_{i=1}^{|\mathcal{D}|}\sum_{t=1}^T \log P_{\theta}(w_t^{(i)}|\{w_{< t}\}^{(i)})=-\sum_{i=1}^{|\mathcal{D}|}\sum_{t=1}^T \log \frac{\exp(S_w)}{\sum_{w' \in V} \exp(S_{w'})}
$$

$S=f(\{y_{< t}\}) \in \mathbb{R}^{V}$ is logit vectors output from model $p_{\theta}$ that contains confidence score for each token in vocabulary. take softmax, it becomes probability distribution


### neural text degeneration

text degeneration: generated text is nonsensical, repetitive, or inconsistent, especially open-ended task

some Phenomena:

- **Repetition**: Models repeat phrases or sentences that were previously generated.

- **Overuse of Common Phrases**: generated text sound generic and dull, Lack of Diversity.

- Inconsistency: e.g., describe the weather as sunny and then, a few sentences later, mention that it's raining.

- Nonsensical Text: model generates text that is syntactically correct but semantically meaningless.

reason: 

- trained by maximum likelihood objective that focuses on optimizing next-token conditional distributions

- deterministic decoding.

- exposure bias: discrepancy between the training and inference stages of a model. 

    training: the model is "exposed" to the true data distribution because it is fed the ground-truth previous tokens. 
    
    inference: the model generates text autoregressively, using its own previous outputs as inputs for future predictions. If these predictions are wrong, they can't recover from their own bad samples, lead to error propagation, where one mistake leads to subsequent errors in the generated text.


### solution of repetition

new training objective: 

[token-level unlikelihood (Welleck et al., 2020)](https://arxiv.org/pdf/1908.04319.pdf): MLE + unlikelihood loss, penalize repeat and frequent tokens by unlikelihood loss

$$
L_{t}(p_{\theta}(\cdot|w_{<t}), C_{t}) = - \underbrace{\log p_{\theta}(w_{t}|w_{<t})}_{likelihood}- \alpha \underbrace{\sum_{c \in C_{t}} \log(1 - p_{\theta}(c|w_{<t}))}_{\text{unlikelihood}} 
$$
  
$C_t = \{w_1, ..., w_{t-1}\}$ : Set of negative candidates defined as the previous context tokens

- coverage loss (See et al., 2017): penalizes if model attends to the same source words repeatedly (over-translation) or ignores certain source words (under-translation).

- contrastive decoding (Li et al, 2022): find the string $x$ that difference between likelihoods under the large language model and the smaller one.  encourages the generation of text that is likely under the large language model but unlikely under the small language model.

    $$\max \log P_{largeLM} (x) – \log P_{smallLM} (x)$$

stochastic decoding algorithm

- temperature: can be tuned for both beam search and sampling. apply a temperature hyperparameter $\tau$ to the softmax to rebalance.

    $$
    P_{\theta}(w_t^{(i)}|\{w_{< t}\}^{(i)})=\frac{\exp(S_w/\tau)}{\sum_{w' \in V} \exp(S_{w'}/\tau)}
    $$

    $\tau > 1$: $P$ becomes more uniform, More diverse output

    $\tau <> 1$: $P$ becomes more peaky, less diverse output

- Typical Sampling (Meister et al. 2022): Reweights the score based on the entropy of the distribution.

- Epsilon Sampling (Hewitt et al. 2022): Set a threshold for lower bounding valid probabilities.

- reranking: Define a score to approximate quality of sequences (e.g., style, discourse, entailment, logical consistency) and re-rank a bunch of decoded sequences by this score

### solution of exposure bias 

- Scheduled sampling (Bengio et al., 2015): teacher forcing: With some probability p, decode a token and feed that as the next input, rather than the gold token and Increase p over the course of training

- Dataset Aggregation (DAgger; Ross et al., 2011): At various intervals during training, Add generated sequences to training set as additional examples

- retrieval augmentation  (Guu, Hashimoto, et al., 2018)

    input: a prompt (e.g., dialogue query)

    output: human-like edited sequence (e.g., dialogue responses)

    two-step process:

    Retrieve: The model queries a prototype database of human-written text to find a sequence that is relevant to the input.

    Edit: The model then modifies the retrieved sequence to better fit the context of the input. This could involve adding, removing, or changing words or phrases in the retrieved text.


- Reinforcement Learning: cast your text generation model as a Markov decision process. Learn behaviors by rewarding the model when it exhibits them

    reward function: 1. use evaluation metrics, e.g., BLEU for machine translation, ROUGE for summarization, CIDEr and SPIDEr for image captioning. 2. learn a reward function of human preference (RLHF)

    rewardable behaviors: formality, politeness, consistency, sentence simplicity



## decoding algorithm

model generates probability distributions over all possible output tokens at each time step

decoding algorithms $g$ use these probabilities to select the most likely output sequences. 

$$
\hat w_t = g(P(w_t |\{w_{< t}\}))
$$

can be divided to deterministic (greedy search, beam search) and stochastic (various sampling methods)

<table>
    <thead>
        <tr>
            <th>Deterministic or Stochastic</th>
            <th>Algorithm</th>
            <th>Description</th>
            <th>Pros</th>
            <th>Cons</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan="3">Deterministic</td>
            <td>Greedy Search</td>
            <td>Select the most probable token at each step</td>
            <td>Computationally efficient</td>
            <td>Suboptimal; can get stuck in local optima</td>
        </tr>
        <tr>
            <td>Beam Search</td>
            <td>Explore k hypotheses in parallel</td>
            <td>Better quality sequences; more efficient than exhaustive search</td>
            <td>Can suffer from duplicate hypotheses; higher computational cost than greedy search</td>
        </tr>
        <tr>
            <td>Length-Normalized Search</td>
            <td>Normalize sequence probability by length</td>
            <td>Generates sequences of desirable length</td>
            <td>Can be sensitive to the normalization factor</td>
        </tr>
        <tr>
            <td rowspan="3">Stochastic</td>
            <td>Top-k Sampling</td>
            <td>Sample from the top k (k=50) most probable tokens</td>
            <td>More diverse sequences; introduces randomness</td>
            <td>Risk of generating incoherent sequences</td>
        </tr>
        <tr>
            <td>Top-p (Nucleus) Sampling</td>
            <td>Sample tokens based on top p cumulative probability</td>
            <td>Dynamic token selection; more diverse sequences</td>
            <td>Risk of generating incoherent sequences; higher computational cost than greedy search</td>
        </tr>
        <tr>
            <td>Temperature-Scaled Sampling</td>
            <td>Modify token probabilities using temperature</td>
            <td>Control over diversity and randomness</td>
            <td>Requires tuning the temperature parameter; risk of generating incoherent sequences</td>
        </tr>
    </tbody>
</table>


### beam search

- beam search is a greedy algorithm widely used in sequence generation tasks to find the most likely output sequence in a computationally efficient manner.

- Beam search provides a trade-off between the computational cost and search quality. 

- hyperparameter: beam size (K: number of possible sequences)

    A larger beam width increases the likelihood of finding a better solution but requires more computational resources. 

**Algorithm**

1. Initialize the beam with the [BOS]

2. At each time step

    expand the sequences in the beam by considering all possible next tokens.

    Score each expanded sequence using a scoring function.

    Keep only the top K sequences (where K is the beam width) based on their scores and discard the rest.

3. Repeat steps 2 until the [EOS] is reached for all sequences in the beam.

4. Choose the sequence with the highest score as the final output.

<img src='https://d2l.ai/_images/beam-search.svg' />

## evaluation

|  | Content Overlap Metrics | Model-Based Metrics | Human Evaluation |
|---|---|---|---|
| Description |lexical similarity between generated and gold-standard | semantic similarity between generated and gold-standard <br>computed by pretrained word/sentence embeddings | Human judgement of various dimensions (fluency, coherence/consistency, factuality and correctness, commonsense, style/formality, grammaticality, typicality, redundancy) |
| Example Metrics | BLEU, ROUGE, METEOR, CIDEr | Word-level: Vector Similarity, Word Mover’s Distance, BERTSCORE<br>Sentence-level: Sentence Movers Similarity, BLEURT, MAUVE | ADEM, HUSE |
| Pros | Simple, widely used, closed-ended tasks| open-ended tasks | gold standard |
| Cons | open-ended tasks| Require pretrained model for embeddings, fixed embeddings may not capture all nuances | inconsistent, unreproducible, time-consuming and expensive |


Overall, it's important to remember that evaluation metrics are just tools, and they may not capture all aspects of text quality. 

researchers should examine the model's outputs themselves and releasing large samples of these outputs for the community to review.