Sheet 3.3: Prompting & Decoding
=======
**Author**: Polina Tsvilodub & Michael Franke

This sheet provides more details on concepts that have been mentioned in passing in the previous sheets, and provides some practical examples and exercises for prompting techniques that have been covered in lecture four. Therefore, the learning goals for this sheet are:
* take a closer look and understand various decoding schemes,
* understand the temperature parameter,
* see a few practical examples of prompting techniques from the lecture.

## Decoding schemes

This part of this sheet is a close replication of [this](https://michael-franke.github.io/npNLG/06-LSTMs/06d-decoding-GPT2.html) sheet.

This topic addresses the following question: Given a language model that outputs a next-word probability, how do we use this to actually generate naturally sounding text? For that, we need to choose a single next token from the distribution, which we will then feed back to the model, together with the preceding tokens, so that it can generate the next one. This inference procedure is repeated, until the EOS token is chosen, or a maximal sequence length is achieved. The procedure of how exactly to get that single token from the distribution is call *decoding scheme*. Note that "decoding schemes" and "decoding strategies" refer to the same concept and are used interchangeably. 

We have already discussed decoding schemes in lecture 02 (slide 25). The following introduces these schemes in more detail again and provides example code for configuring some of them. 

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.3.1: Decoding schemes</span></strong>
>
> Please read through the following introduction and look at the provided code. 
> 1. With the help of the example and the documentation, please complete the code (where it says "### YOUR CODE HERE ####") for all the decoding schemes.

Common decoding strategies are:
* **pure sampling**: In a pure sampling approach, we just sample each next word with exactly the probability assigned to it by the LM. Notice that this process, therefore, is non-determinisitic. We can force replicable results, though, by setting a *seed*.
* **Softmax sampling**: In soft-max sampling, the probablity of sampling word $w_i$ is $P_{LM} (w_i \mid w_{1:i-1}) \propto \exp(\frac{1}{\tau} P_{LM}(w_i \mid w_{1:i-1}))$, where $\tau$ is a *temperature parameter*.
  * The *temperature parameter* is also often available for closed-source models like the GPT family. It is often said to change the "creativity" of the output.
* **greedy sampling**: In greedy sampling, we don’t actually sample but just take the most likely next-word at every step. Greedy sampling is equivalent to setting $\tau = 0$ for soft-max sampling. It is also sometimes referred to as *argmax* decoding.
* **beam search**: In simplified terms, beam search is a parallel search procedure that keeps a number $k$ of path probabilities open at each choice point, dropping the least likely as we go along. (There is actually no unanimity in what exactly beam search means for NLG.)
* **top-$k$ sampling**: his sampling scheme looks at the $k$ most likely next-words and samples from so that: $$P_{\text{sample}}(w_i  \mid w_{1:i-1}) \propto \begin{cases} P_{M}(w_i \mid w_{1:i-1}) & \text{if} \; w_i \text{ in top-}k \\ 0 & \text{otherwise} \end{cases}$$
* **top-$p$ sampling**: Top-$p$ sampling is similar to top-$k$ sampling, but restricts sampling not to the top-$k$ most likely words (so always the same number of words), but the set of most likely words the summed probability of which does not exceed threshold $p$.

The within the `transformers` package, for all causal LMs, the `.generate()` function is available which allows to sample text from the model (remember the brief introduction in [sheet 2.5](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/02e-intro-to-hf.html)). Configuring this function via different values and combinations of various parameters allows to sample text with the different decoding schemes described above. The respective documentation can be found [here](https://huggingface.co/docs/transformers/v4.40.2/en/generation_strategies#decoding-strategies). The same configurations can be passed to the `pipeline` endpoint which we have seen in the same sheet.

Check out [this](https://medium.com/@harshit158/softmax-temperature-5492e4007f71) blog post for very noce visualizations and more detials on the *temperature* parameter.

Please complete the code below. GPT-2 is used as an example model, but this works exactly the same with any other causal LM from HF.

In [None]:
# import relevant packages
import torch 
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# convenience function for nicer output
def pretty_print(s):
    print("Output:\n" + 100 * '-')
    print(tokenizer.decode(s, skip_special_tokens=True))

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='pt')


In [None]:
# set a seed for reproducibility (if you want)
torch.manual_seed(199)

# below, greedy decoding is implemented
# NOTE: while it is the default for .generate(), it is NOT for pipeline()

greedy_output = model.generate(input_ids, max_new_tokens=10)
print(pretty_print(greedy_output[0]))

# here, beam search is shown
# option `early_stopping` implies stopping when all beams reach the end-of-sentence token
beam_output = model.generate(
    input_ids, 
    max_new_tokens=10, 
    num_beams=3, 
    early_stopping=True
) 

pretty_print(beam_output[0])


#  pure sampling
sample_output = model.generate(
    input_ids,        # context to continue
    #### YOUR CODE HERE ####
    max_new_tokens=10, # return maximally 10 new tokens (following the input)
)

pretty_print(sample_output[0])

# same as pure sampling before but with `temperature`` parameter
SM_sample_output = model.generate(
    input_ids,        # context to continue
    #### YOUR CODE HERE ####
    max_new_tokens=10,
)

pretty_print(SM_sample_output[0])

# top-k sampling 
top_k_output = model.generate(
    input_ids, 
    ### YOUR CODE HERE #### 
    max_new_tokens=10,
)

pretty_print(top_k_output[0])

# top-p sampling
top_p_output = model.generate(
    input_ids, 
    ### YOUR CODE HERE #### 
    max_length=50, 
)

pretty_print(top_p_output[0])


> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.3.2: Understanding decoding schemes</span></strong>
>
> Think about the following questions about the different decoding schemes.
>  
> 1. Why is the temperature parameter in softmax sampling sometimes referred to as a creativity parameter? Hint: Think about the shape distribution and from which the next word is sampled, and how it compares to the "pure" distribution when the temperature parameter is varied.
> 2. Just for yourself, draw a diagram of how beam decoding that starts with the BOS token and results in the sentence "BOS Attention is all you need" might work, assuming k=3 and random other tokens of your choice.
> 3. Which decoding scheme seems to work best for GPT-2? 
> 4. Which of the decoding schemes included in this work sheet is a special case of which other decoding scheme(s)? E.g., X is a special case of Y if the behavior of Y is obtained when we set certain paramters of X to specific values.
> 5. Can you see pros and cons to using some of these schemes over others?

**Outlook** 

There are also other more recent schemes, e.g., [locally typical sampling](https://arxiv.org/abs/2202.00666) introduced by Meister et al. (2022).

## Prompting strategies