Sheet 3.3: Prompting & Decoding
=======
**Author**: Polina Tsvilodub & Michael Franke

This sheet provides more details on concepts that have been mentioned in passing in the previous sheets, and provides some practical examples and exercises for prompting techniques that have been covered in lecture four. Therefore, the learning goals for this sheet are:
* take a closer look and understand various decoding schemes,
* understand the temperature parameter,
* see a few practical examples of prompting techniques from the lecture.

## Decoding schemes

This part of this sheet is a close replication of [this](https://michael-franke.github.io/npNLG/06-LSTMs/06d-decoding-GPT2.html) sheet.

This topic addresses the following question: Given a language model that outputs a next-word probability, how do we use this to actually generate naturally sounding text? For that, we need to choose a single next token from the distribution, which we will then feed back to the model, together with the preceding tokens, so that it can generate the next one. This inference procedure is repeated, until the EOS token is chosen, or a maximal sequence length is achieved. The procedure of how exactly to get that single token from the distribution is call *decoding scheme*. Note that "decoding schemes" and "decoding strategies" refer to the same concept and are used interchangeably. 

We have already discussed decoding schemes in lecture 02 (slide 25). The following introduces these schemes in more detail again and provides example code for configuring some of them. 

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.3.1: Decoding schemes</span></strong>
>
> Please read through the following introduction and look at the provided code. 
> 1. With the help of the example and the documentation, please complete the code (where it says "### YOUR CODE HERE ####") for all the decoding schemes.

Common decoding strategies are:
* **pure sampling**: In a pure sampling approach, we just sample each next word with exactly the probability assigned to it by the LM. Notice that this process, therefore, is non-determinisitic. We can force replicable results, though, by setting a *seed*.
* **Softmax sampling**: In soft-max sampling, the probablity of sampling word $w_i$ is $P_{LM} (w_i \mid w_{1:i-1}) \propto \exp(\frac{1}{\tau} P_{LM}(w_i \mid w_{1:i-1}))$, where $\tau$ is a *temperature parameter*.
  * The *temperature parameter* is also often available for closed-source models like the GPT family. It is often said to change the "creativity" of the output.
* **greedy sampling**: In greedy sampling, we don’t actually sample but just take the most likely next-word at every step. Greedy sampling is equivalent to setting $\tau = 0$ for soft-max sampling. It is also sometimes referred to as *argmax* decoding.
* **beam search**: In simplified terms, beam search is a parallel search procedure that keeps a number $k$ of path probabilities open at each choice point, dropping the least likely as we go along. (There is actually no unanimity in what exactly beam search means for NLG.)
* **top-$k$ sampling**: his sampling scheme looks at the $k$ most likely next-words and samples from so that: $$P_{\text{sample}}(w_i  \mid w_{1:i-1}) \propto \begin{cases} P_{M}(w_i \mid w_{1:i-1}) & \text{if} \; w_i \text{ in top-}k \\ 0 & \text{otherwise} \end{cases}$$
* **top-$p$ sampling**: Top-$p$ sampling is similar to top-$k$ sampling, but restricts sampling not to the top-$k$ most likely words (so always the same number of words), but the set of most likely words the summed probability of which does not exceed threshold $p$.

The within the `transformers` package, for all causal LMs, the `.generate()` function is available which allows to sample text from the model (remember the brief introduction in [sheet 2.5](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/02e-intro-to-hf.html)). Configuring this function via different values and combinations of various parameters allows to sample text with the different decoding schemes described above. The respective documentation can be found [here](https://huggingface.co/docs/transformers/v4.40.2/en/generation_strategies#decoding-strategies). The same configurations can be passed to the `pipeline` endpoint which we have seen in the same sheet.

Check out [this](https://medium.com/@harshit158/softmax-temperature-5492e4007f71) blog post for very noce visualizations and more detials on the *temperature* parameter.

Please complete the code below. GPT-2 is used as an example model, but this works exactly the same with any other causal LM from HF.

In [None]:
# import relevant packages
import torch 
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# convenience function for nicer output
def pretty_print(s):
    print("Output:\n" + 100 * '-')
    print(tokenizer.decode(s, skip_special_tokens=True))

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='pt')


In [None]:
# set a seed for reproducibility (if you want)
torch.manual_seed(199)

# below, greedy decoding is implemented
# NOTE: while it is the default for .generate(), it is NOT for pipeline()

greedy_output = model.generate(input_ids, max_new_tokens=10)
print(pretty_print(greedy_output[0]))

# here, beam search is shown
# option `early_stopping` implies stopping when all beams reach the end-of-sentence token
beam_output = model.generate(
    input_ids, 
    max_new_tokens=10, 
    num_beams=3, 
    early_stopping=True
) 

pretty_print(beam_output[0])


#  pure sampling
sample_output = model.generate(
    input_ids,        # context to continue
    #### YOUR CODE HERE ####
    max_new_tokens=10, # return maximally 10 new tokens (following the input)
)

pretty_print(sample_output[0])

# same as pure sampling before but with `temperature`` parameter
SM_sample_output = model.generate(
    input_ids,        # context to continue
    #### YOUR CODE HERE ####
    max_new_tokens=10,
)

pretty_print(SM_sample_output[0])

# top-k sampling 
top_k_output = model.generate(
    input_ids, 
    ### YOUR CODE HERE #### 
    max_new_tokens=10,
)

pretty_print(top_k_output[0])

# top-p sampling
top_p_output = model.generate(
    input_ids, 
    ### YOUR CODE HERE #### 
    max_length=50, 
)

pretty_print(top_p_output[0])


> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.3.2: Understanding decoding schemes</span></strong>
>
> Think about the following questions about the different decoding schemes.
>  
> 1. Why is the temperature parameter in softmax sampling sometimes referred to as a creativity parameter? Hint: Think about the shape distribution and from which the next word is sampled, and how it compares to the "pure" distribution when the temperature parameter is varied.
> 2. Just for yourself, draw a diagram of how beam decoding that starts with the BOS token and results in the sentence "BOS Attention is all you need" might work, assuming k=3 and random other tokens of your choice.
> 3. Which decoding scheme seems to work best for GPT-2? 
> 4. Which of the decoding schemes included in this work sheet is a special case of which other decoding scheme(s)? E.g., X is a special case of Y if the behavior of Y is obtained when we set certain paramters of X to specific values.
> 5. Can you see pros and cons to using some of these schemes over others?

**Outlook** 

There are also other more recent schemes, e.g., [locally typical sampling](https://arxiv.org/abs/2202.00666) introduced by Meister et al. (2022).

## Prompting strategies

The lecture introduced different prompting techniques. (Note: "prompting technique" and "prompting strategy" refer to the same concept and are used interchangeably) 
Prompting techniques refer to the way (one could almost say -- the art) of constructing the inputs to the LM, so as to get optimal outputs for your task at hand. Note that prompting is complementary to choosing the right decoding scheme -- one still has to choose the decoding scheme for predicting the completion, given the prompt constructed via a particulat prompting strategy.

Below, a practical example of a simple prompting strategy, namely *few-shot prompting* (which is said to elicit *in-context learning*), and a more advanced example, namely *generated knowledge prompting* are provided. These should serve as inspiration for your own implementations and explorations of other prompting schemes out there. Also, feel free to play around with the examples below to build your intuitions! 

**TODO** note on using this large model on Colab.

In [None]:
# uncomment and run in your environment / on Colab, if you haven't installed these packages yet
# !pip install "transformers[torch]" "huggingface_hub[inference]"

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

In [None]:
# few shot prompting 

few_shot_prompt = """
Input: This class is awesome. Sentiment: positive
Input: This class is terrible. Sentiment: neutral
Input: The class is informative. Sentiment: neutral
"""
input_text = "The class is my favourite!"

full_prompt = few_shot_prompt + "\nInput: " + input_text + " Sentiment: "

input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids
few_shot_prediction = model.generate(
    input_ids, max_new_tokens=10, 
    do_smaple=True, 
    temperature=0.1
)

In [None]:
# generated knowledge prompting 
# TODO NOTE: the code below is deprecated and uses an old LangChain version -- it will be updated
# better explanations will also be added
# for now, it's for the vibe

import os
import pandas as pd
from dotenv import load_dotenv
import re
import numpy as np
from langchain import PromptTemplate, FewShotPromptTemplate
from langchain.prompts.example_selector.base import BaseExampleSelector
from langchain.example_generator import generate_example
from langchain.chains import TransformChain
from langchain.chains.llm import LLMChain
import openai

from utils import init_model

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

class RandomExampleSelector(BaseExampleSelector):
    """
    Convenience class for loading few-shot examples from a file.
    """
    
    def __init__(self, examples: pd.DataFrame, num_facts=1):
        self.examples = examples
        self.num_facts = num_facts
       
        self.keys = ["input", "knowledge"]

    def add_example(self, example: pd.DataFrame) -> None:
        """Add new example to store for a key."""
        self.examples.append(example)

    def select_examples(self, num_examples: int = 5) -> list[dict]:
        """Select which examples to use based on the inputs."""
        num_examples = min(num_examples, len(self.examples))
        selected_rows = self.examples.sample(n=num_examples, replace=False, random_state=42)
        
        selected_examples = []
        for _, row in selected_rows.iterrows():
            example = {}
            for key in self.keys:
                example[key] = row[key]
            selected_examples.append(example)

        return selected_examples

def transform_fct(inputs: dict) -> list:
    """
    Transform raw output of the facts sampler (string)
    into a list of strings for easier further processing

    Parameters:
    -----------
    inputs: dict
        Dict with the key "knowledge" containing the proposed facts.
    Returns:
    --------
    facts: dict
        Dict with they key "facts" with a list of facts.
    """

    if "\n" in inputs["knowledge"]:
        facts = [
            re.sub("^-|(\d)+", "", u).strip() 
            for u 
            in inputs["knowledge"].split("\n") 
            if len(u) > 3
        ]
    else:
        facts = [
            re.sub("^-|(\d)+", "", u).strip().replace("\n", "") 
            for u 
            in inputs["knowledge"].split(".") 
            if len(u) > 3
        ]
    return {"facts": facts}   


def sample_knowledge(model_name, temperature, num_facts=2, **kwargs):
    """
    Function for retrieving knowledge statements
    for generated knowledge prompting.
    Includes a string postrprocessing step.

    Parameters:
    -----------
    model_name: str
        Name of the model to be used.
    temperature: float
        Temperature for sampling.
    num_facts: int
        Number of facts to be sampled per input question.
    **kwargs:
        Args for the backbone LLM.

    Returns:
    --------
    knowledge_chain: langchain.chains.llm.LLMChain
        Chain for generating knowledge statements. Takes question as input.
    transform_facts_chain: langchain.chains.TransformChain
        Chain for parsing the raw output of the knowledge sampler.
    """
    # read in instructions
    instructions_text = f"""Instructions:
    Generate {str(num_facts)} numerical fact(s) about obejcts. Please provide the facts in a bullet list.
    
    Examples: 
    """
    # define template for few-shot examples
    example_template = """
    Input: {input}
    Knowledge: {knowledge}    
    """

    # read in examples
    examples = pd.read_csv("data/session5/knowledge_examples.csv", sep = "|")
    # sample random examples from file
    example_selector = RandomExampleSelector(examples)
    selected_examples = example_selector.select_examples(num_examples=2)
    # instantiate the LLM backbone
    if model_name == "text-davinci-003" or model_name == "gpt-4":
        model = init_model(
            model_name=model_name, 
            temperature=temperature, 
         )
    elif "flan-t5" in model_name:
        print("Initting HF model")
        model = init_model(
            model_name=model_name, 
        )
    else:
        raise ValueError(f"Model {model_name} cannot be used for knowledge based QA.")

    # parse few-shot examples into template
    example_prompt = PromptTemplate(
        template = example_template,
        input_variables = ['input', 'knowledge'],
    )
    input_template = """
    Input: {input}
    Knowledge:
    """
    # format the few_shot prompt
    few_shot_prompt = FewShotPromptTemplate(
        prefix=instructions_text, 
        examples=selected_examples,
        example_prompt=example_prompt, 
        input_variables=["input"],
        suffix=input_template,
        example_separator="\n\n",
    )    
    # define the LLM
    knowledge_chain = LLMChain(
        llm=model, 
        prompt=few_shot_prompt, 
        verbose=True,
        output_key="knowledge"
    )

    # parse the outputs into list
    transform_facts_chain = TransformChain(
        input_variables=["knowledge"], 
        output_variables=["facts"], 
        transform=transform_fct, 
        verbose=True
    )
    
    return knowledge_chain, transform_facts_chain

def answer_question(question, answers, knowledge, model_name, temperature, **kwargs):
    """
    Function for answering multiple choice questions
    based on question and knowledge statements (rough replication of system by Liu et al, 2022).

    Parameters:
    -----------
    question: str
        Question to be answered and for which facts are generated.
    answers: list
        List of possible answers.
    knowledge: list
        List of knowledge facts (all are used for answer scoring).
    model_name: str
        Name of the model to be used.
    temperature: float
        Temperature for sampling.
    Returns:
    --------
    answer: str
        Answer with highest probability when conditioned on the facts.
    """
    answer_logprobs = []
    # define the template for scoring answers based on facts
    template = "{knowledge} {question} {answer}"
    # instantiate the LLM backbone
    # in particular, add params for retrieving log probs
    if model_name == "text-davinci-003":
        model = init_model(
            model_name=model_name, 
            temperature=temperature, 
            logprobs=0,
            max_tokens=0,
            echo=True,
        )
    elif "flan-t5" in model_name or "gpt-4" in model_name:
        model = init_model(
            model_name="text-davinci-003", 
            temperature=temperature, 
            logprobs=0,
            max_tokens=0,
            echo=True,
        )
    else:
        raise ValueError(f"Model {model_name} cannot be used for knowledge based QA.")
    
    # format the prompt
    prompt = PromptTemplate(
        template=template,
        input_variables = ['question', 'answer', 'knowledge'],
    )
    qa_chain = LLMChain(   
        llm=model,
        prompt=prompt,
        verbose=True,
    )
    # plain model request in order to get answer probabilities
    for answer in answers:
        # note that all facts are used for scoring
        result = qa_chain.generate(input_list=[{
            "question": question,
            "answer": answer,
            "knowledge": knowledge,
            }]
        )
        # retrieve log probs from LLM results object       
        log_p = result.generations[0][0].generation_info["logprobs"]["token_logprobs"]
        # cut off none probability of first token
        answer_logprobs.append(np.sum(np.array(log_p[1:])))
    # renormalize
    answer_logprobs = np.array(answer_logprobs)/np.sum(np.array(answer_logprobs))
    # find max probability
    max_prob_idx = np.argmax(answer_logprobs)
    # return answer with max probability
    print("All answers ", answers)
    print("Answer probabilities ", answer_logprobs)
    print("Selected answer ", answers[max_prob_idx])
    return answers[max_prob_idx]

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.3.3: Prompting techniques</span></strong>
>
> For the following exercises, use the same model as used above.
> 1. Implement an example of a few-shot chain-of-thought prompt.
> 2. Try to vary the few-shot and the chain-of-thought prompt by introducing mistakes and inconsistencies. Do these mistakes affect the result of your prediction.

**Outlook**

**TODO** to finish.

* prompting webbook
* some utils for already using pre-built utils for stuff like ToT (LangChain)