# Homework 2: Prompting & Generation with LMs (50 points)

The second homework zooms in on the following skills: on gaining a deeper understanding of different state-of-the-art prompting techniques and training your critical conceptual thinking regarding research on LMs. 

### Logistics

* submission deadline: June 2nd th 23:59 German time via Moodle
  * please upload a **SINGLE .IPYNB FILE named Surname_FirstName_HW2.ipynb** containing your solutions of the homework.
* please solve and submit the homework **individually**! 
* if you use Colab, to speed up the execution of the code on Colab, you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.


## Exercise 1: Advanced prompting strategies (16 points)

The lecture discussed various sophisticated ways of prompting language models for generating texts. Please answer the following questions about prompting techniques in context of different models, and write down your answers, briefly explaining them (max. 3 sentences). Feel free to actually implement some of the prompting strategies to play around with them and build your intuitions.

> Consider the following language models: 
> * GPT-2, GPT-4, Vicuna (an instruction-tuned version of Llama) and Llama-2-7b-base.
>  
> Consider the following prompting / generation strategies: 
> * beam search, tree-of-thought reasoning, zero-shot CoT prompting, few-shot CoT prompting, few-shot prompting.
> 
> For each model, which strategies do you think work well, and why? Do you think there are particular tasks or contexts, in which they work better, than in others?

**Solution**
4p per model. Aspects that can be mentioned include: 
* GPT-2: 
  * beam search: it has been shown that it improves results for "standard" LLMs
  * few-shot prompting: GPT-2 might be able to do in-context learning if the examples are more liek text-completion.
  * other strategies are too fancy
* GPT-4:
  * anything except beam search should work (it is probably too costly). depending on the task, few-shot CoT or tree of thought could be best for reasoning tasks
* Vicuna:
  * few-shot prompting or zero-shot CoT could work because it was instruction-tuned
* Llama-base: 
  * few-shot prompting  or few-shot CoT could work, ToT or zero-shot might be too advanced because it wasn't instruction- / RL-tuned

## Exercise 2: Prompting for NLI & Multiple-choice QA (14 points)

In this exercise, you can let your creativity flow -- your task is to come up with prompts for language models such that they achieve maximal accuracy on the following example tasks. Feel free to take inspiration from the in-class examples of the sentiment classification task. Also feel free to play around with the decoding scheme and see how it interacts with the different prompts.

**TASK:**
> Use the code that was introduced in the Intro to HF sheet to load the model and generate predictions from it with your sample prompts.
> 
> * Please provide your code.
> * Please report the best prompt that you found for each model and task (i.e., NLI and multiple choice QA), and the decoding scheme parameters that you used. 
> * Please write a brief summary of your explorations, stating what you tried, what worked (better), why you think that is.

* Models: Pythia-410m, Pythia-1.4b
* Tasks: please **test** the model on the following sentences and report the accuracy of the model with your best prompt and decoding configurations.
  * Natural language inference: the task is to classify whether two sentences form a "contradiction" or an "entailment", or the relation is "neutral". The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. neutral
    * A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. entailment
    * Children smiling and waving at camera. There are children present. entailment
    * A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. contradiction
    * An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. neutral
    * High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. contradiction
  * Multiple choice QA: the task is to predict the correct answer option for the question, given the question and the options (like in the task of Ex. 3 of homework 1). The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * The only baggage the woman checked was a drawstring bag, where was she heading with it? ["garbage can", "military", "jewelry store", "safe", "airport"] -- airport
    * To prevent any glare during the big football game he made sure to clean the dust of his what? ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"] -- television
    * The president is the leader of what institution? ["walmart", "white house", "country", "corporation", "government"] -- country
    * What kind of driving leads to accidents? ["stressful", "dangerous", "fun", "illegal", "deadly"] -- dangerous
    * Can you name a good reason for attending school? ["get smart", "boredom", "colds and flu", "taking tests", "spend time"] -- "get smart"
    * Stanley had a dream that was very vivid and scary. He had trouble telling it from what? ["imagination", "reality", "dreamworker", "nightmare", "awake"] -- reality

**Partial solution suggestion**

* 6 pts / model, 2 pts for code
  * for each model, there should be: a prompt, decoding parameters, accuracy for NLI, accuracy for QA, conclusion / summary
  * the actual accuracies don't matter that much as long as the response sensibly reflects upon what's going on
* any kind of code that does what is asked for in this task is of course acceptable, but below is one possibility (for one model). If people manually evaluated the accuracy, it's also fine (code is not required here).
* intuition suggests that some kind of few shot prompting should work, especially if the prompt is formatted as text continuation rather than some structured format for the smaller model; for the larger model, even more advanced things might work, e.g., formatting the QA as multiple choice could work.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_name = "EleutherAI/pythia-410m" # "EleutherAI/pythia-1.4b"
model = AutoModelForCausalLM.from_pretrained(
    model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# prompt
input_text = "Given two pairs of sentences, decide whether they entail, contradict or are neutral to each other. The answer should be one of: \
entailment, contradiction, neutral. Here are a 3 examples:\n\
1. A man inspects the uniform of a figure in some East Asian country. The man is sleeping. contradiction.\
\n2. An older and younger man smiling. Two men are smiling and laughing at the cats playing on the floor. neutral.\
\n3. A soccer game with multiple males playing. Some men are playing a sport. entailment"

# there are various ways of doing this, e.g., people could have converted the examples above into a csv, or just use lists.
# a list is used below. code should be functionally identical for both tasks (although the accuracy computation differs somewhat)
# nli example, with labels in form of actual words (rather than e g A B C labels)
nli_inputs = [
    ("\nA person on a horse jumps over a broken down airplane. A person is training his horse for a competition."),
    ("\nA person on a horse jumps over a broken down airplane. A person is outdoors, on a horse."),
    ("\nChildren smiling and waving at camera. There are children present."),
    ("\nA boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk."),
    ("\nAn older man sits with his orange juice at a small table in a coffee shop while employees in brightcolored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work."),
    ("\nHigh fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear.")
]
nli_labels = [
    "neutral",
    "entailment",
    "entailment",
    "contradiction",
    "neutral",
    "contradiction"
]
tokenizer.pad_token_id = tokenizer.eos_token_id
predicted_answers = []
predicted_answers_correctness = []
for input_test, label in zip(nli_inputs, nli_labels):
    input_ids = tokenizer.encode(input_text + input_test, return_tensors="pt", max_length=240, padding=True, truncation=True, return_attention_mask=True).to(device)
    # generate predictions
    prediction = model.generate(
        input_ids,
        do_sample=True,
        temperature=0.5,
        max_new_tokens=3,
    )
    # decode the prediction (i.e., convert tokens back to text)
    answer = tokenizer.decode(prediction[0])
    print("\n\nPredicted continuation: ", answer)
    predicted_answers.append(answer)
    if label in answer[len(input_text + input_test):].lower():
        predicted_answers_correctness.append(1)
    else:
        predicted_answers_correctness.append(0)

print("Accuracy: ", sum(predicted_answers_correctness) / len(predicted_answers_correctness))

The few shot learning approach at least gets the models to output one of the three labels. However the accuracy of 1/3 is most likely due to chance. The result is the same for both model and both seem to inhibit a strong bias towards contradiction.

In [None]:
tokenizer.pad_token_id = tokenizer.eos_token_id
prompt = "Select the correct label from the list of options, that answers the question. Here is an example:\n \
Sammy wanted to go to where the people were. Where might he go? answer options: A: race track, B: populated areas, C: the desert, D: apartment, E: roadblock\nanswer: B"
questions = ["The only baggage the woman checked was a drawstring bag, where was she heading with it?", 
              "To prevent any glare during the big football game he made sure to clean the dust of his what?", 
              "The president is the leader of what institution?",
              "What kind of driving leads to accidents?",
              "Can you name a good reason for attending school?",
              "Stanley had a dream that was very vivid and scary. He had trouble telling it from what?"]
"""
answer_options = ["garbage can, military, jewelry store, safe, airport",
                  "television, attic, corner, they cannot clean corner and library during football match they cannot need that, ground",
                  "walmart, white house, country, corporation, government",
                  "stressful, dangerous, fun, illegal, deadly",
                  "get smart, boredom, colds and flu, taking tests, spend time",
                  "imagination, reality, dreamworker, nightmare, awake"]
"""

answer_options = [["garbage can", "military", "jewelry store", "safe", "airport"],
                  ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"],
                  ["walmart", "white house", "country", "corporation", "government"],
                  ["stressful", "dangerous", "fun", "illegal", "deadly"],
                  ["get smart", "boredom", "colds and flu", "taking tests", "spend time"],
                  ["imagination", "reality", "dreamworker", "nightmare", "awake"]]

labels = ["airport", "television", "country", "dangerous", "get smart", "reality"]
predicted_answers = []
predicted_answers_correctness = []
for question, answer_option, label in zip(questions, answer_options, labels):
    option_list = ""
    chars = ["A", "B", "C", "D", "E"]
    for option, c in zip(answer_option, chars):
        option_list += c + ": " + option + " "
    input_text = prompt + "\n" + question + " answer options: " + option_list + "\nanswer: "
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=240, padding=True, truncation=True, return_attention_mask=True).to(device)
    # generate predictions
    prediction = model.generate(
        input_ids,
        do_sample=True,
        temperature=0.8,
        max_new_tokens=3,
    )
    # decode the prediction (i.e., convert tokens back to text)
    answer = tokenizer.decode(prediction[0])
    print("\n\nPredicted continuation: ", answer)
    predicted_answers.append(answer)
    if label in answer[len(input_text + input_test):].lower():
        predicted_answers_correctness.append(1)
    else:
        predicted_answers_correctness.append(0)

print("Accuracy: ", sum(predicted_answers_correctness) / len(predicted_answers_correctness))

## Exercise 3: First neural LM (20 points)

Next to reading and understanding package documentations, a key skill for NLP researchers and practitioners is reading and critically assessing NLP literature. The density, but also the style of NLP literature has undergone a significant shift in the recent years with increasing acceleration of progress. Your task in this exercise is to read a paper about one of the first successful neural langauge models, understand its key architectural components and compare how these key components have evolved in modern systems that were discussed in the lecture. 

> Specifically, please read this paper and answer the following questions: [Bengio et al. (2003)](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
>
> * How were words / tokens represented? What is the difference / similarity to modern LLMs?
> * How was the context represented? What is the difference / similarity to modern LLMs?
> * What is the curse of dimensionality? Give a concrete example in the context of language modeling.
> * Which training data was used? What is the difference / similarity to modern LLMs?
> * Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
> 
> * Please formulate one question about the paper (not the same as the questions above) and post it to the dedicated **Forum** space, and **answer 1 other question** about the paper.

Furthermore, your task is to carefully dissect the paper by Bengio et al. (2003) and analyse its structure and style in comparison to another more recent paper:  [Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)

**TASK:**

> For each section of the Bengio et al. (2003) paper, what are key differences between the way it is written, the included contents, to the BERT paper (Devlin et al., 2019)? What are key similarities? Write max. 2 sentences per section.


> ### Answers
> #### How were words / tokens represented? What is the difference / similarity to modern LLMs?
>A continuous real-vector for each word was used to represent similarity between words (instead of using discrete random or deterministic variables). That way, each word is associated with a specific point in the vector space, while the number of features is smaller than the size of the vocabulary.
That idea reminds of the feature vectors used in information retrieval. However, in this model we are not looking in the co-occurring of words but in the probability distribution of word sequences from natural language text.
Like the Bengio et al. paper, LLMs using a vector representation. However, LLMs are not working with continuous real-vector but with context-sensitive embeddings. Also, they divide the word into sub words. 
> 
> #### How was the context represented? What is the difference / similarity to modern LLMs?
>The context vector is formed for a given sequence of words. Here, the word vectors of the preceding words in the sequence are getting combined through a neural network architecture.
In contrast Modern LLM work with self-attention and Transformers. Furthermore, the context size is fixed in the neural network model, but not with the LLMs.
> 
> #### What is the curse of dimensionality? Give a concrete example in the context of language modelling.
>The curse of dimensionality occurs while analysing data in high-dimensional spaces. When the dimensionality rises, the volume of the space increases exponentially. 
One example is a joint distribution of 10 consecutive words with a vocabulary size of 100,000. Here we have 100,000^10-1 = 10^50-1 free parameters. 
This model is using a joint probability function of word feature vector sequences which is a smooth function of this feature values with a neural network. 
Doing so, the method is crucially different to modern LLMs.
> 
> #### Which training data was used? What is the difference / similarity to modern LLMs?
>The training set is a sequence of words, the vocabulary large but finite. 
Comparative experiments were performed on the Brown corpus, where the first 800,000 words were used for the training data set. 
Furthermore, a experiment was run on the Associated Press News texts, where the training set consist of a stream of about 14 million words.
The training data of Modern LLMs is much larger and consist a lot more data than that. 
> 
> #### Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
> Bengio et al.:
> - Using neural networks with softmax
> - generalize unseen words with similarity between word vectors
> 
>LLMs:
> - Self-Attention
> - embeddings
> - Masking
> - Special Tokens 
> - Fine-Tuning

> ### differences per section
 > The section are selected as the main section of the Bengio et al. paper
> #### Abstract
> Similar in both papers: introducing problem and solution.
> 
> #### Introduction
> Bengio et al. describes the challenges of statistical modelling and offers a neural network solution. A special focus is here the curse of dimensionality.
Devlin et al. explains the limitation of existing pre-trained techniques and introduced BERT. 
> Therefore, Bengio et al. introduces a new theoretical framework whereas Denvio et. al introduces an entire alternative model.
> Bengio et al. looks into earlier neural networks and statistical models whereas Devlin et al. focusses on feature based models like ELMo.
> 
> #### A neural model
> Bengio et al. describes a neural network architecture. Devlin et al. explains a pre-trained transformer with a giant corpus, masked language modelling and finetuning.
> Both papers are explaining there architecture detailed and with pictures. 
> 
> #### Parallel Implementation
> Bengio et al. emphasizes parallel computation where the hardware no longer exists. There is no comaprable section in the paper of Devlin et al.
> 
> #### Experimental Results
> Bengio et al. uses only perplexity reduction on the Brown dataset and the AP News Corpora. The paper also compares to SOTA.  Devlin et al. uses different NLP benchmarks.
> 
> #### Extensions and Future Work
> Bengio et al. describes different possible improvements that can be tried out in the future. Devlin et al.is keeping this part very short. 
> 
> #### Conclusion
> Both papers summarize their break throughs. Nonetheless, the conclusion in the BERT paper is shorter than in the paper introducing the neural network solution.
