# Homework 2: Prompting & Generation with LMs (50 points)

The second homework zooms in on the following skills: on gaining a deeper understanding of different state-of-the-art prompting techniques and training your critical conceptual thinking regarding research on LMs. 

### Logistics

* submission deadline: June 2nd th 23:59 German time via Moodle
  * please upload a **SINGLE .IPYNB FILE named Surname_FirstName_HW2.ipynb** containing your solutions of the homework.
* please solve and submit the homework **individually**! 
* if you use Colab, to speed up the execution of the code on Colab, you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.


## Exercise 1: Advanced prompting strategies (16 points)

The lecture discussed various sophisticated ways of prompting language models for generating texts. Please answer the following questions about prompting techniques in context of different models, and write down your answers, briefly explaining them (max. 3 sentences). Feel free to actually implement some of the prompting strategies to play around with them and build your intuitions.

> Consider the following language models: 
> * GPT-2, GPT-4, Vicuna (an instruction-tuned version of Llama) and Llama-2-7b-base.
>  
> Consider the following prompting / generation strategies: 
> * beam search, tree-of-thought reasoning, zero-shot CoT prompting, few-shot CoT prompting, few-shot prompting.
> 
> For each model, which strategies do you think work well, and why? Do you think there are particular tasks or contexts, in which they work better, than in others?

**Solution**
4p per model. Aspects that can be mentioned include: 
* GPT-2: 
  * beam search: it has been shown that it improves results for "standard" LLMs
  * few-shot prompting: GPT-2 might be able to do in-context learning if the examples are more liek text-completion.
  * other strategies are too fancy
* GPT-4:
  * anything except beam search should work (it is probably too costly). depending on the task, few-shot CoT or tree of thought could be best for reasoning tasks
* Vicuna:
  * few-shot prompting or zero-shot CoT could work because it was instruction-tuned
* Llama-base: 
  * few-shot prompting  or few-shot CoT could work, ToT or zero-shot might be too advanced because it wasn't instruction- / RL-tuned

## Exercise 2: Prompting for NLI & Multiple-choice QA (14 points)

In this exercise, you can let your creativity flow -- your task is to come up with prompts for language models such that they achieve maximal accuracy on the following example tasks. Feel free to take inspiration from the in-class examples of the sentiment classification task. Also feel free to play around with the decoding scheme and see how it interacts with the different prompts.

**TASK:**
> Use the code that was introduced in the Intro to HF sheet to load the model and generate predictions from it with your sample prompts.
> 
> * Please provide your code.
> * Please report the best prompt that you found for each model and task (i.e., NLI and multiple choice QA), and the decoding scheme parameters that you used. 
> * Please write a brief summary of your explorations, stating what you tried, what worked (better), why you think that is.

* Models: Pythia-410m, Pythia-1.4b
* Tasks: please **test** the model on the following sentences and report the accuracy of the model with your best prompt and decoding configurations.
  * Natural language inference: the task is to classify whether two sentences form a "contradiction" or an "entailment", or the relation is "neutral". The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. neutral
    * A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. entailment
    * Children smiling and waving at camera. There are children present. entailment
    * A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. contradiction
    * An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. neutral
    * High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. contradiction
  * Multiple choice QA: the task is to predict the correct answer option for the question, given the question and the options (like in the task of Ex. 3 of homework 1). The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * The only baggage the woman checked was a drawstring bag, where was she heading with it? ["garbage can", "military", "jewelry store", "safe", "airport"] -- airport
    * To prevent any glare during the big football game he made sure to clean the dust of his what? ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"] -- television
    * The president is the leader of what institution? ["walmart", "white house", "country", "corporation", "government"] -- country
    * What kind of driving leads to accidents? ["stressful", "dangerous", "fun", "illegal", "deadly"] -- dangerous
    * Can you name a good reason for attending school? ["get smart", "boredom", "colds and flu", "taking tests", "spend time"] -- "get smart"
    * Stanley had a dream that was very vivid and scary. He had trouble telling it from what? ["imagination", "reality", "dreamworker", "nightmare", "awake"] -- reality

## Exercise 2

**Partial solution suggestion**

* 6 pts / model, 2 pts for code
  * for each model, there should be: a prompt, decoding parameters, accuracy for NLI, accuracy for QA, conclusion / summary
  * the actual accuracies don't matter that much as long as the response sensibly reflects upon what's going on
* any kind of code that does what is asked for in this task is of course acceptable, but below is one possibility (for one model). If people manually evaluated the accuracy, it's also fine (code is not required here).
* intuition suggests that some kind of few shot prompting should work, especially if the prompt is formatted as text continuation rather than some structured format for the smaller model; for the larger model, even more advanced things might work, e.g., formatting the QA as multiple choice could work.

The following solution was created by Karahan Sarıtaş. It is much more that what was expected, but shows a very good and systematic approach to ivestigate different combinations of prompting/decoding strategies.

In [None]:
# import packages
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import numpy as np

# define computational device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-410m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-410m").to(device)

  from .autonotebook import tqdm as notebook_tqdm


Device: cuda


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from transformers import set_seed  # reproducibility
torch.manual_seed(42)
set_seed(42)

In [None]:
test_set = [
    "Input: A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. Relation:", # neutral
    "Input: A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. Relation:",  # entailment
    "Input: Children smiling and waving at camera. There are children present. Relation:",  # entailment
    "Input: A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. Relation:", # contradiction
    "Input: An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. Relation:", # neutral
    "Input: High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. Relation:" # contradiction
]

In [None]:
def pretty_print(s, text = None):
    decoded = tokenizer.decode(s, skip_special_tokens=True)
    if text is not None:
        decoded = decoded.replace(text, "")
    print(decoded)
    print(100 * '-' + '\n')

In [None]:
# greedy decoding
def greedy_decoding(input_ids, max_new_tokens=2):
    output = model.generate(input_ids, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return output

def beam_search_decoding(input_ids, max_new_tokens=2, num_beams=5):
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        num_beams=num_beams,
        early_stopping=True,   # option `early_stopping` implies stopping when all beams reach the end-of-sentence token
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def pure_sampling_decoding(input_ids, max_new_tokens=2):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=0,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def softmax_sampling_decoding(input_ids, max_new_tokens=2, temperature=0.7):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=0,
        temperature=temperature,  # higher temperature means more randomness
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def top_k_sampling_decoding(input_ids, max_new_tokens=2, k=50):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=k,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

# the set of most likely words the summed probability of which exceeds threshold p  (also called nucleus sampling)
def top_p_sampling_decoding(input_ids, max_new_tokens=2, p=0.9):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=p,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def contrastive_decoding(input_ids, max_new_tokens=2, penalty_alpha=0.6, top_k=4):
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        penalty_alpha=penalty_alpha,
        top_k=top_k,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

## Experiment Scheme
In our experiments we worked on the following configurations:
* Two models, `pythia-410m` and `pythia-1.4b` are used to generate the output.


### Natural Language Inference

* We use "Instruction Prompting" to make LM better understand user intention and follow the instruction.
* Prefix of the input is given as follows:\
 "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”."
* For each model, we test the following prompt engineering techniques:
    * $k$-shot learning
        * Zero-shot learning (no example is provided)
        * One-shot learning (only one example is provided)
        * Few-shot learning ($k$ = 3) (1 example for each relation)
        * Few-shot learning ($k$ = 9) (3 examples for each relation)
    * Self-consistency prompting (softmax sampling) with majority vote
* For each $k$-shot learning scenario, we test the following decoding strategies:
    * Greedy decoding
    * Pure sampling
    * Softmax sampling (temperature = 0.7)
    * Top-$k$ sampling ($k$ = 50)
    * Top-$p$ sampling ($p$ = 0.9)
    * Beam search (beam size = 5)
    * Contrastive decoding (penalty=0.6, $k$=4)

In self-consistency, we apply majority vote on the outputs generate using softmax sampling.

Disclaimer: As all these methods have different hyperparameters to tune, therefore the comparison is not completely fair. However, we believe that this comparison can give us a general idea of the performance of these methods.

## Natural Language Inference
Natural language inference: the task is to classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”.
* A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. **neutral**
* A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. **entailment**
* Children smiling and waving at camera. There are children present. **entailment**
* A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. **contradiction**
* An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. **neutral**
* High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. **contradiction**

Few-shot examples are collected from the Stanford Natural Language Inference (SNLI) Corpus. \
Paper: https://arxiv.org/pdf/1508.05326

We used two sets of few-shot examples to experiment on. First set consists of entirely independent sentences, while the second set consists of sentences that are related to each other. In the second set, three different "hypothesis" sentences are created for each "premise" sentence. The idea is to prevent the model from relying solely on the first sentence for the relation prediction.


| Model | Prompt Engineering | Few-Shot Examples | Best Accuracy | Majority Vote Accuracy |
| :---: | :---: | :---: |:---: |:---: |
| pythia-410m | Zero-shot learning | -  | 0/6 | 0/6|
| pythia-410m | One-shot learning |  - | 2/6 | 2/6 |
| pythia-410m | Few-shot learning ($k = 3$) |  Independent | 3/6 | 3/6|
| pythia-410m | Few-shot learning ($k = 9$) | Independent | 3/6 | 2/6 |
| pythia-410m | Few-shot learning ($k = 3$) |Related | 3/6| 2/6 |
| pythia-410m | Few-shot learning ($k = 9$) |  Related | 3/6 | 2/6 |
| pythia-410m | Self-consistency  ($k = 3$) |  Independent | 2/6 | - |
| pythia-410m | Self-consistency  ($k = 9$) |  Related | 2/6 | - |
| pythia-1.4b | Zero-shot learning | -  | 0/6 | 0/6|
| pythia-1.4b | One-shot learning |  - | 2/6  | 2/6|
| pythia-1.4b | Few-shot learning ($k = 3$) |  Independent | 3/6 | 2/6|
| pythia-1.4b | Few-shot learning ($k = 9$) | Independent | **4/6** (softmax sampling) | 2/6 |
| pythia-1.4b | Few-shot learning ($k = 3$) |Related | **4/6** (contrastive decoding) | 3/6 |
| pythia-1.4b | Few-shot learning ($k = 9$) |  Related | 3/6 | 2/6 |
| pythia-1.4b | Self-consistency  ($k = 3$) |  Independent | 2/6 | - |
| pythia-1.4b | Self-consistency  ($k = 9$) |  Related | 2/6 | - |

* Best Accuracy: For optimal accuracy, we generate outputs for each decoding scheme and select the one with the highest accuracy, disregarding the others.
* Majority Vote Accuracy (few-shot learning): Accuracy is determined by considering all outputs from different decoding schemes for a given input, with the final output decided by majority vote.

Observations:
* When only the input is provided without any example or CoT prompting, the model doesn't even seem to understand the task - although the task is explicitly stated in the prompt. It attempts to complete the sentence with irrelevant information.
* We tested all these approaches with and without CoT prompts. In this particular case, using the given models and few-shot examples, there doesn't seem to be any difference between prompting for reasoning and not.
* There doesn't appear to be a hierarchy among the different decoding strategies, as the leading one varies from experiment to experiment.
* The model seems to achieve the same accuracy for both independent and related few-shot examples.
* `pythia-1.4b` appears to outperform `pythia-410m` when the few-shot examples are provided. However, the test set needs to be larger to draw a solid conclusion. Best accuracies are achieved when $k=3$ for both of the models.

In [None]:
# independent
few_shot_examples_v1 = [
    "Input: Children smiling and waving at camera. There are children present. Relation: entailment",
    "Input: A man inspects the uniform of a figure in some East Asian country. The man is sleeping. Relation: contradiction", # contradiction
    "Input: An older and younger man smiling. Two men are smiling and laughing at the cats playing on the floor. Relation: neutral", # neutral
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. The church is filled with song. Relation: entailment",  # entailment
    "Input: A black race car starts up in front of a crowd of people. A man is driving down a lonely road. Relation: contradiction", # contradiction
    "Input: A smiling costumed woman is holding an umbrella. A happy woman in a fairy costume holds an umbrella. Relation: neutral", # neutral
    "Input: A soccer game with multiple males playing. Some men are playing a sport. Relation: entailment",  # entailment
    "Input: Four dirty and barefooted children. Four kids won awards for 'cleanest feet'. Relation: contradiction", # contradiction
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman is young. Relation: neutral", # neutral
]

In [None]:
# related
few_shot_examples_v2 = [
    "Input: Children smiling and waving at camera. There are children present. Relation: entailment",  # entailment
    "Input: Children smiling and waving at camera. The kids are frowning. Relation: contradiction", # contradiction
    "Input: Children smiling and waving at camera. They are smiling at their parents. Relation: neutral", # neutral
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. The church is filled with song. Relation: entailment",  # entailment
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. A choir singing at a baseball game. Relation: contradiction", # contradiction
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. The church has cracks in the ceiling. Relation: neutral", # neutral
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman is very happy. Relation: entailment", # entailment
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman has been shot. Relation: contradiction", # contradiction
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman is young. Relation: neutral", # neutral
]

In [None]:
def experiment(k, decoding, few_shot_examples, test_set):
    print(f"Decoding scheme: {decoding.__name__}\n")
    for idx, test in enumerate(test_set):
        suffix = ""
        if(k):
            few_shot = few_shot_examples[:k]
            suffix = "\n".join(few_shot) + "\n"
        task = "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”. \n"
        input_text = task + suffix +  test

        input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
        output = decoding(input_ids)
        print(f"{idx}) " + test)
        pretty_print(output[0], text=input_text)


In [None]:
def majority_vote(k, few_shot_examples, test_set):
  for idx, test in enumerate(test_set):
    suffix = ""
    if(k):
        few_shot = few_shot_examples[:k]
        suffix = "\n".join(few_shot) + "\n"
    task = "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”. \n"
    input_text = task + suffix +  test

    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    print(f"{idx}) " + test)
    for decoding in [greedy_decoding, beam_search_decoding, pure_sampling_decoding, softmax_sampling_decoding, top_k_sampling_decoding, top_p_sampling_decoding, contrastive_decoding]:
      output = decoding(input_ids)
      decoded = tokenizer.decode(output[0], skip_special_tokens=True)
      decoded = decoded.replace(input_text, "").replace("\n","")
      print(decoded)


In [None]:
# decoding strategies: greedy, beam search, pure sampling, softmax sampling, top-k sampling, top-p sampling, contrastive decoding
for decoding in [greedy_decoding, beam_search_decoding, pure_sampling_decoding, softmax_sampling_decoding, top_k_sampling_decoding, top_p_sampling_decoding, contrastive_decoding]:
    experiment(3, decoding, few_shot_examples_v2, test_set)

## GT:  neutral entailment entailment contradiction neutral contradiction

In [None]:
import gc
torch.cuda.empty_cache()
gc.collect()

In [None]:
majority_vote(1, few_shot_examples_v1, test_set)
## GT:  neutral entailment entailment contradiction neutral contradiction

In [None]:
def self_consistency(k, decoding, few_shot_examples, test_set, n = 5):
  for idx, test in enumerate(test_set):
    suffix = ""
    if(k):
        few_shot = few_shot_examples[:k]
        suffix = "\n".join(few_shot) + "\n"
    task = "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”.\n"
    input_text = task + suffix +  test


    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    print(f"{idx}) " + test)

    for i in range(n):
      output = decoding(input_ids)
      decoded = tokenizer.decode(output[0], skip_special_tokens=True)
      decoded = decoded.replace(input_text, "").replace("\n","")
      print(decoded)


In [None]:
self_consistency(9, softmax_sampling_decoding, few_shot_examples_v1, test_set)
## GT:  neutral entailment entailment contradiction neutral contradiction

## Multiple-choice QA

Multiple-choice QA: the task is to predict the correct answer option for the question, given the question and the options.


## Experiment Scheme
In our experiments we worked on the following configurations:
* Two models, `pythia-410m` and `pythia-1.4b` are used to generate the output.
* Few-shot examples are collected from two different datasets: [CommonSenseQA](https://huggingface.co/datasets/tau/commonsense_qa) and [NumerSense](https://inklab.usc.edu/NumerSense/) dataset. CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. NumerSense is a numerical commonsense reasoning probing task, with a diagnostic dataset consisting of 3,145 masked-word-prediction probes.
* We primarily explored two prompting techniques: Generated Knowledge Prompting (with few-shot examples) and Few-shot Prompting. In Generated Knowledge Prompting, the approach involves generating knowledge statements pertaining to the given question. Subsequently, we evaluate the log probability of each answer option being generated given these knowledge statements. This methodology enables the model to either support or oppose its own answer. We also employ two output extraction techniques: scoring the answers based on log-probabilities and classical generation. Classical generation utilizes the log-probabilities to generate new tokens. Hence, these approaches initially appear similar. However, the former allows us to constrain the model's output to only one of the options, thereby limiting its output possibilities. We worked on the following combinations:
  * Generated Knowledge Prompting with NumerSense dataset + scoring
  * Generated Knowledge Prompting with CommonSenseQA dataset + scoring
  * Few Shot Learning with CommonSenseQA dataset + scoring
  * Few Shot Learning with CommonSenseQA dataset + classical generation (softmax sampling)

For the generated knowledge prompting, we generated five knowledge statements for each question. For each answer choice 𝑎, we identified the knowledge statement that best supports it by calculating the log probability of generating that answer for each augmented prompt (with the knowledge statement). This process allows us to determine the maximum probability for each answer. Ultimately, we select the answer with the highest maximum probability.

All the few-shot prompts can be found below.

| Model | Prompt Engineering | Dataset | Generation Technique | Accuracy |
| :---: | :---: | :---: |:---: |:---: |
| pythia-410m | Generated Knowledge Prompting | NumerSense  |  Log-Probability scoring | 3/6 |
| pythia-410m | Generated Knowledge Prompting | CommonSenseQA  |  Log-Probability scoring | **4/6** |
| pythia-410m | Few Shot learning | CommonSenseQA  |  Log-Probability scoring | 0/6 |
| pythia-410m | Few Shot learning | CommonSenseQA  |  Generation with Softmax Sampling | 1/6 |
| pythia-1.4b | Generated Knowledge Prompting | NumerSense  |  Log-Probability scoring | 2/6 |
| pythia-1.4b| Generated Knowledge Prompting | CommonSenseQA  |  Log-Probability scoring | 2/6 |
| pythia-1.4b | Few Shot learning | CommonSenseQA  |  Log-Probability scoring | 1/6 |
| pythia-1.4b | Few Shot learning | CommonSenseQA  |  Generation with Softmax Sampling | 2/6 |


Observations:
* Overall, generated knowledge prompting seems to outperform vanilla few shot learning (no knowledge statement is generated).
* Maximum accuracy is achieved using generated knowledge prompting on CommonSenseQA with `pythia-410m`.


### Generated Knowledge Prompting with NumerSense Dataset

In [None]:
qa = {
    "The only baggage the woman checked was a drawstring bag, where was she heading with it?": ["garbage can", "military", "jewelry store", "safe", "airport"],
    "To prevent any glare during the big football game he made sure to clean the dust of his what?": ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"],
    "The president is the leader of what institution?": ["walmart", "white house", "country", "corporation", "government"],
    "What kind of driving leads to accidents?": ["stressful", "dangerous", "fun", "illegal", "deadly"],
    "Can you name a good reason for attending school?": ["get smart", "boredom", "colds and flu", "taking tests", "spend time"],
    "Stanley had a dream that was very vivid and scary. He had trouble telling it from what?": ["imagination", "reality", "dreamworker", "nightmare", "awake"]
}

# GT: airport - television - country - dangerous - get smart - reality

In [None]:
# Run this to get NumerSense dataset few shot prompts
import pandas as pd

inputs = [
    "How many wings do penguins have?",
    "How many sides does a parallelogram have?",
    "What is the number of limbs a typical human being has?",
    "How many feet are there in a yard?",
]

knowledges = [
    "Birds have two wings. Penguin is a kind of bird.",
    "A rectangular is a parallelogram. A square is a parallelogram.",
    "Human beings have four limbs.",
    "A yard is three feet.",
]

# Creating the dataframe
df = pd.DataFrame({
    'input': inputs,
    'knowledge': knowledges
})
df


few_shot_template = """{q} We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=df.loc[i, "input"],
        k=df.loc[i, "knowledge"].lower()
    )
    for i in range(len(df))
])
print("Constructed few shot prompt\n" + few_shot_prompt)

Constructed few shot prompt
How many wings do penguins have? We know that birds have two wings. penguin is a kind of bird.
How many sides does a parallelogram have? We know that a rectangular is a parallelogram. a square is a parallelogram.
What is the number of limbs a typical human being has? We know that human beings have four limbs.
How many feet are there in a yard? We know that a yard is three feet.


In [None]:
# Run this to get CommonSenseQA dataset few shot prompts

inputs = [
    "Google Maps and other highway and street GPS services have replaced what?",
    "The fox walked from the city into the forest, what was it looking for?",
    "You can share files with someone if you have a connection to a what?",
    "Too many people want exotic snakes. The demand is driving what to carry them?",
    "The bodyguard was good at his duties, he made the person who hired him what?"
]

knowledges = [
    "Electronic maps are the modern version of paper atlas.",
    "Natural habitats are usually away from cities.",
    "Files can be shared over the Internet.",
    "Some people raise snakes as pets.",
    "The job of bodyguards is to ensure the safety and security of the employer."
]
# Creating the dataframe
df = pd.DataFrame({
    'input': inputs,
    'knowledge': knowledges
})
df


few_shot_template = """{q} We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=df.loc[i, "input"],
        k=df.loc[i, "knowledge"].lower()
    )
    for i in range(len(df))
])
print("Constructed few shot prompt\n" + few_shot_prompt)

Constructed few shot prompt
Google Maps and other highway and street GPS services have replaced what? We know that electronic maps are the modern version of paper atlas.
The fox walked from the city into the forest, what was it looking for? We know that natural habitats are usually away from cities.
You can share files with someone if you have a connection to a what? We know that files can be shared over the internet.
Too many people want exotic snakes. The demand is driving what to carry them? We know that some people raise snakes as pets.
The bodyguard was good at his duties, he made the person who hired him what? We know that the job of bodyguards is to ensure the safety and security of the employer.


In [None]:
# Only for visualizing a generated knowledge:
question = list(qa)[0]
choices = qa[question]
prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
).input_ids.to(device)

knowledge_statements = model.generate(
    prompt_input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.5
)
# access the knowledge statements (i.e., only text that comes after prompt)
knowledge = tokenizer.decode(
    knowledge_statements[0, prompt_input_ids.shape[-1]:],
    skip_special_tokens=True
)
print("Generated knowledge: ", knowledge)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
bags are usually carried by the person carrying them.

So,


In [None]:
# 3. Score each answer to the question based on the knowledge statements
# as the score, we take the average log probability of the tokens in the answer
# iterate over the answer options
import numpy as np
no_knowledge_statements = 5

for question in list(qa):
  answers = qa[question]
  answer_log_probs = []
  prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
    ).input_ids.to(device)

  knowledge_statements = []
  for i in range(no_knowledge_statements):
    knowledge = model.generate(
        prompt_input_ids,
        max_new_tokens=15,
        do_sample=True,
        temperature=0.5
    )

    # access the knowledge statements (i.e., only text that comes after prompt)
    knowledge = tokenizer.decode(
        knowledge[0, prompt_input_ids.shape[-1]:],
        skip_special_tokens=True
    )
    knowledge_statements.append(knowledge)
    print("Generated knowledge: ", knowledge)

  # now we have knowledge statements in hand
  maximizing_answer = None
  maximizing_log_prob = -float("inf")
  for a in answers:
    log_probs_for_a = []
    for knowledge in knowledge_statements:
        # construct the full prompt
        prompt = f"{knowledge} {question} {a}"
        # construct the prompt without the answer to create a mask which will
        # allow to retrieve the token probabilities for tokens in the answer only
        context_prompt = f"{knowledge} {question}"
        # tokenize the prompt
        input_ids = tokenizer(prompt,
                            return_tensors="pt").input_ids.to(device)
        # tokenize the context prompt
        context_input_ids = tokenizer(context_prompt,
                                    return_tensors="pt").input_ids
        # create a mask with -100 for all tokens in the context prompt
        # the -100 indicates that the token should be ignored in the loss computation
        masked_labels = torch.ones_like(input_ids) * -100
        masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
        # generate the answer
        preds = model(
            input_ids,
            labels=masked_labels
        )
        # retrieve the average log probability of the tokens in the answer
        log_p = preds.loss.item()
        log_probs_for_a.append(-log_p)
    max_prob = np.max(log_probs_for_a)
    if max_prob > maximizing_log_prob:
        maximizing_log_prob = max_prob
        maximizing_answer = a
    print("Answer ", a, "Answer probabilities for each knowledge statement: ", log_probs_for_a)
  print("Selected answer ", maximizing_answer, "with log P ", maximizing_log_prob)
  print(100 * "-")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the woman checked her luggage, but she didn't take any of her


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
baggage is usually checked in checked baggage.
The man was not


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
women carry small bags, especially when they're on the go.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the woman checked her bag before she left.
What was the reason
Generated knowledge:  
bags are usually carried in a carrier bag or a briefcase.

Answer  garbage can Answer probabilities for each knowledge statement:  [-8.651183128356934, -8.358133316040039, -9.2638578414917, -8.854644775390625, -8.918338775634766]
Answer  military Answer probabilities for each knowledge statement:  [-13.746541976928711, -13.27782917022705, -14.510452270507812, -14.339978218078613, -14.297685623168945]
Answer  jewelry store Answer probabilities for each knowledge statement:  [-9.776968002319336, -10.095916748046875, -10.659232139587402, -9.95555591583252, -10.486684799194336]
Answer  safe Answer probabilities for each knowledge statement:  [-11.570777893066406, -11.832415580749512, -13.639629364013672, -12.748946189880371, -13.790425300598145]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  airport Answer probabilities for each knowledge statement:  [-11.538839340209961, -9.888652801513672, -12.38808536529541, -11.434816360473633, -11.631083488464355]
Selected answer  garbage can with log P  -8.358133316040039
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are also responsible for cleaning the field.
The police officer was


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  icing is used to stop glare on the ice.
The computer software was


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  icing is a technique used to prevent glare in the game of football.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are responsible for preventing any glare.
The person who wanted to
Generated knowledge:  umpires are the workers who make sure the game goes well.
You
Answer  television Answer probabilities for each knowledge statement:  [-10.235928535461426, -9.407893180847168, -8.890667915344238, -9.370429039001465, -11.198206901550293]
Answer  attic Answer probabilities for each knowledge statement:  [-13.596774101257324, -12.11440658569336, -11.61604118347168, -11.67751693725586, -13.281587600708008]
Answer  corner Answer probabilities for each knowledge statement:  [-11.031961441040039, -10.703117370605469, -10.687712669372559, -11.021014213562012, -11.909016609191895]
Answer  they cannot clean corner and library during football match they cannot need that Answer probabilities for each knowledge statement:  [-6.271496772766113, -6.655148983001709, -6.51792049407959, -6.4584479331970215, -6.452688217163086]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  ground Answer probabilities for each knowledge statement:  [-8.554006576538086, -9.23339557647705, -8.114116668701172, -8.754240989685059, -9.520474433898926]
Selected answer  they cannot clean corner and library during football match they cannot need that with log P  -6.271496772766113
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are the people who make the decisions about what to call an 


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
presidents can be elected by the people.

The president is


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the president is the leader of what institution? We know that the president


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the president has a lot of power, but the president is also the
Generated knowledge:  umpires are the people who officiate at baseball games.
The cat
Answer  walmart Answer probabilities for each knowledge statement:  [-8.142420768737793, -8.972271919250488, -8.90455436706543, -8.03041934967041, -7.8204474449157715]
Answer  white house Answer probabilities for each knowledge statement:  [-8.612419128417969, -7.63491153717041, -7.091569900512695, -6.645547866821289, -8.576825141906738]
Answer  country Answer probabilities for each knowledge statement:  [-10.13504695892334, -12.369648933410645, -13.64605712890625, -11.518067359924316, -11.83149242401123]
Answer  corporation Answer probabilities for each knowledge statement:  [-12.276028633117676, -14.06977367401123, -16.650859832763672, -13.893194198608398, -13.85256576538086]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  government Answer probabilities for each knowledge statement:  [-10.173294067382812, -9.73061466217041, -11.749650001525879, -9.258166313171387, -10.929475784301758]
Selected answer  white house with log P  -6.645547866821289
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
driving is a very dangerous job.
What kind of car was the


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  ####### has a very high accident rate.
What do you mean by


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
There are many types of accidents:

Faulty brakes



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
- the driver is under the influence of drugs
- the driver is
Generated knowledge:  
accidents happen because of what? We know that we can’t
Answer  stressful Answer probabilities for each knowledge statement:  [-14.708761215209961, -15.258934020996094, -15.46780014038086, -14.419212341308594, -14.7927885055542]
Answer  dangerous Answer probabilities for each knowledge statement:  [-10.114299774169922, -11.092577934265137, -11.165120124816895, -9.094646453857422, -10.583547592163086]
Answer  fun Answer probabilities for each knowledge statement:  [-14.501969337463379, -14.986560821533203, -16.233652114868164, -14.622293472290039, -13.889074325561523]
Answer  illegal Answer probabilities for each knowledge statement:  [-11.216497421264648, -13.336479187011719, -12.792975425720215, -9.515718460083008, -11.635777473449707]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  deadly Answer probabilities for each knowledge statement:  [-12.690098762512207, -14.63901424407959, -15.035849571228027, -12.710489273071289, -13.199512481689453]
Selected answer  dangerous with log P  -9.094646453857422
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
school is expensive.

A:

There are several reasons


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
schools are meant to prepare students for the future.

The


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
schools are for learning and education, they do not teach people how


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
schools will be the place for you to learn what? We know
Generated knowledge:  ith the education system.

A:

I'm not sure
Answer  get smart Answer probabilities for each knowledge statement:  [-8.589640617370605, -8.743338584899902, -8.013982772827148, -8.673873901367188, -9.343545913696289]
Answer  boredom Answer probabilities for each knowledge statement:  [-7.858763694763184, -7.82045841217041, -7.263005256652832, -7.9718918800354, -7.668207168579102]
Answer  colds and flu Answer probabilities for each knowledge statement:  [-5.3271989822387695, -5.516082286834717, -5.24860143661499, -5.370843410491943, -5.465941429138184]
Answer  taking tests Answer probabilities for each knowledge statement:  [-8.29901123046875, -7.830539703369141, -7.338224411010742, -8.12620735168457, -7.884116172790527]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  spend time Answer probabilities for each knowledge statement:  [-7.690305233001709, -7.893522262573242, -7.179699897766113, -7.780394554138184, -8.239646911621094]
Selected answer  colds and flu with log P  -5.24860143661499
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
Stanley had trouble telling it from what? We know that Stanley had


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are often called to decide games.
A person who is very


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
Stanley had a dream that was very vivid and scary. He had


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
Stanley had an aversion to heights.
The man was a
Generated knowledge:  icing is used in medicine to treat pain.
The computer was a big
Answer  imagination Answer probabilities for each knowledge statement:  [-14.361309051513672, -11.466972351074219, -12.520139694213867, -9.94018840789795, -11.04466724395752]
Answer  reality Answer probabilities for each knowledge statement:  [-12.76948356628418, -9.612512588500977, -9.671875, -7.55483865737915, -8.412652015686035]
Answer  dreamworker Answer probabilities for each knowledge statement:  [-14.165630340576172, -10.419055938720703, -12.055522918701172, -12.118217468261719, -11.015838623046875]
Answer  nightmare Answer probabilities for each knowledge statement:  [-15.019416809082031, -11.555357933044434, -11.577406883239746, -10.60135555267334, -11.841059684753418]
Answer  awake Answer probabilities for each knowledge statement:  [-15.815916061401367, -12.528425216674805, -13.598419189453125, -13.201595306396484, -13

### Few-Shot Prompting with CommonSenseQA

In [None]:
data = [
    ('A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?', ['A. bank', 'B. library', 'C. department store', 'D. mall', 'E. new york'], 'A'),
    ('What do people aim to do at work?', ['A. complete job', 'B. learn from each other', 'C. kill animals', 'D. wear hats', 'E. talk to each other'], 'A'),
    ('Where would you find magazines alongside many other printed works?', ['A. doctor', 'B. bookstore', 'C. market', 'D. train station', 'E. mortuary'], 'B'),
    ('Where are you likely to find a hamburger?', ['A. fast food restaurant', 'B. pizza', 'C. ground up dead cows', 'D. mouth', 'E. cow carcass'], 'A'),
    ('James was looking for a good place to buy farmland. Where might he look?', ['A. midwest', 'B. countryside', 'C. estate', 'D. farming areas', 'E. illinois'], 'A'),
    ('What island country is ferret popular?', ['A. own home', 'B. north carolina', 'C. great britain', 'D. hutch', 'E. outdoors'], 'C')
]
df = pd.DataFrame(data, columns=['Question', 'Answer_Options', 'Correct_Answer'])

# Define task and instructions prompts
task_prompt = "Task: Predict the correct answer option for the question provided, considering the available options."
instructions_prompt = "Instructions: Review the question and choices provided, then select the option you believe is the correct answer."

# Generate prompt with multiple examples
few_shot_prompt = f"{task_prompt}\n{instructions_prompt}\n"
for index, row in df.iterrows():
    few_shot_prompt += "\n"
    few_shot_prompt += f"Question: {row['Question']}\nOptions:\n"
    for option in row['Answer_Options']:
        few_shot_prompt += f"{option}\n"
    few_shot_prompt += f"Selected Choice: {row['Correct_Answer']}\n"

# Display prompt
print(few_shot_prompt)

Task: Predict the correct answer option for the question provided, considering the available options.
Instructions: Review the question and choices provided, then select the option you believe is the correct answer.

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Options:
A. bank
B. library
C. department store
D. mall
E. new york
Selected Choice: A

Question: What do people aim to do at work?
Options:
A. complete job
B. learn from each other
C. kill animals
D. wear hats
E. talk to each other
Selected Choice: A

Question: Where would you find magazines alongside many other printed works?
Options:
A. doctor
B. bookstore
C. market
D. train station
E. mortuary
Selected Choice: B

Question: Where are you likely to find a hamburger?
Options:
A. fast food restaurant
B. pizza
C. ground up dead cows
D. mouth
E. cow carcass
Selected Choice: A

Question: James was looking for a good place to buy farmland. Where might he look?

In [None]:
question = list(qa)[0]
choices = qa[question]

print(few_shot_prompt + "\nQuestion: " + question + "\nOptions:\n" + "\n".join([chr(ord('A') + i) + ". " + choices[i] for i in range(len(choices))]) + "\nSelected Choice:")

Task: Predict the correct answer option for the question provided, considering the available options.
Instructions: Review the question and choices provided, then select the option you believe is the correct answer.

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Options:
A. bank
B. library
C. department store
D. mall
E. new york
Selected Choice: A

Question: What do people aim to do at work?
Options:
A. complete job
B. learn from each other
C. kill animals
D. wear hats
E. talk to each other
Selected Choice: A

Question: Where would you find magazines alongside many other printed works?
Options:
A. doctor
B. bookstore
C. market
D. train station
E. mortuary
Selected Choice: B

Question: Where are you likely to find a hamburger?
Options:
A. fast food restaurant
B. pizza
C. ground up dead cows
D. mouth
E. cow carcass
Selected Choice: A

Question: James was looking for a good place to buy farmland. Where might he look?

In [None]:
answer_log_probs = []
# iterate over the answer options
# NOTE: This can take a moment

for question in list(qa):
  answers = qa[question]
  answer_log_probs = []

  for choice in ['A','B','C','D','E']:
    # construct the full prompt
    context_prompt = few_shot_prompt + "\nQuestion: " + question + "\nOptions:\n" + "\n".join([chr(ord('A') + i) + ". " + answers[i] for i in range(len(answers))]) + "\nSelected Choice:"
    prompt = context_prompt + " " + choice

    # construct the prompt without the answer to create a mask which will
    # allow to retrieve the token probabilities for tokens in the answer only
    # tokenize the prompt
    input_ids = tokenizer(prompt,
                          return_tensors="pt").input_ids.to(device)
    # tokenize the context prompt
    context_input_ids = tokenizer(context_prompt,
                                  return_tensors="pt").input_ids
    # create a mask with -100 for all tokens in the context prompt
    # the -100 indicates that the token should be ignored in the loss computation
    masked_labels = torch.ones_like(input_ids) * -100
    masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
    # generate the answer
    preds = model(
        input_ids,
        labels=masked_labels
    )
    # retrieve the average log probability of the tokens in the answer
    log_p = preds.loss.item()
    answer_log_probs.append(-log_p)
  import numpy as np
  print("All answers ", answers)
  print("Answer probabilities ", answer_log_probs)
  max_prob_idx = np.argmax(answer_log_probs)
  print("Selected answer ", answers[max_prob_idx], "with log P ", answer_log_probs[max_prob_idx])
  print("-" * 100)

All answers  ['garbage can', 'military', 'jewelry store', 'safe', 'airport']
Answer probabilities  [-1.149327039718628, -0.8839691281318665, -1.7674624919891357, -2.4614875316619873, -5.297186851501465]
Selected answer  military with log P  -0.8839691281318665
----------------------------------------------------------------------------------------------------
All answers  ['television', 'attic', 'corner', 'they cannot clean corner and library during football match they cannot need that', 'ground']
Answer probabilities  [-1.6051456928253174, -1.3155701160430908, -1.452256441116333, -1.5575783252716064, -2.7076313495635986]
Selected answer  attic with log P  -1.3155701160430908
----------------------------------------------------------------------------------------------------
All answers  ['walmart', 'white house', 'country', 'corporation', 'government']
Answer probabilities  [-1.0540814399719238, -0.9338192343711853, -1.921186923980713, -2.363598346710205, -4.736490726470947]
Selected 

In [None]:
for question in list(qa):
  answers = qa[question]
  prompt = few_shot_prompt + "\nQuestion: " + question + "\nOptions:\n" + "\n".join([chr(ord('A') + i) + ". " + answers[i] for i in range(len(answers))]) + "\nSelected Choice:"

  input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

  output = softmax_sampling_decoding(input_ids)
  decoded = tokenizer.decode(output[0], skip_special_tokens=True)
  decoded = decoded.replace(prompt, "").replace("\n","")
  print(question)
  print(decoded)

The only baggage the woman checked was a drawstring bag, where was she heading with it?
 B
To prevent any glare during the big football game he made sure to clean the dust of his what?
 B
The president is the leader of what institution?
 A
What kind of driving leads to accidents?
 B
Can you name a good reason for attending school?
 B
Stanley had a dream that was very vivid and scary. He had trouble telling it from what?
 D


## Exercise 3: First neural LM (20 points)

Next to reading and understanding package documentations, a key skill for NLP researchers and practitioners is reading and critically assessing NLP literature. The density, but also the style of NLP literature has undergone a significant shift in the recent years with increasing acceleration of progress. Your task in this exercise is to read a paper about one of the first successful neural langauge models, understand its key architectural components and compare how these key components have evolved in modern systems that were discussed in the lecture. 

> Specifically, please read this paper and answer the following questions: [Bengio et al. (2003)](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
>
> * How were words / tokens represented? What is the difference / similarity to modern LLMs?
> * How was the context represented? What is the difference / similarity to modern LLMs?
> * What is the curse of dimensionality? Give a concrete example in the context of language modeling.
> * Which training data was used? What is the difference / similarity to modern LLMs?
> * Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
> 
> * Please formulate one question about the paper (not the same as the questions above) and post it to the dedicated **Forum** space, and **answer 1 other question** about the paper.

Furthermore, your task is to carefully dissect the paper by Bengio et al. (2003) and analyse its structure and style in comparison to another more recent paper:  [Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)

**TASK:**

> For each section of the Bengio et al. (2003) paper, what are key differences between the way it is written, the included contents, to the BERT paper (Devlin et al., 2019)? What are key similarities? Write max. 2 sentences per section.


> ### Answers
> #### How were words / tokens represented? What is the difference / similarity to modern LLMs?
>A continuous real-vector for each word was used to represent similarity between words (instead of using discrete random or deterministic variables). That way, each word is associated with a specific point in the vector space, while the number of features is smaller than the size of the vocabulary.
That idea reminds of the feature vectors used in information retrieval. However, here we are looking at the probability distribution of word sequences from natural language text.
Like the Bengio et al. paper, LLMs using a vector representation. However, modern LLMs work with context-sensitive embeddings. Also, they divide the word into sub words. 
> 
> #### How was the context represented? What is the difference / similarity to modern LLMs?
>The vector, that is learnt to represent a word, is based on the preceding context. This follows the intuition that words, which occur in similar contexts, have similar meaning.
In contrast Modern LLMs work with self-attention and Transformers. They also use much wider context-windows, than the model decribed here.
> 
> #### What is the curse of dimensionality? Give a concrete example in the context of language modelling.
>The curse of dimensionality occurs while analysing data in high-dimensional spaces. When the dimensionality rises, the volume of the space increases exponentially. The Problem is that words, which are similar, will still be different in high dimensial space. This is because even though they appear relatively often in simialar contexts, most of the contexts will still be different, if we consider every possible context. 
One example is a joint distribution of 10 consecutive words with a vocabulary size of 100,000. Here we have 100,000^10-1 = 10^50-1 free parameters. 
This model is using a joint probability function of word feature vector sequences which is a smooth function of this feature values with a neural network. 
Doing so, the method is crucially different to modern LLMs.
> 
> #### Which training data was used? What is the difference / similarity to modern LLMs?
>The training set is a sequence of words, the vocabulary large but finite. 
Comparative experiments were performed on the Brown corpus, where the first 800,000 words were used for the training data set. 
Furthermore, a experiment was run on the Associated Press News texts, where the training set consist of a stream of about 14 million words.
The training data of Modern LLMs is much larger and much more diverse. 
> 
> #### Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
> Bengio et al.:
> - Using neural networks with softmax
> - generalize unseen words with similarity between word vectors
> 
>LLMs:
> - Self-Attention
> - contextualised embeddings
> - Masking
> - Special Tokens 
> - Fine-Tuning

> ### differences per section
 > The section are selected as the main section of the Bengio et al. paper
> #### Abstract
> Similar in both papers: introducing problem and solution.
> 
> #### Introduction
> Bengio et al. describes the challenges of statistical modelling and offers a neural network solution. A special focus is here the curse of dimensionality.
Devlin et al. explains the limitation of existing pre-trained techniques and introduced BERT. 
> Therefore, Bengio et al. introduce a new theoretical framework whereas Denvio et. al introduces an entire alternative model.
> Bengio et al. looks into earlier neural networks and statistical models whereas Devlin et al. focus on feature based models like ELMo.
> 
> #### A neural model
> Bengio et al. describes a neural network architecture. Devlin et al. explains a pre-trained transformer with a giant corpus, masked language modelling and finetuning.
> Both papers are explaining there architecture detailed and with pictures. But Bengio et al. give a much more detailed description of the theoretical background, while Devlin et al. presuppose knowledge about general transformer architecture and emphasize the diffenrencs and improvements of BERT.
> 
> #### Parallel Implementation
> Bengio et al. emphasizes parallel computation where the hardware no longer exists. There is no comaprable section in the paper of Devlin et al.
> 
> #### Experimental Results
> Bengio et al. uses only perplexity reduction on the Brown dataset and the AP News Corpora. The paper also compares to SOTA.  Devlin et al. uses different NLP benchmarks.
> 
> #### Extensions and Future Work
> Bengio et al. describes different possible improvements that can be tried out in the future. Devlin et al.is keeping this part very short. 
> 
> #### Conclusion
> Both papers summarize their break throughs. Nonetheless, the conclusion in the BERT paper is shorter than in the paper introducing the neural network solution.
