# Exercise #1

* Consider the following language models: GPT-2, GPT-4, Vicuna (an instruction-tuned version of Llama) and Llama-2-7b-base.
* Consider the following prompting / generation strategies: Beam search, tree-of-thought reasoning, zero-shot CoT prompting, few-shot CoT prompting, few-shot prompting.

For each model, which strategies do you think work well, and why? Do you think there are particular tasks or contexts, in which they work better, than in others?

Let's start by briefly describing each model and technique.
* GPT-2: A large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages, called WebText (an internal OpenAI corpus created by scraping web pages with emphasis on document quality).
* GPT-4: A multimodal transformer-based language model, fine-tuned with reinforcement learning from human and AI feedback.
* Vicuna: An open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
* Llama-2-7b-base: A large transformer-based open-source language model with 7 billion parameters, trained on a mix of data from publicly available sources.

Techniques:
* Beam search: A search algorithm for text generation that maintains $k$ path probabilities at each decision point, discarding the least likely paths as it progresses. The idea is to avoid eliminating all other possibilities at each step, instead keeping a few alternatives to explore, as one of them might turn out to be the most likely output.
* Tree-of-thought reasoning: A technique that structures the generation process as a tree, where each node represents a thought or concept. It is used to guide the model's generation process by enforcing logical consistency and coherence.
* Few-shot prompting: A technique that prompts the LM with $k$ pairs of demonstrations $(x_i, y_i)$ to learn a task-specific mapping from input to output.
* Few-shot CoT prompting: A variant of the few-shot prompting where the LM is requested to provide chain-of-thought reasoning. This way, the model is expected to generate more accurate and coherent responses. The idea is to introduce a chain-of-thought to bridge the input $x$ to output $y$ when that connection is non-trivial.
* Zero-shot CoT prompting: A technique that prompts the LM with a chain-of-thought reasoning task without any demonstrations (as in the few-shot prompting).

Comments:

* Beam search is a widely used decoding strategy for text generation, but it can be prone to repetition and lack of diversity in the generated text. It is particularly useful for machine translation [[1]](https://arxiv.org/abs/1808.09582) [[2]](https://arxiv.org/abs/1808.10006). I believe it can increase the performances of the models - but the improvement would be upper-bounded by the model's capabilities.

* GPT-2's relatively small architecture limits its zero-shot performance on most tasks. It is shown that while GPT-2 can match supervised baselines in zero-shot reading comprehension (still way lower than the human level), its performance in summarization and other tasks like translation and question answering remains rudimentary compared to established techniques [[3]](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf). Few shot and beam search should work well with GPT-2.

* It is shown that the larger models with high number of parameters, make increasingly efficient use of in-context information [[4]](https://arxiv.org/abs/2005.14165). Therefore the strategies such as few-shot prompting and few-shot CoT prompting would have more impact on the performance of GPT-4 (estimated to have ~1.8 trillion parameters[*](https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/)), then on Vicuna (assuming we use Vicuna-v1.5 based on Llama2, other versions are based on Llama1), Llama-2-7b-base (7 billion parameters) and GPT-2 (1.5 billion parameters) respectively. Zero shot CoT should work well with Vicuna, because it was fine-tuned on instructions.
The same hierarchy applies to zero-shot chain-of-thought (CoT) prompting, but the gap is narrower when no example pair is provided, as shown in [[4]](https://arxiv.org/abs/2005.14165), [[5]](https://arxiv.org/abs/2205.11916).


* ToT is a complex technique that relies heavily on the "thoughts" generated by the LM, making it most beneficial for higher-capability models like GPT-4. It is recommended to be used in tasks requiring deliberate reasoning, such as mathematical problems, where CoT struggles [[6]](https://arxiv.org/abs/2305.10601). Surprisingly, ToT might also boost the performance of models like Vicuna or Llama-2-7b-base, as GPT-3.5 ToT has been shown to outperform GPT-4 Input-Output on various tasks [[6]](https://arxiv.org/abs/2305.10601).

* In their blog post[*](https://lmsys.org/blog/2023-03-30-vicuna/), the Vicuna-13b Team shows that Vicuna, fine-tuned on 70,000 user-shared ChatGPT conversations, can compete with GPT-3.5 in areas like Humanities and writing, with GPT-4 serving as the judge [[7]](https://arxiv.org/abs/2306.05685). However, Vicuna still lags behind GPT-3.5 in coding, mathematical questions, and reasoning. In summary, LLaMa-1-13b is significantly outperformed in all fields by Vicuna-13b-v1.3, which can compete with Llama-2 models on most tasks (but slightly worse) and with GPT-3.5 on writing and humanities, though GPT-3.5 decisively outperforms both Vicuna-13b-v1.3 and Llama-2 in reasoning and mathematical tasks. Surprisingly, Vicuna-33b-v1.3 (trained on LLama-1) is slightly better than Llama-2 models with 7b and 13b parameters - and slightly worse than Llama-2-70b-chat[*](https://chat.lmsys.org/?leaderboard).

* ChatGPT Llama-2 models are shown to be slightly outperformed by GPT-3.5 in benchmarks [[8]](https://arxiv.org/pdf/2307.09288), making GPT-4 the best model among the list. However, Llama-2 and Vicuna models are preferable for tasks requiring fine-tuning or architectural modifications, as they are open-source. Although Llama models currently perform worse than GPT, their open-source nature makes them appealing for advanced usage and customization.

Vicuna versions (source: https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md)

| Weights version | Link | FastChat version compatibility | Base Model | Release Date | Fine-tuning Data |
| ---- | ---- | ---- | ---- | ---- | ---- |
| v1.5 | [7B](https://huggingface.co/lmsys/vicuna-7b-v1.5), [7B-16k](https://huggingface.co/lmsys/vicuna-7b-v1.5-16k), [13B](https://huggingface.co/lmsys/vicuna-13b-v1.5), [13B-16k](https://huggingface.co/lmsys/vicuna-13b-v1.5-16k) | `>=0.2.21` | Llama 2 | Aug. 1, 2023 | 370M tokens |
| v1.3 | [7B](https://huggingface.co/lmsys/vicuna-7b-v1.3), [13B](https://huggingface.co/lmsys/vicuna-13b-v1.3), [33B](//huggingface.co/lmsys/vicuna-33b-v1.3) | `>=0.2.1` | Llama 1 | Jun. 22, 2023 | 370M tokens |
| v1.1 | [7B](https://huggingface.co/lmsys/vicuna-7b-v1.1), [13B](https://huggingface.co/lmsys/vicuna-13b-v1.1) | `>=0.2.1` | Llama 1 | Apr. 12, 2023 | - |
| v0 | [7B-delta](https://huggingface.co/lmsys/vicuna-7b-delta-v0), [13B-delta](https://huggingface.co/lmsys/vicuna-13b-delta-v0) | `<=0.1.10` | Llama 1 | Mar. 30, 2023 | - |

In [2]:
!wget https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_single.jsonl
!wget https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_pair.jsonl
!pip install -U plotly kaleido

--2024-05-30 08:46:40--  https://huggingface.co/spaces/lmsys/mt-bench/resolve/main/data/mt_bench/model_judgment/gpt-4_single.jsonl
Resolving huggingface.co (huggingface.co)... 18.172.134.88, 18.172.134.24, 18.172.134.124, ...
Connecting to huggingface.co (huggingface.co)|18.172.134.88|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/12/2b/122bd8e9eccbb3acc98acf73e0ecef3c96f24dcdb5f6639074ed304eb19f9cd4/76c55033c6b2b1cc3f62513458f84748a23352495fd42b1062a7401de5ff9bd9?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27gpt-4_single.jsonl%3B+filename%3D%22gpt-4_single.jsonl%22%3B&Expires=1717318001&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNzMxODAwMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy8xMi8yYi8xMjJiZDhlOWVjY2JiM2FjYzk4YWNmNzNlMGVjZWYzYzk2ZjI0ZGNkYjVmNjYzOTA3NGVkMzA0ZWIxOWY5Y2Q0Lzc2YzU1MDMzYzZiMmIxY2MzZjYyNTEzNDU4Zjg0NzQ4YTIzMzUyN

In [3]:
# Disclaimer: Radar plot from the notebook: https://colab.research.google.com/drive/15O3Y8Rxq37PuMlArE291P4OC6ia37PQK#scrollTo=5i8R0l-XqkgO

import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go


CATEGORIES = ["Writing", "Roleplay", "Reasoning", "Math", "Coding", "Extraction", "STEM", "Humanities"]


def get_model_df():
    cnt = 0
    q2result = []
    fin = open("gpt-4_single.jsonl", "r")
    for line in fin:
        obj = json.loads(line)
        obj["category"] = CATEGORIES[(obj["question_id"]-81)//10]
        q2result.append(obj)
    df = pd.DataFrame(q2result)
    return df

def toggle(res_str):
    if res_str == "win":
        return "loss"
    elif res_str == "loss":
        return "win"
    return "tie"

def get_model_df_pair():
    fin = open("gpt-4_pair.jsonl", "r")
    cnt = 0
    q2result = []
    for line in fin:
        obj = json.loads(line)

        result = {}
        result["qid"] = str(obj["question_id"])
        result["turn"] = str(obj["turn"])
        if obj["g1_winner"] == "model_1" and obj["g2_winner"] == "model_1":
            result["result"] = "win"
        elif obj["g1_winner"] == "model_2" and obj["g2_winner"] == "model_2":
            result["result"] = "loss"
        else:
            result["result"] = "tie"
        result["category"] = CATEGORIES[(obj["question_id"]-81)//10]
        result["model"] = obj["model_1"]
        q2result.append(result)

    df = pd.DataFrame(q2result)

    return df

df = get_model_df()
df_pair = get_model_df_pair()

all_models = df["model"].unique()
print(all_models)
scores_all = []
for model in all_models:
    for cat in CATEGORIES:
        # filter category/model, and score format error (<1% case)
        res = df[(df["category"]==cat) & (df["model"]==model) & (df["score"] >= 0)]
        score = res["score"].mean()
        scores_all.append({"model": model, "category": cat, "score": score})

target_models = ["Llama-2-7b-chat", "vicuna-33b-v1.3", "Llama-2-13b-chat", "Llama-2-70b-chat", "gpt-3.5-turbo",  "vicuna-13b-v1.3", "gpt-4", "llama-13b"]

scores_target = [scores_all[i] for i in range(len(scores_all)) if scores_all[i]["model"] in target_models]

# sort by target_models
scores_target = sorted(scores_target, key=lambda x: target_models.index(x["model"]), reverse=True)

df_score = pd.DataFrame(scores_target)
df_score = df_score[df_score["model"].isin(target_models)]

rename_map = {"llama-13b": "LLaMA-13B",
              "alpaca-13b": "Alpaca-13B",
              "vicuna-33b-v1.3": "Vicuna-33B-v1.3",
              "vicuna-13b-v1.3": "Vicuna-13B-v.1.3",
              "gpt-3.5-turbo": "GPT-3.5-turbo",
              "gpt-4": "GPT-4"}

for k, v in rename_map.items():
    df_score.replace(k, v, inplace=True)

fig = px.line_polar(df_score, r = 'score', theta = 'category', line_close = True, category_orders = {"category": CATEGORIES},
                    color = 'model', markers=True, color_discrete_sequence=px.colors.qualitative.Pastel)

fig.show()

['alpaca-13b' 'baize-v2-13b' 'chatglm-6b' 'claude-instant-v1' 'claude-v1'
 'dolly-v2-12b' 'falcon-40b-instruct' 'fastchat-t5-3b' 'gpt-3.5-turbo'
 'gpt-4' 'gpt4all-13b-snoozy' 'guanaco-33b' 'guanaco-65b'
 'h2ogpt-oasst-open-llama-13b' 'koala-13b' 'llama-13b' 'mpt-30b-chat'
 'mpt-30b-instruct' 'mpt-7b-chat' 'nous-hermes-13b'
 'oasst-sft-4-pythia-12b' 'oasst-sft-7-llama-30b' 'palm-2-chat-bison-001'
 'rwkv-4-raven-14b' 'stablelm-tuned-alpha-7b' 'tulu-30b' 'vicuna-13b-v1.3'
 'vicuna-33b-v1.3' 'vicuna-7b-v1.3' 'wizardlm-13b' 'wizardlm-30b'
 'Llama-2-7b-chat' 'Llama-2-13b-chat' 'Llama-2-70b-chat']


# Exercise #2

In [3]:
# import packages
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import numpy as np

# define computational device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-410m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-410m").to(device)

Device: cuda


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
from transformers import set_seed  # reproducibility
torch.manual_seed(42)
set_seed(42)

In [None]:
test_set = [
    "Input: A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. Relation:", # neutral
    "Input: A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. Relation:",  # entailment
    "Input: Children smiling and waving at camera. There are children present. Relation:",  # entailment
    "Input: A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. Relation:", # contradiction
    "Input: An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. Relation:", # neutral
    "Input: High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. Relation:" # contradiction
]

In [None]:
def pretty_print(s, text = None):
    decoded = tokenizer.decode(s, skip_special_tokens=True)
    if text is not None:
        decoded = decoded.replace(text, "")
    print(decoded)
    print(100 * '-' + '\n')

In [None]:
# greedy decoding
def greedy_decoding(input_ids, max_new_tokens=2):
    output = model.generate(input_ids, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return output

def beam_search_decoding(input_ids, max_new_tokens=2, num_beams=5):
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        num_beams=num_beams,
        early_stopping=True,   # option `early_stopping` implies stopping when all beams reach the end-of-sentence token
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def pure_sampling_decoding(input_ids, max_new_tokens=2):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=0,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def softmax_sampling_decoding(input_ids, max_new_tokens=2, temperature=0.7):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=0,
        temperature=temperature,  # higher temperature means more randomness
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def top_k_sampling_decoding(input_ids, max_new_tokens=2, k=50):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=k,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

# the set of most likely words the summed probability of which exceeds threshold p  (also called nucleus sampling)
def top_p_sampling_decoding(input_ids, max_new_tokens=2, p=0.9):
    # # activate sampling and deactivate top_k by setting top_k sampling to 0
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=p,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

def contrastive_decoding(input_ids, max_new_tokens=2, penalty_alpha=0.6, top_k=4):
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        penalty_alpha=penalty_alpha,
        top_k=top_k,
        pad_token_id=tokenizer.eos_token_id
    )
    return output

## Experiment Scheme
In our experiments we worked on the following configurations:
* Two models, `pythia-410m` and `pythia-1.4b` are used to generate the output.


### Natural Language Inference

* We use "Instruction Prompting" to make LM better understand user intention and follow the instruction.
* Prefix of the input is given as follows:\
 "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”."
* For each model, we test the following prompt engineering techniques:
    * $k$-shot learning
        * Zero-shot learning (no example is provided)
        * One-shot learning (only one example is provided)
        * Few-shot learning ($k$ = 3) (1 example for each relation)
        * Few-shot learning ($k$ = 9) (3 examples for each relation)
    * Self-consistency prompting (softmax sampling) with majority vote
* For each $k$-shot learning scenario, we test the following decoding strategies:
    * Greedy decoding
    * Pure sampling
    * Softmax sampling (temperature = 0.7)
    * Top-$k$ sampling ($k$ = 50)
    * Top-$p$ sampling ($p$ = 0.9)
    * Beam search (beam size = 5)
    * Contrastive decoding (penalty=0.6, $k$=4)

In self-consistency, we apply majority vote on the outputs generate using softmax sampling.

Disclaimer: As all these methods have different hyperparameters to tune, therefore the comparison is not completely fair. However, we believe that this comparison can give us a general idea of the performance of these methods.

## Natural Language Inference
Natural language inference: the task is to classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”.
* A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. **neutral**
* A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. **entailment**
* Children smiling and waving at camera. There are children present. **entailment**
* A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. **contradiction**
* An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. **neutral**
* High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. **contradiction**

Few-shot examples are collected from the Stanford Natural Language Inference (SNLI) Corpus. \
Paper: https://arxiv.org/pdf/1508.05326

We used two sets of few-shot examples to experiment on. First set consists of entirely independent sentences, while the second set consists of sentences that are related to each other. In the second set, three different "hypothesis" sentences are created for each "premise" sentence. The idea is to prevent the model from relying solely on the first sentence for the relation prediction.


| Model | Prompt Engineering | Few-Shot Examples | Best Accuracy | Majority Vote Accuracy |
| :---: | :---: | :---: |:---: |:---: |
| pythia-410m | Zero-shot learning | -  | 0/6 | 0/6|
| pythia-410m | One-shot learning |  - | 2/6 | 2/6 |
| pythia-410m | Few-shot learning ($k = 3$) |  Independent | 3/6 | 3/6|
| pythia-410m | Few-shot learning ($k = 9$) | Independent | 3/6 | 2/6 |
| pythia-410m | Few-shot learning ($k = 3$) |Related | 3/6| 2/6 |
| pythia-410m | Few-shot learning ($k = 9$) |  Related | 3/6 | 2/6 |
| pythia-410m | Self-consistency  ($k = 3$) |  Independent | 2/6 | - |
| pythia-410m | Self-consistency  ($k = 9$) |  Related | 2/6 | - |
| pythia-1.4b | Zero-shot learning | -  | 0/6 | 0/6|
| pythia-1.4b | One-shot learning |  - | 2/6  | 2/6|
| pythia-1.4b | Few-shot learning ($k = 3$) |  Independent | 3/6 | 2/6|
| pythia-1.4b | Few-shot learning ($k = 9$) | Independent | **4/6** (softmax sampling) | 2/6 |
| pythia-1.4b | Few-shot learning ($k = 3$) |Related | **4/6** (contrastive decoding) | 3/6 |
| pythia-1.4b | Few-shot learning ($k = 9$) |  Related | 3/6 | 2/6 |
| pythia-1.4b | Self-consistency  ($k = 3$) |  Independent | 2/6 | - |
| pythia-1.4b | Self-consistency  ($k = 9$) |  Related | 2/6 | - |

* Best Accuracy: For optimal accuracy, we generate outputs for each decoding scheme and select the one with the highest accuracy, disregarding the others.
* Majority Vote Accuracy (few-shot learning): Accuracy is determined by considering all outputs from different decoding schemes for a given input, with the final output decided by majority vote.

Observations:
* When only the input is provided without any example or CoT prompting, the model doesn't even seem to understand the task - although the task is explicitly stated in the prompt. It attempts to complete the sentence with irrelevant information.
* We tested all these approaches with and without CoT prompts. In this particular case, using the given models and few-shot examples, there doesn't seem to be any difference between prompting for reasoning and not.
* There doesn't appear to be a hierarchy among the different decoding strategies, as the leading one varies from experiment to experiment.
* The model seems to achieve the same accuracy for both independent and related few-shot examples.
* `pythia-1.4b` appears to outperform `pythia-410m` when the few-shot examples are provided. However, the test set needs to be larger to draw a solid conclusion. Best accuracies are achieved when $k=3$ for both of the models.

In [None]:
# independent
few_shot_examples_v1 = [
    "Input: Children smiling and waving at camera. There are children present. Relation: entailment",
    "Input: A man inspects the uniform of a figure in some East Asian country. The man is sleeping. Relation: contradiction", # contradiction
    "Input: An older and younger man smiling. Two men are smiling and laughing at the cats playing on the floor. Relation: neutral", # neutral
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. The church is filled with song. Relation: entailment",  # entailment
    "Input: A black race car starts up in front of a crowd of people. A man is driving down a lonely road. Relation: contradiction", # contradiction
    "Input: A smiling costumed woman is holding an umbrella. A happy woman in a fairy costume holds an umbrella. Relation: neutral", # neutral
    "Input: A soccer game with multiple males playing. Some men are playing a sport. Relation: entailment",  # entailment
    "Input: Four dirty and barefooted children. Four kids won awards for 'cleanest feet'. Relation: contradiction", # contradiction
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman is young. Relation: neutral", # neutral
]

In [None]:
# related
few_shot_examples_v2 = [
    "Input: Children smiling and waving at camera. There are children present. Relation: entailment",  # entailment
    "Input: Children smiling and waving at camera. The kids are frowning. Relation: contradiction", # contradiction
    "Input: Children smiling and waving at camera. They are smiling at their parents. Relation: neutral", # neutral
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. The church is filled with song. Relation: entailment",  # entailment
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. A choir singing at a baseball game. Relation: contradiction", # contradiction
    "Input: This church choir sings to the masses as they sing joyous songs from the book at a church. The church has cracks in the ceiling. Relation: neutral", # neutral
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman is very happy. Relation: entailment", # entailment
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman has been shot. Relation: contradiction", # contradiction
    "Input: A woman with a green headscarf, blue shirt and a very big grin.	The woman is young. Relation: neutral", # neutral
]

In [None]:
def experiment(k, decoding, few_shot_examples, test_set):
    print(f"Decoding scheme: {decoding.__name__}\n")
    for idx, test in enumerate(test_set):
        suffix = ""
        if(k):
            few_shot = few_shot_examples[:k]
            suffix = "\n".join(few_shot) + "\n"
        task = "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”. \n"
        input_text = task + suffix +  test

        input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
        output = decoding(input_ids)
        print(f"{idx}) " + test)
        pretty_print(output[0], text=input_text)


In [None]:
def majority_vote(k, few_shot_examples, test_set):
  for idx, test in enumerate(test_set):
    suffix = ""
    if(k):
        few_shot = few_shot_examples[:k]
        suffix = "\n".join(few_shot) + "\n"
    task = "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”. \n"
    input_text = task + suffix +  test

    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    print(f"{idx}) " + test)
    for decoding in [greedy_decoding, beam_search_decoding, pure_sampling_decoding, softmax_sampling_decoding, top_k_sampling_decoding, top_p_sampling_decoding, contrastive_decoding]:
      output = decoding(input_ids)
      decoded = tokenizer.decode(output[0], skip_special_tokens=True)
      decoded = decoded.replace(input_text, "").replace("\n","")
      print(decoded)


In [None]:
# decoding strategies: greedy, beam search, pure sampling, softmax sampling, top-k sampling, top-p sampling, contrastive decoding
for decoding in [greedy_decoding, beam_search_decoding, pure_sampling_decoding, softmax_sampling_decoding, top_k_sampling_decoding, top_p_sampling_decoding, contrastive_decoding]:
    experiment(3, decoding, few_shot_examples_v2, test_set)

## GT:  neutral entailment entailment contradiction neutral contradiction

In [None]:
import gc
torch.cuda.empty_cache()
gc.collect()

In [None]:
majority_vote(1, few_shot_examples_v1, test_set)
## GT:  neutral entailment entailment contradiction neutral contradiction

In [None]:
def self_consistency(k, decoding, few_shot_examples, test_set, n = 5):
  for idx, test in enumerate(test_set):
    suffix = ""
    if(k):
        few_shot = few_shot_examples[:k]
        suffix = "\n".join(few_shot) + "\n"
    task = "Please classify whether two sentences form a “contradiction” or an “entailment”, or the relation is “neutral”.\n"
    input_text = task + suffix +  test


    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    print(f"{idx}) " + test)

    for i in range(n):
      output = decoding(input_ids)
      decoded = tokenizer.decode(output[0], skip_special_tokens=True)
      decoded = decoded.replace(input_text, "").replace("\n","")
      print(decoded)


In [None]:
self_consistency(9, softmax_sampling_decoding, few_shot_examples_v1, test_set)
## GT:  neutral entailment entailment contradiction neutral contradiction

## Multiple-choice QA

Multiple-choice QA: the task is to predict the correct answer option for the question, given the question and the options.


## Experiment Scheme
In our experiments we worked on the following configurations:
* Two models, `pythia-410m` and `pythia-1.4b` are used to generate the output.
* Few-shot examples are collected from two different datasets: [CommonSenseQA](https://huggingface.co/datasets/tau/commonsense_qa) and [NumerSense](https://inklab.usc.edu/NumerSense/) dataset. CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. NumerSense is a numerical commonsense reasoning probing task, with a diagnostic dataset consisting of 3,145 masked-word-prediction probes.
* We primarily explored two prompting techniques: Generated Knowledge Prompting (with few-shot examples) and Few-shot Prompting. In Generated Knowledge Prompting, the approach involves generating knowledge statements pertaining to the given question. Subsequently, we evaluate the log probability of each answer option being generated given these knowledge statements. This methodology enables the model to either support or oppose its own answer. We also employ two output extraction techniques: scoring the answers based on log-probabilities and classical generation. Classical generation utilizes the log-probabilities to generate new tokens. Hence, these approaches initially appear similar. However, the former allows us to constrain the model's output to only one of the options, thereby limiting its output possibilities. We worked on the following combinations:
  * Generated Knowledge Prompting with NumerSense dataset + scoring
  * Generated Knowledge Prompting with CommonSenseQA dataset + scoring
  * Few Shot Learning with CommonSenseQA dataset + scoring
  * Few Shot Learning with CommonSenseQA dataset + classical generation (softmax sampling)

For the generated knowledge prompting, we generated five knowledge statements for each question. For each answer choice 𝑎, we identified the knowledge statement that best supports it by calculating the log probability of generating that answer for each augmented prompt (with the knowledge statement). This process allows us to determine the maximum probability for each answer. Ultimately, we select the answer with the highest maximum probability.

All the few-shot prompts can be found below.

| Model | Prompt Engineering | Dataset | Generation Technique | Accuracy |
| :---: | :---: | :---: |:---: |:---: |
| pythia-410m | Generated Knowledge Prompting | NumerSense  |  Log-Probability scoring | 3/6 |
| pythia-410m | Generated Knowledge Prompting | CommonSenseQA  |  Log-Probability scoring | **4/6** |
| pythia-410m | Few Shot learning | CommonSenseQA  |  Log-Probability scoring | 0/6 |
| pythia-410m | Few Shot learning | CommonSenseQA  |  Generation with Softmax Sampling | 1/6 |
| pythia-1.4b | Generated Knowledge Prompting | NumerSense  |  Log-Probability scoring | 2/6 |
| pythia-1.4b| Generated Knowledge Prompting | CommonSenseQA  |  Log-Probability scoring | 2/6 |
| pythia-1.4b | Few Shot learning | CommonSenseQA  |  Log-Probability scoring | 1/6 |
| pythia-1.4b | Few Shot learning | CommonSenseQA  |  Generation with Softmax Sampling | 2/6 |


Observations:
* Overall, generated knowledge prompting seems to outperform vanilla few shot learning (no knowledge statement is generated).
* Maximum accuracy is achieved using generated knowledge prompting on CommonSenseQA with `pythia-410m`.


### Generated Knowledge Prompting with NumerSense Dataset

In [None]:
qa = {
    "The only baggage the woman checked was a drawstring bag, where was she heading with it?": ["garbage can", "military", "jewelry store", "safe", "airport"],
    "To prevent any glare during the big football game he made sure to clean the dust of his what?": ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"],
    "The president is the leader of what institution?": ["walmart", "white house", "country", "corporation", "government"],
    "What kind of driving leads to accidents?": ["stressful", "dangerous", "fun", "illegal", "deadly"],
    "Can you name a good reason for attending school?": ["get smart", "boredom", "colds and flu", "taking tests", "spend time"],
    "Stanley had a dream that was very vivid and scary. He had trouble telling it from what?": ["imagination", "reality", "dreamworker", "nightmare", "awake"]
}

# GT: airport - television - country - dangerous - get smart - reality

In [None]:
# Run this to get NumerSense dataset few shot prompts
import pandas as pd

inputs = [
    "How many wings do penguins have?",
    "How many sides does a parallelogram have?",
    "What is the number of limbs a typical human being has?",
    "How many feet are there in a yard?",
]

knowledges = [
    "Birds have two wings. Penguin is a kind of bird.",
    "A rectangular is a parallelogram. A square is a parallelogram.",
    "Human beings have four limbs.",
    "A yard is three feet.",
]

# Creating the dataframe
df = pd.DataFrame({
    'input': inputs,
    'knowledge': knowledges
})
df


few_shot_template = """{q} We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=df.loc[i, "input"],
        k=df.loc[i, "knowledge"].lower()
    )
    for i in range(len(df))
])
print("Constructed few shot prompt\n" + few_shot_prompt)

Constructed few shot prompt
How many wings do penguins have? We know that birds have two wings. penguin is a kind of bird.
How many sides does a parallelogram have? We know that a rectangular is a parallelogram. a square is a parallelogram.
What is the number of limbs a typical human being has? We know that human beings have four limbs.
How many feet are there in a yard? We know that a yard is three feet.


In [None]:
# Run this to get CommonSenseQA dataset few shot prompts

inputs = [
    "Google Maps and other highway and street GPS services have replaced what?",
    "The fox walked from the city into the forest, what was it looking for?",
    "You can share files with someone if you have a connection to a what?",
    "Too many people want exotic snakes. The demand is driving what to carry them?",
    "The bodyguard was good at his duties, he made the person who hired him what?"
]

knowledges = [
    "Electronic maps are the modern version of paper atlas.",
    "Natural habitats are usually away from cities.",
    "Files can be shared over the Internet.",
    "Some people raise snakes as pets.",
    "The job of bodyguards is to ensure the safety and security of the employer."
]
# Creating the dataframe
df = pd.DataFrame({
    'input': inputs,
    'knowledge': knowledges
})
df


few_shot_template = """{q} We know that {k}"""

few_shot_prompt = "\n".join([
    few_shot_template.format(
        q=df.loc[i, "input"],
        k=df.loc[i, "knowledge"].lower()
    )
    for i in range(len(df))
])
print("Constructed few shot prompt\n" + few_shot_prompt)

Constructed few shot prompt
Google Maps and other highway and street GPS services have replaced what? We know that electronic maps are the modern version of paper atlas.
The fox walked from the city into the forest, what was it looking for? We know that natural habitats are usually away from cities.
You can share files with someone if you have a connection to a what? We know that files can be shared over the internet.
Too many people want exotic snakes. The demand is driving what to carry them? We know that some people raise snakes as pets.
The bodyguard was good at his duties, he made the person who hired him what? We know that the job of bodyguards is to ensure the safety and security of the employer.


In [None]:
# Only for visualizing a generated knowledge:
question = list(qa)[0]
choices = qa[question]
prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
).input_ids.to(device)

knowledge_statements = model.generate(
    prompt_input_ids,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.5
)
# access the knowledge statements (i.e., only text that comes after prompt)
knowledge = tokenizer.decode(
    knowledge_statements[0, prompt_input_ids.shape[-1]:],
    skip_special_tokens=True
)
print("Generated knowledge: ", knowledge)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
bags are usually carried by the person carrying them.

So,


In [None]:
# 3. Score each answer to the question based on the knowledge statements
# as the score, we take the average log probability of the tokens in the answer
# iterate over the answer options
import numpy as np
no_knowledge_statements = 5

for question in list(qa):
  answers = qa[question]
  answer_log_probs = []
  prompt_input_ids = tokenizer(
    few_shot_prompt + "\n" + question + " We know that ",
    return_tensors="pt"
    ).input_ids.to(device)

  knowledge_statements = []
  for i in range(no_knowledge_statements):
    knowledge = model.generate(
        prompt_input_ids,
        max_new_tokens=15,
        do_sample=True,
        temperature=0.5
    )

    # access the knowledge statements (i.e., only text that comes after prompt)
    knowledge = tokenizer.decode(
        knowledge[0, prompt_input_ids.shape[-1]:],
        skip_special_tokens=True
    )
    knowledge_statements.append(knowledge)
    print("Generated knowledge: ", knowledge)

  # now we have knowledge statements in hand
  maximizing_answer = None
  maximizing_log_prob = -float("inf")
  for a in answers:
    log_probs_for_a = []
    for knowledge in knowledge_statements:
        # construct the full prompt
        prompt = f"{knowledge} {question} {a}"
        # construct the prompt without the answer to create a mask which will
        # allow to retrieve the token probabilities for tokens in the answer only
        context_prompt = f"{knowledge} {question}"
        # tokenize the prompt
        input_ids = tokenizer(prompt,
                            return_tensors="pt").input_ids.to(device)
        # tokenize the context prompt
        context_input_ids = tokenizer(context_prompt,
                                    return_tensors="pt").input_ids
        # create a mask with -100 for all tokens in the context prompt
        # the -100 indicates that the token should be ignored in the loss computation
        masked_labels = torch.ones_like(input_ids) * -100
        masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
        # generate the answer
        preds = model(
            input_ids,
            labels=masked_labels
        )
        # retrieve the average log probability of the tokens in the answer
        log_p = preds.loss.item()
        log_probs_for_a.append(-log_p)
    max_prob = np.max(log_probs_for_a)
    if max_prob > maximizing_log_prob:
        maximizing_log_prob = max_prob
        maximizing_answer = a
    print("Answer ", a, "Answer probabilities for each knowledge statement: ", log_probs_for_a)
  print("Selected answer ", maximizing_answer, "with log P ", maximizing_log_prob)
  print(100 * "-")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the woman checked her luggage, but she didn't take any of her


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
baggage is usually checked in checked baggage.
The man was not


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
women carry small bags, especially when they're on the go.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the woman checked her bag before she left.
What was the reason
Generated knowledge:  
bags are usually carried in a carrier bag or a briefcase.

Answer  garbage can Answer probabilities for each knowledge statement:  [-8.651183128356934, -8.358133316040039, -9.2638578414917, -8.854644775390625, -8.918338775634766]
Answer  military Answer probabilities for each knowledge statement:  [-13.746541976928711, -13.27782917022705, -14.510452270507812, -14.339978218078613, -14.297685623168945]
Answer  jewelry store Answer probabilities for each knowledge statement:  [-9.776968002319336, -10.095916748046875, -10.659232139587402, -9.95555591583252, -10.486684799194336]
Answer  safe Answer probabilities for each knowledge statement:  [-11.570777893066406, -11.832415580749512, -13.639629364013672, -12.748946189880371, -13.790425300598145]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  airport Answer probabilities for each knowledge statement:  [-11.538839340209961, -9.888652801513672, -12.38808536529541, -11.434816360473633, -11.631083488464355]
Selected answer  garbage can with log P  -8.358133316040039
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are also responsible for cleaning the field.
The police officer was


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  icing is used to stop glare on the ice.
The computer software was


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  icing is a technique used to prevent glare in the game of football.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are responsible for preventing any glare.
The person who wanted to
Generated knowledge:  umpires are the workers who make sure the game goes well.
You
Answer  television Answer probabilities for each knowledge statement:  [-10.235928535461426, -9.407893180847168, -8.890667915344238, -9.370429039001465, -11.198206901550293]
Answer  attic Answer probabilities for each knowledge statement:  [-13.596774101257324, -12.11440658569336, -11.61604118347168, -11.67751693725586, -13.281587600708008]
Answer  corner Answer probabilities for each knowledge statement:  [-11.031961441040039, -10.703117370605469, -10.687712669372559, -11.021014213562012, -11.909016609191895]
Answer  they cannot clean corner and library during football match they cannot need that Answer probabilities for each knowledge statement:  [-6.271496772766113, -6.655148983001709, -6.51792049407959, -6.4584479331970215, -6.452688217163086]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  ground Answer probabilities for each knowledge statement:  [-8.554006576538086, -9.23339557647705, -8.114116668701172, -8.754240989685059, -9.520474433898926]
Selected answer  they cannot clean corner and library during football match they cannot need that with log P  -6.271496772766113
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are the people who make the decisions about what to call an 


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
presidents can be elected by the people.

The president is


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the president is the leader of what institution? We know that the president


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
the president has a lot of power, but the president is also the
Generated knowledge:  umpires are the people who officiate at baseball games.
The cat
Answer  walmart Answer probabilities for each knowledge statement:  [-8.142420768737793, -8.972271919250488, -8.90455436706543, -8.03041934967041, -7.8204474449157715]
Answer  white house Answer probabilities for each knowledge statement:  [-8.612419128417969, -7.63491153717041, -7.091569900512695, -6.645547866821289, -8.576825141906738]
Answer  country Answer probabilities for each knowledge statement:  [-10.13504695892334, -12.369648933410645, -13.64605712890625, -11.518067359924316, -11.83149242401123]
Answer  corporation Answer probabilities for each knowledge statement:  [-12.276028633117676, -14.06977367401123, -16.650859832763672, -13.893194198608398, -13.85256576538086]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  government Answer probabilities for each knowledge statement:  [-10.173294067382812, -9.73061466217041, -11.749650001525879, -9.258166313171387, -10.929475784301758]
Selected answer  white house with log P  -6.645547866821289
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
driving is a very dangerous job.
What kind of car was the


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  ####### has a very high accident rate.
What do you mean by


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
There are many types of accidents:

Faulty brakes



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
- the driver is under the influence of drugs
- the driver is
Generated knowledge:  
accidents happen because of what? We know that we can’t
Answer  stressful Answer probabilities for each knowledge statement:  [-14.708761215209961, -15.258934020996094, -15.46780014038086, -14.419212341308594, -14.7927885055542]
Answer  dangerous Answer probabilities for each knowledge statement:  [-10.114299774169922, -11.092577934265137, -11.165120124816895, -9.094646453857422, -10.583547592163086]
Answer  fun Answer probabilities for each knowledge statement:  [-14.501969337463379, -14.986560821533203, -16.233652114868164, -14.622293472290039, -13.889074325561523]
Answer  illegal Answer probabilities for each knowledge statement:  [-11.216497421264648, -13.336479187011719, -12.792975425720215, -9.515718460083008, -11.635777473449707]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  deadly Answer probabilities for each knowledge statement:  [-12.690098762512207, -14.63901424407959, -15.035849571228027, -12.710489273071289, -13.199512481689453]
Selected answer  dangerous with log P  -9.094646453857422
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
school is expensive.

A:

There are several reasons


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
schools are meant to prepare students for the future.

The


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
schools are for learning and education, they do not teach people how


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
schools will be the place for you to learn what? We know
Generated knowledge:  ith the education system.

A:

I'm not sure
Answer  get smart Answer probabilities for each knowledge statement:  [-8.589640617370605, -8.743338584899902, -8.013982772827148, -8.673873901367188, -9.343545913696289]
Answer  boredom Answer probabilities for each knowledge statement:  [-7.858763694763184, -7.82045841217041, -7.263005256652832, -7.9718918800354, -7.668207168579102]
Answer  colds and flu Answer probabilities for each knowledge statement:  [-5.3271989822387695, -5.516082286834717, -5.24860143661499, -5.370843410491943, -5.465941429138184]
Answer  taking tests Answer probabilities for each knowledge statement:  [-8.29901123046875, -7.830539703369141, -7.338224411010742, -8.12620735168457, -7.884116172790527]


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Answer  spend time Answer probabilities for each knowledge statement:  [-7.690305233001709, -7.893522262573242, -7.179699897766113, -7.780394554138184, -8.239646911621094]
Selected answer  colds and flu with log P  -5.24860143661499
----------------------------------------------------------------------------------------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
Stanley had trouble telling it from what? We know that Stanley had


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  umpires are often called to decide games.
A person who is very


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
Stanley had a dream that was very vivid and scary. He had


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Generated knowledge:  
Stanley had an aversion to heights.
The man was a
Generated knowledge:  icing is used in medicine to treat pain.
The computer was a big
Answer  imagination Answer probabilities for each knowledge statement:  [-14.361309051513672, -11.466972351074219, -12.520139694213867, -9.94018840789795, -11.04466724395752]
Answer  reality Answer probabilities for each knowledge statement:  [-12.76948356628418, -9.612512588500977, -9.671875, -7.55483865737915, -8.412652015686035]
Answer  dreamworker Answer probabilities for each knowledge statement:  [-14.165630340576172, -10.419055938720703, -12.055522918701172, -12.118217468261719, -11.015838623046875]
Answer  nightmare Answer probabilities for each knowledge statement:  [-15.019416809082031, -11.555357933044434, -11.577406883239746, -10.60135555267334, -11.841059684753418]
Answer  awake Answer probabilities for each knowledge statement:  [-15.815916061401367, -12.528425216674805, -13.598419189453125, -13.201595306396484, -13

### Few-Shot Prompting with CommonSenseQA

In [None]:
data = [
    ('A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?', ['A. bank', 'B. library', 'C. department store', 'D. mall', 'E. new york'], 'A'),
    ('What do people aim to do at work?', ['A. complete job', 'B. learn from each other', 'C. kill animals', 'D. wear hats', 'E. talk to each other'], 'A'),
    ('Where would you find magazines alongside many other printed works?', ['A. doctor', 'B. bookstore', 'C. market', 'D. train station', 'E. mortuary'], 'B'),
    ('Where are you likely to find a hamburger?', ['A. fast food restaurant', 'B. pizza', 'C. ground up dead cows', 'D. mouth', 'E. cow carcass'], 'A'),
    ('James was looking for a good place to buy farmland. Where might he look?', ['A. midwest', 'B. countryside', 'C. estate', 'D. farming areas', 'E. illinois'], 'A'),
    ('What island country is ferret popular?', ['A. own home', 'B. north carolina', 'C. great britain', 'D. hutch', 'E. outdoors'], 'C')
]
df = pd.DataFrame(data, columns=['Question', 'Answer_Options', 'Correct_Answer'])

# Define task and instructions prompts
task_prompt = "Task: Predict the correct answer option for the question provided, considering the available options."
instructions_prompt = "Instructions: Review the question and choices provided, then select the option you believe is the correct answer."

# Generate prompt with multiple examples
few_shot_prompt = f"{task_prompt}\n{instructions_prompt}\n"
for index, row in df.iterrows():
    few_shot_prompt += "\n"
    few_shot_prompt += f"Question: {row['Question']}\nOptions:\n"
    for option in row['Answer_Options']:
        few_shot_prompt += f"{option}\n"
    few_shot_prompt += f"Selected Choice: {row['Correct_Answer']}\n"

# Display prompt
print(few_shot_prompt)

Task: Predict the correct answer option for the question provided, considering the available options.
Instructions: Review the question and choices provided, then select the option you believe is the correct answer.

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Options:
A. bank
B. library
C. department store
D. mall
E. new york
Selected Choice: A

Question: What do people aim to do at work?
Options:
A. complete job
B. learn from each other
C. kill animals
D. wear hats
E. talk to each other
Selected Choice: A

Question: Where would you find magazines alongside many other printed works?
Options:
A. doctor
B. bookstore
C. market
D. train station
E. mortuary
Selected Choice: B

Question: Where are you likely to find a hamburger?
Options:
A. fast food restaurant
B. pizza
C. ground up dead cows
D. mouth
E. cow carcass
Selected Choice: A

Question: James was looking for a good place to buy farmland. Where might he look?

In [None]:
question = list(qa)[0]
choices = qa[question]

print(few_shot_prompt + "\nQuestion: " + question + "\nOptions:\n" + "\n".join([chr(ord('A') + i) + ". " + choices[i] for i in range(len(choices))]) + "\nSelected Choice:")

Task: Predict the correct answer option for the question provided, considering the available options.
Instructions: Review the question and choices provided, then select the option you believe is the correct answer.

Question: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?
Options:
A. bank
B. library
C. department store
D. mall
E. new york
Selected Choice: A

Question: What do people aim to do at work?
Options:
A. complete job
B. learn from each other
C. kill animals
D. wear hats
E. talk to each other
Selected Choice: A

Question: Where would you find magazines alongside many other printed works?
Options:
A. doctor
B. bookstore
C. market
D. train station
E. mortuary
Selected Choice: B

Question: Where are you likely to find a hamburger?
Options:
A. fast food restaurant
B. pizza
C. ground up dead cows
D. mouth
E. cow carcass
Selected Choice: A

Question: James was looking for a good place to buy farmland. Where might he look?

In [None]:
answer_log_probs = []
# iterate over the answer options
# NOTE: This can take a moment

for question in list(qa):
  answers = qa[question]
  answer_log_probs = []

  for choice in ['A','B','C','D','E']:
    # construct the full prompt
    context_prompt = few_shot_prompt + "\nQuestion: " + question + "\nOptions:\n" + "\n".join([chr(ord('A') + i) + ". " + answers[i] for i in range(len(answers))]) + "\nSelected Choice:"
    prompt = context_prompt + " " + choice

    # construct the prompt without the answer to create a mask which will
    # allow to retrieve the token probabilities for tokens in the answer only
    # tokenize the prompt
    input_ids = tokenizer(prompt,
                          return_tensors="pt").input_ids.to(device)
    # tokenize the context prompt
    context_input_ids = tokenizer(context_prompt,
                                  return_tensors="pt").input_ids
    # create a mask with -100 for all tokens in the context prompt
    # the -100 indicates that the token should be ignored in the loss computation
    masked_labels = torch.ones_like(input_ids) * -100
    masked_labels[:, context_input_ids.shape[-1]:] = input_ids[:, context_input_ids.shape[-1]:]
    # generate the answer
    preds = model(
        input_ids,
        labels=masked_labels
    )
    # retrieve the average log probability of the tokens in the answer
    log_p = preds.loss.item()
    answer_log_probs.append(-log_p)
  import numpy as np
  print("All answers ", answers)
  print("Answer probabilities ", answer_log_probs)
  max_prob_idx = np.argmax(answer_log_probs)
  print("Selected answer ", answers[max_prob_idx], "with log P ", answer_log_probs[max_prob_idx])
  print("-" * 100)

All answers  ['garbage can', 'military', 'jewelry store', 'safe', 'airport']
Answer probabilities  [-1.149327039718628, -0.8839691281318665, -1.7674624919891357, -2.4614875316619873, -5.297186851501465]
Selected answer  military with log P  -0.8839691281318665
----------------------------------------------------------------------------------------------------
All answers  ['television', 'attic', 'corner', 'they cannot clean corner and library during football match they cannot need that', 'ground']
Answer probabilities  [-1.6051456928253174, -1.3155701160430908, -1.452256441116333, -1.5575783252716064, -2.7076313495635986]
Selected answer  attic with log P  -1.3155701160430908
----------------------------------------------------------------------------------------------------
All answers  ['walmart', 'white house', 'country', 'corporation', 'government']
Answer probabilities  [-1.0540814399719238, -0.9338192343711853, -1.921186923980713, -2.363598346710205, -4.736490726470947]
Selected 

In [None]:
for question in list(qa):
  answers = qa[question]
  prompt = few_shot_prompt + "\nQuestion: " + question + "\nOptions:\n" + "\n".join([chr(ord('A') + i) + ". " + answers[i] for i in range(len(answers))]) + "\nSelected Choice:"

  input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

  output = softmax_sampling_decoding(input_ids)
  decoded = tokenizer.decode(output[0], skip_special_tokens=True)
  decoded = decoded.replace(prompt, "").replace("\n","")
  print(question)
  print(decoded)

The only baggage the woman checked was a drawstring bag, where was she heading with it?
 B
To prevent any glare during the big football game he made sure to clean the dust of his what?
 B
The president is the leader of what institution?
 A
What kind of driving leads to accidents?
 B
Can you name a good reason for attending school?
 B
Stanley had a dream that was very vivid and scary. He had trouble telling it from what?
 D


# Exercise #3

**Q1**: How were words / tokens represented? What is the difference / similarity to modern LLMs?

**A**: Bengio et al. propose a word-level model* where each word is mapped to a feature vector using an embedding matrix $C$ of size $|V| \times m$, where $|V|$ is the vocabulary size and $m$ is the embedding dimension. While the exact mapping process isn't detailed, it likely involves one-hot encoding followed by matrix multiplication. This approach to word embeddings is also common in modern large language models (LLMs). For instance, in transformers, input words are mapped to embeddings using an embedding layer. These layers can be learnable or pretrained. A key difference in modern LLMs is that these embeddings are combined with positional encodings to capture the order of words in the sequence.

Since Bengio et al.'s model operates at the word level, it is more prone to encountering out-of-vocabulary (OOV) words compared to modern LLMs. Modern LLMs handle OOV words using strategies like subword or character-level tokenization and fine-tuning the embedding layer. Bengio et al.'s model is more likely to encounter with OOV words in the test set. However, they claim that their model outperforms $n$-gram models in handling OOV words and offer a solution by using a weighted convex combination of the feature vectors of other words that could occur in the same context. A key difference is that Bengio et al. use static embeddings while modern LLMs can make use of contextualised embeddings.

*: Rare words with frequency ≤ 3 were merged into a single symbol.

---
**Q2**: How was the context represented? What is the difference / similarity to modern LLMs?

**A**: Similar to $n$-gram models, preceding $n$ words are considered as context in Bengio et al.'s model. However, since the model utilizes word embeddings this time (in contrast to $n$-gram models), it is reported to be able to take advantage of more context (on Brown, going from 2 words of context to 4 words brought improvements to the neural network, not to the $n$-grams). The idea of "context window" is still present in modern LLMs, but the context length is extremely longer. For instance, GPT-4 has two versions with context windows of 8,192 and 32,768 tokens*. Another difference is that, in modern LLMs, the context is not limited to the preceding words; it includes both preceding and following words. This is achieved through bi-directional transformers, introduced in models like BERT and GPT-2.

*: https://en.wikipedia.org/wiki/GPT-4

---
**Q3**: What is the curse of dimensionality? Give a concrete example in the context of language modeling.

**A**: _Curse of dimensionality_ refers to the problems that arise when working with high-dimensional data. Initially, this term described the phenomenon where algorithms that perform efficiently in low-dimensional spaces become intractable in high dimensions. As discussed in the article [A Few Useful Things to Know About Machine Learning](https://homes.cs.washington.edu/%7Epedrod/papers/cacm12.pdf), intuition fails in high dimensions. In the realm of higher-dimensional data analysis, classical distance metrics, such as the Euclidean distance, often exhibit limitations. For instance, a nearest neighbor classifier can become ineffective due to noise from irrelevant features or because all examples appear similar, making the classification essentially random.

In the context of language modeling, as pointed out in the [paper](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf), modeling the joint probability distribution of a sentence becomes increasingly complex as the length of the sentence grows. If there are $n$ words in the sentence, and the vocabulary size is $|V|$, the number of possible sentences is $|V|^n$ - leading to $|V|^n - 1$ many free parameters (1 is reducted as the probabilities must sum up to 1 - making the last probability dependent on others). Modelling the joint distribution between many discrete random variables turns out to be intractable. Therefore we have to use another approach to model the language - that's why the paper proposes the distributed representations. The main idea is to do dimensionality reduction using word embeddings. Although the model is billions of parameters, representations consist of a few hundred dimensions (set as a hyperparameter). Embedding matrix consists of $|V| \times m$ many entries where $m$ stands for the dimension. Therefore the number of free parameters scales **linearly** with $|V|$. Similarly the linear projections are used in transformers to reduce the dimensionality of the input. One can even say that the task of a fully connected layer is projecting the high-dimensional input to a lower-dimensional space so that the computations become intractable.

---

**Q4**: Which training data was used? What is the difference / similarity to modern LLMs?

**A**: Two datasets are used for the experiments: Brown corpus which is a stream of 1,181,041 words, from a large variety of English texts and books, and the Associated Press (AP) News from 1995 and 1996 (a stream of about 14 million (13,994,528) words). From a very simplistic perspective, the data is similar to the ones used by modern LLMs - consisting of sequences of words to train the model. However, modern LLMs use extremely larger datasets (GPT-3 training data consists of approximately 500 billion tokens*), and they are trained on diverse data sources from different domains to capture a wide range of language patterns. Bengio et al.'s model is trained on these small datasets, only for the experimental purposes.

*: https://en.wikipedia.org/wiki/GPT-3

 ---

**Q5**: Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?

**A**: First and foremost, the part where we get the word embeddings from tokens can be found in modern LMs as well. You can even use the layer `nn.Embedding` in `PyTorch` to train embeddings from scratch. Additionally, modern LMs can also be classified as neural based language models. After Bengio et al. successfully showed that the neural-based models outperform $n$-gram models, the whole paradigm switched to neural networks. However, in contrast, learning the word embeddings is only the first step in modern LLMs, as they consist of lots of other steps like self-attention, feed-forward layers, etc. It is almost impossible to find any modern LM with such a simple architecture proposed by Bengio et al. However, the idea of distributed representations is still valid and used in modern LLMs.

---

**Q6**: For each section of the Bengio et al. (2003) paper, what are key differences between the way it is written, the included contents, to the BERT paper (Devlin et al., 2019)? What are key similarities? Write max. 2 sentences per section.

**A**: Sections of the Bengio et al. (2003) paper:
- Abstract
- Introduction
- Neural Model
- Parallel Implementation
- Experimental Results
- Extensions and Future Work
- Conclusion

Sections of the BERT paper (Devlin et al., 2019):
- Abstract
- Introduction
- Related Work
- BERT
- Experiments
- Ablation Studies
- Conclusion
- Appendix


1. Abstract: Both papers begin with an abstract that highlights the advantages of their models over previous approaches. However, Bengio et al. do not provide numerical details about their experimental results, while the BERT paper includes specific percentages showing improvements over previous state-of-the-art models.
2. Introduction: Introduction part of the Bengio et al. paper incorporates both the motivation behind the model and the related work in the field. The BERT paper provides a high-level description of the model and outlines the main contributions of the paper, leaving the related work to a separate section.
3. Neural Model - BERT: In the main sections where the model details are explained, Bengio et al. provide low-level details such as the equations for the neural network, probability distribution calculations, and the gradient descent update rule. In contrast, the BERT paper, published about 15 years later, doesn't delve deeply into the essentials but instead focuses more on how to pre-train the model and fine-tune it for downstream tasks.
4. One of the major differences between the two papers is the "Parallel Implementation" section. Bengio et al. discuss the parallelization of their model for training on multiple processors using the Message Passing Interface (MPI). In contrast, the BERT paper does not include such a section, thanks to the significant advancements in computational power and the widespread availability of GPUs for training large models (training on GPUs was introduced with [this paper](http://robotics.stanford.edu/~ang/papers/icml09-LargeScaleUnsupervisedDeepLearningGPU.pdf) in 2009).
5. In the experiments section, both of the papers follow a similar pattern where they introduce the details of the standard datasets used for evaluation, the results and show that their models outperform the previous state-of-the-art models. However, the BERT paper includes more detailed ablation studies to analyze the impact of different components of the model.
6. Bengio et al. provides a section on "Extensions and Future Work" where they discuss potential improvements and future research directions. Surprisingly, BERT paper does not include any section on future work, possibly due to the rapid pace of advancements in the field of NLP and the need to publish results quickly.
7. Both papers conclude by summarizing the key findings and contributions of their work. Bengio et al. emphasize that "an important priority of future research should be to improve speed-up techniques."*. In contrast, the BERT paper does not mention such a priority, likely because computational resources were not a bottleneck (access to Cloud TPUs).
8. Lastly, BERT paper includes an appendix section that provides additional details about the model architecture and hyperparameters used in the experiments. Bengio et al. does not include such an appendix, possibly due to the simplicity of their model compared to the complexity of BERT.


*: They ran 5 epochs over 3 weeks using 40 CPUs on the AP News corpus.