# Lab 3: Fewshot ICL

As knowledge graph requires background in SPARQL and/or LLM finetuning, this lab won't be totally related to what you saw in today's course.

We'll be delving into In Context Learning (ICL), in particular ICL fewshot, and trying to understand how it works and when to use it. To do this, we'll be using the Transformer Library, a Mistral LLM and an emotion classification dataset.


The laboratory is divided into 4 sections:
0. Setup: This section is dedicated to installing modules, loading models and loading data.You don't need to code, just run it.
1. Zeroshot Classification: Some of you may have had trouble finding a prompt that always returned a “well-formed” answer in the last lab. In this section, we'll use a “well-formed” prompt to perform zeroshot classification.
2. Fewshot Classification - Random Retrieval: One of the most common methods of improving ICL classification is to add demonstrations to the prompt. This helps the LLM to “properly format” the response and can also give semantic information about how to solve the task. In this section, we will use random retrieved demonstration and compare the results with those of section 1.
3. Fewshot Classification - Vector-based Retrieval: Extracting random demonstrations in fewshot classification can introduce bias. In addition, most semantically relevant demonstrations are not taken into account. As with did with RAG, we will use a vector representation of the example to retrieve the most relevant demonstrations.
4. Constrained Decoding: Finally, we'll discovering the `outlines` library, which contains modules that are useful to do constrained decoding.

At the end of each section (except section 0.), there's a question to answer.

## 0. Setup

In [1]:
!pip install transformers bitsandbytes accelerate datasets outlines scikit-learn

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting outlines
  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting interegular (from outlines)
  Downloading interegular-0.3.3-py37-none-any.whl.metadata (3.0 kB)
Collecting lark (from outlines)
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Collect

In [2]:
from google.colab import userdata

In [3]:
from transformers import (
    BitsAndBytesConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig
)

import torch

# Put your hugging face token here: https://huggingface.co/docs/hub/en/security-tokens
# You need to fill the access form with your huggingface account on this link: https://huggingface.co/mistralai/Ministral-8B-Instruct-2410
hf_token = userdata.get('HF_TOKEN')
llm_name = "mistralai/Ministral-8B-Instruct-2410"

# We want to use 4bit quantization to save memory
quantization_config = BitsAndBytesConfig(
    load_in_8bit=False, load_in_4bit=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_name, padding_side="left", token=hf_token)
# Prevent some transformers specific issues.
tokenizer.use_default_system_prompt = False
tokenizer.pad_token_id = tokenizer.eos_token_id

# Load LLM.
llm = AutoModelForCausalLM.from_pretrained(
    llm_name,
    quantization_config=quantization_config,
    device_map={"": 0}, # load all the model layers on GPU 0
    torch_dtype=torch.bfloat16, # float precision
    token=hf_token
)
# Set LLM on eval mode.
llm.eval()


tokenizer_config.json:   0%|          | 0.00/181k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(131072, 4096)
    (layers): ModuleList(
      (0-35): 36 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSN

In [4]:
# Set up our generation configuration.
# We set max_new_token to 128 to reduce computation time (we may also lose some accuracy).
# We disable beamsearch to ensure reproducibility (we may lose some accuracy).
generation_config = GenerationConfig(
  max_new_tokens = 128,
  do_sample=False,
  eos_token_id=tokenizer.eos_token_id,
  pad_token_id=tokenizer.pad_token_id,
)

In [5]:
from datasets import load_dataset
import random
random.seed(42)

id2label = {0:"sadness", 1:"joy", 2:"love", 3:"anger", 4:"fear", 5:"surprise"}


# Dataset: https://huggingface.co/datasets/dair-ai/emotion
ds = load_dataset("dair-ai/emotion", "split")
examples = [{"text":ex["text"], "label":id2label[ex["label"]]}for ex in ds['test'].to_list()]
random.shuffle(examples)

# Split examples and keep only a few samples to have short computation time.
test, train = examples[:100], examples[100:500]
print(f"Train len {len(train)}. Test len {len(test)}")
print(f"First example of test:\n{test[0]}")

README.md:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Train len 400. Test len 100
First example of test:
{'text': 'i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history', 'label': 'surprise'}


## 1. Zero-shot Classification

It's very similar to what you've done last time, so we're providing you with most of the code. The only thing you need to code yourself is the parse_answer function.
- We adapted the recommended classification prompt from: https://docs.mistral.ai/guides/prompting_capabilities/
- The purpose of this function is to return the first occurrence of a correct label (sadness, joy, love, anger, fear, surprise)
- We want to return "" if no answer is found.
- You can use regex or string functions.

There is a cell below to test your code. The output should be:

```
##### Example 0 #####
# You're an expert in sentiment analysis. Your task is to classify the sentence emotion after <<<>>> with one of the following predefined labels:

sadness
joy
love
anger
fear
surprise

You will only respond with the category. Do not include the word "Category". Do not provide explanations or notes.

<<<
Sentence: i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history
Label:
>>>
# sadness
# sadness
```

In [6]:
import re

zeroshot_prompt = """
You're an expert in sentiment analysis. Your task is to classify the sentence emotion after <<<>>> with one of the following predefined labels:

sadness
joy
love
anger
fear
surprise

You will only respond with the label. Do not include the word "Label". Do not provide explanations or notes.

<<<
Sentence: {sentence}
Label:
>>>
""".strip()


def generate(prompt, llm=llm, generation_config=generation_config):

  # Create turns with the given prompt
  turns = [
    {'role':'user', 'content':prompt}
  ]

  # Tokenize turns.
  input_ids = tokenizer.apply_chat_template(turns, return_tensors='pt').to('cuda')

  # Ensure we don't use gradient to save memory space and computation time.
  with torch.no_grad():
    outputs = llm.generate(
      input_ids,
      generation_config
    )

  # Recover and decode answer.
  answer_tokens = outputs[0, input_ids.shape[1]:-1]
  return tokenizer.decode(answer_tokens).strip()


def parse_answer(answer):
  # Remove any extraneous text around the label
  label = answer.strip().split("\n")[0]
  return label

In [7]:
# Test your code

example = test[0]

prompt = zeroshot_prompt.format(sentence=example["text"])
answer = generate(prompt)
prediction = parse_answer(answer)

print("##### Example 0 #####")
print(f"# {prompt}")
print(f"# {answer}")
print(f"# {prediction}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


##### Example 0 #####
# You're an expert in sentiment analysis. Your task is to classify the sentence emotion after <<<>>> with one of the following predefined labels:

sadness
joy
love
anger
fear
surprise

You will only respond with the label. Do not include the word "Label". Do not provide explanations or notes.

<<<
Sentence: i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history
Label:
>>>
# sadness
# sadness


Now apply the fewshot prompt on the full test dataset. You need report:
- Accuracy (recall: number of correct answers divided by number of samples)
- Ratio of missing answer (i.e "." answer)

It should take 3 to 5 minutes to run.

In [8]:
import json
from tqdm import tqdm

results = []
correct_predictions = 0
missing_answers = 0
total_examples = len(test)

for example in tqdm(test):
    prompt = zeroshot_prompt.format(sentence=example["text"])
    answer = generate(prompt)
    prediction = parse_answer(answer)

    results.append({
        "text": example["text"],
        "label": example["label"],
        "prediction": prediction
    })

    if prediction == example["label"]:
        correct_predictions += 1
    if prediction == "" or prediction not in ["sadness","joy","love","anger","fear","surprise"]:
        missing_answers += 1

accuracy = correct_predictions / total_examples
missing_answer_ratio = missing_answers / total_examples

# Add statistics to the results
results.append({
    "statistics": {
        "accuracy": accuracy,
        "missing_answer_ratio": missing_answer_ratio
    }
})

# Save the results to a JSON file
with open("Zero_shot_Classification.json", "w") as f:
    json.dump(results, f, indent=4)

print(f"Accuracy: {accuracy}")
print(f"Missing answer ratio: {missing_answer_ratio}")
print("Results saved to Zero_shot_Classification.json")

100%|██████████| 100/100 [01:43<00:00,  1.04s/it]

Accuracy: 0.63
Missing answer ratio: 0.0
Results saved to Zero_shot_Classification.json





Note: We always find an answer, because we've used a “well-formed” prompt and because Mistral is good at following this type of instruction. If you try with the Lama-3, some answers may be missing.

**Question: Are we sure that all these answer are "well-formed" answer ?**

With this prompt and the use of Mistral, we can be confident that the answers are either an empty string ("") or one of the predefined labels: "sadness," "joy," "love," "anger," "fear," or "surprise," making them well-formed. However, there is a possibility that the model might hallucinate and generate an answer not included in the provided list.

## 2. Fewshot Classification - Random Retrieval:

Now we have a working zeroshot solution. Our next next step is to use demonstrations. We will start be implementing a random few shot generation. You need to implement 3 functions:

- format_demo, wich format a given example into a demonstration string
- format_demos, wich format a given list of example into a demonstration string (try to use format_demo)
- get_random_demo, wich return k random examples. (you should use random.choice. https://docs.python.org/3/library/random.html)


There is a cell below to test your code. The output should be:
```
##### format_demo #####
# Sentence: i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history
Label: surprise.


##### format_demos #####
# Sentence: i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history
Label: surprise.

Sentence: im feeling optimistic to finish out these last two weeks strong and probably continue with what i have been doing
Label: joy.

Sentence: i feel complacent and satisfied
Label: joy.

Sentence: im the only one with all the feelings and emotions and thats just pathetic of me to do so
Label: sadness.

Sentence: i just sat there in my group feeling really depressed because my book just had to go missing at this time
Label: sadness.


##### Example 0 #####
# You're an expert in sentiment analysis. Your task is to classify the sentence emotion after <<<>>> with one of the following predefined labels:

sadness
joy
love
anger
fear
surprise

You will only respond with the label. Do not include the word "Label". Do not provide explanations or notes.

####
Here are some examples:

Sentence: i feel inspired so many thing i want to write down
Label: joy.

Sentence: i feel like i should have some sort of rockstar razzle dazzle lifestyle but i would at least like to spend a third of my life doing something i feel is worthwhile
Label: joy.

Sentence: i continue to write this i feel more and more distraught
Label: fear.

Sentence: i feel that third situation pretty much sums up my feelings toward this title
Label: joy.

Sentence: i remember wanting to fit in so bad and feeling like no one liked me
Label: love.
####

<<<
Sentence: i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history
Label:
>>>
# sadness
# sadness
```

In [9]:
fewshot_prompt = """
You're an expert in sentiment analysis. Your task is to classify the sentence emotion after <<<>>> with one of the following predefined labels:

sadness
joy
love
anger
fear
surprise

You will only respond with the label. Do not include the word "Label". Do not provide explanations or notes.

####
Here are some examples:

{examples}
####

<<<
Sentence: {sentence}
Label:
>>>
""".strip()

def format_demo(demo):
  # TODO

def format_demos(demos):
  # TODO

def get_random_demo(k, train=train):
  # TODO

IndentationError: expected an indented block after function definition on line 25 (<ipython-input-9-20f75f6fc387>, line 28)

In [None]:
# Test your code !


print("##### format_demo #####")
print(f"# {format_demo(test[0])}")


print("\n\n##### format_demos #####")
print(f"# {format_demos(test[:5])}")


random.seed(42)

example = test[0]
demos = format_demos(get_random_demo(5))

prompt = fewshot_prompt.format(examples=demos, sentence=example["text"])
answer = generate(prompt)
prediction = parse_answer(answer)

print("\n\n##### Example 0 #####")

print(f"# {prompt}")
print(f"# {answer}")
print(f"# {prediction}")


Now apply the fewshot prompt on the full test dataset. You need report:
- Accuracy (recall: number of correct answers divided by number of samples)
- Report them for k=1 and k=5

It should take 5 to 7 minutes to run.



In [None]:
from tqdm import tqdm

random.seed(42)

for example in tqdm(test): # tqdm allow you to track the progression of your loop.
  # TODO

**Question: What are the limits of using a single demonstration? What are the limits of using too many demonstrations?**

## 3. Fewshot Classification - Vector-based Retrieval

Now, we want to improve demonstrayion by the vector representation of our sentence. This is close to what we did when we used RAG on wikipedia page. But here, we'll do it manually and step by step.

To do so, we need to calculate the vector representation of our training dataset. To do this, we'll code a function that returns a vector for a given example. We'll use our LLM hidden states to do this. It's not optimal, but we won't have to load another model.

First, look at the mistral architecture:

```
MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(131072, 4096)
    (layers): ModuleList(
      (0-35): 36 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): MistralRMSNorm((4096,), eps=1e-05)
  )
  (lm_head): Linear(in_features=4096, out_features=131072, bias=False)
)
```
There are 36 transformer layers and 1 language model (LM) layer. Each layer will take the following shape: [1, N_TOKENS, N_PARAMS]. We want to extract the vector of the last token from the last transformer. To do so:
- Encode the sentence without any template. `tokenizer.encode(...)`
- Use the `output_hidden_states` keyword of the llm forward function.
- Select the last transformer layer (be careful, don't take the LM layer).
- Select the last token.
- Convert the vector to numpy `.to('cpu').float().numpy()` and return it.

There is a cell below to test your code. The output should be:
```
# (4096,)
# [ 4.59375    -9.          0.80078125 ...  0.890625   -0.20019531
 -0.62109375]
```

In [10]:
def get_hidden_repr(text, llm=llm):
    """
    Get the vector representation of the last token from the last transformer layer for a given text.
    """
    tokenized = tokenizer(text, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = llm(**tokenized, output_hidden_states=True)
    hidden_states = outputs.hidden_states[-2]
    last_token_vector = hidden_states[0, -1, :]
    return last_token_vector.to("cpu").float().numpy()

In [36]:
example = train[0]
vector = get_hidden_repr(example["text"])
print("#", vector.shape)
print("#", vector)

# (4096,)
# [ 4.59375    -9.          0.80078125 ...  0.890625   -0.20019531
 -0.62109375]


Now, we need to get the hidden represation vector for all examples in the train and the test datasets.

You should store the vector directly in the example dict: `example["vector"] = ...`

Both should take 3 - 5 mins to run.

In [30]:
from tqdm import tqdm

# Process train dataset
for example in tqdm(train):  # tqdm allows you to track the progression of your loop.
    text = example['text']
    example['vector'] = get_hidden_repr(text, llm=llm)



100%|██████████| 400/400 [03:35<00:00,  1.85it/s]


In [31]:
# Process test dataset
for example in tqdm(test):  # tqdm allows you to track the progression of your loop.
    text = example['text']
    example['vector'] = get_hidden_repr(text, llm=llm)


100%|██████████| 100/100 [00:53<00:00,  1.86it/s]


Now that we have our vector representations. We want a function that compute the cosine similarity between 2 examples.

- Use the function from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
- Be careful, you have to reshape each vector to: [1, 4096]

There is a cell below to test your code. The output should be:
```
# a . a = 1.0000019073486328
# a . b = 0.930396318435669
```

In [43]:
from sklearn.metrics.pairwise import cosine_similarity

def compute_similarity(example_a, example_b):
    vector_a = example_a["vector"].reshape(1, -1)
    vector_b = example_b["vector"].reshape(1, -1)
    similarity = cosine_similarity(vector_a, vector_b)
    return similarity[0][0]

In [44]:
# Test your code !

a, b = train[0], train[1]

print(f"# a . a = {compute_similarity(a, a)}")
print(f"# a . b = {compute_similarity(a, b)}")

# a . a = 1.0000028610229492
# a . b = 0.5770441293716431


Last step, we want a function that retrieve the k more similar demonstrations of the train examples given a test example.

There is a cell below to test your code. The output should be:
```
# surprise - i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history
#  joy - i feel lucky that theyve chosen to share their lives with me

joy - i feel our world then was a much more innocent place

joy - i know he does the same thing for so many passersby i feel special truly welcome in his country

joy - i do know that i tell some people if i feel that their question is sincere some of my sacred treasures

anger - i feel appalled that i took advantage of my old friend s kindness

```

In [46]:
def get_k_similar_demo(example, k, train=train):
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np

    # Extract the vector from the input example
    input_vector = example["vector"].reshape(1, -1)

    # Compute cosine similarities with all training examples
    similarities = []
    for train_example in train:
        train_vector = train_example["vector"].reshape(1, -1)
        similarity = cosine_similarity(input_vector, train_vector)[0][0]
        similarities.append((train_example, similarity))

    # Sort by similarity in descending order
    similarities = sorted(similarities, key=lambda x: x[1], reverse=True)

    # Extract the top k examples
    top_k_examples = [item[0] for item in similarities[:k]]

    return top_k_examples


In [47]:
# Test your code !
example = test[0]
print(f"# {example['label']} - {example['text']}")
print("# ", "\n\n".join([f"{ex['label']} - {ex['text']}" for ex in get_k_similar_demo(example, 5)]))

# surprise - i feel a strange gratitude for the hated israeli occupation of sinai that lasted from to for actually recognizing the importance of sinais history
#  joy - i feel lucky that theyve chosen to share their lives with me

joy - i feel our world then was a much more innocent place

joy - i know he does the same thing for so many passersby i feel special truly welcome in his country

joy - i do know that i tell some people if i feel that their question is sincere some of my sacred treasures

anger - i feel appalled that i took advantage of my old friend s kindness


Now apply the fewshot prompt on the full test dataset. You need report:
- Accuracy (recall: number of correct answers divided by number of samples)
- Report them for k=1 and k=5

It should take 5 to 7 minutes to run.

Your results should be:
```
##### k=1 #####
Accuracy:  0.65
##### k=5 #####
Accuracy:  0.63
```

In [49]:
from tqdm import tqdm

# for example in tqdm(test): # tqdm allow you to track the progression of your loop.
#   # TODO
from tqdm import tqdm

# Initialize variables for accuracy computation
k_values = [1, 5]
total_samples = len(test)
results = {}

for k in k_values:
    correct_predictions = 0
    print(f"##### Evaluating for k={k} #####")

    # Loop over the test dataset
    for example in tqdm(test):  # tqdm allows tracking the progression of the loop
        # Get k most similar examples
        similar_examples = get_k_similar_demo(example, k, train=vectorized_train)

        # Format the examples into the few-shot prompt
        demos = format_demos(similar_examples)
        prompt = fewshot_prompt.format(examples=demos, sentence=example["text"])

        # Generate the prediction using the prompt
        answer = generate(prompt)
        prediction = parse_answer(answer)

        # Check if the prediction is correct
        if prediction == example["label"]:
            correct_predictions += 1

    # Calculate accuracy for this k
    accuracy = correct_predictions / total_samples
    results[k] = accuracy

    # Print the results for this k
    print(f"Accuracy: {accuracy:.2f}")


##### Evaluating for k=1 #####


  0%|          | 0/100 [00:00<?, ?it/s]


NameError: name 'format_demos' is not defined

**Question: What could be the main issue with this approach? How can it be mitigated?**

## 4. Constrained Decoding

Last exercise, we will use the `outlines` package to do constrained generation. This main idea is to guide the generation of the LLM to get the good output formats.

We will use the choices module. Here is the documentation: https://dottxt-ai.github.io/outlines/latest/reference/generation/choices/

There is an example below on how to use it on 1 example. We let you apply this methods to the test dataset. You need report:
- Accuracy (recall: number of correct answers divided by number of samples)
- Ratio of missing answer (i.e "E." answer)
- Report them for k=1 and k=5

It should take 3 to 5 minutes to run.

Your results should be:
```
Accuracy:  0.38
```

In [None]:
from outlines import models, generate

# TODO

**Question: Now that you've used all these solutions, when should you use zeroshot? when should you use fewshot? when should you use constrained decoding?**

## Bonus

Try to use differents modules of `outlines` like json, pydantic or regex ...

Compare this results with previous ones !