In [1]:
import random
import time
from random import sample
from typing import Dict, List, Optional, Tuple

import kscope
import pandas as pd
from metrics import report_metrics
from tqdm import tqdm
from transformers import AutoTokenizer, LlamaTokenizerFast
from utils import get_label_token_ids, get_label_with_highest_likelihood, split_prompts_into_batches

# Getting Started

There is a bit of documentation on how to interact with the large models [here](https://kaleidoscope-sdk.readthedocs.io/en/latest/). The relevant github links to the SDK are [here](https://github.com/VectorInstitute/kaleidoscope-sdk) and underlying code [here](https://github.com/VectorInstitute/kaleidoscope).

First we connect to the service through which we'll interact with the LLMs and see which models are available to us

In [2]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)

Show all supported models

In [3]:
client.models

['gpt2',
 'llama2-7b',
 'llama2-7b_chat',
 'llama2-13b',
 'llama2-13b_chat',
 'llama2-70b',
 'llama2-70b_chat',
 'falcon-7b',
 'falcon-40b',
 'sdxl-turbo']

Show all model instances that are currently active

In [4]:
client.model_instances

[{'id': '09987df5-eb97-41e0-8b58-d0c16b6555ac',
  'name': 'llama2-7b',
  'state': 'ACTIVE'}]

To start, we obtain a handle to a model. In this example, let's use the LLaMA-2 model.

**NOTE**: This notebook uses activation retrieval to extract responses from the model: 
* This functionality is available for LLaMA-2 models (non-chat). 
* It is **NOT**, however, currently available for Falcon models of any size.

In [5]:
model = client.load_model("llama2-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

print("The model is active!")

The model is active!


We're going to have the model attempt to answer questions based on some context. The answer to each question is either yes or no based on the question and context. We're going to focus on zero-shot prompting for this task, as LLaMA-2 is fairly good at it and the examples are quite long, making few-shot prompting difficult. We'll consider a few-shot example at the end.

In [6]:
def boolq_preprocessor(path: str) -> Tuple[List[str], List[str], List[str], List[int]]:
    boolq_df = pd.read_csv(path)
    titles = boolq_df["Title"].tolist()
    passages = boolq_df["Passage"].tolist()
    questions = boolq_df["Question"].tolist()
    labels = boolq_df["Answer"].apply(lambda x: 1 if x else 0).tolist()
    return titles, passages, questions, labels

In [7]:
# Read in a sampling of the BoolQ test dataset and a small sample of training examples from the training dataset for
# few-shot prompting
bool_q_test_titles, bool_q_test_passages, bool_q_test_questions, bool_q_test_labels = boolq_preprocessor(
    "resources/boolq_task_datasets/test_sample_dataset.csv"
)
bool_q_train_titles, bool_q_train_passages, bool_q_train_questions, bool_q_train_labels = boolq_preprocessor(
    "resources/boolq_task_datasets/example_dataset.csv"
)

In this notebook, we'll be working on the BoolQ task. In this task, the model is given a passage that contains the answer to a question. The model should be able to answer the question based on the passage provided. The answer to the questions is either yes or no. An example is below.

**Title**: Rabies transmission

**Passage**: Transmission between humans is extremely rare, although it can happen through organ transplants, or through bites.

**Question**: can a person transmit rabies to another person?

In creating prompts, demonstrations are used for few-shot examples. If demonstrations in the `create_prompts` function is an empty string then the prompt is zero shot (that is, it includes no demonstrations). We follow the prompt structure used by the original [GPT-3 paper](https://arxiv.org/pdf/2005.14165.pdf) for the BoolQ task. That is 

{title} -- {passage}

question: {question}

answer: {answer}

In [8]:
def create_demonstrations(
    demo_titles: List[str],
    demo_passages: List[str],
    demo_questions: List[str],
    demo_labels: List[int],
    label_map: Dict[int, str],
    n_demos: Optional[int] = None,
) -> List[str]:
    # n_demos controls how many demonstration examples are included. That is, n_demo-shot prompts are created. If
    # n_demos is none, all available demonstrations are used
    demonstrations = []
    for demo_title, demo_passage, demo_question, demo_label in zip(
        demo_titles, demo_passages, demo_questions, demo_labels
    ):
        label_str = label_map[demo_label]
        demonstration = f"{demo_title} -- {demo_passage}\nquestion: {demo_question}?\nanswer: {label_str}\n\n"
        demonstrations.append(demonstration)
    return sample(demonstrations, n_demos) if n_demos else demonstrations

The examples in the BoolQ task are fairly long and our implementation of LLaMA-2, which we are using here, has a cap on the input length (512). So it's difficult to pack in a lot of examples for this task. The `create_prompt` function below, will put as many of the demonstrations into each prompt as possible without over-running the required length.

In [9]:
def create_prompts(
    demonstrations: List[str],
    test_titles: List[str],
    test_passages: List[str],
    test_questions: List[str],
    tokenizer: Optional[LlamaTokenizerFast] = None,
) -> List[str]:
    prompts = []
    for test_title, test_passage, test_question in zip(test_titles, test_passages, test_questions):
        prompt = f"{test_title} -- {test_passage}\nquestion: {test_question}?\nanswer:"
        n_shot = 0
        # We all demonstrations or as many as we can, without violating the LLaMA max prompt length
        for demonstration in demonstrations:
            candidate_prompt = f"{demonstration}{prompt}"
            if tokenizer is not None:
                if len(tokenizer.encode(candidate_prompt)) < 512:
                    n_shot += 1
                    prompt = candidate_prompt
                else:
                    print(f"Prompt is too long: {len(tokenizer.encode(candidate_prompt))}. Using {n_shot}-shot prompt")
                    break
        prompts.append(prompt)
    return prompts

The last layer activations of the model are analogous to the probabilities of each token in the model vocabulary. That is, it is the conditional probability
$$
P(y_t \vert y_{<t}, x),
$$
The probability distribution over the vocabulary of the next token given the preceding tokens $y_{<t}$, and the prompt text $x$. Thus, for each token $y_{t}$ in our input, we get back a vector of dimension $32000$ (the vocabulary size of LLaMA-2) which encodes the probability distribution of $y_{t+1}$ over the vocabulary. For this example, we only care about the last token in our input, as it houses the probability of the, as yet, unseen token the model will generate.

**NOTE**: The last layer for LLaMA-2, named "output," is actually the logits (pre-softmax) and therefore not quite probabilities, but is proportional to them.


In [10]:
# Below, we're interested in the activations from the last layer of the model, because this will allow us to calculate
# the likelihoods.
last_layer_name = model.module_names[-1]
last_layer_name

'output'

First, we're going to consider zero-shot prompting with two different generation configurations. For a discussion of possible configuration parameters see the [Configuration README](src/reference_implementations/prompting_vector_llms/CONFIG_README.md).

For the first experiment, we'll consider sampling in our response generation, letting `temperature = 0.8`. In the second experiment, we'll narrow the generation settings to greedy decoding (`temperature = 0.0`), where the model always selects the most probable token, at least from it's perspective. In both cases, we'll just generate a single token, as we're looking for a yes/no answer.

In [11]:
stochastic_generation_config = {"max_tokens": 1, "temperature": 0.8}
greedy_generation_config = {"max_tokens": 1, "temperature": 0.0}

### Zero-Shot Prompting (Stochastic Generation)

In this section, we won't include any demonstrations in our prompts and we'll sample from the answer distribution.

In [12]:
label_map = {0: "No", 1: "Yes"}
label_ordering = ["No", "Yes"]
lowercase_labels = [label.lower() for label in label_ordering]
prompts = create_prompts([], bool_q_test_titles, bool_q_test_passages, bool_q_test_questions)

In [13]:
# Let's check one of the prompts.
print(prompts[0])

Shear wall -- In structural engineering, a shear wall is a structural system composed of braced panels (also known as shear panels) to counter the effects of lateral load acting on a structure. Wind and seismic loads are the most common building codes, including the International Building Code (where it is called a braced wall line) and Uniform Building Code, all exterior wall lines in wood or steel frame construction must be braced. Depending on the size of the building some interior walls must be braced as well.
question: is a shear wall a load bearing wall?
answer:


In [14]:
predicted_labels = []
unmatched_predictions = []
# For memory management, we split the prompts into batches of size 10
prompt_batches = split_prompts_into_batches(prompts, 10)
for prompt_batch in tqdm(prompt_batches):
    responses = model.generate(prompt_batch, stochastic_generation_config)
    processed_responses = [generation.strip().lower() for generation in responses.generation["sequences"]]
    # If a token doesn't correspond to one of our labels, we'll randomly select one
    for potential_prediction in processed_responses:
        if potential_prediction in lowercase_labels:
            predicted_labels.append(potential_prediction)
        else:
            unmatched_predictions.append(potential_prediction)
            predicted_labels.append(random.choice(lowercase_labels))

  0%|          | 0/10 [00:00<?, ?it/s]100%|██████████| 10/10 [03:28<00:00, 20.82s/it]


In [15]:
print(f"{len(unmatched_predictions)} Responses did not match the label space {lowercase_labels}")
print(f"Some examples of unmatched responses: {sample(unmatched_predictions, 10)}")
# Map the labels from integers to strings for comparison to the string predicted labels in the confusion matrix
bool_q_text_labels_string = [label_map[label].lower() for label in bool_q_test_labels]
report_metrics(predicted_labels, bool_q_text_labels_string, labels_order=lowercase_labels)

42 Responses did not match the label space ['no', 'yes']
Some examples of unmatched responses: ['the', 'lead', 'the', 'fred', 'an', 'there', '', 'the', 'the', 'ty']
Prediction Accuracy: 0.57
Confusion Matrix with ordering ['no', 'yes']
[[19 20]
 [23 38]]
Label: no, F1: 0.4691358024691358, Precision: 0.4523809523809524, Recall: 0.48717948717948717
Label: yes, F1: 0.6386554621848739, Precision: 0.6551724137931034, Recall: 0.6229508196721312


With stochastic sampling, it's pretty clear that the model doesn't always answer in our label space. This makes it hard to map the response to a prediction that we can score and we end up selecting a random guess a lot of the time. This leads us to have poor accuracy on the task. Let's see if we can do better with Greedy Decoding (setting `temperature = 0.0`)

### Zero-shot Prompting (Greedy Generation)

In [16]:
# Setting random seet for consistent demonstration construction
random.seed(2024)
predicted_labels = []
unmatched_predictions = []
# For memory management, we split the prompts into batches of size 10
prompt_batches = split_prompts_into_batches(prompts, 10)
for prompt_batch in tqdm(prompt_batches):
    responses = model.generate(prompt_batch, greedy_generation_config)
    processed_responses = [generation.strip().lower() for generation in responses.generation["sequences"]]
    # If a token doesn't correspond to one of our labels, we'll randomly select one
    for potential_prediction in processed_responses:
        if potential_prediction in lowercase_labels:
            predicted_labels.append(potential_prediction)
        else:
            unmatched_predictions.append(potential_prediction)
            predicted_labels.append(random.choice(lowercase_labels))

100%|██████████| 10/10 [03:22<00:00, 20.25s/it]


In [17]:
print(f"{len(unmatched_predictions)} Responses did not match the label space {lowercase_labels}")
print(f"Some examples of unmatched responses: {sample(unmatched_predictions, 5)}")
# Map the labels from integers to strings for comparison to the string predicted labels in the confusion matrix
bool_q_text_labels_string = [label_map[label].lower() for label in bool_q_test_labels]
report_metrics(predicted_labels, bool_q_text_labels_string, labels_order=lowercase_labels)

11 Responses did not match the label space ['no', 'yes']
Some examples of unmatched responses: ['the', 'the', 'anne', 'a', 'oliv']
Prediction Accuracy: 0.8
Confusion Matrix with ordering ['no', 'yes']
[[27 12]
 [ 8 53]]
Label: no, F1: 0.7297297297297296, Precision: 0.7714285714285715, Recall: 0.6923076923076923
Label: yes, F1: 0.8412698412698412, Precision: 0.8153846153846154, Recall: 0.8688524590163934


Despite not properly matching 11 responses to our label space, we get a really big boost in task accuracy for this problem just by using greedy decoding. However, can we do better if we don't miss matching those 11 responses? Let's use the activations!

### Zero-shot Prompting (Activation Retrieval/Likelihood Estimation)

For activation retrieval, we need to instantiate a tokenizer to obtain appropriate token indices for our labels. 

__NOTE__: All LLaMA-2 models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

If you are on the cluster, the tokenizer may be loaded from `/model-weights/Llama-2-7b-hf`. Otherwise, you'll need to download the `config.json`, `tokenizer.json`, `tokenizer.model`, and `tokenizer_config.json` from there to your local machine.

In [18]:
tokenizer = AutoTokenizer.from_pretrained("/model-weights/Llama-2-7b-hf")
# Let's test out how the tokenizer works on an example sentence. Note that the token with ID = 1 is the
# Beginning of sentence token ("BOS")
encoded_tokens = tokenizer.encode("Hello this is a test")
print(f"Encoded Tokens: {encoded_tokens}")
# If you ever need to move back from token ids, you can use tokenizer.decode or tokenizer.batch_decode
decoded_tokens = tokenizer.decode(encoded_tokens)
print(f"Decoded Tokens: {decoded_tokens}")

Encoded Tokens: [1, 15043, 445, 338, 263, 1243]
Decoded Tokens: <s> Hello this is a test


In [19]:
# extract the tokenizer ids associated with our labels
label_token_ids = get_label_token_ids(tokenizer, prompts[0], ["No", "no", "Yes", "yes"])
print(label_token_ids)
# If you ever need to move back from token ids, you can use tokenizer.decode or tokenizer.batch_decode
print(tokenizer.decode([1939, 694, 3869, 4874]))

tensor([1939,  694, 3869, 4874])
No no Yes yes


We need the token ids of our labels to extract the probabilties from the vocabulary of the model. Note that we're actually assigning two tokens for each label ("No" and "no"), as the model might think one is more likely than the other in its probability distribution, but they are essentially the same answer.

The token id corresponds to the index of the token in the vocabulary matrix of the underlying model. For a discussion and demonstration of how this extraction is done, see the [AGs News Notebook](src/reference_implementations/prompting_vector_llms/llm_prompting_examples/llm_prompt_ag_news.ipynb) and the comments in the `get_label_with_highest_likelihood` function.

In [20]:
activation_map_to_label = {0: "No", 1: "No", 2: "Yes", 3: "Yes"}
predicted_labels = []
# For memory management, we split the prompts into batches of size 1, as activations are fairly heavy
prompt_batches = split_prompts_into_batches(prompts, 1)
for prompt_batch in tqdm(prompt_batches):
    # Note the configuration temperature won't matter for our setting, as we're not using sampling when
    # extracting activations
    activations = model.get_activations(prompt_batch, [last_layer_name], greedy_generation_config)
    for activations_single_prompt in activations.activations:
        # For each prompt we extract the activations and calculate which label had the high likelihood.
        last_layer_matrix = activations_single_prompt[last_layer_name]
        predicted_label = get_label_with_highest_likelihood(
            last_layer_matrix, label_token_ids, activation_map_to_label
        )
        predicted_labels.append(predicted_label)

100%|██████████| 100/100 [04:58<00:00,  2.98s/it]


In [21]:
bool_q_text_labels_string = [label_map[label] for label in bool_q_test_labels]
report_metrics(predicted_labels, bool_q_text_labels_string, labels_order=label_ordering)

Prediction Accuracy: 0.82
Confusion Matrix with ordering ['No', 'Yes']
[[28 11]
 [ 7 54]]
Label: No, F1: 0.7567567567567569, Precision: 0.8, Recall: 0.717948717948718
Label: Yes, F1: 0.8571428571428572, Precision: 0.8307692307692308, Recall: 0.8852459016393442


We get a small improvement in the performance of this task using activations, but, importantly, we are no longer failing to match 11 answers in the generation and using a random guess in those instances. This means that more of our answers are grounded in the models probability estimation.



### Zero-shot Prompting (Poor Label Space Choice)

Because there are only two categories in our label space (yes/no) and the question is fairly "leading." That is,  the model tends to answer in our label space through greedy decoding fairly well, (though there are still 11 instances where it doesn't). However, it's not always easy to select a label space that the model consistently produces an answer in for zero-shot (see [AG's News Notebook](src/reference_implementations/prompting_vector_llms/llm_prompting_examples/llm_prompt_ag_news.ipynb)). There are techniques that can be used to help this (like re-writing the prompt or using few-shot prompts), but what happens if we choose a bad label space?

Let's try it out below. Instead of expecting a "Yes" or "No" answer, let's "expect" True or False.

In [22]:
label_map = {0: "False", 1: "True"}
label_ordering = ["False", "True"]
lowercase_labels = [label.lower() for label in label_ordering]

yes_no_prompts = []
for test_title, test_passage, test_question in zip(bool_q_test_titles, bool_q_test_passages, bool_q_test_questions):
    prompt = f"{test_title} -- {test_passage}\nquestion: {test_question}?\nanswer (True or False):"
    yes_no_prompts.append(prompt)

print(f"Prompt Example\n{yes_no_prompts[0]}")

Prompt Example
Shear wall -- In structural engineering, a shear wall is a structural system composed of braced panels (also known as shear panels) to counter the effects of lateral load acting on a structure. Wind and seismic loads are the most common building codes, including the International Building Code (where it is called a braced wall line) and Uniform Building Code, all exterior wall lines in wood or steel frame construction must be braced. Depending on the size of the building some interior walls must be braced as well.
question: is a shear wall a load bearing wall?
answer (True or False):


In [23]:
# extract the tokenizer ids associated with our labels
label_token_ids = get_label_token_ids(tokenizer, yes_no_prompts[0], ["False", "false", "True", "true"])
print(label_token_ids)
# If you ever need to move back from token ids, you can use tokenizer.decode or tokenizer.batch_decode
print(tokenizer.decode([7700, 2089, 5852, 1565]))

tensor([7700, 2089, 5852, 1565])
False false True true


In [24]:
predicted_labels = []
unmatched_predictions = []
# For memory management, we split the prompts into batches of size 10
prompt_batches = split_prompts_into_batches(yes_no_prompts, 10)
for prompt_batch in tqdm(prompt_batches):
    responses = model.generate(prompt_batch, greedy_generation_config)
    processed_responses = [generation.strip().lower() for generation in responses.generation["sequences"]]
    # If a token doesn't correspond to one of our labels, we'll randomly select one
    for potential_prediction in processed_responses:
        if potential_prediction in lowercase_labels:
            predicted_labels.append(potential_prediction)
        else:
            unmatched_predictions.append(potential_prediction)
            predicted_labels.append(random.choice(lowercase_labels))

100%|██████████| 10/10 [02:44<00:00, 16.45s/it]


In [25]:
print(f"{len(unmatched_predictions)} Responses did not match the label space {lowercase_labels}")
print(f"Some examples of unmatched responses: {sample(unmatched_predictions, 10)}")
# Map the labels from integers to strings for comparison to the string predicted labels in the confusion matrix
bool_q_text_labels_string = [label_map[label].lower() for label in bool_q_test_labels]
report_metrics(predicted_labels, bool_q_text_labels_string, labels_order=lowercase_labels)

94 Responses did not match the label space ['false', 'true']
Some examples of unmatched responses: ['there', 'there', 'stephen', 'krist', 'pat', 'the', 'no', 'green', 'white', 'the']
Prediction Accuracy: 0.48
Confusion Matrix with ordering ['false', 'true']
[[19 20]
 [32 29]]
Label: false, F1: 0.42222222222222217, Precision: 0.37254901960784315, Recall: 0.48717948717948717
Label: true, F1: 0.5272727272727273, Precision: 0.5918367346938775, Recall: 0.47540983606557374


Clearly the responses don't match up well with our label space, which we expected based on our question phrasing. The question is whether a poorly chosen label space is essentially useless from a prediction standpoint? To help answer this question, let's consider the results when using activation extraction.

### Zero-shot Prompting (Poor Label Choice Activations)

In [26]:
activation_map_to_label = {0: "False", 1: "False", 2: "True", 3: "True"}
predicted_labels = []
# For memory management, we split the prompts into batches of size 1, as activations are fairly heavy
prompt_batches = split_prompts_into_batches(yes_no_prompts, 1)
for prompt_batch in tqdm(prompt_batches):
    # Note the configuration temperature won't matter for our setting, as we're not using sampling when
    # extracting activations
    activations = model.get_activations(prompt_batch, [last_layer_name], greedy_generation_config)
    for activations_single_prompt in activations.activations:
        # For each prompt we extract the activations and calculate which label had the high likelihood.
        last_layer_matrix = activations_single_prompt[last_layer_name]
        predicted_label = get_label_with_highest_likelihood(
            last_layer_matrix, label_token_ids, activation_map_to_label
        )
        predicted_labels.append(predicted_label)

100%|██████████| 100/100 [05:15<00:00,  3.15s/it]


In [27]:
# Map the labels from integers to strings for comparison to the string predicted labels in the confusion matrix
bool_q_text_labels_string = [label_map[label] for label in bool_q_test_labels]
report_metrics(predicted_labels, bool_q_text_labels_string, labels_order=label_ordering)

Prediction Accuracy: 0.58
Confusion Matrix with ordering ['False', 'True']
[[22 17]
 [25 36]]
Label: False, F1: 0.5116279069767442, Precision: 0.46808510638297873, Recall: 0.5641025641025641
Label: True, F1: 0.631578947368421, Precision: 0.6792452830188679, Recall: 0.5901639344262295


While the accuracy is still not great, we get an improvement using activations over pure generation and the results suggest that there is still some information encoded in the True/False answers.

### N=2 Few-Shot Examples (Greedy Decoding)

In this example, we consider the affect of adding demonstrations to our prompts. That is, for this problem, does adding a small number of demonstrationss improve accuracy over zero-shot prompting?

**NOTE**: Some of the Bool Q examples are quite long. Our LLaMA-2 implementation limits input lengths to 512. So sometimes adding two or even just one demonstration puts us over this limit. The `create_prompts` function handles this by putting as many demonstrations "as possible" (up to 2) into our prompts. In many of the few-shot prompting examples, the benefit from demonstrations scales with the number of demonstrations provided. So we may not get an increase in task accuracy.

In [28]:
label_map_two = {0: "No", 1: "Yes"}
label_ordering_two = ["No", "Yes"]
lowercase_labels = [label.lower() for label in label_ordering_two]
demonstrations_two = create_demonstrations(
    bool_q_train_titles, bool_q_train_passages, bool_q_train_questions, bool_q_train_labels, label_map_two, 2
)
prompts_two = create_prompts(
    demonstrations_two, bool_q_test_titles, bool_q_test_passages, bool_q_test_questions, tokenizer
)

Prompt is too long: 534. Using 1-shot prompt
Prompt is too long: 593. Using 0-shot prompt
Prompt is too long: 517. Using 1-shot prompt
Prompt is too long: 557. Using 1-shot prompt
Prompt is too long: 552. Using 1-shot prompt
Prompt is too long: 548. Using 1-shot prompt
Prompt is too long: 598. Using 1-shot prompt


In [29]:
print(prompts_two[0])

Check (chess) -- In friendly games, the checking player customarily says ``check'' when making a checking move. Announcing ``check'' is not required under the rules of chess and it is usually not done in formal games. Until the early 20th century a player was expected to announce ``check'', and some sources of rules even required it (Hooper & Whyld 1992:74).
question: do we have to say check in chess?
answer: No

The 100 (TV series) -- In March 2017, The CW renewed the series for a fifth season, which premiered on April 24, 2018. In May 2018, the series was renewed for a sixth season.
question: are they gonna make a season 5 of the 100?
answer: Yes

Shear wall -- In structural engineering, a shear wall is a structural system composed of braced panels (also known as shear panels) to counter the effects of lateral load acting on a structure. Wind and seismic loads are the most common building codes, including the International Building Code (where it is called a braced wall line) and Uni

In [30]:
predicted_labels = []
unmatched_predictions = []
# For memory management, we split the prompts into batches of size 10
prompt_batches = split_prompts_into_batches(prompts_two, 10)
for prompt_batch in tqdm(prompt_batches):
    responses = model.generate(prompt_batch, greedy_generation_config)
    processed_responses = [generation.strip().lower() for generation in responses.generation["sequences"]]
    # If a token doesn't correspond to one of our labels, we'll randomly select one
    for potential_prediction in processed_responses:
        if potential_prediction in lowercase_labels:
            predicted_labels.append(potential_prediction)
        else:
            unmatched_predictions.append(potential_prediction)
            predicted_labels.append(random.choice(lowercase_labels))

100%|██████████| 10/10 [02:16<00:00, 13.65s/it]


In [31]:
print(f"{len(unmatched_predictions)} Responses did not match the label space {lowercase_labels}")
print(f"Some examples of unmatched responses: {unmatched_predictions}")
# Map the labels from integers to strings for comparison to the string predicted labels in the confusion matrix
bool_q_text_labels_string = [label_map_two[label].lower() for label in bool_q_test_labels]
report_metrics(predicted_labels, bool_q_text_labels_string, labels_order=lowercase_labels)

0 Responses did not match the label space ['no', 'yes']
Some examples of unmatched responses: []
Prediction Accuracy: 0.83
Confusion Matrix with ordering ['no', 'yes']
[[27 12]
 [ 5 56]]
Label: no, F1: 0.7605633802816902, Precision: 0.84375, Recall: 0.6923076923076923
Label: yes, F1: 0.868217054263566, Precision: 0.8235294117647058, Recall: 0.9180327868852459


Unfortunately, we don't see a significant increase in accuracy for this task with 2-shot (at most) prompts. However, one immediate benefit that can be seen is that the number of responses that we were unable to map to our label space drops from 11 -> 0, which is a useful result.