In [2]:
import random
import re
import time
from random import choice, sample
from typing import Any, Dict, List, Tuple

import evaluate
import kscope
import numpy as np
import torch
import torch.nn as nn
from evaluate import EvaluationModule
from tqdm import tqdm
from transformers import AutoTokenizer
from utils import (
    copa_preprocessor,
    create_first_prompt,
    create_first_prompt_label,
    create_mc_prompt,
    create_mc_prompt_answer,
    create_second_prompt,
    create_second_prompt_label,
    split_prompts_into_batches,
)

# Setting seed for reproducibility
random.seed(2024)

# Getting Started

There is a bit of documentation on how to interact with the large models [here](https://kaleidoscope-sdk.readthedocs.io/en/latest/). The relevant github links to the SDK are [here](https://github.com/VectorInstitute/kaleidoscope-sdk) and underlying code [here](https://github.com/VectorInstitute/kaleidoscope).

First we connect to the service through which we'll interact with the LLMs and see which models are available to us

## Establish a client connection to the Kaleidoscope service


In [3]:
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)

In [4]:
client.models

['gpt2',
 'llama2-7b',
 'llama2-7b_chat',
 'llama2-13b',
 'llama2-13b_chat',
 'llama2-70b',
 'llama2-70b_chat',
 'falcon-7b',
 'falcon-40b',
 'sdxl-turbo']

In [5]:
client.model_instances

[{'id': 'f0973d9f-d6d7-41ab-974c-e398ed1abc21',
  'name': 'llama2-7b',
  'state': 'ACTIVE'},
 {'id': '38c590c0-f3b3-4331-8296-74e6068a7106',
  'name': 'falcon-7b',
  'state': 'ACTIVE'}]

In this notebook, we'll perform some experimentation with both the LLaMA-2 and Falcon 7B parameter models.

In [6]:
llama_model = client.load_model("llama2-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while llama_model.state != "ACTIVE":
    time.sleep(1)

In [7]:
falcon_model = client.load_model("falcon-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while falcon_model.state != "ACTIVE":
    time.sleep(1)

###  Target Task: Balanced Choice of Plausible Alternatives (CoPA)

We'll use an updated (harder) version of the CoPA dataset. We're only going to work with a small subset of the true development set in order to expedite LLM evaluation. 

The task, in short, is, given a context and a premise of either cause or effect, the model must choose between two distinct sentences to determine which is the logical following sentence. 

Two examples are:

From the following choices,
1) The author faded into obscurity.
2) It was adapted into a movie.

and an __effect__ premise, which logically follows the sentence "The book became a huge failure." The answer is "The author faded into obscurity." 

From the following choices,
1) The shop was undergoing renovation.
2) The owner was helping customers.

and a __cause__ premise, which logically follows the sentence "The shop was closed." The answer is "The shop was undergoing renovation." 

You can inspect the preprocessed dataset at 

`src/reference_implementations/prompting_vector_llms/prompt_ensembling/resources/copa_sample.tsv`

We print out some of the demonstrations that we've setup below for additional reference.

__NOTE__: Construction of the prompts and some other functions that are used throughout this notebook have been pulled into a utils file 

`src/reference_implementations/prompting_vector_llms/prompt_ensembling/utils.py`

### A Generation Formulation

For our first approach to this task, we'll attempt to have the model generate exact matches to the possible choices. We expect this to be somewhat brittle, as it relies on the model producing an a copy (or something "nearby") of the selected completion.

__NOTE__: In general, when we see an "effect" premise the string ", so" is added to the phrase to be completed. If the premise is "cause" then then string ", because" is added to the phrase to be completed. We strip out the ending period to improve fluency. See the demonstrations below for an example.

In [8]:
# How many of the initial data points should be reserved for demonstrations
demonstration_candidates = 50
# Number of demonstrations to be used per prompt
n_demonstrations = 10

copa_data_set = copa_preprocessor("resources/copa_sample.tsv")

demonstration_pool = copa_data_set[0:demonstration_candidates]
test_pool = copa_data_set[demonstration_candidates:]
demonstrations = sample(demonstration_pool, n_demonstrations)
prompts: List[str] = []
labels: List[str] = []
int_labels: List[int] = []
choices: List[Tuple[str, str]] = []
for premise, label, phrase, first_choice, second_choice in test_pool:
    int_labels.append(label)
    choices.append((first_choice, second_choice))
    labels.append(create_first_prompt_label(first_choice.lower(), second_choice.lower(), label))
    prompts.append(create_first_prompt(demonstrations, premise, phrase, first_choice, second_choice))

In [9]:
print(prompts[0])

Choose the sentence that best completes the phrase

"the student's phone rang." or "the student took notes."
Everyone in the class turned to stare at the student, because the student's phone rang.

"the teapot whistled." or "the teapot cooled."
The water in the teapot started to boil, so the teapot whistled.

"the girl applied her makeup." or "the girl turned on the fan."
The mirror in the bathroom fogged up, so the girl turned on the fan.

"the surveillance camera was out of focus." or "he noticed some suspicious activity."
The security guard could not identify the thief, because the surveillance camera was out of focus.

"he trusted the therapist." or "he disagreed with the therapist."
The man revealed personal information to the therapist, because he trusted the therapist.

"she contacted her lawyer." or "she cancelled her appointments."
The woman was summoned for jury duty, so she cancelled her appointments.

"the student's phone rang." or "the student took notes."
The teacher cove

Let's see how this prompt performs on a small sample of the data

In [10]:
def process_generation_text(original_texts: List[str]) -> List[str]:
    responses = []
    for single_generation in original_texts:
        generation_text: List[str] = re.findall(r".*?[.!\?]", single_generation)
        response_text = generation_text[0] if len(generation_text) > 0 else single_generation
        responses.append(response_text)
    return responses

In [11]:
# Note that both of these configuration specify GREEDY decoding strategies for the different models
llama_generation_config = {"max_tokens": 35, "top_p": 1.0, "temperature": 0.0}
falcon_generation_config = {"max_tokens": 35, "top_k": 1, "temperature": 1.0, "do_sample": False}

In [12]:
n_samples_to_run = 3
llama_generations = llama_model.generate(prompts[0:n_samples_to_run], llama_generation_config)
llama_responses = process_generation_text(llama_generations.generation["sequences"])

falcon_generations = falcon_model.generate(prompts[0:n_samples_to_run], falcon_generation_config)
falcon_responses = process_generation_text(falcon_generations.generation["sequences"])

In [13]:
for response, label in zip(llama_responses, labels[0:n_samples_to_run]):
    print(f"LLaMA Response: {response}\nLabel: {label}\n")

for response, label in zip(falcon_responses, labels[0:n_samples_to_run]):
    print(f"Falcon Response: {response}\nLabel: {label}\n")

LLaMA Response: "the student's phone rang.
Label: they rested.

LLaMA Response: I stopped receiving new issues.
Label: i discarded the new issue.

LLaMA Response: She felt self-conscious.
Label: she felt self-conscious.

Falcon Response:  they rested.
Label: they rested.

Falcon Response:  I discarded the new issue.
Label: i discarded the new issue.

Falcon Response:  she felt self-conscious.
Label: she felt self-conscious.



#### Scoring generated Responses for the 10-shot Prompt Above

Here we consider the performance of the demonstration prompt above on our subsampling of the CoPA dataset.

Let's run all of the examples through the models and collect the responses into responses lists.

In [14]:
llama_responses = []
prompt_batches = split_prompts_into_batches(prompts, 10)
for prompt_batch in tqdm(prompt_batches):
    generations = llama_model.generate(prompt_batch, llama_generation_config)
    llama_responses.extend(process_generation_text(generations.generation["sequences"]))

falcon_responses = []
# Falcon requires a batch size of 8 or less
prompt_batches = split_prompts_into_batches(prompts, 8)
for prompt_batch in tqdm(prompt_batches):
    generations = falcon_model.generate(prompt_batch, falcon_generation_config)
    falcon_responses.extend(process_generation_text(generations.generation["sequences"]))

  0%|          | 0/10 [00:00<?, ?it/s]100%|██████████| 10/10 [00:44<00:00,  4.46s/it]
100%|██████████| 13/13 [04:24<00:00, 20.31s/it]


We can perform scoring based on the generated text, by considering the rouge score of the responses using the label as the reference. We choose between the two available choices for the logical completion of the reference phrase. The model has provided a response and we treat each choice as a reference for the ROUGE metric. We take as the model's prediction the phrase with the highest ROUGE score compared to the response text.

In [15]:
rouge_metric = evaluate.load("rouge")

In [16]:
def score_response_via_rouge(
    response: str, first_choice: str, second_choice: str, rouge_metric: EvaluationModule
) -> int:
    response = response.lower()
    first_choice = first_choice.lower()
    second_choice = second_choice.lower()
    # Use the rouge metric to score the response against the first choice or second choice as reference
    rouge_0 = rouge_metric.compute(predictions=[response], references=[first_choice])
    rouge_1 = rouge_metric.compute(predictions=[response], references=[second_choice])
    # We take the average of the unigram and bi-gram rouge scores for the first and second choice results.
    score_0 = (rouge_0["rouge1"] + rouge_0["rouge2"]) / 2.0
    score_1 = (rouge_1["rouge1"] + rouge_1["rouge2"]) / 2.0
    # If the first score is larger we select the first choice
    return 0 if score_0 > score_1 else 1

In [17]:
total = 0
correct = 0
for response, label_int, (first_choice, second_choice) in zip(llama_responses, int_labels, choices):
    predicted_label = score_response_via_rouge(response, first_choice, second_choice, rouge_metric)
    if predicted_label == label_int:
        correct += 1
    total += 1

print(f"LLaMA Accuracy: {correct/total}")

total = 0
correct = 0
for response, label_int, (first_choice, second_choice) in zip(falcon_responses, int_labels, choices):
    predicted_label = score_response_via_rouge(response, first_choice, second_choice, rouge_metric)
    if predicted_label == label_int:
        correct += 1
    total += 1

print(f"Falcon Accuracy: {correct/total}")

LLaMA Accuracy: 0.66
Falcon Accuracy: 0.68


We score above random chance for this problem, but we'd certainly like to do better. We'll try a few additional formulations below.

### A Multiple Choice Formulation

Instead of formulating the task as a completion, we can try using a multiple choice type construction and see how well the model does the task.

In [18]:
prompts = []
labels = []
int_labels = []
choices = []
for premise, label, phrase, first_choice, second_choice in test_pool:
    int_labels.append(label)
    choices.append((first_choice, second_choice))
    labels.append(create_mc_prompt_answer(label))
    prompts.append(create_mc_prompt(demonstrations[2:], premise, phrase, first_choice, second_choice))

In [19]:
print(prompts[0])

From A or B which choice best completes the phrase?
Phrase: The mirror in the bathroom fogged up, so
A: the girl applied her makeup.
B: the girl turned on the fan.
Answer: B

From A or B which choice best completes the phrase?
Phrase: The security guard could not identify the thief, because
A: the surveillance camera was out of focus.
B: he noticed some suspicious activity.
Answer: A

From A or B which choice best completes the phrase?
Phrase: The man revealed personal information to the therapist, because
A: he trusted the therapist.
B: he disagreed with the therapist.
Answer: A

From A or B which choice best completes the phrase?
Phrase: The woman was summoned for jury duty, so
A: she contacted her lawyer.
B: she cancelled her appointments.
Answer: B

From A or B which choice best completes the phrase?
Phrase: The teacher covered a lot of material, because
A: the student's phone rang.
B: the student took notes.
Answer: B

From A or B which choice best completes the phrase?
Phrase: Th

In [20]:
# Note that both of these are GREEDY decoding strategies for the different models.
# Since we're generating a multiple choice answer, we shorten the max tokens.
llama_generation_config = {"max_tokens": 4, "top_p": 1.0, "temperature": 0.0}
falcon_generation_config = {"max_tokens": 4, "top_k": 1, "temperature": 1.0, "do_sample": False}

In [21]:
def process_mc_generation_text(original_texts: List[str]) -> List[str]:
    responses = []
    for single_generation in original_texts:
        generation_text: List[str] = re.findall(r"(A|B)", single_generation)
        # If you find an A or B in the answer use the first occurence. Otherwise randomly select one
        if len(generation_text) == 0:
            print(f"Selecting Randomly. No selection match was found in: {single_generation}")
        response_text = generation_text[0] if len(generation_text) > 0 else choice(["A", "B"])
        responses.append(response_text)
    return responses

In [22]:
n_samples_to_run = 3
llama_generations = llama_model.generate(prompts[0:n_samples_to_run], llama_generation_config)
llama_responses = process_mc_generation_text(llama_generations.generation["sequences"])

falcon_generations = falcon_model.generate(prompts[0:n_samples_to_run], falcon_generation_config)
falcon_responses = process_mc_generation_text(falcon_generations.generation["sequences"])

In [23]:
for response, label in zip(llama_responses, labels[0:n_samples_to_run]):
    print(f"LLaMA Response: {response}\nLabel: {label}\n")

for response, label in zip(falcon_responses, labels[0:n_samples_to_run]):
    print(f"Falcon Response: {response}\nLabel: {label}\n")

LLaMA Response: A
Label: A

LLaMA Response: A
Label: B

LLaMA Response: A
Label: A

Falcon Response: B
Label: A

Falcon Response: A
Label: B

Falcon Response: A
Label: A



In [24]:
llama_responses = []
prompt_batches = split_prompts_into_batches(prompts, 10)
for prompt_batch in tqdm(prompt_batches):
    generations = llama_model.generate(prompt_batch, llama_generation_config)
    llama_responses.extend(process_mc_generation_text(generations.generation["sequences"]))

falcon_responses = []
# Falcon requires a batch size of 8 or less
prompt_batches = split_prompts_into_batches(prompts, 8)
for prompt_batch in tqdm(prompt_batches):
    generations = falcon_model.generate(prompt_batch, falcon_generation_config)
    falcon_responses.extend(process_mc_generation_text(generations.generation["sequences"]))

100%|██████████| 10/10 [00:25<00:00,  2.53s/it]
100%|██████████| 13/13 [00:38<00:00,  2.94s/it]


In [25]:
total = 0
correct = 0
for predicted_label_str, label in zip(llama_responses, labels):
    if predicted_label_str == label:
        correct += 1
    total += 1

print(f"LLaMA Accuracy: {correct/total}")

total = 0
correct = 0
for predicted_label_str, label in zip(falcon_responses, labels):
    if predicted_label_str == label:
        correct += 1
    total += 1

print(f"Falcon Accuracy: {correct/total}")

LLaMA Accuracy: 0.68
Falcon Accuracy: 0.53


For this example, the LLaMA model performs significantly better than Falcon. However, we're still not doing the task with as high accuracy as we'd like to. In the [Bootstrap Ensembling Notebook](bootstrap_ensembling.ipynb), we'll consider ways to improve this formulation through ensembling.

### A Log-Likelihood Formulation

Alternatively, for each prompt and the two responses, we can score the candidate responses by log likelihood (from the models' perspective) and choose the higher one as our label. That is, we complete the prompt with both labels and then extract the log-likelihoods of that input text from the perspective of the model. See the comments in the code below for more details on how this is done.

__NOTE__: In our current implementations, __only LLaMA-2__ is configured to produce activations. So we'll only use that model in this example

#### Tokenizer 

For activation retrieval, we need to instantiate a tokenizer to obtain appropriate token indices for our labels. 

__NOTE__: All LLaMA-2 models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

If you are on the cluster, the tokenizer may be loaded from `/model-weights/Llama-2-7b-hf`. Otherwise, you'll need to download the `config.json`, `tokenizer.json`, `tokenizer.model`, and `tokenizer_config.json` from there to your local machine.

In [26]:
# Tokenizer prepares the input of the model. LLaMA models of all sizes use the same underlying tokenizer
tokenizer = AutoTokenizer.from_pretrained("/model-weights/Llama-2-7b-hf")
# Let's test out how the tokenizer works on an example sentence. Note that the token with ID = 1 is the
# Beginning of sentence token ("BOS")
encoded_tokens = tokenizer.encode("Hello this is a test")
print(f"Encoded Tokens: {encoded_tokens}")
# If you ever need to move back from token ids, you can use tokenizer.decode or tokenizer.batch_decode
decoded_tokens = tokenizer.decode(encoded_tokens)
print(f"Decoded Tokens: {decoded_tokens}")

# We're interested in the activations from the last layer of the model, because this will allow us to calculate the
# likelihoods
last_layer_name = llama_model.module_names[-1]
print(f"Last Layer Name: {last_layer_name}")
# Get a log softmax function to compute log probabilities from the output layer.
log_softmax = nn.LogSoftmax(dim=1)

endline_token_id = tokenizer.encode("Hello\n")[-1]
print(f"Endline Token Id: {endline_token_id}")

Encoded Tokens: [1, 15043, 445, 338, 263, 1243]
Decoded Tokens: <s> Hello this is a test
Last Layer Name: output
Endline Token Id: 13


In [27]:
def compute_log_probability_from_activations(logits: torch.Tensor, token_ids: List[int]) -> float:
    # First we get the logprobs associated with each token, logits is n_tokens x vocabulary size
    log_probs = log_softmax(logits.type(torch.float32))
    # Drop the first token ID (as it corresponds to the <s> token) and add placeholder to the end. This shift aligns
    # The tokens with the output activations corresponding to their logprobs
    token_ids.pop(0)
    token_ids.append(1)
    # We only really care about the logprobs associated with the sentence to be completed
    # (i.e. not the demonstrations or the question). So search for the last endline in the tokens and only
    # sum the logprobs thereafter.
    endline_index = len(token_ids) - list(reversed(token_ids)).index(endline_token_id)
    # Turn token ids into the appropriate column indices
    token_id_slicer = torch.Tensor(token_ids).reshape(-1, 1).type(torch.int64)
    log_probs_per_token = log_probs.gather(1, token_id_slicer)
    # We sum the log probabilities, except for the last one which corresponds to the as yet predicted token)
    # and then normalize by the number of tokens (minus one for the placeholder)
    selected_log_probs_per_token = log_probs_per_token[endline_index:-1]
    normalized_log_prob = torch.sum(selected_log_probs_per_token) / len(selected_log_probs_per_token)
    return normalized_log_prob.item()

In [28]:
# We're running a lot of activation retrievals. Once in a while there is a json decoding or Triton error. If that
# happens, we retry the activations request.
def get_activations_with_retries(prompt: str, layers: List[str], config: Dict[str, Any], retries: int = 5) -> Any:
    for _ in range(retries):
        try:
            return llama_model.get_activations(prompt, layers, config)
        except Exception as e:  # noqa: F841
            print("Something went wrong in activation retrieval...retrying")
    raise ValueError("Exceeded retry limit. Exiting Process")

In [29]:
def pair_prompts_with_choices(prompt_batch: List[Tuple[str, Tuple[str, str]]]) -> List[str]:
    # We want to complete our prompt with the two possible choices and score those completions using our LM.
    prompts_with_choices = []
    for prompt, (first_choice, second_choice) in prompt_batch:
        prompts_with_choices.append(f"{prompt}{first_choice.lower()}")
        prompts_with_choices.append(f"{prompt}{second_choice.lower()}")
    return prompts_with_choices

Once we have a log likelihood for each prompt completion corresponding to completion with the first or second potential phrase, we pair those up and compute which has the higher likelihood between the two options. This then becomes our "predicted" label.

In [30]:
def post_process_logprobs_to_labels(logprobs: List[float]) -> Tuple[List[int], List[List[float]]]:
    # Need to group logprobs in twos because they represent likelihoods of the two completions
    assert len(logprobs) % 2 == 0
    paired_logprobs = [logprobs[x : x + 2] for x in range(0, len(logprobs), 2)]
    predicted_labels: List[int] = []
    predicted_logprobs = []
    for logprob_pair in paired_logprobs:
        # Paired logprob for first and second choice together
        predicted_labels.append(np.argmax(logprob_pair, axis=0))
        predicted_logprobs.append(logprob_pair)
    return predicted_labels, predicted_logprobs

In [31]:
prompts = []
labels = []
int_labels = []
choices = []
for premise, label, phrase, first_choice, second_choice in test_pool:
    int_labels.append(label)
    choices.append((first_choice, second_choice))
    labels.append(create_second_prompt_label(first_choice.lower(), second_choice.lower(), label))
    prompts.append(create_second_prompt(demonstrations, premise, phrase))

In [32]:
print(prompts[0])

Complete the phrase with a logical phrase.

Everyone in the class turned to stare at the student, because the student's phone rang.

The water in the teapot started to boil, so the teapot whistled.

The mirror in the bathroom fogged up, so the girl turned on the fan.

The security guard could not identify the thief, because the surveillance camera was out of focus.

The man revealed personal information to the therapist, because he trusted the therapist.

The woman was summoned for jury duty, so she cancelled her appointments.

The teacher covered a lot of material, because the student took notes.

The truck crashed into the motorcycle on the bridge, so the motorcyclist died.

The pants had no defects, because the pants were new.

I took off the rubber gloves, because i was preparing to wash my hands.

The couple was very tired, so 


In [33]:
all_logprobs = []
prompts_and_choices = pair_prompts_with_choices(list(zip(prompts, choices)))
# prompts and choices is now twice as long as the original prompts and choices because the prompts have been completed
# with the two possible choices
# We split the prompts into batches of 1 for memory management since activation retrieval is a bit heavy.
prompt_batches = split_prompts_into_batches(prompts_and_choices, 1)
llama_generation_config = {"max_tokens": 1, "top_p": 1.0, "temperature": 0.0}
for prompt_batch in tqdm(prompt_batches):
    # Process below only works for batches of size 1
    assert len(prompt_batch) == 1
    single_prompt = prompt_batch[0]
    # The score for a sentence is the sum of log probability of each word in the sentence.
    prompt_activations = get_activations_with_retries(single_prompt, [last_layer_name], llama_generation_config)  # type: ignore # noqa: E501
    token_ids = tokenizer.encode(single_prompt)
    last_layer_matrix = prompt_activations.activations[0][last_layer_name]
    prompt_log_probs = compute_log_probability_from_activations(last_layer_matrix, token_ids)
    all_logprobs.append(prompt_log_probs)

100%|██████████| 200/200 [15:28<00:00,  4.64s/it]


In [34]:
predicted_labels, _ = post_process_logprobs_to_labels(all_logprobs)
total = 0
correct = 0
for predicted_label, label_int in zip(predicted_labels, int_labels):
    if predicted_label == label_int:
        correct += 1
    total += 1
print(f"Accuracy: {correct/total}")

Accuracy: 0.79


This is a nice boost to the accuracy that we've been seeing with our other formulations. In some follow up notebooks, we'll see if we can improve up the results of this notebook with a few basic ensembling techniques. These notebooks are [Bootstrap Ensembling](bootstrap_ensembling.ipynb) and [Prompt Ensembling](prompt_ensembling.ipynb).