In [1]:
import re
import time

import evaluate
import kscope
from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

# Getting Started

There is a bit of documentation on how to interact with the large models [here](https://kaleidoscope-sdk.readthedocs.io/en/latest/). The relevant github links to the SDK are [here](https://github.com/VectorInstitute/kaleidoscope-sdk) and underlying code [here](https://github.com/VectorInstitute/kaleidoscope).

First we connect to the service through which we'll interact with the LLMs and see which models are avaiable to us

In [2]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)

Show all supported models

In [3]:
client.models

['gpt2',
 'llama2-7b',
 'llama2-7b_chat',
 'llama2-13b',
 'llama2-13b_chat',
 'llama2-70b',
 'llama2-70b_chat',
 'falcon-7b',
 'falcon-40b',
 'sdxl-turbo']

Show all model instances that are currently active

In [15]:
client.model_instances

[{'id': '561d26c9-04db-4670-8ab6-503cb408fb66',
  'name': 'falcon-7b',
  'state': 'ACTIVE'},
 {'id': '9f03d4f8-3d70-4838-bf5a-c68a0ad7dbb7',
  'name': 'llama2-7b',
  'state': 'ACTIVE'}]

In [4]:
model = client.load_model("falcon-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

We need to configure the model to generate in the way we want it to. So we set a number of important parameters. For a discussion of the configuration parameters see: `src/reference_implementations/prompting_vector_llms/CONFIG_README.md`

In [9]:
long_generation_config = {"max_tokens": 128, "top_p": 1.0, "temperature": 1.0}

Let's try a basic prompt for factual information.

__Note__ that if you run the cell multiple times, you'll get different responses due to sampling.

In [7]:
generation = model.generate("What is the capital of Canada?", long_generation_config)
# Extract the text from the returned generation
print(generation.generation["sequences"][0])


Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa is the capital of Canada.


In [10]:
def post_process_generations(generation_text: str) -> str:
    # This simply attempts to extract the first three "sentences" within a generated string
    split_text = re.findall(r".*?[.!\?]", generation_text)[0:3]
    split_text = [text.strip() for text in split_text]
    return " ".join(split_text)

### Basic Prompts

Now let's create a basic prompt template that we can reuse for multiple text inputs. This will be an instruction prompt with an unconstrained answer space as we're going to try to get Falcon to summarize texts. We'll try several different templates and examine performance for each. Note that this section simply considers "manual" or "human-level" inspection to determine the quality of the summary. At the bottom of this notebook, we consider measuring the quality of two prompts on a sample of the CNN Daily Mail task using a ROUGE-1 Score.

In [9]:
prompt_template_summary_1 = "Summarize the preceding text."
prompt_template_summary_2 = "Short Summary:"
prompt_template_summary_3 = "TLDR;"

In [10]:
with open("resources/news_summary_datasets/examples_news.txt", "r", encoding="utf-8") as file:
    news_stories = file.readlines()

In [11]:
prompts_with_template_1 = [f"{news_story} {prompt_template_summary_1}" for news_story in news_stories]
prompts_with_template_2 = [f"{news_story} {prompt_template_summary_2}" for news_story in news_stories]
prompts_with_template_3 = [f"{news_story} {prompt_template_summary_3}" for news_story in news_stories]

In these examples, we use the prompt structures

* (text) Summarize the preceding text.
* (text) Short Summary:
* (text) TLDR;

In [12]:
generation_1 = []
for prompt_with_template_1, original_story in zip(prompts_with_template_1, news_stories):
    generation_1.append(model.generate(prompt_with_template_1, long_generation_config))
    print(f"Prompt: {prompt_template_summary_1}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(generation_1[-1].generation["sequences"][0])
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Summarize the preceding text.
Original Length: 1262, Summary Length: 278
The US believes that Russia is sending captured Western weapons to Iran to reverse-engineer. The US believes that Russia is sending captured Western weapons to Iran to reverse-engineer. The US believes that Russia is sending captured Western weapons to Iran to reverse-engineer.

Prompt: Summarize the preceding text.
Original Length: 1181, Summary Length: 599

Prompt: Summarize the preceding text.
Original Length: 1260, Summary Length: 410
The state’s request comes as the Supreme Court is considering a case that could have a major impact on transgender rights. The justices are scheduled to hear arguments in a case involving a Virginia student who was barred from using the boys’ bathroom at his high school. The court is also considering a case involving a Colorado transgender student who was barred from using the girls’ bathroom at her school.



In [13]:
generation_2 = []
for prompt_with_template_2, original_story in zip(prompts_with_template_2, news_stories):
    generation_2.append(model.generate(prompt_with_template_2, long_generation_config))
    print(f"Prompt: {prompt_template_summary_2}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(generation_2[-1].generation["sequences"][0])
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Short Summary:
Original Length: 1262, Summary Length: 589
- Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. - Over the last year, US, NATO and other Western officials have seen several instances of Russian forces seizing smaller, shoulder-fired weapons and equipment including Javelin anti-tank and Stinger anti-aircraft systems that Ukrainian forces have at times been forced to leave behind on the battlefield, the sources told CNN.

Prompt: Short Summary:
Original Length: 1181, Summary Length: 200
- The National Weather Service has issued a flash flood watch for the San Francisco Bay Area, including the East Bay, the North Bay, and the Santa Cruz Mountains. - The watch is in effect from 4 a. m.

Prompt: Short Summary:
Original Length: 1260, Summary Length: 586
West Virg

In [14]:
generation_3 = []
for prompt_with_template_3, original_story in zip(prompts_with_template_3, news_stories):
    generation_3.append(model.generate(prompt_with_template_3, long_generation_config))
    print(f"Prompt: {prompt_template_summary_3}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(generation_3[-1].generation["sequences"][0])
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: TLDR;
Original Length: 1262, Summary Length: 266
Russia is sending captured US weapons to Iran to reverse engineer. The US is concerned that Iran will use the captured weapons to attack US forces in the Middle East. The US is concerned that Iran will use the captured weapons to attack US forces in the Middle East.

Prompt: TLDR;
Original Length: 1181, Summary Length: 369

Prompt: TLDR;
Original Length: 1260, Summary Length: 454
West Virginia is asking the Supreme Court to allow it to enforce a state law that prohibits transgender women and girls from participating in public school sports. The state’s request comes as the Supreme Court is considering whether to take up a case that could have a major impact on transgender rights. The justices are scheduled to hear arguments in a case involving a Virginia student who was barred from using the boys’ bathroom at his high school.



Story 2 is about the possibility of severe flooding in California and an evacuation order being issued. Let's see what we get that from the three summaries and maybe which worked better.

In [15]:
print(f"{prompt_template_summary_1}|| {post_process_generations(generation_1[1].generation['sequences'][0])}")
print("====================================================================================")
print(f"{prompt_template_summary_2}|| {post_process_generations(generation_2[1].generation['sequences'][0])}")
print("====================================================================================")
print(f"{prompt_template_summary_3}|| {post_process_generations(generation_3[1].generation['sequences'][0])}")

Short Summary:|| - The National Weather Service has issued a flash flood watch for the San Francisco Bay Area, including the East Bay, the North Bay, and the Santa Cruz Mountains. - The watch is in effect from 4 a. m.


### Can we improve the results by providing additional context to our instructions?

In this example, we prompt the model to provide a summary and try to do so in a compact way. We still post-process the text to grab the first three sentences, but hopefully the model tries to pack more information into those first sentences.

In [16]:
prompt_template_summary_4 = "Summarize the text in as few words as possible:"
prompts_with_template_4 = [f"{news_story} {prompt_template_summary_4}" for news_story in news_stories]

generation_4 = []
for prompt_with_template_4, original_story in zip(prompts_with_template_4, news_stories):
    generation_4.append(model.generate(prompt_with_template_4, long_generation_config))
    print(f"Prompt: {prompt_template_summary_4}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(generation_4[-1].generation["sequences"][0])
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Summarize the text in as few words as possible:
Original Length: 1262, Summary Length: 508
Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems. The US doesn’t believe that the issue is widespread or systematic, and the Ukrainian military has made it a habit since the beginning of the war to report to the Pentagon any losses of US-provided equipment to Russian forces. Still, US officials acknowledge that the issue is difficult to track.

Prompt: Summarize the text in as few words as possible:
Original Length: 1181, Summary Length: 549

Prompt: Summarize the text in as few words as possible:
Original Length: 1260, Summary Length: 593
The West Virginia Attorney General’s Office has asked the US Supreme Court to allow the state to enforce a law that prohibits transgender women and girls from participating in public school sp

Generative models in general, have been reported to perform better when not prompted with "declarative" instructions or direct interogatives (See the [OPT Paper](https://arxiv.org/abs/2205.01068)). As such, let's ask for the summary as a question!

In [17]:
prompt_template_summary_5 = "How would you briefly summarize the text?"
prompts_with_template_5 = [f"{news_story} {prompt_template_summary_5}" for news_story in news_stories]

generation_5 = []
for prompt_with_template_5, original_story in zip(prompts_with_template_5, news_stories):
    generation_5.append(model.generate(prompt_with_template_5, long_generation_config))
    print(f"Prompt: {prompt_template_summary_5}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(generation_5[-1].generation["sequences"][0])
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: How would you briefly summarize the text?
Original Length: 1262, Summary Length: 416
The text is a very important document. It is a very important document because it is the first time that the United States and the European Union have agreed on a common position on the issue of the Russian invasion of Ukraine. It is a very important document because it is the first time that the United States and the European Union have agreed on a common position on the issue of the Russian invasion of Ukraine.

Prompt: How would you briefly summarize the text?
Original Length: 1181, Summary Length: 151
The text is about the California’s weather. What is the main idea of the text? The main idea of the text is that California is facing a severe weather.

Prompt: How would you briefly summarize the text?
Original Length: 1260, Summary Length: 264
The text of the law is very straightforward. It says that a student who is a biological male cannot participate in a girls’ sport or a girls’ athletic

Rephrasing the question will likely induce different summarization and possibly improve the results

In [18]:
prompt_template_summary_6 = "Briefly, what is this story about?"
prompts_with_template_6 = [f"{news_story} {prompt_template_summary_6}" for news_story in news_stories]

generation_6 = []
for prompt_with_template_6, original_story in zip(prompts_with_template_6, news_stories):
    generation_6.append(model.generate(prompt_with_template_6, long_generation_config))
    print(f"Prompt: {prompt_template_summary_6}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(generation_6[-1].generation["sequences"][0])
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Briefly, what is this story about?
Original Length: 1262, Summary Length: 488
Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. What is the significance of this story? The US and its allies have been providing Ukraine with a wide range of weapons and equipment since the beginning of the war, including Javelin anti-tank and Stinger anti-aircraft systems.

Prompt: Briefly, what is this story about?
Original Length: 1181, Summary Length: 157
The story is about the California weather. What is the main point of the story? The main point of the story is that California is experiencing a lot of rain.

Prompt: Briefly, what is this story about?
Original Length: 1260, Summary Length: 327
West Virginia is asking the US Supreme Court to allow it to enforce a state law that prohibit

As a final example, rather than asking a question, we putting the task in a context that might be more natural for a generative model. That is, we ask it to "sum up" the article with a natural phrase prefix to be completed in a "conversational" way.

In [19]:
prompt_template_summary_7 = "In short,"
prompts_with_template_7 = [f"{news_story} {prompt_template_summary_7}" for news_story in news_stories]

generation_7 = []
for prompt_with_template_7, original_story in zip(prompts_with_template_7, news_stories):
    generation_7.append(model.generate(prompt_with_template_7, long_generation_config))
    print(f"Prompt: {prompt_template_summary_7}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(generation_7[-1].generation["sequences"][0])
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: In short,
Original Length: 1262, Summary Length: 416
the US is worried that Russia is giving Iran the blueprints for the weapons it has captured. The US has been providing Ukraine with weapons since the beginning of the war, and the US has been providing Ukraine with weapons since the beginning of the war. The US has been providing Ukraine with weapons since the beginning of the war, and the US has been providing Ukraine with weapons since the beginning of the war.

Prompt: In short,
Original Length: 1181, Summary Length: 488
the storm is expected to bring “significant” rainfall to the region, with some areas receiving up to 10 inches of rain, according to the National Weather Service. The storm is expected to bring “significant” rainfall to the region, with some areas receiving up to 10 inches of rain, according to the National Weather Service. The storm is expected to bring “significant” rainfall to the region, with some areas receiving up to 10 inches of rain, according to t

### Measuring Performance on CNN Daily Mail

In [5]:
dataset = load_dataset("cnn_dailymail", "3.0.0")

We load the data from the CNN Daily Mail Test set, the ROUGE metric scorer from Hugging Face, and a Tokenizer from Falcon. The tokenizer is used to truncate the text such that it fits nicely into the Falcon model context. We truncate the text to 1023, so that it is of length 1024 when the start-of-sentence token (`<s>`) is added.

__NOTE__: All Falcon models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

In [6]:
falcon_tokenizer = AutoTokenizer.from_pretrained("Rocketknight1/falcon-rw-1b")
dataloader = DataLoader(dataset["test"], shuffle=False, batch_size=10)
rouge = evaluate.load("rouge")
prompt_template_summary_1 = "How would you briefly summarize the text?"
prompt_template_summary_2 = "In summary,"

In [7]:
def truncate_article_text(article_text: str, tokenizer: AutoTokenizer, max_sequence_length: int = 1023) -> str:
    tokenized_article = tokenizer.encode(article_text, truncation=True, max_length=max_sequence_length)
    return tokenizer.decode(tokenized_article, skip_special_tokens=True)

We'll try two different prompts from the examples above and consider how well they each do in terms of rouge score against reference summaries on the CNN Daily Mail task, which is a common summarization benchmark. You can see a discussion of this dataset here: [CNN Daily Mail](https://huggingface.co/datasets/cnn_dailymail). 

Running the First prompt structure

(text) How would you briefly summarize the text?

In [13]:
# Running the first prompt type
max_batches = 10
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    print(f"Processing Batch: {batch_number}")
    truncated_articles = [truncate_article_text(text, falcon_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text} {prompt_template_summary_1}" for article_text in truncated_articles]

    summaries = []
    for p in prompts:
        summaries.append(model.generate(p, long_generation_config).generation["sequences"][0])

    # Let's just take the first 3 sentences, split by periods
    summaries = [post_process_generations(summary) for summary in summaries]
    # References for the metric need to be in the form of list of lists
    # (ROUGE can admit multiple references per prediction)
    highlights = [[highlight] for highlight in batch["highlights"]]
    results = rouge.compute(
        predictions=summaries,
        references=highlights,
        rouge_types=["rouge1"],
    )
    batch_rouge_scores.append(results["rouge1"])
# Average all the ROUGE-1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

Processing Batch: 1
Processing Batch: 2
Processing Batch: 3
Processing Batch: 4
Processing Batch: 5
Processing Batch: 6
Processing Batch: 7
Processing Batch: 8
Processing Batch: 9
Processing Batch: 10
Final Rouge Score: 0.19531457510166395


Running the second prompt structure

(text) In summary,

In [14]:
max_batches = 10
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    print(f"Processing Batch: {batch_number}")
    truncated_articles = [truncate_article_text(text, falcon_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text} {prompt_template_summary_2}" for article_text in truncated_articles]

    summaries = []
    for p in prompts:
        summaries.append(model.generate(p, long_generation_config).generation["sequences"][0])

    # Let's just take the first 3 sentences, split by periods
    summaries = [post_process_generations(summary) for summary in summaries]
    # References for the metric need to be in the form of list of lists
    # (ROUGE can admit multiple references per prediction)
    highlights = [[highlight] for highlight in batch["highlights"]]
    results = rouge.compute(
        predictions=summaries,
        references=highlights,
        rouge_types=["rouge1"],
    )
    batch_rouge_scores.append(results["rouge1"])
# Average all the ROUGE-1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

Processing Batch: 1
Processing Batch: 2
Processing Batch: 3
Processing Batch: 4
Processing Batch: 5
Processing Batch: 6
Processing Batch: 7
Processing Batch: 8
Processing Batch: 9
Processing Batch: 10
Final Rouge Score: 0.22200761130138016


The second prompt, as measured by ROUGE-1 scores, appears to produce summaries of higher quality than the first prompt. This is likely due to the way it is structured. It fits into the "generative" training setting a bit better than asking a point blank question.