# Fine-tuning and Prompting with Large Language Models

**Programming Data Intensive Applications** - Alma Mater Studiorum, University of Bologna

Professor:
- Gianluca Moro

Teaching Assistants:
- Giacomo Frisoni
- Lorenzo Molfetta
- Alessio Cocchieri

name.surname@unibo.it

## 📜 Outline

- [ 0 - Set up Kernel and Required Dependencies](#1)
- [ 1 - Prompting vs Fine-tuning in Text Classification](#1)
  - [ 1.1 - Dataset and Model Loading](#1.1)
  - [ 1.2 - Finetuning](#1.2)
  - [ 1.3 - Prompting](#1.3)
- [ 2 - Prompt Engineering for Dialogue Summarization](#2)
  - [ 2.1 - Dataset and Model Loading](#2.1)
  - [ 2.2 - Summarize Dialogue without Prompt Engineering](#2.2)
  - [ 2.3 - Summarize Dialogue with an Instruction Prompt](#2.3)
    - [ 2.3.1 - Zero Shot Inference with an Instruction Prompt](#2.3.1)
    - [ 2.3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5](#2.3.2)
  - [ 2.4 - Summarize Dialogue with One Shot and Few Shot Inference](#2.4)
    - [ 2.4.1 - One Shot Inference](#2.4.1)
    - [ 2.4.2 - Few Shot Inference](#2.4.2)
  - [ 2.5 - Generative Configuration Parameters for Inference](#2.5)


<a name='0'></a>
## ⚙️ 0 - Set up Kernel and Required Dependencies

First, check that the Runtime type is `Python 3` with a `GPU`-based hardware accelerator.

Go to "Runtime" $→$ "Change runtime type".

<a name='1'></a>
## 👨‍💻 1 - Prompting vs Fine-tuning in Text Classification

<a name='1.1'></a>
### 1.1 - Dataset and Model Loading

In [None]:
!pip install datasets
!pip install transformers[torch]
!pip install acclerate
!pip install bitsandbytes

In [None]:
from datasets import load_dataset

data = load_dataset('ag_news')

train_dataset = data["train"].shuffle(seed=42).select(range(2000))
test_dataset = data['test'].shuffle(seed=42).select(range(100))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

<a name='1.2'></a>
### 1.2 - Finetuning
Let's now finetune the model for text classification. Given the text of a news as input, the model is trained to `generate` the respective topic chosing among the classes:
- "World",
- "Sports",
- "Business",
- "Sci/Tech"

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM
)

model_id = "google/flan-t5-base"

model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
from sklearn.metrics import f1_score, accuracy_score
import numpy as np

def compute_metrics(output_info):
    """
    Compute metrics for the generation.
    """
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)

    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

def one_hot_label(label_id):
  """
  Transform label id into one-hot vector.
  """
  base = [0,0,0,0]
  if label_id != -1:
    base[label_id] = 1

  return base


def parse_labels_txt(label):
  """
  Parse generated label into one-hot vector
  """
  conversion = {
      'World' : 0,
      'Sports' : 1,
      'Business' : 2,
      'Sci/Tech' : 3
  }

  try:
    conv_label = conversion[label]
  except:
    return -1

  out_label = one_hot_label(conv_label)

  return out_label

Let's preprocess the dataset for training.

In [None]:
# Map label id to label name
id2lbl = {
     0: "World",
     1: "Sports",
     2: "Business",
     3: "Sci/Tech"
 }

def formatting_prompt(x):
    """
    Define the input and output fields.
    """
    return f"Input: {x['text']}\nOutput:"


from datasets import Dataset

def preprocess_fun(x):
    results = tokenizer(formatting_prompt(x), max_length=128, padding='max_length', truncation=True)

    results['labels'] = tokenizer(id2lbl[x['label']], max_length=16, padding='max_length', truncation=True).input_ids

    return results


train_dataset_formatted = train_dataset
train_dataset_formatted = train_dataset_formatted.map(preprocess_fun, remove_columns=train_dataset_formatted.column_names)
train_dataset_formatted

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 2000
})

In [None]:
_ = model.to('cuda')

In [None]:
from transformers import TrainingArguments, default_data_collator
import transformers
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100

# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=32,
    learning_rate=1e-3,
    logging_steps=10,
    num_train_epochs=1,
    optim="paged_adamw_32bit"
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_dataset_formatted,
    args=training_args,
    data_collator=data_collator,
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

trainer.train()

Step,Training Loss
10,10.9602
20,0.6481
30,0.044
40,0.0334
50,0.0205
60,0.0254


TrainOutput(global_step=63, training_loss=1.8636217447263854, metrics={'train_runtime': 63.8016, 'train_samples_per_second': 31.347, 'train_steps_per_second': 0.987, 'total_flos': 342378676224000.0, 'train_loss': 1.8636217447263854, 'epoch': 1.0})

In [None]:
from tqdm import tqdm

preds = []
for item in tqdm(test_dataset, total=len(test_dataset)):
    inputs = tokenizer(f"Input: {item['text']}\nOutput:", max_length=128, padding='max_length', return_tensors='pt')
    inputs = inputs['input_ids'].to('cuda')
    output = model.generate(inputs, max_new_tokens=16)
    preds.append(output)

100%|██████████| 100/100 [00:07<00:00, 14.18it/s]


In [None]:
actual_preds = [tokenizer.decode(pred[0], skip_special_tokens=True) for pred in preds]

actual_preds

['Sports',
 'Business',
 'Sports',
 'Business',
 'Sci/Tech',
 'Business',
 'Business',
 'Business',
 'Sci/Tech',
 'Sci/Tech',
 'Sports',
 'Sports',
 'Sports',
 'World',
 'World',
 'Sci/Tech',
 'World',
 'Sports',
 'World',
 'Sci/Tech',
 'World',
 'Business',
 'Sports',
 'Sports',
 'World',
 'Business',
 'Business',
 'Sci/Tech',
 'Business',
 'Business',
 'Business',
 'Sci/Tech',
 'Sci/Tech',
 'Business',
 'Sci/Tech',
 'Business',
 'Sports',
 'Business',
 'World',
 'Sports',
 'Sports',
 'World',
 'World',
 'Sci/Tech',
 'World',
 'Sports',
 'Business',
 'Sports',
 'Business',
 'Sports',
 'Sci/Tech',
 'World',
 'Business',
 'Business',
 'Sports',
 'Business',
 'World',
 'Sci/Tech',
 'Business',
 'Sports',
 'Sci/Tech',
 'Sports',
 'Sports',
 'Business',
 'Sports',
 'Sci/Tech',
 'Business',
 'Sports',
 'Sports',
 'Business',
 'World',
 'Sports',
 'World',
 'World',
 'World',
 'Sports',
 'Sports',
 'Sci/Tech',
 'Sci/Tech',
 'Business',
 'World',
 'Sports',
 'Sports',
 'World',
 'Sports',
 'S

In [None]:
preds, labels = [], []
for i in range(len(actual_preds)):
  parsed_pred = parse_labels_txt(actual_preds[i])
  if parsed_pred != 1:
    preds.append(parsed_pred)
    labels.append(test_dataset['label'][i])



metrics = compute_metrics([np.array(preds), np.array(labels)])
print(metrics)

{'f1': 0.7912203539636777, 'acc': 0.8}


<a name='1.3'></a>
### 1.3 - Prompting

Prompting is an input formatting technique that has proved to elicit hidden capabilities of the language model. By specifing the task at hand with just a few lines, we can significanlty boost the performance.

Since few lines can do the trick, their choice is crucial. The process of finding the best instruction text goes under the name of __Prompt Engineering__.

In [None]:
from transformers import pipeline
pipeline = pipeline(model='google/flan-t5-base')

In [None]:
INSTRUCTION_PROMPT = """
## Instruction:
You are an expert in text classification. Answer in a concise manner. Classify the following news article according to one of these domain:
- World
- Business
- Sports
- Sci/Tech

Answer only with the correct label.

## News:
{text}

## Label:
"""

In [None]:
IDS = 0

news = data['train']['text'][IDS]
input_prompt = INSTRUCTION_PROMPT.format(text=news)

out = pipeline(input_prompt, max_new_tokens=4, pad_token_id=50256, num_return_sequences=1)

In [None]:
print(out[0]['generated_text'])

Business


Let's now test the perfomances of prompting with the whole test dataset.

In [None]:
from tqdm import tqdm

predictions = []

for n_ids, news in tqdm(enumerate(test_dataset['text']), total=len(test_dataset['text'])):
  input_prompt = INSTRUCTION_PROMPT.format(text=news)

  out = pipeline(input_prompt, max_new_tokens=4, pad_token_id=50256, num_return_sequences=1)
  pred_label = out[0]['generated_text'].split('## Label:')[-1].strip()

  predictions.append(pred_label)

100%|██████████| 100/100 [01:08<00:00,  1.46it/s]


In [None]:
preds, labels = [], []
for i in range(len(predictions)):
  parsed_pred = parse_labels_txt(predictions[i])
  if parsed_pred != 1:
    preds.append(parsed_pred)
    labels.append(test_dataset['label'][i])


metrics = compute_metrics([np.array(preds), np.array(labels)])
print(metrics)

{'f1': 0.8113839285714286, 'acc': 0.82}


<a name='2'></a>
## 👨‍💻 2 - Summarize Dialogue with Prompt Engineering

In this part of the lab you will do the dialogue summarization task using generative AI. You will explore **how the input text affects the output of the model**, and perform **prompt engineering** to direct it towards the task you need. By comparing zero shot, one shot, and few shot inferences, you will take the first step towards prompt engineering and see how it can enhance the generative output of Large Language Models.

Now install the required packages to use PyTorch and Hugging Face transformers and datasets.



In [None]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0  --quiet



Load the datasets, Large Language Model (LLM), tokenizer, and configurator. Do not worry if you do not understand yet all of those components - they will be described and discussed later in the notebook.

In [None]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

<a name='2.1'></a>
### 2.1 - Dataset and Model Loading

In this use case, you will be generating a summary of a dialogue with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging Face. The list of available models in the Hugging Face `transformers` package can be found [here](https://huggingface.co/docs/transformers/index).

Let's upload some simple dialogues from the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. This dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Print a couple of dialogues with their baseline summaries.

In [None]:
example_indices = [40, 200]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Exa

Load the [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5), creating an instance of the `AutoModelForSeq2SeqLM` class with the `.from_pretrained()` method.

In [None]:
model_name='google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

To perform encoding and decoding, you need to work with text in a tokenized form. **Tokenization** is the process of splitting texts into smaller units that can be processed by the LLM models.

Download the tokenizer for the FLAN-T5 model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` switches on fast tokenizer. At this stage, there is no need to go into the details of that, but you can find the tokenizer parameters in the [documentation](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Test the tokenizer encoding and decoding a simple sentence:

In [None]:
sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0],
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

DECODED SENTENCE:
What time is it, Tom?


<a name='2.2'></a>
### 2.2 - Summarize Dialogue without Prompt Engineering

Now it's time to explore how well the base LLM summarizes a dialogue without any prompt engineering. **Prompt engineering** is an act of a human changing the **prompt** (input) to improve the response for a given task.

In [None]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
MODEL GENERATION - WITHOUT PROMPT ENGINEERING:
Person1: It's ten to nine.

-------------------------------

You can see that the guesses of the model make some sense, but it doesn't seem to be sure what task it is supposed to accomplish. Seems it just makes up the next sentence in the dialogue. Prompt engineering can help here.

<a name='2.3'></a>
### 2.3 - Summarize Dialogue with an Instruction Prompt

Prompt engineering is an important concept in using foundation models for text generation.

<a name='2.3.1'></a>
#### 2.3.1 - Zero Shot Inference with an Instruction Prompt

In order to instruct the model to perform a task - summarize a dialogue - you can take the dialogue and convert it into an instruction prompt. This is often called **zero shot inference**.

Wrap the dialogue in a descriptive instruction and see how the generated text will change:

In [None]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

Summary:
    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
The train is about to

This is much better! But the model still does not pick up on the nuance of the conversations though.

**Exercise:**

- Experiment with the `prompt` text and see how the inferences will be changed. Will the inferences change if you end the prompt with just empty string vs. `Summary: `?
- Try to rephrase the beginning of the `prompt` text from `Summarize the following conversation.` to something different - and see how it will influence the generated output.

<a name='2.3.2'></a>
#### 2.3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5

Let's use a slightly different prompt. FLAN-T5 has many prompt templates that are published for certain tasks [here](https://github.com/google-research/FLAN/tree/main/flan/v2). In the following code, you will use one of the [pre-built FLAN-T5 prompts](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py):

In [None]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Tom is late for the train.

--------------

Notice that this prompt from FLAN-T5 did help a bit, but still struggles to pick up on the nuance of the conversation. This is what you will try to solve with the few shot inferencing.

<a name='2.4'></a>
### 2.4 - Summarize Dialogue with One Shot and Few Shot Inference

**One shot and few shot inference** are the practices of providing an LLM with either one or more full examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task.  You can read more about it in [this blog from HuggingFace](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api).

<a name='2.4.1'></a>
#### 2.4.1 - One Shot Inference

Let's build a function that takes a list of `example_indices_full`, generates a prompt with full examples, then at the end appends the prompt which you want the model to complete (`example_index_to_summarize`).  You will use the same FLAN-T5 prompt template from Section [3.2](#3.2).

In [None]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""

    dialogue = dataset['test'][example_index_to_summarize]['dialogue']

    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""

    return prompt

Construct the prompt to perform one shot inference:

In [None]:
example_indices_full = [40]
example_index_to_summarize = 200

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also ne

Now pass this prompt to perform the one shot inference:

In [None]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to add a CD-ROM drive.


<a name='2.4.2'></a>
#### 2.4.2 - Few Shot Inference

Let's explore few shot inference by adding two more full dialogue-summary pairs to your prompt.

In [None]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: May, do you mind helping me prepare for the picnic?
#Person2#: Sure. Have you checked the weather report?
#Person1#: Yes. It says it will be sunny all day. No sign of rain at all. This is your father's favorite sausage. Sandwiches for you and Daniel.
#Person2#: No, thanks Mom. I'd like some toast and chicken wings.
#Person1#: Okay. Please take some fruit salad and crackers for me.
#Person2#: Done. Oh, don't forget to take napkins disposable plates, cups and picnic blanket.
#Person1#: All set. 

Now pass this prompt to perform a few shot inference:

In [None]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (819 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to upgrade his hardware.


<a name='2.5'></a>
### 2.5 - Generative Configuration Parameters for Inference

You can change the configuration parameters of the `generate()` method to see a different output from the LLM. So far the only parameter that you have been setting was `max_new_tokens=50`, which defines the maximum number of tokens to generate. A full list of available parameters can be found in the [Hugging Face Generation documentation](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig).

A convenient way of organizing the configuration parameters is to use `GenerationConfig` class.

<img src="https://media.licdn.com/dms/image/D4D12AQGfm8FgLin7pQ/article-cover_image-shrink_720_1280/0/1690498156500?e=2147483647&v=beta&t=rWdwF388L4aZwSPERYEjcDRoGLuQO-mPQXBtwP4OwCE" width="800"/>

</br>
</br>
</br>
</br>

<img src="https://pbs.twimg.com/media/F31h_9VWIAEuw41.jpg:large" width="1400"/>






**Exercise:**

Change the configuration parameters to investigate their influence on the output.

Putting the parameter `do_sample = True`, you activate various decoding strategies which influence the next token from the probability distribution over the entire vocabulary. You can then adjust the outputs changing `temperature` and other parameters (such as `top_k` and `top_p`).

Uncomment the lines in the cell below and rerun the code. Try to analyze the results. You can read some comments below.

In [None]:
generation_config = GenerationConfig(max_new_tokens=50)
# generation_config = GenerationConfig(max_new_tokens=10)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to upgrade his hardware.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.



Comments related to the choice of the parameters in the code cell above:
- Choosing `max_new_tokens=10` will make the output text too short, so the dialogue summary will be cut.
- Putting `do_sample = True` and changing the temperature value you get more flexibility in the output.

As you can see, prompt engineering can take you a long way for this use case, but there are some limitations. Next, you will start to explore how you can use fine-tuning to help your LLM to understand a particular use case in better depth!

# 🏁 The End!