# 1. Introduction to the Project
This project aims to explore the basics of prompt engineering with large language models (LLMs). We'll see how different prompts affect the generative outputs of LLMs, focusing on summarizing dialogues. We'll compare zero-shot, one-shot, and few-shot inference methods to understand their impact on model performance.

# 2. Setup: Installing Dependencies
To get started, we need to install the necessary Python libraries, including torch for PyTorch, transformers for accessing pre-trained models, and datasets to load and handle data.

In [1]:
!pip install torch transformers
!pip install -U datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

# 3. Loading the LLM and Tokenizer
We'll use the FLAN-T5 model from Hugging Face, which is suitable for text-to-text tasks such as summarization. This model needs a tokenizer to preprocess text inputs into tokens that the model can understand.


In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
from transformers import GenerationConfig

Now we will load our LLM Model FLAN-T5


In [3]:
model_name='google/flan-t5-large'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

The LLMs cannot directly deal with the text in human readable format. That is they do not see the words in word by word form. Instead we train them and use them using tokens. Tokenization is the process of splitting texts into smaller units that the LLMs can process.


So lets download the tokenizer for the FLAN-T5 model

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Let's give a sample sentence to the tokinizer and see how it does the encoding and decoding.

In [5]:
sentence = "When does a man die? When he is hit by a bullet? No! When he suffers a disease? No! When he ate a soup made out of a poisonous mushroom? No! A man dies when he is forgotten!"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0],
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([  366,   405,     3,     9,   388,    67,    58,   366,     3,    88,
           19,  1560,    57,     3,     9, 11126,    58,   465,    55,   366,
            3,    88,  5696,     7,     3,     9,  1994,    58,   465,    55,
          366,     3,    88,     3,   342,     3,     9,  5759,   263,    91,
           13,     3,     9, 14566,  1162, 25415,    58,   465,    55,    71,
          388,    67,     7,   116,     3,    88,    19, 11821,    55,     1])

DECODED SENTENCE:
When does a man die? When he is hit by a bullet? No! When he suffers a disease? No! When he ate a soup made out of a poisonous mushroom? No! A man dies when he is forgotten!


# 4. Loading and Exploring the Dataset
We'll use the dialogsum dataset, which contains over 10,000 dialogues with labeled summaries. This dataset will help us test how well our model can summarize conversations.

In [6]:
dataset = load_dataset('knkarthick/dialogsum')

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Let's have a look at our dataset, We will print a few dialogues and their corresponding summaries, to get an idea of what we are working with. These summaries can be our baseline.

In [7]:
example_indices = [400, 600]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:


# 5. Baseline Summarization without Prompt Engineering
In this section, we'll see how the model performs without any specific prompt engineering. We'll feed raw dialogues into the model and observe the outputs compared to human-written summaries.

In [8]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#P

The model captures the main topic (a storm, Chinese cuisine) but lacks detail and context. This indicates that without explicit guidance, the model might overlook nuanced details, providing overly simplified summaries. This result highlights the limitations of using zero-shot inference without prompt engineering, where the model relies solely on its internal pre-training and might not fully grasp the specific context or requirements of the task.

# 6. Summarization with Prompt Engineering
Prompt engineering involves crafting specific instructions that direct the model toward desired outputs. In this project, we explore three primary methods of prompt engineering: Zero-Shot, One-Shot, and Few-Shot inference. Each method progressively provides more context to the model, which improves the accuracy and relevance of the generated summaries.

## 6.1 Zero-Shot Inference
Definition:
Zero-shot inference involves asking the model to perform a task without providing any examples. The model relies solely on the instruction provided to understand what is expected.

In [9]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation. As briefly as you can while retaining all the important details.

{dialogue}

Summary:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation. As briefly as you can while retaining all the important details.

#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.

Summary:
    
------

With a simple instruction like "Summarize the following conversation," the model provides a better understanding of the dialogue. It begins to incorporate elements from both participants, demonstrating that even minimal prompt engineering can guide the model to capture a broader context. However, it still may not reflect all nuances, such as emotional tones or opinions, indicating the need for more structured prompting.

## 6.2 Zero-Shot Inference with Template from FLAN-T5
Definition:
Using a model-specific template enhances zero-shot inference by leveraging predefined structures that the model is familiar with. These templates are designed to align with the model's training data, improving comprehension and output quality.

In [10]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Dialogue:

#Person1#: It was a heavy storm last night, wasn't it?
#Person2#: It certainly was. The wind broke several windows. What weather!
#Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night.
#Person2#: Really? Did it do any damage to your home?
#Person1#: Thank goodness! It is far away from that.
#Person2#: I really hate storms. It's about time we had some nice spring weather.
#Person1#: It's April, you know. The flowers are beginning to blossom.
#Person2#: Yes, that's true. But I still think the weather is terrible.
#Person1#: I suppose we should not complain. We had a fine March after all.

What was going on?

---------------------------------------------------------------------------------------------

Using a model-specific prompt template aligns the task with how the model was trained, resulting in more accurate and contextually rich outputs. The summaries are more detailed and capture key events and recommendations, showing that template-based prompts enhance the model's ability to focus on important details. This approach proves effective in directing the model's attention and improving output relevance.

## 6.3 One-Shot Inference
Definition:
One-shot inference provides the model with a single example alongside the instruction. This example serves as a reference, helping the model understand the expected output format and style.



In [11]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""

    dialogue = dataset['test'][example_index_to_summarize]['dialogue']

    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""

    return prompt

In [12]:
example_indices_full = [40]
example_index_to_summarize = 600

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: Oh, I'm starving. It's my first time to China. And I'd like to try some real Chinese cuisine. What would you recommend?
#Person2#: Well, depends. You see, there are eight famous Chinese food cuisines, for instance, Sichuan cuisine and Hunan cuisine.
#Person1#: There're all spicy or hot of heard.
#Person2#: That's right. If you have hot dishes, you can try some.
#Person1#: I cannot have it. Last time I had some in the US. It almost killed me.
#Person2#: And there are Cantonese and Kiangsu cuisi

In [13]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (634 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
It's #Person1#'s first time to China and #Person1# wants some Chinese cuisine. #Person2# recommends some but it's too far and #Person1# is starving. Then #Person2# suggests a nearby Quanjude restaurant and its Beijing roast duck. #Person1# will go there.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
Person1 is on his first trip to China. He wants to try some Chinese cuisine. He is not sure about the Cantonese restaurant. He wants to try the Beijing dishes restaurant. He will show the taxi driver the name of the Qu


Providing one example helps the model better understand the expected output format, leading to summaries that are more aligned with human-like responses. The model integrates more context and interaction details, indicating that even a single reference example can significantly enhance the model's performance. It shows how the model leverages examples to learn context, style, and information prioritization.

## 6.4 Few-Shot Inference
Definition:
Few-shot inference provides multiple examples, giving the model a broader understanding of the task requirements. This method trains the model on a few examples to better generalize the task.

In [14]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: May, do you mind helping me prepare for the picnic?
#Person2#: Sure. Have you checked the weather report?
#Person1#: Yes. It says it will be sunny all day. No sign of rain at all. This is your father's favorite sausage. Sandwiches for you and Daniel.
#Person2#: No, thanks Mom. I'd like some toast and chicken wings.
#Person1#: Okay. Please take some fruit salad and crackers for me.
#Person2#: Done. Oh, don't forget to take napkins disposable plates, cups and picnic blanket.
#Person1#: All set. 

In [15]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Person2 is considering upgrading her system. She would like to add a painting program to her software. She needs a faster processor, more memory, a faster modem and a CD-ROM drive.


With few-shot inference, the model effectively generalizes from multiple examples, capturing a variety of details and context elements. The summaries are comprehensive and mirror the structure and depth of human-written summaries, indicating that exposure to multiple examples helps the model learn the task more robustly. This method demonstrates the power of providing multiple examples to guide the model, resulting in high-quality, nuanced outputs.

# 7. Configuring Generation Parameters
Different configurations in the generation process can affect the quality of the output. We experiment with settings such as max_new_tokens, do_sample, and temperature to control aspects like length and randomness of the generated summaries.

In [16]:
generation_config = GenerationConfig(max_new_tokens=50)
# generation_config = GenerationConfig(max_new_tokens=10)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
Person2 is considering upgrading her system. She would like to add a painting program to her software. She needs a faster processor, more memory, a faster modem and a CD-ROM drive.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.



This section shows how tuning generation parameters allows fine control over the summary characteristics. These configurations help balance between detail and conciseness, ensuring that the output meets specific requirements such as clarity, brevity, or creativity. The results illustrate the flexibility of LLMs in adapting outputs based on parameter settings, crucial for tailoring responses to diverse use cases.

# 8. Results and Observations
We compare the model-generated summaries against the baseline human-written summaries. This comparison helps us evaluate the effectiveness of prompt engineering and the model's performance across zero-shot, one-shot, and few-shot settings.

# 9. Conclusion
Through these different output scenarios, we observe that prompt engineering and example-driven inference significantly enhance the ability of large language models to generate accurate, context-rich summaries. The results underline the importance of crafting appropriate prompts and selecting inference methods that align with the complexity and requirements of the task, ensuring that the generated outputs are useful, relevant, and aligned with human expectations.