# Summerize the text

In [1]:
#for using huggingface datasets
!pip install datasets



In [2]:
# import dependencies

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

## Dataset Summary
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics.

In [3]:
# load dataset of huggingface dialouge and base human summary

huggfaceDataset = "knkarthick/dialogsum"
dataset = load_dataset(huggfaceDataset)

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

##  check dialogsum dataset

In [5]:

# name of columns
dataset.column_names

{'train': ['id', 'dialogue', 'summary', 'topic'],
 'validation': ['id', 'dialogue', 'summary', 'topic'],
 'test': ['id', 'dialogue', 'summary', 'topic']}

In [12]:
# number of rows
dataset.num_rows

{'train': 12460, 'validation': 500, 'test': 1500}

In [14]:
# number of columns
dataset.num_columns

{'train': 4, 'validation': 4, 'test': 4}

In [15]:
dataset['train']

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 12460
})

In [31]:
# check sample of 999 of dataset
print(dataset['train'][999]['dialogue'])

#Person1#: do you like animals? I really like dogs. 
#Person2#: so do i. I don't like cats. 
#Person1#: why? I think cats are ok. 
#Person2#: I can't bear being near cats. They don't seem to like me either. 
#Person1#: I like wild animals. I don't like spiders and snakes. I think spiders and snakes are disgusting. 
#Person2#: I'm fond of snakes. I think they're great. I agree with you about spiders though. I think spiders are horrible. I think it's because they have so many legs. 
#Person1#: I think bears are wonderful. Pandas are fantastic. I low the people who kill them for their fur. 
#Person2#: I agree. I'm carzy about mice. I think they're so cute! 
#Person1#: really? I don't see the attraction. I'm afraid of mice. 


In [32]:
print(dataset['train'][999]['summary'])

#Person1# and #Person2# are sharing their different attitudes towards different animals. They have opposite preferences of some animals, like snacks and mice.


In [34]:
dataset['train'][999]['topic']

'discuss animals'

In [68]:
# sample of 500 and 1200
example_indices = [500, 1200]

dash_line = '*'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1 ,'\n')
    print(dash_line)
    print('INPUT DIALOGUE:','\n')
    print(dataset['test'][index]['dialogue'],'\n')
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:','\n')
    print(dataset['test'][index]['summary'],'\n')
    print(dash_line)
    print()

***************************************************************************************************
Example  1 

***************************************************************************************************
INPUT DIALOGUE: 

#Person1#: Are you going anywhere for your vacation?
#Person2#: Yes, we're making plans for a tour.
#Person1#: That'll be lovely. Where are you going?
#Person2#: Well, we will start out from Long Island this Friday. We've planned a four day drive to Salt Lake City, where we'll join my brother and his family on his fortieth birthday.
#Person1#: Well, you've got to prepare a lot of food and enough sleeping bags then.
#Person2#: Oh, we'll spend the nights in hotels and enjoy local food as we pass by. How does it sound, David?
#Person1#: It sounds good. You can do a lot of sightseeing, too.
#Person2#: Yes, we'll take our time. And we'll go to Five Lake Strict and the Wall Street.
#Person1#: So, you're going to have a really nice vacation.
#Person2#: You can say t

# [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)

## Overview
FLAN-T5 was released in the paper Scaling Instruction-Finetuned Language Models - it is an enhanced version of T5 that has been finetuned in a mixture of tasks.

An example of FLAN-T5 from huggingface

The goal of this code is to use a pre-trained sequence-to-sequence language model to generate text — specifically, to complete or continue a given text prompt as a helpful assistant

In [40]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

inputs = tokenizer("A step by step recipe to make bolognese pasta:" ,return_tensors = 'pt')
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['Pour a cup of bolognese into a large bowl and add the pasta to']


### now step by step explain how to use this model to summerize.



In [41]:
# load the model flan-t5 base


model_name='google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Download the tokenizer for the **FLAN-T5** model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` switches on fast tokenizer. At this stage, there is no need to go into the details of that, but you can find the tokenizer parameters in the [documentation](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [42]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

`Text Input → Tokenization → Numerical Encoding → Vector Representation → Decoding → Text Output`


## Detailed Technical Breakdown

1. Text Acquisition & Preprocessing

Input Reception: Capture raw textual input from source (user query, document, API request)

Normalization: Standardize casing, remove extraneous whitespace, handle special characters

Sanitization: Filter inappropriate content, validate input boundaries

2. Tokenization Phase

Segmentation: Divide continuous text into discrete linguistic units (tokens)

Methodology: Employ subword tokenization (e.g., WordPiece, Byte-Pair Encoding)

Special Tokens: Insert control tokens ([CLS], [SEP], [PAD]) for model-specific processing

3. Numerical Encoding

Vocabulary Mapping: Convert each token to corresponding integer ID from pretrained vocabulary

Vector Creation: Generate tensor representation for batch processing

Attention Masks: Create binary masks distinguishing actual tokens from padding

4. Model Processing (Vector Operations)

Embedding Lookup: Convert token IDs to dense vector representations

Neural Transformation: Apply transformer architecture (self-attention, feed-forward layers)

Contextualization: Generate context-aware representations via multi-head attention

5. Decoding & Text Reconstruction

Token Generation: Produce output token IDs through autoregressive sampling

Detokenization: Map numerical IDs back to string tokens

Post-processing: Remove special tokens, reconstruct original formatting

In [60]:
# sentence
txt = "Test for encoding and decoding ?"

#tokenize the sentence
txt_encoded = tokenizer(txt , return_tensors='pt')

#decode the sentence
txt_decoded = tokenizer.decode(txt_encoded["input_ids"][0] ,
                               skip_special_tokens=True)

print('encoded text:')
print(txt_encoded['input_ids'][0])
print('\ntxt_decoded:')
print(txt_decoded)



encoded text:
tensor([2300,   21,    3,   35, 9886,   11,   20, 9886,    3,   58,    1])

txt_decoded:
Test for encoding and decoding ?


In [54]:
txt_encoded

{'input_ids': tensor([[2300,   21,    3,   35, 9886,   11,   20, 9886,    3,   58,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [59]:
# With 3 sentences in a batch
sentences = ["What time is it?", "Hello world", "How are you?"]
batch_encoded = tokenizer(sentences, return_tensors='pt', padding=True)

print(batch_encoded["input_ids"])
# tensor([[  101,  2054,  2051,  2003,  2009,  1029,   102,     0,     0],
#         [  101,  7592,  2088,   102,     0,     0,     0,     0,     0],
#         [  101,  2129,  2024,  2017,  1029,   102,     0,     0,     0]])

# Access each sentence separately:
print(batch_encoded["input_ids"][0])  # First sentence token IDs
print(batch_encoded["input_ids"][1])  # Second sentence token IDs
print(batch_encoded["input_ids"][2])  # Third sentence token IDs

tensor([[ 363,   97,   19,   34,   58,    1],
        [8774,  296,    1,    0,    0,    0],
        [ 571,   33,   25,   58,    1,    0]])
tensor([363,  97,  19,  34,  58,   1])
tensor([8774,  296,    1,    0,    0,    0])
tensor([571,  33,  25,  58,   1,   0])


## It's time to explore how well the base LLM summarize dilogues without any prompt enginnering.

In [69]:
example_indices = [500, 1200]

for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line,'\n')
    print('Example ', i + 1,'\n')
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}','\n')
    print(dash_line,'\n')
    print(f'BASELINE HUMAN SUMMARY:\n{summary}','\n')
    print(dash_line,'\n')
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n','\n')

*************************************************************************************************** 

Example  1 

***************************************************************************************************
INPUT PROMPT:
#Person1#: Are you going anywhere for your vacation?
#Person2#: Yes, we're making plans for a tour.
#Person1#: That'll be lovely. Where are you going?
#Person2#: Well, we will start out from Long Island this Friday. We've planned a four day drive to Salt Lake City, where we'll join my brother and his family on his fortieth birthday.
#Person1#: Well, you've got to prepare a lot of food and enough sleeping bags then.
#Person2#: Oh, we'll spend the nights in hotels and enjoy local food as we pass by. How does it sound, David?
#Person1#: It sounds good. You can do a lot of sightseeing, too.
#Person2#: Yes, we'll take our time. And we'll go to Five Lake Strict and the Wall Street.
#Person1#: So, you're going to have a really nice vacation.
#Person2#: You can say tha

## Prompt engineering

In [72]:
for i, index in enumerate(example_indices):

  dialogue = dataset['test'][index]['dialogue']

  summary = dataset['test'][index]['summary']


  prompt = f"""

  Summarize the following conversation.

  {dialogue}

  Summary:
  """


  inputs = tokenizer(prompt, return_tensors='pt')
  outputs = tokenizer.decode(model.generate(inputs["input_ids"] ,
                                            max_new_tokens =50,)[0],
                             skip_special_tokens=True)
  print(dash_line,'\n')
  print('Example ', i + 1,'\n')
  print(dash_line,'\n')
  print(f'INPUT PROMPT:\n{prompt}','\n')
  print(dash_line,'\n')
  print(f'BASELINE HUMAN SUMMARY:\n{summary}','\n')
  print(dash_line,'\n')
  print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n','\n')


*************************************************************************************************** 

Example  1 

*************************************************************************************************** 

INPUT PROMPT:


  read the dialogue and summarize the following conversation.

  #Person1#: Are you going anywhere for your vacation?
#Person2#: Yes, we're making plans for a tour.
#Person1#: That'll be lovely. Where are you going?
#Person2#: Well, we will start out from Long Island this Friday. We've planned a four day drive to Salt Lake City, where we'll join my brother and his family on his fortieth birthday.
#Person1#: Well, you've got to prepare a lot of food and enough sleeping bags then.
#Person2#: Oh, we'll spend the nights in hotels and enjoy local food as we pass by. How does it sound, David?
#Person1#: It sounds good. You can do a lot of sightseeing, too.
#Person2#: Yes, we'll take our time. And we'll go to Five Lake Strict and the Wall Street.
#Person1#: So, yo

# Zero Shot Inference with the Prompt Template from FLAN-T5

In [74]:
for i ,index in enumerate(example_indices):

  dialogue = dataset['test'][index]['dialogue']
  summary = dataset['test'][index]['summary']


  prompt = f"""
  Dialogue:
  {dialogue}
  what was going on?
  """

  inputs = tokenizer(prompt, return_tensors='pt')

  outputs = tokenizer.decode(model.generate(inputs["input_ids"],
                                            max_new_tokens=50)[0],
                             skip_special_tokens = True)


  print(dash_line , '\n')
  print('Example ', i+1, '\n')
  print(dash_line, '\n')
  print(f'input prompt :\n{prompt}')
  print(dash_line, '\n')
  print(f'base line human summary:\n{summary}\n')
  print(dash_line, '\n')
  print(f'model generation - zero shot: \n{outputs}\n')




*************************************************************************************************** 

Example  1 

*************************************************************************************************** 

input prompt :

  Dialogue:
  #Person1#: Are you going anywhere for your vacation?
#Person2#: Yes, we're making plans for a tour.
#Person1#: That'll be lovely. Where are you going?
#Person2#: Well, we will start out from Long Island this Friday. We've planned a four day drive to Salt Lake City, where we'll join my brother and his family on his fortieth birthday.
#Person1#: Well, you've got to prepare a lot of food and enough sleeping bags then.
#Person2#: Oh, we'll spend the nights in hotels and enjoy local food as we pass by. How does it sound, David?
#Person1#: It sounds good. You can do a lot of sightseeing, too.
#Person2#: Yes, we'll take our time. And we'll go to Five Lake Strict and the Wall Street.
#Person1#: So, you're going to have a really nice vacation.
#Person2


# 4 -  One Shot Inference with the Prompt Template from FLAN-T5

In [78]:
def build_prompt(example_indices_full, example_index_to_summarize):

  prompt=''

  for index in example_indices_full:
      dialogue = dataset['test'][index]['dialogue']
      summary = dataset['test'][index]['summary']

      prompt += f"""

Dialogue:
{dialogue}

what was going on?
{summary}

      """

      dialogue = dataset['test'][example_index_to_summarize]['dialogue']

      prompt += f"""

Dialogue:
{dialogue}

what was going on?
"""
      return prompt


In [79]:
example_indices_full = [40]
example_index_to_summarize = 200

one_shot_prompt = build_prompt(example_indices_full, example_index_to_summarize)
print(f'one shot prompt: {one_shot_prompt}')

one shot prompt: 

Dialogue:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

what was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

      

Dialogue:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin

In [80]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
outputs = tokenizer.decode(model.generate(inputs['input_ids'],max_new_tokens=50,)[0],
                           skip_special_tokens=True)


print(dash_line ,'\n')
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n','\n')
print(dash_line,'\n')
print(f'MODEL GENERATION - ONE SHOT:\n{output}','\n')

*************************************************************************************************** 

BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
 

*************************************************************************************************** 

MODEL GENERATION - ONE SHOT:
#Person1#: I'm going to call the Tenants Advocacy Resource Center. 



# Few Shot Inference with the Prompt Template from FLAN-T5

In [81]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = build_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)



Dialogue:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

what was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

      

Dialogue:
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you al

In [82]:

summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line,'\n')
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n','\n')
print(dash_line,'\n')
print(f'MODEL GENERATION - FEW SHOT:\n{output}','\n')

*************************************************************************************************** 

BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
 

*************************************************************************************************** 

MODEL GENERATION - FEW SHOT:
#Person1 wants to upgrade his system. #Person2 wants to add a painting program to his software. #Person1 wants to add a CD-ROM drive. 

