### 1. Install necessary libraries

In [1]:
%pip install datasets
%pip install transformers
%pip install evaluate
%pip install accelerate -U
%pip install transformers[torch]
%pip install peft

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s

### 2. Import the libraries

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer,GenerationConfig
import torch
device ='cuda' if torch.cuda.is_available() else 'cpu'

import evaluate

import pandas as pd
import numpy as np


### 3. Load Dataset and Model from hugging face
The load_dataset() function is a utility provided by the Hugging Face library to load datasets
The AutoModelForSeq2SeqLM.from_pretrained() method loads a pre-trained model for sequence-to-sequence learning as per the model name givne
The AutoTokenizer.from_pretrained() method initializes a tokenizer for the specified pre-trained model

In [3]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset =load_dataset(huggingface_dataset_name)

model_name = "google/flan-t5-base"
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [5]:
dataset['train']

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 12460
})

In [6]:
dataset['train'][0]

{'id': 'train_0',
 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.",
 'summary': "Mr. Smith'

### 4. Define a function to check number of model parameters
The below defined function provides the size and trainability of the model’s parameters, which will be utilized during PEFT training to see how it reduces resource requirements.

In [7]:
def print_number_of_trainable_model_parameters(model):
	trainable_model_params = 0
	all_model_params = 0
	for _, param in model.named_parameters():
		all_model_params += param.numel()
		if param.requires_grad:
			trainable_model_params += param.numel()
	return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


print(print_number_of_trainable_model_parameters(base_model))


trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


### 5. Base model output
Let us check a random sample from test dataset and generate its output. Before generating the output, we prepare a simple prompt template as shown below.

In [8]:
i= 20
dialogue = dataset['test'][i]['dialogue']
summary = dataset['test'][i]['summary']


prompt = f"Summarize the following dialogue {dialogue} Summary:"


input_ids = tokenizer(prompt, return_tensors="pt").input_ids
print(f"input ids: {input_ids}\n")

output_ids=base_model.generate(input_ids, max_new_tokens=200)
print(f"output ids: {output_ids}\n")

output = tokenizer.decode(output_ids[0],skip_special_tokens=True)


print(f"Input Prompt : {prompt}")
print("--------------------------------------------------------------------")
print("Human evaluated summary ---->")
print(summary)
print("---------------------------------------------------------------------")
print("Baseline model generated summary : ---->")
print(output)


input ids: tensor([[12198,  1635,  1737,     8,   826,  7478,  1713,   345, 13515,   536,
          4663,    10,   363,    31,     7,  1786,    28,    25,    58,  1615,
            33,    25,  8629,    53,    78,   231,    58,  1713,   345, 13515,
           357,  4663,    10,    27,   473,    34, 11971,    55,    27,    54,
            31,    17,  1518,    34,  7595,    55,    27,   317,    27,   164,
            36,  1107,   323,    28,   424,     5,    27,   473,   659, 22248,
            11,  5676,     5,  1713,   345, 13515,   536,  4663,    10,  1563,
           140,    43,     3,     9,   320,     5,  2645,     9,    55,  1609,
           550,    45,   140,    55,  1713,   345, 13515,   357,  4663,    10,
           363,    31,     7,  1786,    58,  1713,   345, 13515,   536,  4663,
            10,    27,   317,    25,    43,  3832,  1977,   226,    55,   148,
            33,   975,  2408,  2936,    55,  1609,   550,    55,  1008,    31,
            17, 13418,    30,   140,    5

## important step
### 7. Define our dataset or input data prepration for training
In order to use our model we need to define a function that
Tokenizes each constructed prompt and the summary using the tokenizer.
The padding=”max_length” and truncation=True arguments ensure that all sequences have the same length by padding or truncating them accordingly.
Returns tensors of input IDs for the prompt and the summary.
The dataset.map() function applies the tokenize_function to each example in the dataset in batches.
We then filter the tokenized datasets to retain examples at every 100th index to speed our training

In [9]:
def tokenize_function(example):
	start_prompt = 'Summarize the following conversation.\n\n'
	end_prompt = '\n\nSummary: '
	prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
	example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
	example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

	return example

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

tokenized_datasets


Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})

In [11]:
tokenized_datasets['train'][0]

{'input_ids': [12198,
  1635,
  1737,
  8,
  826,
  3634,
  5,
  1713,
  345,
  13515,
  536,
  4663,
  10,
  2018,
  6,
  1363,
  5,
  3931,
  5,
  27,
  31,
  51,
  7582,
  12833,
  77,
  7,
  5,
  1615,
  33,
  25,
  270,
  469,
  58,
  1713,
  345,
  13515,
  357,
  4663,
  10,
  27,
  435,
  34,
  133,
  36,
  3,
  9,
  207,
  800,
  12,
  129,
  3,
  9,
  691,
  18,
  413,
  5,
  1713,
  345,
  13515,
  536,
  4663,
  10,
  2163,
  6,
  168,
  6,
  25,
  43,
  29,
  31,
  17,
  141,
  80,
  21,
  305,
  203,
  5,
  148,
  225,
  43,
  80,
  334,
  215,
  5,
  1713,
  345,
  13515,
  357,
  4663,
  10,
  27,
  214,
  5,
  27,
  2320,
  38,
  307,
  38,
  132,
  19,
  1327,
  1786,
  6,
  572,
  281,
  217,
  8,
  2472,
  58,
  1713,
  345,
  13515,
  536,
  4663,
  10,
  1548,
  6,
  8,
  200,
  194,
  12,
  1792,
  2261,
  21154,
  19,
  12,
  253,
  91,
  81,
  135,
  778,
  5,
  264,
  653,
  12,
  369,
  44,
  709,
  728,
  3,
  9,
  215,
  21,
  39,
  293,
  207,
  5,
  1713,

### 6. Define lora config, Peft model , training arguments and peft trianiger
Let us use a low rank matrix of size 32. We see that compared to model size we need to train only 1.41 % of parameters.




#### you can read about LoraConfig()

LoraConfig from PEFT :
1. https://huggingface.co/docs/peft/en/package_reference/lora
2. https://medium.com/@manyi.yim/more-about-loraconfig-from-peft-581cf54643db


In [12]:
from peft import LoraConfig, get_peft_model, TaskType


lora_config = LoraConfig(r=32,lora_alpha = 32, target_modules=["q","v"],
						lora_dropout = 0.5, bias ="none", task_type =TaskType.SEQ_2_SEQ_LM)

output_dir = f"./peft-dialogue-summary-training"

peft_model_train = get_peft_model(base_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model_train))   ## OR             peft_model_train.print_trainable_parameters()


trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


#### Let us define our training parameters and training for above peft model

peft_training_args is defined using TrainingArguments which specifies settings for the training process such as the output directory, batch size, learning rate, and number of train

A Trainer instance (peft_trainer) is created with the specified model (peft_model_train), training arguments (peft_training_args), and the training dataset (tokenized_datasets[“train”]).

The train() method is called on the peft_trainer object to start the training process.

In [21]:
peft_training_args = TrainingArguments(
	output_dir=output_dir,
	auto_find_batch_size=True,
	learning_rate=1e-3, # Higher learning rate than full fine-tuning.
	num_train_epochs=10,
)
peft_trainer = Trainer(
	model=peft_model_train,
	args=peft_training_args,
	train_dataset=tokenized_datasets["train"],
)

peft_trainer.train()


Step,Training Loss


TrainOutput(global_step=320, training_loss=0.16262210607528688, metrics={'train_runtime': 308.0454, 'train_samples_per_second': 4.058, 'train_steps_per_second': 1.039, 'total_flos': 869536235520000.0, 'train_loss': 0.16262210607528688, 'epoch': 10.0})

In [22]:


# Empty the cache to free up unused memory
torch.cuda.empty_cache()


### 8. Save our model and load it for inference
save_pretrained() method is called on the peft_trainer.model object to save the trained model to the specified path.

Similarly, the save_pretrained() method is called on the tokenizer object to save the tokenizer to the same path.

The AutoModelForSeq2SeqLM.from_pretrained() method is used to load the base model (google/flan-t5-base) for sequence-to-sequence learning
The AutoTokenizer.from_pretrained() method is used to load the tokenizer corresponding to the base model.

#### PeftModel.from_pretrained() is used to load the PEFT model from the saved checkpoint directory (peft-dialogue-summary-checkpoint-local). The is_trainable parameter is set to False to ensure that the loaded model is not trainable, indicating that it’s meant for inference only.

In [23]:
peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)




('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

### 9. Generate output

#### Load model


In [24]:

from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
									'./peft-dialogue-summary-checkpoint-local',
																			is_trainable=False)

#### testing on test data

In [25]:
peft_model_outputs = peft_model.generate(input_ids=input_ids, max_new_tokens=200)
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)


print(f"Input Prompt : {prompt}")
print("--------------------------------------------------------------------")
print("Human evaluated summary ---->")
print(summary)
print("---------------------------------------------------------------------")
print("Baseline model generated summary : ---->")
print(output)
print("---------------------------------------------------------------------")
print("Peft model generated summary : ---->")
print(peft_model_text_output)


Input Prompt : Summarize the following dialogue 
Mom: Hey sweetie, have you finished your homework yet?

Son: Not yet, Mom. I’m almost done with my math problems, though.

Mom: That’s good. Do you need any help with them?

Son: No, I think I’ve got it. But I’m worried about my science test tomorrow.

Mom: I can help you study for that. Do you have your notes ready?

Son: Yeah, they’re on the kitchen table. Can we go over them after dinner?

Mom: Of course. Let’s make sure you understand everything.

Son: Thanks, Mom. I want to do well on this test.

Mom: You’re welcome. I’m proud of you for working so hard.

Son: Thanks. Your help really makes a difference. Summary:
--------------------------------------------------------------------
Human evaluated summary ---->
#Person1# thinks #Person2# has chicken pox and warns #Person2# about the possible hazards but #Person2# thinks it will be fine.
---------------------------------------------------------------------
Baseline model generated sum

#### Testing on unseen data

In [26]:
dialogue='''
Mom: Hey sweetie, have you finished your homework yet?

Son: Not yet, Mom. I’m almost done with my math problems, though.

Mom: That’s good. Do you need any help with them?

Son: No, I think I’ve got it. But I’m worried about my science test tomorrow.

Mom: I can help you study for that. Do you have your notes ready?

Son: Yeah, they’re on the kitchen table. Can we go over them after dinner?

Mom: Of course. Let’s make sure you understand everything.

Son: Thanks, Mom. I want to do well on this test.

Mom: You’re welcome. I’m proud of you for working so hard.

Son: Thanks. Your help really makes a difference.'''

In [27]:
prompt = f"Summarize the following dialogue {dialogue} Summary:"

In [28]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids

tensor([[12198,  1635,  1737,     8,   826,  7478,  8822,    10,  9459,  2093,
            23,    15,     6,    43,    25,  2369,    39, 11920,   780,    58,
          3885,    10,   933,   780,     6,  8822,     5,    27,    22,    51,
           966,   612,    28,    82,  7270,   982,     6,   713,     5,  8822,
            10,   466,    22,     7,   207,     5,   531,    25,   174,   136,
           199,    28,   135,    58,  3885,    10,   465,     6,    27,   317,
            27,    22,   162,   530,    34,     5,   299,    27,    22,    51,
          9220,    81,    82,  2056,   794,  5721,     5,  8822,    10,    27,
            54,   199,    25,   810,    21,    24,     5,   531,    25,    43,
            39,  3358,  1065,    58,  3885,    10, 11475,     6,    79,    22,
            60,    30,     8,  1228,   953,     5,  1072,    62,   281,   147,
           135,   227,  2634,    58,  8822,    10,  1129,   503,     5,  1563,
            22,     7,   143,   417,    25,   734,  

In [29]:

peft_model_outputs = peft_model.generate(input_ids=input_ids, max_new_tokens=200)
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print("Peft model generated summary : ---->")
print(peft_model_text_output)


Peft model generated summary : ---->
Son is almost done with his math problems. Mom is worried about his science test tomorrow. Mom will help Son study for the test.
