# Fine-Tune T5 Model `BART` for Text Summarization

*  Fine tune a T5 model, facebook/bart-large-cnn, on the [SAMSum](https://huggingface.co/datasets/Samsung/samsum) dataset for summerization
*  Push the finetune model to HuggingFace model hub
*  Load the finetuned model from hub for inference

-----------------

## Installing Dependencies

----------------

In [18]:
## downloading transformers and datasets from huggingface
!pip -q install transformers datasets

In [19]:
## downloading Accelerate
!pip -q install accelerate -U

In [20]:
# A dependecy required for loading SAMSum dataset
!pip -q install py7zr

In [45]:
## downloading ROUGE Score Metrics for Evaluation
!pip -q install rouge-score evaluate

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


## Importing Necessary Packages

In [21]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import TrainingArguments, Trainer

from datasets import load_dataset

import os
import torch

import warnings
warnings.filterwarnings('ignore')

In [22]:
## loading model from Huggingface Hub

model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/bart-large-cnn"
)

## loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "facebook/bart-large-cnn"
)



In [23]:
## load the datasets
dataset = load_dataset("Samsung/samsum")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

### Testing the pre-trained Model

In [24]:
## dialogue and summary checking
dialogue_0 = dataset["train"][0]['dialogue']
summary_0 = dataset['train'][0]['summary']

print("\n")
print("Dialogue : ", "\n", dialogue_0)
print("\n")
print("Summary : ", "\n",  summary_0)



Dialogue :  
 Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)


Summary :  
 Amanda baked cookies and will bring Jerry some tomorrow.


#### generating summary using existing model

In [73]:
## preparing prompt for generating summary

def generate_summary(text, model):
  """prepare prompt --> tokenize --> generate summary using model --> Detokenize to get Output"""
  input_prompt = f"""
                Summarise the following conversation.

                {text}

                Summary :
                """
  input_ids = tokenizer(input_prompt, return_tensors='pt')
  tokenize_outputs = model.generate(
      input_ids = input_ids['input_ids'],
      min_length = 30,
      max_length = 200
  )
  output = tokenizer.decode(
      tokenize_outputs[0],
      skip_special_tokens = True
  )

  return output

In [26]:
## looking into the summary

generate_summary(
    dialogue_0,
    model = model
)

"Amanda baked cookies. Jerry asked if he wanted some. Amanda said she'd bring them to him tomorrow. Jerry said he'd like them. The conversation went on and on."

#### difference between Actual and Model Generated Summary

In [27]:
## looking into dialogue_2 and summary_2 and generating summary for dialogue_2

dialogue_2 = dataset["train"][1]["dialogue"]
summary_2 = dataset["train"][1]["summary"]

model_summary_2 = generate_summary(
    text = dialogue_2,
    model = model
)

###

print("\n")
print("Dialogue :", "\n")
print(dialogue_2)
print("Actual Summary : ", "\n")
print(summary_2)
print("Model Summary : ", "\n")
print(model_summary_2)



Dialogue : 

Olivia: Who are you voting for in this election? 
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great
Actual Summary :  

Olivia and Olivier are voting for liberals in this election. 
Model Summary :  

Olivia asks Oliver who he is voting for in the election. Oliver says he's voting for the Liberals as always. Olivia says she's going to vote for him too.


## Fine Tunning

-----

### Prepare DataSet

----

In [28]:
## prepare datasets

def tokenize_inputs(text):
  start_prompt = 'Summarize the following text \n\n'
  end_prompt = '\n\nSummary :'
  prompt = [start_prompt + end_prompt + dialogue for dialogue in text['dialogue']]
  text['input_ids'] = tokenizer(
      prompt,
      padding = 'max_length',
      truncation = True,
      return_tensors = 'pt'
  ).input_ids

  text['labels'] = tokenizer(
      text['summary'],
      padding = 'max_length',
      truncation = True,
      return_tensors = 'pt'
  ).input_ids

  return text


In [29]:
def tokenize_inputs(text):
    start_prompt = 'Summarize the following text \n\n'
    end_prompt = '\n\nSummary :'
    prompts = [start_prompt + dialogue + end_prompt for dialogue in text['dialogue']]

    tokenized_inputs = tokenizer(
        prompts,
        padding='max_length',
        truncation=True,
        max_length=512,
        return_tensors='pt'
    )

    tokenized_labels = tokenizer(
        text['summary'],
        padding='max_length',
        truncation=True,
        max_length=128,
        return_tensors='pt'
    )

    text['input_ids'] = tokenized_inputs.input_ids
    text['labels'] = tokenized_labels.input_ids

    return text

In [30]:
## prepare dataset

tokenizer.pad_token = tokenizer.eos_token
tokenized_datasets = dataset.map(
    tokenize_inputs,
    batched = True   ## batched = True for faster calculation
)

## removing unnecessary columns
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'dialogue', 'summary'])

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

In [31]:
# Shortening the data: Just picking row index divisible by 100
# For learning purpose! It will reduce the compute resource requirement and training time

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

Filter:   0%|          | 0/819 [00:00<?, ? examples/s]

Filter:   0%|          | 0/818 [00:00<?, ? examples/s]

In [32]:
tokenized_datasets['train'].shape, tokenized_datasets['test'].shape, tokenized_datasets['validation'].shape

((148, 2), (9, 2), (9, 2))

----  

### Define Training Arguments and Trainer Objects

---

In [33]:
## define the Training Arguments Module

from transformers import TrainingArguments, Trainer

training_arguments = TrainingArguments(
    output_dir = "./bart-cnn-samsum-finetunned",
    learning_rate = 1e-5,
    weight_decay = 0.01,
    num_train_epochs = 3,
    auto_find_batch_size = True,
    evaluation_strategy='epoch',
    logging_steps = 10
)

In [34]:
## define trainer Module

trainer = Trainer(
    model = model,
    tokenizer = tokenizer,
    args = training_arguments,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['validation']
)

In [36]:
## training the data

trainer.train()

Epoch,Training Loss,Validation Loss
1,1.3206,1.095176
2,0.7366,0.995161
3,0.5993,0.971707


TrainOutput(global_step=57, training_loss=0.8251190520169442, metrics={'train_runtime': 197.9986, 'train_samples_per_second': 2.242, 'train_steps_per_second': 0.288, 'total_flos': 493016296980480.0, 'train_loss': 0.8251190520169442, 'epoch': 3.0})

### Save the Model on Local System

In [41]:
## defining

version = 1
output_dir = "./bart-cnn-samsum-finetunned"
model_path = f"{output_dir}_{version}"


## save finetunned model
trainer.save_model(model_path)

## save assoicated tokenizer
tokenizer.save_pretrained(model_path)

print(f"Saved at Path : {model_path}")

Saved at Path : ./bart-cnn-samsum-finetunned_1


_________________

### Load Model from Local System and Test

-----------------

In [43]:
## loading tokenizer and model
tokenizer4local = AutoTokenizer.from_pretrained(model_path)
model4local = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [44]:
## testing summarization task
model_summary_test = generate_summary(text = dataset['train'][2]['dialogue'],
                                      model = model4local)
print("Test Summarization Task :", "\n")
print("Dialogue : ", "\n")
print(dataset['train'][2]['dialogue'], "\n")
print("Summary : ", "\n")
print(dataset['train'][2]['summary'], "\n")
print("Model Summary : ", "\n")
print(model_summary_test)


Test Summarization Task : 

Dialogue :  

Tim: Hi, what's up?
Kim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating
Tim: What did you plan on doing?
Kim: Oh you know, uni stuff and unfucking my room
Kim: Maybe tomorrow I'll move my ass and do everything
Kim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies
Tim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores
Tim: It really helps
Kim: thanks, maybe I'll do that
Tim: I also like using post-its in kaban style
Summary :  

Kim may try the pomodoro technique recommended by Tim to get more stuff done.
Model Summary :  

 is about Kim and Tim's weekend plans. Kim was in a bad mood and was procrastinating. Kim was going to do lots of stuff but ended up procrastinating. Kim is going to do uni stuff and unfucking her room tomorrow.


-------------------------

## Testing Model Performance using `ROUGE` Score


---------------------------

In [49]:
## generating both actual summaries and model summary

## creating pipeline for Summarization Task
from transformers import pipeline

pipe_finetunned = pipeline(
    "summarization",
    model = model4local,
    tokenizer = tokenizer4local,
    device = "cuda"
)

pipe_t5_model = pipeline(
    "summarization",
    model = model,
    tokenizer = tokenizer,
    device = "cuda"
)

Device set to use cuda
Device set to use cuda


In [55]:
## loading packages
from datasets import Dataset
from tqdm import tqdm

## Actual Dialogues and Summary

## taking only valiadation dataset
dialogue_val = dataset['validation']['dialogue']
summary_val = dataset['validation']['summary']

# Convert dialogue_val to Hugging Face Dataset for easy mapping
dialogue_dataset = Dataset.from_dict({"dialogue": dialogue_val})

# Define summary generation function
def generate_summary(dialogue, pipe):
    return {"summary": pipe(dialogue)[0]['summary_text']}

# Apply mapping to generate summaries
dialogue_dataset = dialogue_dataset.map(lambda x: generate_summary(x["dialogue"], pipe_finetunned), batched=False)
predicted_summary_finetunned_model = dialogue_dataset['summary']

dialogue_dataset = dialogue_dataset.map(lambda x: generate_summary(x["dialogue"], pipe_t5_model), batched=False)
predicted_summary_t5_model = dialogue_dataset['summary']


Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Your max_length is set to 142, but your input_length is only 58. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=29)
Your max_length is set to 142, but your input_length is only 93. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=46)
Your max_length is set to 142, but your input_length is only 75. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=37)
Your max_length is set to 142, but your input_length is only 68. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=34)
Your

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Your max_length is set to 142, but your input_length is only 58. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=29)
Your max_length is set to 142, but your input_length is only 93. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=46)
Your max_length is set to 142, but your input_length is only 75. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=37)
Your max_length is set to 142, but your input_length is only 68. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=34)
Your

-------------------------

### Compute Rouge Score


-------------------------

In [56]:
## loading the package
import evaluate

## Loading the ROUGE Metric
rouge = evaluate.load("rouge")

## calculating ROUGE Score
results_finetunned_model = rouge.compute(predictions =
                        predicted_summary_finetunned_model,
                        references = summary_val)

results_t5_model = rouge.compute(predictions =
                        predicted_summary_t5_model,
                        references = summary_val)

In [57]:
## Evaluating the Results

print(f"T5 Model ROUGE Score : {results_t5_model}", "\n")
print(f"FineTunned Model ROUGE Score : {results_finetunned_model}")

T5 Model ROUGE Score : {'rouge1': 0.3508524765191202, 'rouge2': 0.1583501651154121, 'rougeL': 0.26192254977541957, 'rougeLsum': 0.26208377096427404} 

FineTunned Model ROUGE Score : {'rouge1': 0.3508524765191202, 'rouge2': 0.1583501651154121, 'rougeL': 0.26192254977541957, 'rougeLsum': 0.26208377096427404}


-------------------

#### Customize ROUGE Evaluation

-----------------------

In [61]:
## loading the modules
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
## finetunned model score
rouge_scores_finetuned_model = [scorer.score(pred, ref) for pred, ref in zip(predicted_summary_finetunned_model, summary_val)]
## t5 model score
rouge_scores_t5_model = [scorer.score(pred, ref) for pred, ref in zip(predicted_summary_t5_model, summary_val)]

# Average the scores - fineunned
avg_rouge1_finetunned = sum([score['rouge1'].fmeasure for score in rouge_scores_finetuned_model]) / len(rouge_scores_finetuned_model)
avg_rouge2_finetunned = sum([score['rouge2'].fmeasure for score in rouge_scores_finetuned_model]) / len(rouge_scores_finetuned_model)
avg_rougeL_finetunned = sum([score['rougeL'].fmeasure for score in rouge_scores_finetuned_model]) / len(rouge_scores_finetuned_model)

# Average the scores - t5
avg_rouge1_t5 = sum([score['rouge1'].fmeasure for score in rouge_scores_t5_model]) / len(rouge_scores_t5_model)
avg_rouge2_t5 = sum([score['rouge2'].fmeasure for score in rouge_scores_t5_model]) / len(rouge_scores_t5_model)
avg_rougeL_t5 = sum([score['rougeL'].fmeasure for score in rouge_scores_t5_model]) / len(rouge_scores_t5_model)

print(f"ROUGE-1: {avg_rouge1_finetunned}, ROUGE-2: {avg_rouge2_finetunned}, ROUGE-L: {avg_rougeL_finetunned}")

print(f"ROUGE-1: {avg_rouge1_t5}, ROUGE-2: {avg_rouge2_t5}, ROUGE-L: {avg_rougeL_t5}")



ROUGE-1: 0.3601602666374244, ROUGE-2: 0.16373721788433837, ROUGE-L: 0.2683430744253543
ROUGE-1: 0.3601602666374244, ROUGE-2: 0.16373721788433837, ROUGE-L: 0.2683430744253543


### Uploading Model to HuggingFace HUB

In [63]:
# Run, and paste the Access token when prompted
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) N
Token is valid (permission: write

In [66]:
# Push model

my_repo = "samsun_dialogue_Summary"

model.push_to_hub(repo_id= my_repo, commit_message= "Upload fine-tuned model", )

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/saibalpatra/samsun_dialogue_Summary/commit/9ebaa11b050640be1ea402934a5cd76f00b9f2c6', commit_message='Upload fine-tuned model', commit_description='', oid='9ebaa11b050640be1ea402934a5cd76f00b9f2c6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/saibalpatra/samsun_dialogue_Summary', endpoint='https://huggingface.co', repo_type='model', repo_id='saibalpatra/samsun_dialogue_Summary'), pr_revision=None, pr_num=None)

In [67]:
# Push tokenizer

tokenizer.push_to_hub(repo_id= my_repo, commit_message= "Upload tokenizer used")

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/saibalpatra/samsun_dialogue_Summary/commit/9ebaa11b050640be1ea402934a5cd76f00b9f2c6', commit_message='Upload tokenizer used', commit_description='', oid='9ebaa11b050640be1ea402934a5cd76f00b9f2c6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/saibalpatra/samsun_dialogue_Summary', endpoint='https://huggingface.co', repo_type='model', repo_id='saibalpatra/samsun_dialogue_Summary'), pr_revision=None, pr_num=None)

-----------------------------

### Test Own Fine Tunned Model

------------------------------

In [68]:
## laoding the model

model_own = AutoModelForSeq2SeqLM.from_pretrained(
    "saibalpatra/samsun_dialogue_Summary"
)

tokenizer_own = AutoTokenizer.from_pretrained(
    "saibalpatra/samsun_dialogue_Summary"
)

config.json:   0%|          | 0.00/1.64k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/358 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/278 [00:00<?, ?B/s]

In [74]:
## testing the results

dialogue_test = dataset["test"][0]["dialogue"]
summary_test = dataset["test"][0]["summary"]

summary_own = generate_summary(text = dataset["test"][0]["dialogue"],
                               model = model_own)

print("Dialogue : ", "\n")
print(dialogue_test, "\n")
print("Summary Test : ", "\n")
print(summary_test, "\n")
print("Summary Own : ", "\n")
print(summary_own)

Dialogue :  

Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye 

Summary Test :  

Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry. 

Summary Own :  

 for the day. Amanda and Hannah are at the park together and Hannah asks Amanda for Betty's number. Amanda doesn't have Betty's number but she can't find it. Amanda asks Hannah to ask Larry to text Betty's number. Amanda is very nice and Larry is very nice.
