#4. Evaluation and Analysis: <br>



In [15]:
pip install transformers datasets evaluate accelerate rouge_score



In [16]:
from tqdm import tqdm

In [17]:
import accelerate
import transformers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import pipeline
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from datasets import load_dataset_builder
from datasets import load_dataset
import seaborn as sns
import torch
import evaluate
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM
from tqdm import tqdm

##(a) Evaluate the fine-tuned model's performance on the test set. Compare these results to the model's pre-fine-tuning performance using the same dialogues to assess the impact of fine-tuning. <br>

In [18]:
dataset = load_dataset("samsum")

In [19]:
train = load_dataset("samsum", split="train")
test = load_dataset("samsum", split="test")
val = load_dataset("samsum", split="validation")

In [20]:
test

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 819
})

In [21]:
pd.set_option('display.max_colwidth', None)

In [22]:
rouge_metric = evaluate.load("rouge")

### Pre-fine-tuned performance on test set

##### {'rouge1': 0.2839869931922432,
##### 'rouge2': 0.07813235417373512,
##### 'rougeL': 0.2146845115342596,
##### 'rougeLsum': 0.2149136580655594}

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
batch_size = 128
pre_ft_t5_summarizer = pipeline("summarization",model="google-t5/t5-small",device=device)

In [None]:
for i in tqdm(range(0,len(test),batch_size)):
  batch = test[i:i+batch_size]
  dialogues = batch['dialogue']
  references = batch['summary']

  outputs = pre_ft_t5_summarizer(dialogues,truncation=True)
  generated_summaries = [output['summary_text'] for output in outputs]
  rouge_metric.add_batch(predictions=generated_summaries,references=references)

rouge_metric.compute()


  0%|          | 0/7 [00:00<?, ?it/s]Your max_length is set to 200, but your input_length is only 133. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=66)
Your max_length is set to 200, but your input_length is only 155. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=77)
Your max_length is set to 200, but your input_length is only 196. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=98)
Your max_length is set to 200, but your input_length is only 139. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, 

{'rouge1': 0.2839869931922432,
 'rouge2': 0.07813235417373512,
 'rougeL': 0.2146845115342596,
 'rougeLsum': 0.2149136580655594}

### Fine-tuned model's performance on test set

##### {'rouge1': 0.35386344587365093,
##### 'rouge2': 0.13875659543532526,
##### 'rougeL': 0.29093864767787203,
##### 'rougeLsum': 0.2911483466753183}

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
batch_size = 8
model = AutoModelForSeq2SeqLM.from_pretrained('/content/drive/MyDrive/LLMPA3-4/checkpoint-2700')
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/LLMPA3-4/checkpoint-2700")
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [9]:
# Assuming test is your dataset
model.to(device)
for i in tqdm(range(0, len(test), batch_size)):
    batch = test[i:i + batch_size]
    dialogues = batch['dialogue']
    references = batch['summary']

    # Tokenize the dialogues
    inputs = tokenizer(dialogues, padding=True, truncation=True, return_tensors="pt").to(device)

    # Generate summaries from the model
    outputs = model.generate(
        inputs["input_ids"],
        max_length=128,  # Adjust max length according to your requirements
        num_beams=4,     # Adjust num_beams for beam search
        early_stopping=True,
        decoder_start_token_id=tokenizer.pad_token_id  # Avoid generating from padding tokens
    )

    # Move outputs to CPU for further processing
    outputs = outputs.cpu()

    # Decode the generated summaries
    generated_summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Add batch to ROUGE metric
    rouge_metric.add_batch(predictions=generated_summaries, references=references)

# Compute ROUGE scores
rouge_metric.compute()


100%|██████████| 103/103 [01:49<00:00,  1.06s/it]


{'rouge1': 0.35386344587365093,
 'rouge2': 0.13875659543532526,
 'rougeL': 0.29093864767787203,
 'rougeLsum': 0.2911483466753183}

##(b) Analyze the generated summaries for improvements in capturing key points, coherence, and fluency. Discuss the fine-tuning's impact on the model's ability to summarize dialogues.

###Training dataset sample

In [24]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [25]:
model = AutoModelForSeq2SeqLM.from_pretrained('/content/drive/MyDrive/LLMPA3-4/checkpoint-2700')
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/LLMPA3-4/checkpoint-2700")
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [26]:
train_data = pd.DataFrame(train)
sample = train_data.sample(5)

In [27]:
google_t5_small_summarizer = pipeline("summarization",model="google-t5/t5-small")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [31]:
def fine_tuned_google_t5_small_output(dialogue,model,tokenizer,device):
  model.to(device)
  inputs = tokenizer(dialogue, padding=True, truncation=True, return_tensors="pt").to(device)
  outputs = model.generate(
        inputs["input_ids"],
        max_length=128,  # Adjust max length according to your requirements
        num_beams=4,     # Adjust num_beams for beam search
        early_stopping=True,
        decoder_start_token_id=tokenizer.pad_token_id  # Avoid generating from padding tokens
    )
  outputs = outputs.cpu()
  generated_summary = tokenizer.batch_decode(outputs, skip_special_tokens=True)

  return generated_summary


In [29]:
summaries = pd.DataFrame(columns=['original summary','google_t5_small_summary','fine_tuned_google_t5_small_summary'])

In [32]:
for row in range(sample.shape[0]):
  dialogue = sample.iloc[row,1]
  original_summary = sample.iloc[row,2]
  google_t5_small_summary = google_t5_small_summarizer(dialogue, max_length=60, min_length=10, do_sample=False)
  fine_tuned_google_t5_small_summary = fine_tuned_google_t5_small_output(dialogue,model,tokenizer,device)
  row = {'original summary':original_summary,
         'google_t5_small_summary':google_t5_small_summary,
         'fine_tuned_google_t5_small_summary':fine_tuned_google_t5_small_summary}

  summaries = summaries.append(row, ignore_index=True)



  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)


In [33]:
summaries

Unnamed: 0,original summary,google_t5_small_summary,fine_tuned_google_t5_small_summary
0,Irma informs Vivienne and Ferb that classes have been cancelled for tomorrow.,"[{'summary_text': 'irma: i will finally get some sleep tomorrow . i'm a sleeper, but i can't wait to get a good sleep .'}]",[Irma will get some sleep tomorrow.]
1,"Sophie wants to pass her driving licence. Clemence recommends Ornikar as one of the cheapest. He read that online schools are better than the traditional ones. He did his licence at CEF, having taken a month and a half for the theory and only 20 hours of driving lessons, but he had driven before.",[{'summary_text': 'Sophie: do you have any agencies to recommend me? Clemence: try Ornikar ou have a look in the center . if you work or not Sophie: how long does it take to pass the theoretical part?'}],[Clemence wants to pass her driving license before the end of summer and start driving course in september.]
2,Tina is on beach holidays in winter. There is a bad weather where Judith is.,"[{'summary_text': 'Judith: I wish, but I'm not sure if I'll be able to >_ . this is why winter is the best time to go on holiday .'}]",[is jealous of you! Tina is on holiday while I'm stuck in this awful weather.]
3,Markus went to the march yesterday and saw Neo-fascists fighting with Antifa. Tony watched the entire march online.,"[{'summary_text': 'Markus: I saw it all online. Awful violence, such a disgrace. I saw a couple of scuffles . Tony: Yes I saw that one. very violent. he pushed him over and started kicking him .'}]",[went to the march yesterday. Tony saw a couple of scuffles online. Tony saw a couple of scuffles. Markus pushed him over and started kicking him until his girlfriend got in between them.]
4,"The shop Sara was looking for is on the second floor next to Rossman, near the elevator.",[{'summary_text': 'i am walking in circles Sofie: it's right next to Rossmann Sara . i can't find the shop you were telling me about . Tom is right .'}],"[Sara can't find the shop you were telling Sara about, i am walking in circles Sofie is right next to Rossmann on the second floor.]"


###Testing dataset sample

In [34]:
test_data = pd.DataFrame(test)
sample = test_data.sample(5)

In [35]:
summaries = pd.DataFrame(columns=['original summary','google_t5_small_summary','fine_tuned_google_t5_small_summary'])
for row in range(sample.shape[0]):
  dialogue = sample.iloc[row,1]
  original_summary = sample.iloc[row,2]
  google_t5_small_summary = google_t5_small_summarizer(dialogue, max_length=60, min_length=10, do_sample=False)
  fine_tuned_google_t5_small_summary = fine_tuned_google_t5_small_output(dialogue,model,tokenizer,device)
  row = {'original summary':original_summary,
         'google_t5_small_summary':google_t5_small_summary,
         'fine_tuned_google_t5_small_summary':fine_tuned_google_t5_small_summary}

  summaries = summaries.append(row, ignore_index=True)



  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)
  summaries = summaries.append(row, ignore_index=True)


In [36]:
summaries

Unnamed: 0,original summary,google_t5_small_summary,fine_tuned_google_t5_small_summary
0,Annette is sick. James is going to the Jesus bar. Oli couldn't find anyone near the bar.,"[{'summary_text': 'peadar: I'm home sick soz huns Annette: Got lung lurgy Oli: Are people at Jesus bar now? Helen: I cycled and ran around the bar, but couldn't find anyone .'}]","[to the bar, but couldn't find anyone.]"
1,"Ed, Valerie, Chris, Lor, Atnee, Jessica, Matt are laughing at the sinfulness of the sex without marriage.",[{'summary_text': 'sex is for married people Valerie: double sin if it’s anything but missionary style! Chris: Then you go get married & leave us all alone Lor: You better hope all non sin living is worth it or your consciousness after death is going to be but'}],"[, and Matt will enjoy Sex for married people. Chris and Atnee go get married & leave us all alone. Lor and Atnee are in hell.]"
2,David lands at 17:30 at Sevilla airport and Victor will pick him up.,[{'summary_text': 'jerez was too expensive David: you don't have to pick me up if you can't Victor: no it's okay to pick you up from the airport .'}],[David and Jerez were too expensive to pick David up from the airport.]
3,"Drade told her brother in the group chatting room that what he had said was wrong. Marenda thinks that he got out of it because he became angry, as he is short-tempered. Drade refuses to apologise and invite him again, since she feels that she is not the one to be blamed but him.","[{'summary_text': 'marenda: why did you act that aggressively? Drade: he failed in his business, didn't he?'}]",[Drade acted like an asshole first. Marenda is angry at his brother's temper.]
4,Blake will be waiting for Clara and Jenny on the platform a fabro-ficule.,"[{'summary_text': 'fabro-ficule Jenny: ""there's only one platform of course"" Blake: there's a bit indeed Clara: will you wait for us there?'}]",[Blake will wait for Clara to get off on the platform.]


## Fine tuning the model improved model's ability to summarize the text for both train and test data. The original model was not good at understanding the underlying semantics. Fine tuning the model didn't help in that aspect. <br>

## However, the fine-tuned model's summary is generally more coherent, atleast able to summarize on the content and not include the dialog itself. Also the number of words produced in generally lesser than that of original model.