<h2> Evaluation of Dialog Summarization </h2>

**LOADING DATASET**

**1. We install the necessary packages and libraries we will need for ou evaluation**

In [1]:
!pip install evaluate
!pip install datasets
!pip install rouge_score
!pip install --user -U nltk
!pip install transformers
!pip install -U bitsandbytes

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.2
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=f7550156e8e2e3b248c0e2cf14d012015e5935806f0cdeb33218bc2089c5a24b
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
Collecting nltk
  Download

**2. We load the dataset we will use for the evaluation**

In [2]:
from datasets import load_dataset

dataset = load_dataset("Marouane50/Dialog-Summarization-Dataset", split="testing")
dataset

Downloading readme:   0%|          | 0.00/540 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/627k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/82.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/89.9k [00:00<?, ?B/s]

Generating training split:   0%|          | 0/879 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/110 [00:00<?, ? examples/s]

Generating testing split:   0%|          | 0/110 [00:00<?, ? examples/s]

Dataset({
    features: ['dialog', 'summary'],
    num_rows: 110
})

**3. From the testing set of the dataset, we select a 10 random samples of dialogs and summaries pairs that we will use as reference values for the evaluation.**

In [3]:
import random
reference_summaries=[]
reference_dialogs = []

def get_random_summaries(dataset):
    for i in range(0,10):
        index = random.randint(0,109)
        dataset["summary"][index]
        reference_summaries.append(dataset["summary"][index])  # Append the summary to the reference summaries list
        reference_dialogs.append(dataset["dialog"][index])  # Append the dialog to the reference dialogs list

get_random_summaries(dataset)

In [4]:
print(reference_summaries[1])

Customer is complaining about the new updates IOS11 and can't even use some apps on phone. Agent asks to send a DM and work from there URL.


**LOADING MODELS**

1. We load the base model and the fine-tuned model from Hugging Face Hub

In [5]:
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
trained_model_name = "Marouane50/Dialog-Summarization-Finetuning"

**2. We set up a QLoRA configuration and use it to load the models using quantization (same code as in finetuning file)**

In [6]:
import torch
import os
from transformers import BitsAndBytesConfig

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype= compute_dtype,
    bnb_4bit_use_double_quant=True
)

**3. We load the base model Llama2 7B chat and use it to run a text generation pipeline**

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config = bnb_config,
    device_map = {"": 0},
)

# Llama Tokenizer configuration
tokenizer = AutoTokenizer.from_pretrained(base_model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token_id = 18610
tokenizer.padding_side = "right"


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [41]:
from transformers import pipeline

# In this case, we need to specify in the prompt that the summary should only contain two sentences, to match the length of the summaries in our dataset.
system_prompt = "The following text is a conversation between a human and an AI agent. Write a summary of the conversation in two sentence."

pipe = pipeline(
    task="text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=1000 
)

# We use the first dialog from the testing set as an input, to be able to compare the output with the testing set summary
base_model_summaries = []
for i in range(0, 10):
    result = pipe(f"<s>[INST] <<SYS>> {system_prompt} <</SYS>> {reference_dialogs[i]} [/INST]")
    base_model_summaries.append(result[0]['generated_text'])
    
print(base_model_summaries[2])

<s>[INST] <<SYS>> The following text is a conversation between a human and an AI agent. Write a summary of the conversation in two sentence. <</SYS>> user: signed up for $75 package June of '17. We've been charged $200+ /month since then. Now received cancellation letter..??!!!? ?
agent: Hi, I would love to help. Please DM me the phone # associated with the account &amp; the service address. -Jules
user: 6157177213
1833 Winding Way Dr. White House, TN 37188
agent: Thanks, do you have the account # on hand as well? -Jules
user:   credit card  
agent: Thanks, I am seeing that this account was disconnected on 09/29/17. Did you just receive a letter now? -Jules
user: Yes. Are you able to review the billing charges?
agent: What is your full name? -MV
 [/INST]  Sure! Here is a summary of the conversation between the human and the AI agent:

The human has been charged $200+ per month since June 2017 after signing up for a $75 package. They have now received a cancellation letter, despite the 

In [46]:
print(base_model_summaries[])

<s>[INST] <<SYS>> The following text is a conversation between a human and an AI agent. Write a summary of the conversation in two sentence. <</SYS>> user:Even when someone tries their hardest to rectify the situation and you still don't have compassion? #notmybank 
agent: Hi, Giselle. I'd like to see how we can help with your fees.   Pls click below send your name/ZIP/phone #/acct type so our Social Care Team can follow up with you.  
user: I spoke to a manager in the customer service department this morning who told me you all could not help me with my overdraft fees  even after all the circumstances and me rarely over drafting and paying back the funds ASAP after I saw the overdraft.  I just feel like there was a total lack of compassion and when someone asks you why your account was so low, it's insulting.
agent: Please let us know if you change your mind and would like us to review your account with you. We are here to help. 
user: I tried clicking the link you sent to send my inf

**3. Then we load the fine-tuned model and use it to run another text generation pipeline**

In [9]:
# Trained Model configuration
finetuned_model = AutoModelForCausalLM.from_pretrained(
    trained_model_name,
    quantization_config = bnb_config,
    device_map = {"": 0},
)

# Llama Tokenizer configuration
tokenizer = AutoTokenizer.from_pretrained(trained_model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token_id = 18610
tokenizer.padding_side = "right"


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [10]:
from transformers import pipeline

system_prompt = "The following text is a conversation between a human and an AI agent. Write a summary of the conversation."

pipe = pipeline(
    task="text-generation",
    model=finetuned_model, 
    tokenizer=tokenizer, 
    max_length=1000 
)

# We use the first dialog from the testing set as an input, to be able to compare the output with the testing set summary
trained_model_summaries = []
for i in range(0, 10):
    result = pipe(f"<s>[INST] <<SYS>> {system_prompt} <</SYS>> {reference_dialogs[i]} [/INST]")
    trained_model_summaries.append(result[0]['generated_text'])
    
trained_model_summaries[1]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


"<s>[INST] <<SYS>> The following text is a conversation between a human and an AI agent. Write a summary of the conversation. <</SYS>> user: the new update ios11 sucks. I can’t even use some apps on my phone.\nagent: We want your iPhone to work properly, and we are here for you. Which apps are giving you trouble, and which iPhone?\nuser: 6s. Words with friends  Words pro\nagent: Do you see app updates in App Store &gt; Updates? Also, are you using iOS 11.0.3?\nuser: I am using 11.0.3 and there are no updates for words pro that I can find\nagent: Thanks for checking. Next, what happens in that app that makes it unusable?\nuser: It’s says it’s not compatible.\nagent: Thanks for confirming this. Send us a DM and we'll work from there: \n [/INST] Customer is complaining about the new update iOS 11. Agent asks to send DM and they will work from there."

In [49]:
def clean_base_summaries(generated_summaries):
    for i in range(0,len(generated_summaries)):
        text = generated_summaries[i]
        index = text.find("\n\n")
        summary = text[index:]
        summary = summary.strip()
        generated_summaries[i] = summary
        i += 1
        
def clean_trained_summaries(generated_summaries):
    for i in range(0,len(generated_summaries)):
        text = generated_summaries[i]
        index = text.find("[/INST]")
        summary = text[index+8:]
        summary = summary.strip()
        generated_summaries[i] = summary
        i += 1

In [50]:
clean_base_summaries(base_model_summaries) 
base_model_summaries

['The human subscribed to a music streaming service but is unable to play songs as it keeps telling them to subscribe. The AI agent tries to help by asking the human to confirm their account and providing a link to connect with a specialist to sort out the issue.\n\nThe human provides screenshots and details of the problem, and the AI agent thanks them for the information. The AI agent then offers to connect the human with a Play Music specialist to help resolve the issue.',
 'The user is experiencing issues with the new iOS 11 update on their iPhone 6s, specifically with the Words With Friends and Words Pro apps. The AI agent is helping the user troubleshoot the issue by asking questions and offering support.',
 "The human has been charged $200+ per month since June 2017 after signing up for a $75 package. They have now received a cancellation letter, despite the account being disconnected on September 29, 2017. The AI agent is asking for the human's full name and account information 

In [28]:
clean_trained_summaries(trained_model_summaries)
trained_model_summaries

["Customer is complaining that he has subscribed to  and it isn't working. Agent updated that they will connect him with their play music specialist to sort this out.",
 'Customer is complaining about the new update iOS 11. Agent asks to send DM and they will work from there.',
 'Customer is complaining that they have been charged $200+/month since then. Agent updated that they are seeing that this account was disconnected on 09/29/17.',
 'Customer is complaining about the update that took 30+ minutes. Agent updated the URL to change the notification and action settings.',
 'Customer is complaining about the poor customer service. Agent updated that they have escalated the support inquiry to a specialist and someone will be reaching out to them momentarily.',
 'Customer is complaining that they are not able to get any help from customer services unless they are prepared to part with another £400 to get home with a disabled wife without changing trains. Agent updated that the only way t

<h1> Evaluating the Models </h1>

**We use the rouge metric from the evaluate library to evaluate the summaries**

In [11]:
import evaluate

rouge_score = evaluate.load("rouge") # We load the ROUGE metric from the evaluate libray, used to evaluate performance of NLP models

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

**We create a function to preprocess the summaries text into the right format for ROUGE evaluation**

In [52]:
from nltk.tokenize import sent_tokenize

import nltk

    
def compute_rouge_score(generated, reference):
    #We need to add '\n' to each line before send it to ROUGE
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
    
    return rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
    )

In [58]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

**We calculate the ROUGE score for the base model**

In [59]:
compute_rouge_score(trained_model_summaries, reference_summaries)

{'rouge1': 0.4674913373800669,
 'rouge2': 0.2157533978391328,
 'rougeL': 0.39614838302200645,
 'rougeLsum': 0.43618860969167306}

**We calculate the ROUGE score for the fine-tuned model**

In [62]:
compute_rouge_score(base_model_summaries, reference_summaries)

{'rouge1': 0.2879373375131953,
 'rouge2': 0.07449206180843443,
 'rougeL': 0.19704841133306294,
 'rougeLsum': 0.2580567512209999}

The higher the ROUGE score for all metrics (Rouge1, Rouge2, RougeL, RougeLsum) the better the performance of the model in dialog summarization compared to the reference summaries in the dataset. The evaluation of the fine-tuned model should show higher ROUGE score than the base model when compared to the testing set summaries to prove that the training had an effect.