<h2> Evaluation of Dialog Summarization </h2>

**LOADING DATASET**

**1. We install the necessary packages and libraries we will need for ou evaluation**

In [1]:
!pip install evaluate
!pip install datasets
!pip install rouge_score
!pip install --user -U nltk
!pip install transformers
!pip install -U bitsandbytes

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.2
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=48065f226d1428a24eb1f692219f8f4d81acdde778c33a6317130d0b2cd50e1a
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
Collecting nltk
  Downloading nltk-3.8.1-py3-n

**2. We load the dataset we will use for the evaluation**

In [2]:
from datasets import load_dataset

dataset = load_dataset("Marouane50/Dialog-Summarization-Dataset", split="testing")
dataset

Downloading readme:   0%|          | 0.00/540 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/627k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/82.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/89.9k [00:00<?, ?B/s]

Generating training split:   0%|          | 0/879 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/110 [00:00<?, ? examples/s]

Generating testing split:   0%|          | 0/110 [00:00<?, ? examples/s]

Dataset({
    features: ['dialog', 'summary'],
    num_rows: 110
})

In [101]:
dataset["summary"][1]

'Customer is asking about the ACC to link to the current  number. Agent says that they have updated their case manager.'

**3. From the testing set of the dataset, we select a 10 random samples of dialogs and summaries pairs that we will use as reference values for the evaluation.**

In [4]:
import random
reference_summaries=[]
reference_dialogs = []

def get_random_summaries(dataset):
    for i in range(0,10):
        index = random.randint(0,109)
        dataset["summary"][index]
        reference_summaries.append(dataset["summary"][index])  # Append the summary to the reference summaries list
        reference_dialogs.append(dataset["dialog"][index])  # Append the dialog to the reference dialogs list

get_random_summaries(dataset)

**LOADING MODELS**

1. We load the base model and the fine-tuned model from Hugging Face Hub

In [5]:
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
trained_model_name = "Marouane50/Llama2-Dialog-Summarization"

**2. We set up a QLoRA configuration and use it to load the models using quantization (same code as in finetuning file)**

In [6]:
import torch
import os
from transformers import BitsAndBytesConfig

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype= compute_dtype,
    bnb_4bit_use_double_quant=True
)

**3. We load the base model Llama2 7B chat and use it to run a text generation pipeline**

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config = bnb_config,
    device_map = {"": 0},
)

# Llama Tokenizer configuration
tokenizer = AutoTokenizer.from_pretrained(base_model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token_id = 18610
tokenizer.padding_side = "right"


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [119]:
from transformers import pipeline

# In this case, we need to specify in the prompt that the summary should be 180 characters maximum, to match the length of the summaries in our dataset.
system_prompt = "The following text is a conversation between a human and an AI agent. Write a 150 characters maximum summary of the conversation."

pipe = pipeline(
    task="text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=1000 
)

# We use the first dialog from the testing set as an input, to be able to compare the output with the testing set summary
base_model_summaries = []
for i in range(0, 10):
    result = pipe(f"<s>[INST] <<SYS>> {system_prompt} <</SYS>> {reference_dialogs[i]} [/INST]")
    base_model_summaries.append(result[0]['generated_text'])
    
print(base_model_summaries[1])

<s>[INST] <<SYS>> The following text is a conversation between a human and an AI agent. Write a 150 characters maximum summary of the conversation. <</SYS>> user: the new update ios11 sucks. I can’t even use some apps on my phone.
agent: We want your iPhone to work properly, and we are here for you. Which apps are giving you trouble, and which iPhone?
user: 6s. Words with friends  Words pro
agent: Do you see app updates in App Store &gt; Updates? Also, are you using iOS 11.0.3?
user: I am using 11.0.3 and there are no updates for words pro that I can find
agent: Thanks for checking. Next, what happens in that app that makes it unusable?
user: It’s says it’s not compatible.
agent: Thanks for confirming this. Send us a DM and we'll work from there: 
 [/INST]  Sure! Here is a 150 character summary of the conversation:

User: New iOS 11 update is terrible, can't use some apps on my iPhone 6s. Specifically, Words with Friends and Words Pro. Agent: Checked for updates in App Store, user is u

**3. Then we load the fine-tuned model and use it to run another text generation pipeline**

In [9]:
# Trained Model configuration
finetuned_model = AutoModelForCausalLM.from_pretrained(
    trained_model_name,
    quantization_config = bnb_config,
    device_map = {"": 0},
)

# Llama Tokenizer configuration
tokenizer = AutoTokenizer.from_pretrained(trained_model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token_id = 18610
tokenizer.padding_side = "right"


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [99]:
from transformers import pipeline

system_prompt = "The following text is a conversation between a human and an AI agent. Write a summary of the conversation."

pipe = pipeline(
    task="text-generation",
    model=finetuned_model, 
    tokenizer=tokenizer, 
    max_length=1000 
)

# We use the first dialog from the testing set as an input, to be able to compare the output with the testing set summary
trained_model_summaries = []
for i in range(0, 10):
    result = pipe(f"<s>[INST] <<SYS>> {system_prompt} <</SYS>> {reference_dialogs[i]} [/INST]")
    trained_model_summaries.append(result[0]['generated_text'])
    
trained_model_summaries[1]


"<s>[INST] <<SYS>> The following text is a conversation between a human and an AI agent. Write a summary of the conversation. <</SYS>> user: the new update ios11 sucks. I can’t even use some apps on my phone.\nagent: We want your iPhone to work properly, and we are here for you. Which apps are giving you trouble, and which iPhone?\nuser: 6s. Words with friends  Words pro\nagent: Do you see app updates in App Store &gt; Updates? Also, are you using iOS 11.0.3?\nuser: I am using 11.0.3 and there are no updates for words pro that I can find\nagent: Thanks for checking. Next, what happens in that app that makes it unusable?\nuser: It’s says it’s not compatible.\nagent: Thanks for confirming this. Send us a DM and we'll work from there: \n [/INST]\n    The customer is complaining about the new update ios11 sucks. The agent asked to DM and they will work from there. The customer is using 11.0.3 and there are no updates for words pro that he can find.</s>"

The output generated by both models are not enough to evaluate. It happens that sometimes we get irregular patterns inside of the text. So we won't be able to use automated testing. 
The alternative we chose is to use manual testing where we manually copy the summary from the output and use it to compute the ROUGE score, considering we're only testing 10 summaries, it shouldn't be very complicated.

<h1> Evaluating the Models </h1>

**We use the rouge metric from the evaluate library to evaluate the summaries**

In [107]:
import evaluate

rouge_score = evaluate.load("rouge") # We load the ROUGE metric from the evaluate libray, used to evaluate performance of NLP models

**We create a function to preprocess the summaries text into the right format for ROUGE evaluation**

In [109]:
from nltk.tokenize import sent_tokenize

def compute_rouge_score(generated, reference):
    #We need to add '\n' to each line before send it to ROUGE
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
    
    return rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
    )

**We set the variables containing the summaries data**

In [120]:
reference_summary = ["Customer is asking about the ACC to link to the current  number. Agent says that they have updated their case manager."]
base_model_summary  = ["User: New iOS 11 update is terrible, can't use some apps on my iPhone 6s. Specifically, Words with Friends and Words Pro. Agent: Checked for updates in App Store, user is using iOS 11.0.3. User: App says it's not compatible. Agent: Send a DM for further assistance."]
trained_model_summary = ["The customer is complaining about the new update ios11 sucks. The agent asked to DM and they will work from there. The customer is using 11.0.3 and there are no updates for words pro that he can find."]


In [121]:
reference_summary

['Customer is asking about the ACC to link to the current  number. Agent says that they have updated their case manager.']

**We calculate the ROUGE score for the base model**

In [117]:
compute_rouge_score(trained_model_summary, reference_summary)

{'rouge1': 0.36065573770491804,
 'rouge2': 0.06779661016949151,
 'rougeL': 0.2622950819672132,
 'rougeLsum': 0.32786885245901637}

**We calculate the ROUGE score for the fine-tuned model**

In [122]:
compute_rouge_score(base_model_summary, reference_summary)

{'rouge1': 0.1111111111111111,
 'rouge2': 0.0,
 'rougeL': 0.08333333333333333,
 'rougeLsum': 0.1111111111111111}

The higher the ROUGE score for all metrics (Rouge1, Rouge2, RougeL, RougeLsum) the better the performance of the model in dialog summarization compared to the reference summaries in the dataset. The evaluation of the fine-tuned model should show higher ROUGE score than the base model when compared to the testing set summaries to prove that the training had an effect.