# LLMs Evaluation

#### This notebook presents a methodology to evaluate the text generated by a LLM comparing it to a manual labelling created by human annotators through three metrics: BERTScore, ROUGE, and BLEU

In [None]:
import pandas as pd
import numpy as np
import re
import nltk

### Import the dataframe containing the generated text and the labels 

In [22]:
df = pd.read_excel(r"..\data\gpt4_output.xlsx")
df.head(5)

Unnamed: 0.1,Unnamed: 0,TEXT,label,extracted_needs
0,0,Hey fellow travel enthusiasts! 👋 So here's ...,suggestions upon best christmas activities and...,"Christmas-themed places recommendations, Trans..."
1,1,Hello! We will be taking a family trip (teens ...,accomodation advices,Hotel selection advice
2,2,Where could I deposit luggage nearby Penn Stat...,luggage storage options,"Luggage storage options, Reliable mobile appli..."
3,3,Hi i have posted a few times before and really...,"itinerary planning advices, shopping advices, ...","Itinerary planning advices, Shopping advices, ..."
4,4,Hi looking for advice on where to stay on a bu...,"accomodation advices, travelling advices from ...","Affordable accomadations, Safe accomodations f..."


<h2>Statistical Analysis</h2>

<p>In this work, we are interested in counting how many customer needs were extracted by GPT-4, to compare its output with the labels.</p>

In [24]:
def count_needs(df_col):
    count = 0
    for i in df_col:
        if type(i) != float:
            count = count + i.count(",") + 1
    return count

In [25]:
print("Number of needs extracted by GPT-4: ", count_needs(df["extracted_needs"]))
print("Number of needs identified by human annotators: ", count_needs(df["label"]))

Number of needs extracted by GPT-4:  769
Number of needs identified by human annotators:  561


<h2>Evaluation</h2>

In [44]:
label_example = df["label"][0]
print("Label example: ", label_example)
extracted_example = df["extracted_needs"][0]
print("Extracted needs example: ", extracted_example)

Label example:  suggestions upon best christmas activities and must sees, transportation advice, budget planning, restaurant recommendations
Extracted needs example:  Christmas-themed places recommendations, Transportation advices, Budget planning advices, Airport recommendations


### BERTScore 

In [29]:
from bert_score import score

def bertscore(label, predicted):
    P, R, F1 = score(predicted, label,model_type="microsoft/deberta-xlarge-mnli",rescale_with_baseline=False, lang="en", verbose=False)
    return P, R, F1

In [50]:
Prec, Rec, F_1 = bertscore([label_example], [extracted_example])
print("Precision: ", Prec.item())
print("Recall: ", Rec.item())
print("F1: ", F_1.item())

Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['pooler.dense.bias', 'classifier.weight', 'pooler.dense.weight', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Precision:  0.7698975205421448
Recall:  0.7301522493362427
F1:  0.7494983673095703


### ROUGE 

In [51]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

def rougescore(label, predicted):
    scores = scorer.score(label, predicted)
    return scores

In [54]:
P1, R1, F1, PL, RL, FL = rougescore(label_example, extracted_example)
print("F1 (Rouge-1): ", F1)
print("F1 (Rouge-L): ", FL)

F1 (Rouge-1):  0.4799999999999999
F1 (Rouge-L):  0.4799999999999999


### BLEU  

In [56]:
from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu_score(reference, candidate):
    # Tokenize the strings into lists of words
    reference_tokens = nltk.word_tokenize(reference.lower())
    candidate_tokens = nltk.word_tokenize(candidate.lower())

    # Calculate BLEU score
    bleu_score = sentence_bleu([reference_tokens], candidate_tokens)
    return bleu_score

In [58]:
print(calculate_bleu_score(label_example, extracted_example))

0.24450989066362577


### Evaluate the LLM's output all together with all the metrics 

In [64]:
def evaluation(df, col_label, col_eval):
    # take the number of rows of the dataframe
    n = len(df.index)
    # BERTScore can be calculated directly on the whole column
    Prec, Rec, F_1 = bertscore(df[col_eval].tolist(), df[col_label].tolist())
    # For Rouge and BLEU we need to iterate all the rows of the dataframe
    # initialize the variables to store the sum of the metrics, to be used to calculate the average
    f1_rouge1 = 0
    f1_rougeL = 0
    bleu_t = 0
    # iterate over all the rows 
    for index, row in df.iterrows():
        # calculate the metrics for the current row
        P1, R1, F1, PL, RL, FL = rougescore(row[col_label], row[col_eval])
        bleu = calculate_bleu_score(row[col_label], row[col_eval])
        # sum the calculated metrics to the variables defined before
        bleu_t += bleu
        f1_rouge1 += F1
        f1_rougeL += FL
    # calculate the mean of the metrics
    bleu_t = round(bleu_t/n, 3)
    f1_rouge1 = round(f1_rouge1/n, 3)
    f1_rougeL = round(f1_rougeL/n, 3)
    return round(Prec.mean().item(), 3), round(Rec.mean().item(), 3), round(F_1.mean().item(), 3), f1_rouge1, f1_rougeL, bleu_t

# define a function to clean the text from the symbols
def prepare(col1):
    col = col1.str.replace(r',', lambda x: x.group().replace(',', ' '), regex=True)
    col = col.str.replace(r',', lambda x: x.group().replace('-', ' '), regex=True)
    col = col.str.lower()
    return col
    

In [60]:
# clean the two texts
df["label"] = prepare(df["label"])
df["extracted_needs"] = prepare(df["extracted_needs"])

In [66]:
print("eval NEEDS GPT")
prec_bert, rec_bert, f1_bert, f1_rou1, f1_rouL, bleu = evaluation(df.head(10), "label", "extracted_needs")
print("Precision (BERTScore): ", prec_bert)
print("Recall (BERTScore): ", rec_bert)
print("F1 (BERTScore): ", f1_bert)
print("F1 (Rouge-1): ", f1_rou1)
print("F1 (Rouge-L): ", f1_rouL)
print("BLEU: ", bleu)


eval NEEDS GPT


Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaModel: ['pooler.dense.bias', 'classifier.weight', 'pooler.dense.weight', 'classifier.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Precision (BERTScore):  0.795
Recall (BERTScore):  0.759
F1 (BERTScore):  0.774
F1 (Rouge-1):  0.607
F1 (Rouge-L):  0.588
BLEU:  0.399
