# GenAI course - Evaluation


## Use pre-trained model

Select a specific task involving image or text generation: image transformation, translation, summarisation, Q&A.

Find a dataset with annotated data corresponding to your selected task and load it.

In [2]:
from datasets import load_dataset

In [3]:
dataset = load_dataset("EdinburghNLP/xsum")

In [4]:
print((dataset['train'][0]))



Divide the dataset into a train and a test set.

In [5]:
train_dataset = dataset["train"]
test_dataset = dataset["test"]

In [6]:
print(f"Train size: {len(train_dataset)}")
print(f"Test size: {len(test_dataset)}")

Train size: 204045
Test size: 11334


Find a pre-trained model on the hugging face hub suitable for the selected task and load it.

In [7]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

In [8]:
import torch
print("Using GPU:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")

Using GPU: True
Device: NVIDIA GeForce RTX 4060 Laptop GPU


In [9]:
model_name = "Falconsai/text_summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [10]:
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer , device=0)

Device set to use cuda:0


Select the appropriate metrics to evaluate the considered task.

==> I'll be choosing ROUGE

Use the selected metrics to evalute the model on your test set.

In [11]:
from evaluate import load

In [12]:
rouge = load('rouge')

In [13]:
from tqdm import tqdm

In [14]:
batch_size = 32
predictions = []
references = []

for i in tqdm(range(0, len(test_dataset[:11333]), batch_size)):
    batch = test_dataset[i: i + batch_size]
    inputs = batch["document"]
    refs = batch["summary"]

    #summarizing
    results = summarizer(
        inputs,
        max_length=60,
        min_length=10,
        do_sample=False,
        truncation=True
    )

    #results collection
    batch_preds = [res["summary_text"] for res in results]
    predictions.extend(batch_preds)
    references.extend(refs)

100%|██████████| 1/1 [00:17<00:00, 17.08s/it]


In [15]:
batch_preds = [res["summary_text"] for res in results]
predictions.extend(batch_preds)
references.extend(refs)

seeing some summary generation examples:  

In [16]:
for i in range(3):
    print(f"\n**Article #{i+1}**")
    print(f"\n Predicted Summary:\n{predictions[i]}")
    print(f"\n Reference Summary:\n{references[i]}")
    print("-" * 80)


**Article #1**

 Predicted Summary:
Prison Link Cymru said some ex-offenders were living rough for up to a year . The Welsh Government said more people than ever were getting help to address housing problems . Changes to the Housing Act in Wales removed the right for prison leavers to be given

 Reference Summary:
There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.
--------------------------------------------------------------------------------

**Article #2**

 Predicted Summary:
a 26-year-old man appeared at Edinburgh Sheriff Court on Thursday . Detectives said three firearms, ammunition and a five-figure sum of money were recovered .

 Reference Summary:
A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.
--------------------------------------------------------------------------------

**Article #3**

 Predicted Summary:
Jordan Hill, Brittany Covington and Tesfaye Cooper appear in court . Th

In [17]:
#rouge evaluation with stemmer
results = rouge.compute(predictions=predictions, references=references, use_stemmer=True)

In [18]:
print("ROUGE Evaluation Results with stemmer:\n")
for metric, score in results.items():
    print(f"{metric.upper()}: {score}")

ROUGE Evaluation Results with stemmer:

ROUGE1: 0.2203514837155019
ROUGE2: 0.04524804536021169
ROUGEL: 0.14859661051613854
ROUGELSUM: 0.1483190708967212


In [19]:
#rouge evaluation WITHOUT stemmer
results = rouge.compute(predictions=predictions, references=references, use_stemmer=False)

In [20]:
print("ROUGE Evaluation Results without stemmer:\n")
for metric, score in results.items():
    print(f"{metric.upper()}: {score}")

ROUGE Evaluation Results without stemmer:

ROUGE1: 0.20796872560730545
ROUGE2: 0.0430749737115319
ROUGEL: 0.14176782992317571
ROUGELSUM: 0.14136828931767265


Comment the results.

In [57]:
import pandas as pd

# ROUGE scores before and after fine-tuning
rouge_before = {
    "ROUGE1": 0.2205,
    "ROUGE2": 0.0452,
    "ROUGEL": 0.1486
}

rouge_after = {
    "ROUGE1": 0.2927,
    "ROUGE2": 0.0819,
    "ROUGEL": 0.2263
}

# Build table
df = pd.DataFrame({
    "Before Fine-tuning": rouge_before,
    "After Fine-tuning": rouge_after,
})

# Add column for change
df["Change"] = df["After Fine-tuning"] - df["Before Fine-tuning"]
df = df.round(4)  # Optional: round for readability

# Display the table
df

Unnamed: 0,Before Fine-tuning,After Fine-tuning,Change
ROUGE1,0.2205,0.2927,0.0722
ROUGE2,0.0452,0.0819,0.0367
ROUGEL,0.1486,0.2263,0.0777


## Model fine-tuning

Use the train set to fine-tune the pre-trained model.

In [43]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

In [44]:
#the tokenization function
def preprocess_function(examples):
    inputs = examples["document"]
    targets = examples["summary"]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=60, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [45]:
subset_train = train_dataset.select(range(2000))
subset_test = test_dataset.select(range(500))

In [46]:
tokenized_train = subset_train.map(preprocess_function, batched=True)
tokenized_test = subset_test.map(preprocess_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Use the selected metrics to evalute the fine-tuned model on your test set.

In [47]:
#config train
training_args = Seq2SeqTrainingArguments(
    output_dir="./finetuned_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=3,  # Use 1 epoch for a quick run, increase for better results
    fp16=torch.cuda.is_available(),  # Use mixed precision if on GPU
    predict_with_generate=True,
    logging_dir='./logs',
)



In [48]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [49]:
#defining the trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Seq2SeqTrainer(


In [50]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.6965,2.524773
2,2.6646,2.525098
3,2.6485,2.525505


TrainOutput(global_step=1500, training_loss=2.6698451334635416, metrics={'train_runtime': 185.3774, 'train_samples_per_second': 32.366, 'train_steps_per_second': 8.092, 'total_flos': 789839950381056.0, 'train_loss': 2.6698451334635416, 'epoch': 3.0})

Comment the results.

In [53]:
inputs = [example["document"] for example in subset_test]
references = [example["summary"] for example in subset_test]

In [54]:
predictions = []
for doc in inputs:
    output = tokenizer(doc, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    summary_ids = model.generate(**output, max_length=60, min_length=10)
    pred = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    predictions.append(pred)

In [56]:
#evaluating with rouge
results = rouge.compute(predictions=predictions, references=references, use_stemmer=True)

print("ROUGE Evaluation After Fine-tuning:\n")
for metric, score in results.items():
    print(f"{metric.upper()}: {score}")

ROUGE Evaluation After Fine-tuning:

ROUGE1: 0.2927370305886142
ROUGE2: 0.08188269800878917
ROUGEL: 0.2263433231636263
ROUGELSUM: 0.22617259846394355


## To go further

Explore how to implement a model from scrath and train it on your train set.

Use the selected metrics to evalute the model trained from scratch on your test set.

Comment the results.