<a href="https://colab.research.google.com/github/armandoordonez/GenAI/blob/main/Lab_2_fine_tune_generative_ai_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune un modelo de Gen AI Model para resúmenes

En este cuaderno, se hará en fine-tune de un LLM existente de Hugging Face para mejorar el resumen de los diálogos. Utilizará el modelo [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5), que proporciona un modelo tuneado con instrucciones de alta calidad y puede resumir texto. Para mejorar las inferencias, se usará full fine-tuning y evaluará los resultados con métricas de ROUGE. Luego, se usará Parameter Efficient Fine-Tuning (PEFT), evaluará el modelo resultante y verá que los beneficios de PEFT superan las desventajas de rendimiento.

# Tabla de contenido

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

In [None]:
!python --version

Python 3.12.11


In [None]:
print("🚀 Instalando PyTorch estable...")
!pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --upgrade

print("🚀 Instalando Hugging Face stack...")
!pip install transformers==4.44.2 peft==0.12.0 accelerate==0.34.2
!pip install datasets evaluate rouge_score
!pip install huggingface_hub safetensors

print("✨ Instalación completada. Reinicia el kernel antes de verificar.")

🚀 Instalando PyTorch estable...
🚀 Instalando Hugging Face stack...
✨ Instalación completada. Reinicia el kernel antes de verificar.


In [None]:
# Verificar instalaciones
print("\n🔍 Verificando instalaciones...")
try:
    import torch
    import transformers
    import datasets
    import evaluate
    import peft

    print("✅ Verificación exitosa:")
    print(f"   PyTorch: {torch.__version__}")
    print(f"   Transformers: {transformers.__version__}")
    print(f"   Datasets: {datasets.__version__}")
    print(f"   PEFT: {peft.__version__}")
    print("\n🎉 ¡Todo listo para hacer fine-tuning!")

except ImportError as e:
    print(f"❌ Error en la verificación: {e}")
    print("Puede que necesites reiniciar el kernel del notebook.")


🔍 Verificando instalaciones...
✅ Verificación exitosa:
   PyTorch: 2.4.1+cu121
   Transformers: 4.44.2
   Datasets: 4.0.0
   PEFT: 0.12.0

🎉 ¡Todo listo para hacer fine-tuning!


In [None]:
!pip show torch torchdata transformers datasets evaluate rouge_score loralib peft

[0mName: torch
Version: 2.4.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, fastai, peft, sentence-transformers, timm, torchaudio, torchdata, torchvision
---
Name: torchdata
Version: 0.11.0
Summary: Composable data loading modules for PyTorch
Home-page: https://github.com/pytorch/data
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD
Location: /usr/local/lib/python3.12/dist-packages
Requires: requests, torch, urllib3
Required-by: torchtu

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [None]:
# Importe los componentes necesarios.
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Cargar Dataset y LLM


In [None]:
dataset = load_dataset("gopalkalpande/bbc-news-summary")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
print(dataset["train"].features)

{'File_path': Value('string'), 'Articles': Value('string'), 'Summaries': Value('string')}


Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "google/flan-t5-base"

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)




Es posible extraer la cantidad de parámetros del modelo y descubrir cuántos de ellos se pueden entrenar.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"Parametros entrenables del modelo: {trainable_model_params:,.0f}\n Total de parametros del modelo: {all_model_params:,.0f}\n Porcentaje de parametros entrenables {100 * trainable_model_params / all_model_params:.0f}%"
#print(f'El area es: {area:,.2f}')
print(print_number_of_trainable_model_parameters(original_model))

Parametros entrenables del modelo: 247,577,856
 Total de parametros del modelo: 247,577,856
 Porcentaje de parametros entrenables 100%


<a name='1.3'></a>
### 1.3 - Prueba del modelo con Zero Shot Inferencing

Pruebe el modelo con la inferencia de tiro cero. Puede ver que el modelo tiene dificultades para resumir el diálogo en comparación con el resumen de referencia, pero extrae información importante del texto que indica que el modelo se puede ajustar a la tarea en cuestión.

In [None]:
print("Dataset splits:")
print(dataset)

Dataset splits:
DatasetDict({
    train: Dataset({
        features: ['File_path', 'Articles', 'Summaries'],
        num_rows: 2224
    })
})


In [None]:
# Split the training data into train, validation, and test sets
train_testvalid = dataset["train"].train_test_split(test_size=0.2)
test_valid = train_testvalid["test"].train_test_split(test_size=0.5)

dataset["train"] = train_testvalid["train"]
dataset["validation"] = test_valid["train"]
dataset["test"] = test_valid["test"]

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['File_path', 'Articles', 'Summaries'],
        num_rows: 1779
    })
    validation: Dataset({
        features: ['File_path', 'Articles', 'Summaries'],
        num_rows: 222
    })
    test: Dataset({
        features: ['File_path', 'Articles', 'Summaries'],
        num_rows: 223
    })
})


In [None]:
index = 1

dialogue = dataset['test'][index]['Articles']
summary = dataset['test'][index]['Summaries']

prompt = f"""
Summarize the following article.

{dialogue}

Summary:
"""
inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f' PROMPT DE ENTRADA:\n{prompt}')
print(dash_line)
print(f'RESUMEN HUMANO (BASELINE) :\n{summary}\n')
print(dash_line)
print(f'RESUMEN GENERADO POR EL MODELO CON ZERO SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (634 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
 PROMPT DE ENTRADA:

Summarize the following article.

Savvy searchers fail to spot ads..Internet search engine users are an odd mix of naive and sophisticated, suggests a report into search habits...The report by the US Pew Research Center reveals that 87% of searchers usually find what they were looking for when using a search engine. It also shows that few can spot the difference between paid-for results and organic ones. The report reveals that 84% of net users say they regularly use Google, Ask Jeeves, MSN and Yahoo when online...Almost 50% of those questioned said they would trust search engines much less, if they knew information about who paid for results was being hidden. According to figures gathered by the Pew researchers the average users spends about 43 minutes per month carrying out 34 separate searches and looks at 1.9 webpages for each hunt. A significant chunk of net use

<a name='2'></a>
## 2 - Realizar Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocesar el dataset Dialog-Summary

Se necesita convertir los pares dialogo-resumen (prompt-response) en instrucciones explícitas para el LLM. Agregar una instrucción al inicio del dialogo como `Summarize the following conversation` y al inicio del resumen agregar `Summary`como se muestra a continuación:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [None]:
def tokenize_function(example):
    start_prompt = 'Summarize the following article.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["Articles"]]
    print("Size of prompt list: ", len(prompt))  # Imprimir el tamaño de la lista
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["Summaries"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(['Articles', 'Summaries', 'File_path'])

Map:   0%|          | 0/1779 [00:00<?, ? examples/s]

Size of prompt list:  1000
Size of prompt list:  779


Map:   0%|          | 0/222 [00:00<?, ? examples/s]

Size of prompt list:  222


Map:   0%|          | 0/223 [00:00<?, ? examples/s]

Size of prompt list:  223


In [None]:
print (type(tokenized_datasets))
first_example = tokenized_datasets["train"].select([0])
print(first_example)
print(first_example['labels'])

<class 'datasets.dataset_dict.DatasetDict'>
Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 1
})
Column([[486, 8, 414, 13, 112, 4551, 1932, 1469, 6, 1363, 25594, 243, 3, 88, 341, 3, 29099, 1363, 10689, 29, 12922, 11, 816, 112, 9322, 47, 17722, 5, 3845, 243, 1363, 10689, 29, 12922, 31, 7, 9322, 3, 7361, 17722, 10, 96, 3845, 47, 8, 568, 166, 13, 66, 113, 1380, 21, 48, 15736, 12, 36, 356, 95, 5, 7855, 526, 3271, 9137, 25594, 243, 34, 47, 97, 12, 3314, 3, 9, 689, 365, 8, 21760, 3825, 1955, 10689, 29, 12922, 5, 329, 52, 10689, 29, 12922, 10399, 38, 234, 15852, 336, 471, 227, 271, 1219, 16, 3245, 13, 8627, 12453, 31, 7, 7469, 5, 3845, 243, 10, 96, 26934, 97, 34, 2906, 34, 164, 59, 36, 81, 3, 9, 3, 29, 15159, 11, 70, 8359, 5, 7638, 2818, 8627, 12453, 243, 8, 917, 21, 1175, 12, 2367, 16, 8, 1270, 263, 57, 6777, 1152, 120, 14673, 29, 31, 7, 3, 29, 15159, 47, 8534, 16, 9065, 477, 6, 5864, 477, 3627, 145, 8, 1348, 5, 15944, 6, 1363, 13816, 243, 10, 96, 7238, 405, 174, 12, 36, 430, 

Compruebe las formas de las tres partes del conjunto de datos:

In [None]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/1779 [00:00<?, ? examples/s]

Filter:   0%|          | 0/222 [00:00<?, ? examples/s]

Filter:   0%|          | 0/223 [00:00<?, ? examples/s]

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (18, 2)
Validation: (3, 2)
Test: (3, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 18
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 3
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 3
    })
})


El dataset de salida esta listo para el fine-tunning.

<a name='2.2'></a>
### 2.2 - Aplicar Fine -Tunning para el modelo con el dataset Preprocesado

Ahora utilice la clase integrada "Trainer" de Hugging Face (consulte la documentación [aqui](https://huggingface.co/docs/transformers/main_classes/trainer)). Pase el conjunto de datos preprocesado con referencia al modelo original. Los demás parámetros de entrenamiento se encuentran experimentalmente y no es necesario profundizar en ellos por el momento.

In [None]:
torch.cuda.empty_cache()

In [None]:
#trainer.train()

In [None]:
# Esto se usa cuando se descarga el modelo de un bucket de S3
# !aws s3 cp --recursive s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/ ./flan-dialogue-summary-checkpoint/
#!ls -alh ./flan-dialogue-summary-checkpoint/pytorch_model.bin
#instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan-dialogue-summary-checkpoint", torch_dtype=torch.bfloat16)

# Este chekpoint del modelo se descarga de Hugging Face
instruct_model_name='truocpham/flan-dialogue-summary-checkpoint'
instruct_model = AutoModelForSeq2SeqLM.from_pretrained( instruct_model_name, torch_dtype=torch.bfloat16)

<a name='2.3'></a>
### 2.3 - Evaluar el modelo cualitativamente (evaluación humana)

Como ocurre con muchas aplicaciones GenAI, un enfoque cualitativo en el que uno se hace la pregunta "¿Mi modelo se comporta como se supone que debe hacerlo" suele ser un buen punto de partida. En el siguiente ejemplo , puede ver cómo el modelo ajustado es capaz de crear un resumen razonable del diálogo en comparación con la incapacidad original de comprender lo que se le pide al modelo.

In [None]:
index = 1

dialogue = dataset['test'][index]['Articles']
human_baseline_summary = dataset['test'][index]['Summaries']

prompt = f"""
Summarize the following article.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'RESUMEN HUMANO (BASELINE):\n{human_baseline_summary}')
print(dash_line)
print(f'MODELO ORIGINAL:\n{original_model_text_output}')
print(dash_line)
print(f'MODEL CON INSTRUCCIONES:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
RESUMEN HUMANO (BASELINE):
Almost 50% of those questioned said they would trust search engines much less, if they knew information about who paid for results was being hidden.Said the Pew report: "This finding is ironic, since nearly half of all users say they would stop using search engines if they thought engines were not being clear about how they presented paid results."Tony Macklin, spokesman for Ask Jeeves, said the results reflected its own research which showed that people use different search engines because the way the sites gather information means they can provide different results for the same query.Internet search engine users are an odd mix of naive and sophisticated, suggests a report into search habits.The report by the US Pew Research Center reveals that 87% of searchers usually find what they were looking for when using a search engine.A small number, 17%, said they wo

In [None]:
from transformers import GenerationConfig

# Índice de ejemplo
index = 1

dialogue = dataset['test'][index]['Articles']
human_baseline_summary = dataset['test'][index]['Summaries']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

# Función auxiliar para generar texto con cualquier modelo
def generate_summary(model, tokenizer, prompt, max_new_tokens=200, num_beams=1):
    # Tokenizamos y movemos todo al dispositivo correcto
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generamos texto
    outputs = model.generate(
        **inputs,
        generation_config=GenerationConfig(max_new_tokens=max_new_tokens, num_beams=num_beams)
    )

    # Decodificamos la salida
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Generar con ambos modelos
original_model_text_output = generate_summary(original_model, tokenizer, prompt)
instruct_model_text_output = generate_summary(instruct_model, tokenizer, prompt)

# Mostrar resultados
dash_line = "-" * 80
print(dash_line)
print(f'RESUMEN HUMANO (BASELINE):\n{human_baseline_summary}')
print(dash_line)
print(f'MODELO ORIGINAL:\n{original_model_text_output}')
print(dash_line)
print(f'MODELO CON INSTRUCCIONES:\n{instruct_model_text_output}')


--------------------------------------------------------------------------------
RESUMEN HUMANO (BASELINE):
Almost 50% of those questioned said they would trust search engines much less, if they knew information about who paid for results was being hidden.Said the Pew report: "This finding is ironic, since nearly half of all users say they would stop using search engines if they thought engines were not being clear about how they presented paid results."Tony Macklin, spokesman for Ask Jeeves, said the results reflected its own research which showed that people use different search engines because the way the sites gather information means they can provide different results for the same query.Internet search engine users are an odd mix of naive and sophisticated, suggests a report into search habits.The report by the US Pew Research Center reveals that 87% of searchers usually find what they were looking for when using a search engine.A small number, 17%, said they wouldn't really miss 

<a name='2.4'></a>
### 2.4 - Evaluar el modelo cuantitativamente (con la métrica ROUGE)

[ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) Ayuda a cuantificar la validez de los resúmenes generados por los modelos. Compara los resúmenes con un resumen de referencia, generalmente creado por un usuario. Si bien no es perfecto, indica el aumento general en la eficacia del resumen que hemos logrado mediante el ajuste.

In [None]:
rouge = evaluate.load('rouge')

Genere las salidas para la muestra del conjunto de datos de prueba (solo 10 diálogos y resúmenes para ahorrar tiempo) y guarde los resultados.

In [None]:
from transformers import GenerationConfig
import pandas as pd


dialogues = dataset['test'][0:10]['Articles']
human_baseline_summaries = dataset['test'][0:10]['Summaries']

original_model_summaries = []
instruct_model_summaries = []

for dialogue in dialogues:
    prompt = f"""
Summarize the following article.

{dialogue}

Summary: """

    # 🔹 Generar con modelo original
    inputs = tokenizer(prompt, return_tensors="pt").to(original_model.device)
    outputs = original_model.generate(
        **inputs,
        generation_config=GenerationConfig(max_new_tokens=200)
    )
    original_model_summaries.append(tokenizer.decode(outputs[0], skip_special_tokens=True))

    # 🔹 Generar con modelo instruccional
    inputs = tokenizer(prompt, return_tensors="pt").to(instruct_model.device)
    outputs = instruct_model.generate(
        **inputs,
        generation_config=GenerationConfig(max_new_tokens=200)
    )
    instruct_model_summaries.append(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Armar dataframe
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
df = pd.DataFrame(zipped_summaries, columns=['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df


Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,"In order for a virtual office to succeed, keep...","...and this week, we're hearing from Marianne ...",Marianne Petersen's plan to convert a barn int...
1,Almost 50% of those questioned said they would...,Internet search engine users are an odd mix of...,Internet search engine users are an odd mix of...
2,An O'Gara penalty put Ireland more than a conv...,Ireland's Brian O'Driscoll guided Ireland to a...,Brian O'Driscoll guides Ireland to a workmanli...
3,Labour will continue to pursue controversial r...,Labour's election chief has defended the party...,Labour's election chief Alan Milburn has said ...
4,"Replacements: Everitt for Mapletoft (53), Hodg...",Wasps smashed London Irish's hopes of a Premie...,Wasps smashed London Irish by a tense first-ha...
5,"The Birmingham athlete, who clocked a season's...",British long jumper Chris Tomlinson has withdr...,Chris Tomlinson has cut his schedule to ensure...
6,Fiat claims that GM is legally obliged to buy ...,Fiat boss Sergio Marchionne says GM's argument...,Fiat will meet car giant General Motors on Tue...
7,And he tells them that if they don't confess -...,The 'prisoner's dilemma' is a perverse logic t...,The key feature of an endless feud is that eve...
8,"Nearly 9,000 business leaders in 104 countries...",Business leaders in Africa are failing to plan...,Business leaders in 104 countries have no poli...
9,The woman said she was assaulted after a recor...,A woman has sued US rapper Snoop Dogg for $25m...,US rapper Snoop Dogg has been sued for $25m by...


Evalúe los modelos que calculan las métricas ROUGE. ¡Observe la mejora en los resultados!

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)

print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.15439675512291906), 'rouge2': np.float64(0.0937933676631198), 'rougeL': np.float64(0.12458924393464818), 'rougeLsum': np.float64(0.12472616585584786)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.22805756123772541), 'rouge2': np.float64(0.1534748496417074), 'rougeL': np.float64(0.19143048939990928), 'rougeLsum': np.float64(0.19314716055848202)}


El archivo `data/dialogue-summary-training-results.csv` contiene una lista predefinida de todos los resultados del modelo, que puede usar para evaluar una sección más amplia de datos. Hagamos esto para cada modelo:

In [None]:
print(original_model.device)

cpu


The results show substantial improvement in all ROUGE metrics:

In [None]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 7.37%
rouge2: 5.97%
rougeL: 6.68%
rougeLsum: 6.84%
