<a href="https://colab.research.google.com/github/Martin-Cifuentes/LR_NN_IA2/blob/master/Lab_2_fine_tune_generative_ai_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune un modelo de Gen AI Model para res√∫menes

En este cuaderno, se har√° en fine-tune de un LLM existente de Hugging Face para mejorar el resumen de los di√°logos. Utilizar√° el modelo [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5), que proporciona un modelo tuneado con instrucciones de alta calidad y puede resumir texto. Para mejorar las inferencias, se usar√° full fine-tuning y evaluar√° los resultados con m√©tricas de ROUGE. Luego, se usar√° Parameter Efficient Fine-Tuning (PEFT), evaluar√° el modelo resultante y ver√° que los beneficios de PEFT superan las desventajas de rendimiento.

# Tabla de contenido

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

In [None]:
!python --version

Python 3.10.11


In [None]:
# Opci√≥n 1: Instalaci√≥n paso a paso con comandos m√°gicos
print("üöÄ Instalando PyTorch...")
!pip install torch==2.8.0+cpu torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

print("üöÄ Instalando Transformers...")
!pip install transformers==4.56.1

print("üöÄ Instalando Datasets...")
!pip install datasets==4.0.0

print("üöÄ Instalando Evaluate...")
!pip install evaluate==0.4.5

print("üöÄ Instalando ROUGE Score...")
!pip install rouge_score==0.1.2

print("üöÄ Instalando PEFT...")
!pip install peft==0.17.1

print("üöÄ Instalando dependencias adicionales...")
!pip install accelerate huggingface_hub safetensors

print("‚ú® ¬°Instalaci√≥n completada!")

In [None]:
# Verificar instalaciones
print("\nüîç Verificando instalaciones...")
try:
    import torch
    import transformers
    import datasets
    import evaluate
    import peft

    print("‚úÖ Verificaci√≥n exitosa:")
    print(f"   PyTorch: {torch.__version__}")
    print(f"   Transformers: {transformers.__version__}")
    print(f"   Datasets: {datasets.__version__}")
    print(f"   PEFT: {peft.__version__}")
    print("\nüéâ ¬°Todo listo para hacer fine-tuning!")

except ImportError as e:
    print(f"‚ùå Error en la verificaci√≥n: {e}")
    print("Puede que necesites reiniciar el kernel del notebook.")

In [None]:
!pip show torch torchdata transformers datasets evaluate rouge_score loralib peft

Name: torch
Version: 2.8.0+cpu
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: c:\code\fine_tunning\.finevenv\lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, peft, torchaudio, torchvision
---
Name: transformers
Version: 4.56.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: c:\code\fine_tunning\.finevenv\lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft
---
Name: da



In [None]:
# verificar que transformers funciona

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Probar con el modelo m√°s peque√±o

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Test b√°sico
input_text = "translate English to Spanish: Hello world"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"‚úÖ T5 funciona: {result}")

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


‚úÖ T5 funciona: Hallo Welt


<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [None]:
# Importe los componentes necesarios.
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Cargar Dataset y LLM

[DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) es un dataset de Hugging Face. contiene 10,000+ dialogos con los correspondientes res√∫menes y temas etiquetados manualmente.

In [None]:
dataset = load_dataset("knkarthick/dialogsum")

In [None]:
print(dataset["train"].features)

{'id': Value('string'), 'dialogue': Value('string'), 'summary': Value('string'), 'topic': Value('string')}


Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [None]:
# Cargamos el modelo y el tokenizador

model_name='google/flan-t5-base'
# model_name='t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Es posible extraer la cantidad de par√°metros del modelo y descubrir cu√°ntos de ellos se pueden entrenar.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"Parametros entrenables del modelo: {trainable_model_params:,.0f}\n Total de parametros del modelo: {all_model_params:,.0f}\n Porcentaje de parametros entrenables {100 * trainable_model_params / all_model_params:.0f}%"
#print(f'El area es: {area:,.2f}')
print(print_number_of_trainable_model_parameters(original_model))

Parametros entrenables del modelo: 247,577,856
 Total de parametros del modelo: 247,577,856
 Porcentaje de parametros entrenables 100%


<a name='1.3'></a>
### 1.3 - Prueba del modelo con Zero Shot Inferencing

Pruebe el modelo con la inferencia de tiro cero. Puede ver que el modelo tiene dificultades para resumir el di√°logo en comparaci√≥n con el resumen de referencia, pero extrae informaci√≥n importante del texto que indica que el modelo se puede ajustar a la tarea en cuesti√≥n.

In [None]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""
inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f' PROMPT DE ENTRADA:\n{prompt}')
print(dash_line)
print(f'RESUMEN HUMANO (BASELINE) :\n{summary}\n')
print(dash_line)
print(f'RESUMEN GENERADO POR EL MODELO CON ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
 PROMPT DE ENTRADA:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------

<a name='2'></a>
## 2 - Realizar Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocesar el dataset Dialog-Summary

Se necesita convertir los pares dialogo-resumen (prompt-response) en instrucciones expl√≠citas para el LLM. Agregar una instrucci√≥n al inicio del dialogo como `Summarize the following conversation` y al inicio del resumen agregar `Summary`como se muestra a continuaci√≥n:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [None]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    print("Size of prompt list: ", len(prompt))  # Imprimir el tama√±o de la lista
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Size of prompt list:  500


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [00:00<00:00, 1011.41 examples/s]


In [None]:
print (type(tokenized_datasets))
first_example = tokenized_datasets["train"].select([0])
print(first_example)
print(first_example['labels'])

<class 'datasets.dataset_dict.DatasetDict'>
Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 1
})
Column([[1363, 5, 3931, 31, 7, 652, 3, 9, 691, 18, 413, 6, 11, 7582, 12833, 77, 7, 7786, 7, 376, 12, 43, 80, 334, 215, 5, 12833, 77, 7, 31, 195, 428, 128, 251, 81, 70, 2287, 11, 11208, 12, 199, 1363, 5, 3931, 10399, 10257, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Para ahorrar algo de tiempo en el laboratorio, submuestrear√° el conjunto de datos:

In [None]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [00:00<00:00, 1577.02 examples/s]


Compruebe las formas de las tres partes del conjunto de datos:

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})


El dataset de salida esta listo para el fine-tunning.

<a name='2.2'></a>
### 2.2 - Aplicar Fine -Tunning para el modelo con el dataset Preprocesado

Ahora utilice la clase integrada "Trainer" de Hugging Face (consulte la documentaci√≥n [aqui](https://huggingface.co/docs/transformers/main_classes/trainer)). Pase el conjunto de datos preprocesado con referencia al modelo original. Los dem√°s par√°metros de entrenamiento se encuentran experimentalmente y no es necesario profundizar en ellos por el momento.

In [None]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Start training process...



In [None]:
#trainer.train()
#Entrenar una versi√≥n completamente tuneada  del modelo toma  horas en una GPU. Para ahorrar tiempo, se puede descargar un punto de control del modelo completamente ajustado para utilizarlo en el resto del notebook The size of the downloaded instruct model is approximately 1GB.

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

Entrenar una versi√≥n completamente tuneada  del modelo toma  horas en una GPU. Para ahorrar tiempo, se puede descargar un punto de control del modelo completamente ajustado para utilizarlo en el resto del notebook The size of the downloaded instruct model is approximately 1GB.

In [None]:
# Esto se usa cuando se descarga el modelo de un bucket de S3
# !aws s3 cp --recursive s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/ ./flan-dialogue-summary-checkpoint/
#!ls -alh ./flan-dialogue-summary-checkpoint/pytorch_model.bin
#instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan-dialogue-summary-checkpoint", torch_dtype=torch.bfloat16)

# Este chekpoint del modelo se descarga de Hugging Face
instruct_model_name='truocpham/flan-dialogue-summary-checkpoint'
instruct_model = AutoModelForSeq2SeqLM.from_pretrained( instruct_model_name, torch_dtype=torch.bfloat16)


`torch_dtype` is deprecated! Use `dtype` instead!


In [None]:

type(instruct_model)

transformers.models.t5.modeling_t5.T5ForConditionalGeneration

<a name='2.3'></a>
### 2.3 - Evaluar el modelo cualitativamente (evaluaci√≥n humana)

Como ocurre con muchas aplicaciones GenAI, un enfoque cualitativo en el que uno se hace la pregunta "¬øMi modelo se comporta como se supone que debe hacerlo" suele ser un buen punto de partida. En el siguiente ejemplo , puede ver c√≥mo el modelo ajustado es capaz de crear un resumen razonable del di√°logo en comparaci√≥n con la incapacidad original de comprender lo que se le pide al modelo.

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'RESUMEN HUMANO (BASELINE):\n{human_baseline_summary}')
print(dash_line)
print(f'MODELO ORIGINAL:\n{original_model_text_output}')
print(dash_line)
print(f'MODEL CON INSTRUCCIONES:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
RESUMEN HUMANO (BASELINE):
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
MODELO ORIGINAL:
#Person1#: I'm thinking of upgrading my computer.
---------------------------------------------------------------------------------------------------
MODEL CON INSTRUCCIONES:
#Person1# suggests #Person2# adding a painting program to #Person2#'s software and upgrading the hardware. #Person2# also wants to add a CD-ROM drive.


<a name='2.4'></a>
### 2.4 - Evaluar el modelo cuantitativamente (con la m√©trica ROUGE)

[ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) Ayuda a cuantificar la validez de los res√∫menes generados por los modelos. Compara los res√∫menes con un resumen de referencia, generalmente creado por un usuario. Si bien no es perfecto, indica el aumento general en la eficacia del resumen que hemos logrado mediante el ajuste.

In [None]:
rouge = evaluate.load('rouge')

Genere las salidas para la muestra del conjunto de datos de prueba (solo 10 di√°logos y res√∫menes para ahorrar tiempo) y guarde los resultados.

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2#: I...",Brian's birthday is coming. #Person1# invites ...


Eval√∫e los modelos que calculan las m√©tricas ROUGE. ¬°Observe la mejora en los resultados!

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)

print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.24089921652421653), 'rouge2': np.float64(0.11769053708439897), 'rougeL': np.float64(0.22001958689458687), 'rougeLsum': np.float64(0.22134175465057818)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.4015906463624618), 'rouge2': np.float64(0.17568542724181807), 'rougeL': np.float64(0.2874569966059625), 'rougeLsum': np.float64(0.2886327613084294)}


El archivo `data/dialogue-summary-training-results.csv` contiene una lista predefinida de todos los resultados del modelo, que puede usar para evaluar una secci√≥n m√°s amplia de datos. Hagamos esto para cada modelo:

In [None]:
results = pd.read_csv("data/dialogue-summary-training-results.csv")

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.2334158581572823), 'rouge2': np.float64(0.07603964187010573), 'rougeL': np.float64(0.20145520923859048), 'rougeLsum': np.float64(0.20145899339006135)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.42161291557556113), 'rouge2': np.float64(0.18035380596301792), 'rougeL': np.float64(0.3384439349963909), 'rougeLsum': np.float64(0.33835653595561666)}


The results show substantial improvement in all ROUGE metrics:

In [None]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%


<a name='3'></a>
## 3 - Fine tunning Eficiente de Par√°metros (PEFT)

Ahora, realicemos un ajuste fino **PEFT** en lugar del ajuste fino completo, como se hizo anteriormente. PEFT es una forma de ajuste fino de instrucciones mucho m√°s eficiente que el ajuste fino completo, con resultados de evaluaci√≥n comparables, como ver√° pronto.

PEFT es un t√©rmino gen√©rico que incluye **Adaptaci√≥n de Bajo Rango (LoRA)** y ajuste de prompts (¬°que NO ES LO MISMO que ingenier√≠a de prompts!). En la mayor√≠a de los casos, cuando alguien habla de PEFT, se refiere a LoRA. LoRA, a un nivel muy alto, permite al usuario ajustar su modelo utilizando menos recursos computacionales (en algunos casos, una sola GPU). Despu√©s del ajuste fino para una tarea, caso de uso o inquilino espec√≠fico con LoRA, el resultado es que el LLM original permanece sin cambios y surge un nuevo "adaptador LoRA". Este adaptador LoRA es mucho m√°s peque√±o que el LLM original: aproximadamente un porcentaje de un solo d√≠gito del tama√±o del LLM original (MB vs. GB).

No obstante, en el momento de la inferencia, el adaptador LoRA debe reunirse y combinarse con su LLM original para atender la solicitud de inferencia. La ventaja, sin embargo, es que muchos adaptadores LoRA pueden reutilizar el LLM original, lo que reduce los requisitos generales de memoria al atender m√∫ltiples tareas y casos de uso.

<a name='3.1'></a>


### 3.1 - Configuraci√≥n del modelo PEFT/LoRA para el ajuste fino

Debe configurar el modelo PEFT/LoRA para el ajuste fino con un nuevo adaptador de capa/par√°metro. Al usar PEFT/LoRA, se congela el LLM subyacente y solo se entrena el adaptador. Observe la configuraci√≥n de LoRA a continuaci√≥n. Observe el hiperpar√°metro de rango (`r`), que define el rango/dimensi√≥n del adaptador que se va a entrenar.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [None]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

Parametros entrenables del modelo: 3,538,944
 Total de parametros del modelo: 251,116,800
 Porcentaje de parametros entrenables 1%




<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [None]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everything is ready to train the PEFT adapter and save the model.



In [None]:
peft_trainer.train()
peft_model_path="./peft-dialogue-summary-checkpoint-local"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3.

In [None]:
#!aws s3 cp --recursive s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/ ./peft-dialogue-summary-checkpoint-from-s3/

download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_config.json to peft-dialogue-summary-checkpoint-from-s3/adapter_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/special_tokens_map.json to peft-dialogue-summary-checkpoint-from-s3/special_tokens_map.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer_config.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_model.bin to peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


Comprueba que el tama√±o de este modelo es mucho menor que el LLM original:

In [None]:
!dir /a .\peft-dialogue-summary-checkpoint-from-s3\adapter_model.bin

 El volumen de la unidad C es S.O
 El nÔøΩmero de serie del volumen es: 5CCB-CFE6

 Directorio de c:\code\fine_tunning\peft-dialogue-summary-checkpoint-from-s3

08/09/2025  08:41 a.ÔøΩm.        14.208.525 adapter_model.bin
               1 archivos     14.208.525 bytes
               0 dirs  117.894.070.272 bytes libres


Prepare este modelo a√±adiendo un adaptador al modelo FLAN-T5 original. Est√° configurando `is_trainable=False` porque el plan es realizar inferencia √∫nicamente con este modelo PEFT. Si estuviera preparando el modelo para entrenamiento posterior, deber√≠a configurar `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       './peft-dialogue-summary-checkpoint-from-s3/',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

Parametros entrenables del modelo: 0
 Total de parametros del modelo: 251,116,800
 Porcentaje de parametros entrenables 0%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of upgrading my computer.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# suggests #Person2# adding a painting program to #Person2#'s software and upgrading the hardware. #Person2# also wants to add a CD-ROM drive.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person1# recommends adding a painting program to #Person2#'s software and upgrading hardware. #Person2# also wants to upgrade the hardware because it's outdated now.


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
5,#Person2# complains to #Person1# about the tra...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and #Person1# s...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced. Kate can'...,Kate tells #Person2# Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2#: I...",Brian's birthday is coming. #Person1# invites ...,Brian remembers his birthday and invites #Pers...



Calcule la puntuaci√≥n ROUGE para este subconjunto de los datos.

In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.24089921652421653), 'rouge2': np.float64(0.11769053708439897), 'rougeL': np.float64(0.22001958689458687), 'rougeLsum': np.float64(0.22134175465057818)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.4015906463624618), 'rouge2': np.float64(0.17568542724181807), 'rougeL': np.float64(0.2874569966059625), 'rougeLsum': np.float64(0.2886327613084294)}
PEFT MODEL:
{'rouge1': np.float64(0.3725351062275605), 'rouge2': np.float64(0.12138811933618107), 'rougeL': np.float64(0.27620639623170606), 'rougeLsum': np.float64(0.2758134870822362)}


Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [None]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': np.float64(0.2334158581572823), 'rouge2': np.float64(0.07603964187010573), 'rougeL': np.float64(0.20145520923859048), 'rougeLsum': np.float64(0.20145899339006135)}
INSTRUCT MODEL:
{'rouge1': np.float64(0.42161291557556113), 'rouge2': np.float64(0.18035380596301792), 'rougeL': np.float64(0.3384439349963909), 'rougeLsum': np.float64(0.33835653595561666)}
PEFT MODEL:
{'rouge1': np.float64(0.40810631575616746), 'rouge2': np.float64(0.1633255794568712), 'rougeL': np.float64(0.32507074586565354), 'rougeLsum': np.float64(0.3248950182867091)}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.70%
rougeL: -1.34%
rougeLsum: -1.35%


Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).