<a href="https://colab.research.google.com/github/armandoordonez/GenAI/blob/main/Lab_2_fine_tune_generative_ai_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune un modelo de Gen AI para Resúmenes de Textos Médicos

En este cuaderno, se hará el fine-tune de un LLM existente de Hugging Face para mejorar el resumen de textos médicos científicos. Utilizará el modelo [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5), que proporciona un modelo tuneado con instrucciones de alta calidad y puede resumir texto médico especializado. Para mejorar las inferencias, se usará full fine-tuning y evaluará los resultados con métricas de ROUGE. Luego, se usará Parameter Efficient Fine-Tuning (PEFT), evaluará el modelo resultante y verá que los beneficios de PEFT superan las desventajas de rendimiento.

# Tabla de contenido

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Medical Articles Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

In [1]:
!python --version

Python 3.10.11


In [2]:
# Opción 1: Instalación paso a paso con comandos mágicos
print("🚀 Instalando PyTorch...")
%pip install torch==2.8.0+cpu torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

print("🚀 Instalando Transformers...")
%pip install transformers==4.56.1

print("🚀 Instalando Datasets...")
%pip install datasets==4.0.0

print("🚀 Instalando Evaluate...")
%pip install evaluate==0.4.5

print("🚀 Instalando ROUGE Score...")
%pip install rouge_score==0.1.2

print("🚀 Instalando PEFT...")
%pip install peft==0.17.1

print("🚀 Instalando dependencias adicionales...")
%pip install accelerate huggingface_hub safetensors

print("✨ ¡Instalación completada!")

# Verificar instalaciones
print("\n🔍 Verificando instalaciones...")
try:
    import torch
    import transformers
    import datasets
    import evaluate
    import peft
    
    print("✅ Verificación exitosa:")
    print(f"   PyTorch: {torch.__version__}")
    print(f"   Transformers: {transformers.__version__}")
    print(f"   Datasets: {datasets.__version__}")
    print(f"   PEFT: {peft.__version__}")
    print("\n🎉 ¡Todo listo para hacer fine-tuning!")
    
except ImportError as e:
    print(f"❌ Error en la verificación: {e}")
    print("Puede que necesites reiniciar el kernel del notebook.")

🚀 Instalando PyTorch...
Looking in indexes: https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
🚀 Instalando Transformers...



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
🚀 Instalando Datasets...



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
🚀 Instalando Evaluate...



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
🚀 Instalando ROUGE Score...



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
🚀 Instalando PEFT...



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
🚀 Instalando dependencias adicionales...



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
✨ ¡Instalación completada!

🔍 Verificando instalaciones...



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
  from .autonotebook import tqdm as notebook_tqdm


✅ Verificación exitosa:
   PyTorch: 2.8.0+cpu
   Transformers: 4.56.1
   Datasets: 4.0.0
   PEFT: 0.17.1

🎉 ¡Todo listo para hacer fine-tuning!


In [3]:
%pip show torch torchdata transformers datasets evaluate rouge_score loralib peft

Name: torch
Version: 2.8.0+cpu
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: c:\users\anfep\appdata\local\programs\python\python310\lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, peft, torchaudio, torchvision
---
Name: transformers
Version: 4.56.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: c:\users\anfep\appdata\local\programs\python\python310\lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors,



In [4]:
%pip install sentencepiece

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
# verificar que transformers funciona

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Probar con el modelo más pequeño

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Test básico
input_text = "translate English to Spanish: Hello world"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"✅ T5 funciona: {result}")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


✅ T5 funciona: Hallo Welt


<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [6]:
# Importe los componentes necesarios.
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Cargar Dataset y LLM

[PubMed Summarization](https://huggingface.co/datasets/ccdv/pubmed-summarization) es un dataset de Hugging Face que contiene artículos científicos médicos de PubMed con sus correspondientes resúmenes (abstracts). Este dataset contiene más de 133,000 artículos médicos y es ideal para entrenar modelos de resumen especializados en terminología médica y científica.

In [7]:
%pip install hf_xet

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
#dataset = load_dataset("knkarthick/dialogsum")
dataset = load_dataset("ccdv/pubmed-summarization")
print(f"Splits disponibles: {list(dataset.keys())}")

Splits disponibles: ['train', 'validation', 'test']


In [9]:
print("ESTRUCTURA DEL DATASET:")
print(dataset["train"].features)

print(f"\nTAMAÑOS DE LOS SPLITS:")
for split_name in dataset.keys():
    print(f"  {split_name}: {len(dataset[split_name]):,}")

print("\nEJEMPLO DEL DATASET:")
example = dataset["train"][0]
print(f"Artículo (primeros 300 chars):\n{example['article'][:300]}...\n")
print(f"Resumen (Abstract):\n{example['abstract']}")
print(f"\nLongitudes:")
print(f"  Artículo: {len(example['article'])} caracteres")
print(f"  Resumen: {len(example['abstract'])} caracteres")

ESTRUCTURA DEL DATASET:
{'article': Value('string'), 'abstract': Value('string')}

TAMAÑOS DE LOS SPLITS:
  train: 119,924
  validation: 6,633
  test: 6,658

EJEMPLO DEL DATASET:
Artículo (primeros 300 chars):
a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing countries . 
 in iran a study among 752 high school...

Resumen (Abstract):
background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 - 13 years old ) based on advocacy approach in shiraz , iran . 
 the project provided nutritious snacks

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [10]:
# Cargamos el modelo y el tokenizador

model_name='google/flan-t5-base'
# model_name='t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

<a name='1.3'></a>
### 1.3 - Prueba del modelo con Zero Shot Inferencing

Pruebe el modelo con la inferencia de tiro cero. Puede ver que el modelo tiene dificultades para resumir el diálogo en comparación con el resumen de referencia, pero extrae información importante del texto que indica que el modelo se puede ajustar a la tarea en cuestión.

In [11]:
index = 5

article = dataset['test'][index]['article']
summary = dataset['test'][index]['abstract']

# Truncar artículo para visualización
article_preview = article[:500] + "..." if len(article) > 500 else article

prompt = f"""
Summarize the following medical research article.

{article_preview}

Summary: """

inputs = tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=100,
    )[0],
    skip_special_tokens=True
)

dash_line = '-' * 100
print(dash_line)
print(f'ARTÍCULO MÉDICO (PREVIEW):\n{article_preview}')
print(dash_line)
print(f'RESUMEN HUMANO (BASELINE):\n{summary}\n')
print(dash_line)
print(f'RESUMEN GENERADO CON ZERO SHOT:\n{output}')

----------------------------------------------------------------------------------------------------
ARTÍCULO MÉDICO (PREVIEW):
determinar a presena de anticorpos ige especficos para superantgenos estafiloccicos e o grau de sensibilizao mediada por esses , assim como se esses esto associados  gravidade da asma em pacientes adultos . 
 estudo transversal incluindo asmticos adultos em acompanhamento ambulatorial em um hospital universitrio tercirio no rio de janeiro ( rj ) . 
 os pacientes foram alocados consecutivamente em dois grupos de gravidade da asma segundo critrios da global initiative for asthma : asma leve ( al )...
----------------------------------------------------------------------------------------------------
RESUMEN HUMANO (BASELINE):
abstractobjective : to determine the presence of staphylococcal superantigen - specific ige antibodies and degree of ige - mediated sensitization , as well as whether or not those are associated with the severity of asthma in adult patients

<a name='2'></a>
## 2 - Realizar Full Fine-Tuning para Resúmenes Médicos

<a name='2.1'></a>
### 2.1 - Preprocesar el dataset de Artículos Médicos

Se necesita convertir los pares artículo-resumen en instrucciones explícitas para el LLM. Agregar una instrucción al inicio del artículo como `Summarize the following medical research article` y al final agregar `Summary:` como se muestra a continuación:

Training prompt (article):
Summarize the following medical research article.
Background: Diabetes mellitus is a chronic metabolic disorder characterized by hyperglycemia resulting from defects in insulin secretion, insulin action, or both...
Summary:

Training response (abstract):

This study investigates the relationship between diabetes and cardiovascular complications in a cohort of 500 patients over 5 years.


Luego preprocesar el dataset prompt-respuesta en tokens y extraer sus `input_ids` (1 por token).

In [12]:
def tokenize_function(example):
    start_prompt = 'Summarize the following medical research article.\n\n'
    end_prompt = '\n\nSummary: '
    
    # Truncar artículos muy largos
    max_article_length = 500  # caracteres
    
    prompt = []
    for article in example["article"]:
        # Truncar artículo 
        truncated_article = article[:max_article_length] + "..." if len(article) > max_article_length else article
        full_prompt = start_prompt + truncated_article + end_prompt
        prompt.append(full_prompt)
    
    print("Size of prompt list: ", len(prompt))
    
    # Tokenizar 
    example['input_ids'] = tokenizer(
        prompt, 
        padding="max_length", 
        truncation=True, 
        max_length=512, 
        return_tensors="pt"
    ).input_ids
    
    example['labels'] = tokenizer(
        example["abstract"], 
        padding="max_length", 
        truncation=True, 
        max_length=150,
        return_tensors="pt"
    ).input_ids

    return example

print("Iniciando tokenización del dataset médico...")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['article', 'abstract'])

print("Tokenización completada")
print(tokenized_datasets)

Iniciando tokenización del dataset médico...


Map:   0%|          | 0/6658 [00:00<?, ? examples/s]

Size of prompt list:  1000


Map:  15%|█▌        | 1000/6658 [00:00<00:01, 3794.40 examples/s]

Size of prompt list:  1000


Map:  30%|███       | 2000/6658 [00:00<00:01, 3898.69 examples/s]

Size of prompt list:  1000


Map:  45%|████▌     | 3000/6658 [00:00<00:00, 3901.91 examples/s]

Size of prompt list:  1000


Map:  60%|██████    | 4000/6658 [00:01<00:00, 3906.29 examples/s]

Size of prompt list:  1000


Map:  75%|███████▌  | 5000/6658 [00:01<00:00, 3917.14 examples/s]

Size of prompt list:  1000


Map:  90%|█████████ | 6000/6658 [00:01<00:00, 3839.02 examples/s]

Size of prompt list:  658


Map: 100%|██████████| 6658/6658 [00:01<00:00, 3506.29 examples/s]

Tokenización completada
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 119924
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 6633
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 6658
    })
})





In [13]:
print("VERIFICANDO TOKENIZACIÓN:")
print(f"Tipo de dataset: {type(tokenized_datasets)}")

first_example = tokenized_datasets["train"].select([0])
print(f"\nPrimer ejemplo tokenizado:")
print(first_example)

input_ids = first_example['input_ids'][0]
labels = first_example['labels'][0]

print(f"\nTipo de input_ids: {type(input_ids)}")
print(f"Tipo de labels: {type(labels)}")

if isinstance(input_ids, list):
    import torch
    input_ids_tensor = torch.tensor(input_ids)
    labels_tensor = torch.tensor(labels)
    print(f"Forma de input_ids: {input_ids_tensor.shape}")
    print(f"Forma de labels: {labels_tensor.shape}")
else:
    print(f"Forma de input_ids: {input_ids.shape}")
    print(f"Forma de labels: {labels.shape}")

# Decodificar para verificar (usar los primeros elementos si es una lista)
if isinstance(input_ids, list):
    sample_input = tokenizer.decode(input_ids, skip_special_tokens=True)
    sample_label = tokenizer.decode(labels, skip_special_tokens=True)
else:
    sample_input = tokenizer.decode(input_ids, skip_special_tokens=True)
    sample_label = tokenizer.decode(labels, skip_special_tokens=True)

print(f"\nInput decodificado (primeros 300 chars):\n{sample_input[:300]}...")
print(f"\nLabel decodificado:\n{sample_label}")

# Verificar longitudes
print(f"\nESTADÍSTICAS:")
print(f"  Longitud de input_ids: {len(input_ids)}")
print(f"  Longitud de labels: {len(labels)}")
print(f"  Tokens no-padding en input: {sum(1 for x in input_ids if x != tokenizer.pad_token_id)}")
print(f"  Tokens no-padding en labels: {sum(1 for x in labels if x != tokenizer.pad_token_id)}")

VERIFICANDO TOKENIZACIÓN:
Tipo de dataset: <class 'datasets.dataset_dict.DatasetDict'>

Primer ejemplo tokenizado:
Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 1
})

Tipo de input_ids: <class 'list'>
Tipo de labels: <class 'list'>
Forma de input_ids: torch.Size([512])
Forma de labels: torch.Size([150])

Input decodificado (primeros 300 chars):
Summarize the following medical research article. a recent systematic analysis showed that in 2011 , 314 ( 296 - 331 ) million children younger than 5 years were mildly , moderately or severely stunted and 258 ( 240 - 274 ) million were mildly , moderately or severely underweight in the developing c...

Label decodificado:
background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school - aged children in shiraz , iran.materials and methods : this case - control nutritional intervention has been done between 2008 and 2009 on 28

In [14]:
def smart_subsample(dataset, min_examples=50, max_examples=500):
    """Submuestrea manteniendo un mínimo de ejemplos útiles"""
    total_size = len(dataset)
    
    if total_size <= min_examples:
        return dataset
    elif total_size <= max_examples:
        step = max(2, total_size // (min_examples * 2))
        return dataset.filter(lambda example, index: index % step == 0, with_indices=True)
    else:
        step = total_size // max_examples
        return dataset.filter(lambda example, index: index % step == 0, with_indices=True)

for split in tokenized_datasets.keys():
    original_size = len(tokenized_datasets[split])
    tokenized_datasets[split] = smart_subsample(tokenized_datasets[split])
    new_size = len(tokenized_datasets[split])
    print(f"{split}: {original_size} → {new_size} ejemplos")

print(f"\nTAMAÑOS FINALES:")
print(f"  Training: {tokenized_datasets['train'].shape}")
print(f"  Validation: {tokenized_datasets['validation'].shape}")  
print(f"  Test: {tokenized_datasets['test'].shape}")

train: 119924 → 502 ejemplos
validation: 6633 → 511 ejemplos


Filter: 100%|██████████| 6658/6658 [00:01<00:00, 6558.14 examples/s]

test: 6658 → 513 ejemplos

TAMAÑOS FINALES:
  Training: (502, 2)
  Validation: (511, 2)
  Test: (513, 2)





<a name='2.2'></a>
### 2.2 - Aplicar Fine-Tuning para el modelo con el dataset de Artículos Médicos

Ahora utilice la clase integrada "Trainer" de Hugging Face (consulte la documentación [aquí](https://huggingface.co/docs/transformers/main_classes/trainer)). Pase el conjunto de datos preprocesado de artículos médicos con referencia al modelo original. Los parámetros de entrenamiento están configurados específicamente para la tarea de resumen médico.

In [15]:
output_dir = f'./medical-summarization-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=10,
    max_steps=100,
    eval_strategy="steps",
    eval_steps=50,
    save_steps=50,
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4,
    warmup_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

print("Configuración de entrenamiento lista para resúmenes médicos")
print(f"Datos de entrenamiento: {len(tokenized_datasets['train'])} ejemplos")
print(f"Datos de validación: {len(tokenized_datasets['validation'])} ejemplos")

Configuración de entrenamiento lista para resúmenes médicos
Datos de entrenamiento: 502 ejemplos
Datos de validación: 511 ejemplos


Start training process...



In [16]:
#trainer.train()
#Entrenar una versión completamente tuneada  del modelo toma  horas en una GPU. Para ahorrar tiempo, se puede descargar un punto de control del modelo completamente ajustado para utilizarlo en el resto del notebook The size of the downloaded instruct model is approximately 1GB.

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

Entrenar una versión completamente tuneada  del modelo toma  horas en una GPU. Para ahorrar tiempo, se puede descargar un punto de control del modelo completamente ajustado para utilizarlo en el resto del notebook The size of the downloaded instruct model is approximately 1GB.

In [17]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)

print("Modelo 'fine-tuneado'")

`torch_dtype` is deprecated! Use `dtype` instead!


Modelo 'fine-tuneado'


In [18]:

type(instruct_model)

transformers.models.t5.modeling_t5.T5ForConditionalGeneration

<a name='2.3'></a>
### 2.3 - Evaluar el modelo cualitativamente (evaluación humana)

Como ocurre con muchas aplicaciones GenAI, un enfoque cualitativo en el que uno se hace la pregunta "¿Mi modelo se comporta como se supone que debe hacerlo" suele ser un buen punto de partida. En el siguiente ejemplo, puede ver cómo el modelo ajustado debería ser capaz de crear resúmenes más precisos y específicos de artículos médicos en comparación con la capacidad limitada del modelo original para comprender terminología médica especializada.

In [19]:
index = 25
article = dataset['test'][index]['article']
human_baseline_summary = dataset['test'][index]['abstract']

article_for_model = article[:1000] + "..." if len(article) > 1000 else article

prompt = f"""
Summarize the following medical research article.

{article_for_model}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).input_ids


original_model_outputs = original_model.generate(
    input_ids=input_ids, 
    generation_config=GenerationConfig(max_new_tokens=120, num_beams=1)
)
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(
    input_ids=input_ids, 
    generation_config=GenerationConfig(max_new_tokens=120, num_beams=1)
)
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

dash_line = '-' * 100
print(dash_line)
print(f'ARTÍCULO MÉDICO (MUESTRA):')
print(f'{article_for_model[:400]}...')
print(dash_line)
print(f'RESUMEN HUMANO (BASELINE):\n{human_baseline_summary}')
print(dash_line)
print(f'MODELO ORIGINAL:\n{original_model_text_output}')
print(dash_line)
print(f'MODELO CON FINE-TUNING:\n{instruct_model_text_output}')
print(dash_line)

def analyze_medical_summary(summary, reference):
    """Analiza la calidad del resumen médico"""
    summary_words = set(summary.lower().split())
    reference_words = set(reference.lower().split())
    
    medical_terms = {'patient', 'treatment', 'study', 'clinical', 'therapy', 'disease', 'diagnosis', 'medical', 'health', 'hospital', 'doctor', 'medicine'}
    
    summary_medical = len(summary_words.intersection(medical_terms))
    reference_medical = len(reference_words.intersection(medical_terms))
    
    overlap = len(summary_words.intersection(reference_words))
    
    return {
        'length': len(summary.split()),
        'medical_terms': summary_medical,
        'word_overlap': overlap,
        'similarity_score': overlap / max(len(summary_words), 1)
    }

print(f"\nANÁLISIS DE CALIDAD:")
orig_analysis = analyze_medical_summary(original_model_text_output, human_baseline_summary)
inst_analysis = analyze_medical_summary(instruct_model_text_output, human_baseline_summary)

print(f"Modelo Original:")
print(f"  Longitud: {orig_analysis['length']} palabras")
print(f"  Términos médicos: {orig_analysis['medical_terms']}")
print(f"  Similitud con referencia: {orig_analysis['similarity_score']:.3f}")

print(f"Modelo Fine-tuned:")
print(f"  Longitud: {inst_analysis['length']} palabras")
print(f"  Términos médicos: {inst_analysis['medical_terms']}")
print(f"  Similitud con referencia: {inst_analysis['similarity_score']:.3f}")

----------------------------------------------------------------------------------------------------
ARTÍCULO MÉDICO (MUESTRA):
radiocontrast - induced nephropathy ( rin ) can lead to acute renal failure ( arf ) , which may require dialysis therapy . 
 arf increases treatment cost due to sepsis , hemorrhage , respiratory failure , and a long hospitalization.[13 ] rin is an important cause of hospital - acquired arf and is responsible for 12% of cases . 
 renal medullary hypoxia and the direct toxic effects of iodinated con...
----------------------------------------------------------------------------------------------------
RESUMEN HUMANO (BASELINE):
radiocontrast administration is an important cause of acute renal failure . in this study , 
 compared the plasma creatinine levels with spot urine il-18 levels following radiocontrast administration . 
 twenty patients ( 11 males , 9 females ) underwent radiocontrast diagnostic and therapeutic - enhanced examinations . 
 the rin mehran r

<a name='2.4'></a>
### 2.4 - Evaluar el modelo cuantitativamente (con métricas ROUGE)

La métrica [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) ayuda a cuantificar la validez de los resúmenes generados por los modelos al compararlos con resúmenes de referencia creados por expertos médicos. Para artículos médicos, ROUGE es especialmente útil porque mide la preservación de información clave y terminología especializada.

In [20]:
rouge = evaluate.load('rouge')

Genere las salidas para la muestra del conjunto de datos de prueba (solo 10 diálogos y resúmenes para ahorrar tiempo) y guarde los resultados.

In [21]:
# Generar resúmenes para evaluación ROUGE en muestra de artículos médicos
articles = dataset['test'][0:10]['article']
human_baseline_summaries = dataset['test'][0:10]['abstract']

original_model_summaries = []
instruct_model_summaries = []

print("Generando resúmenes de artículos médicos para evaluación...")


for idx, article in enumerate(articles):
    article_truncated = article[:1000] + "..." if len(article) > 1000 else article
    
    prompt = f"""
Summarize the following medical research article.

{article_truncated}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).input_ids

    # Modelo original
    original_model_outputs = original_model.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=120, num_beams=2) 
    )
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    # Modelo "fine-tuneado"
    instruct_model_outputs = instruct_model.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=120, num_beams=2)
    )
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
    print(f"Procesado artículo médico {idx+1}/10")

# Crear DataFrame para visualizar resultados
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
df = pd.DataFrame(zipped_summaries, columns=['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])

print("\nPRIMEROS 3 EJEMPLOS DE RESÚMENES MÉDICOS:")
for i in range(min(3, len(df))):
    print(f"\n--- ARTÍCULO MÉDICO {i+1} ---")
    print(f"Humano (Abstract): {df.iloc[i]['human_baseline_summaries'][:150]}...")
    print(f"Original: {df.iloc[i]['original_model_summaries'][:150]}...")
    print(f"Fine-tuned: {df.iloc[i]['instruct_model_summaries'][:150]}...")

print(f"\nDataFrame completo con {len(df)} resúmenes médicos:")
df

Generando resúmenes de artículos médicos para evaluación...
Procesado artículo médico 1/10
Procesado artículo médico 2/10
Procesado artículo médico 3/10
Procesado artículo médico 4/10
Procesado artículo médico 5/10
Procesado artículo médico 6/10
Procesado artículo médico 7/10
Procesado artículo médico 8/10
Procesado artículo médico 9/10
Procesado artículo médico 10/10

PRIMEROS 3 EJEMPLOS DE RESÚMENES MÉDICOS:

--- ARTÍCULO MÉDICO 1 ---
Humano (Abstract): research on the implications of anxiety in parkinson 's disease ( pd ) has been neglected despite its prevalence in nearly 50% of patients and its neg...
Original: anxiety and depression are common in pd patients...
Fine-tuned: anxiety and depression are common in pd patients...

--- ARTÍCULO MÉDICO 2 ---
Humano (Abstract): small non - coding rnas include sirna , mirna , pirna and snorna . 
 the involvement of mirnas in the regulation of mammary gland tumorigenesis has be...
Original: Micrornas are small non - coding rnas that regulat

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,research on the implications of anxiety in par...,anxiety and depression are common in pd patients,anxiety and depression are common in pd patients
1,"small non - coding rnas include sirna , mirna ...",Micrornas are small non - coding rnas that reg...,Micrornas are small non - coding rnas that reg...
2,objective : to evaluate the efficacy and safet...,ohss is a complication of ovulation induction....,ohss is a complication of ovulation induction....
3,congenital adrenal hyperplasia is a group of a...,Congenital adrenal hyperplasia is a group of a...,Congenital adrenal hyperplasia is a group of a...
4,objective(s):pentoxifylline is an immunomodula...,Type 1 diabetes is a autoimmune disease charac...,Type 1 diabetes is a autoimmune disease charac...
5,abstractobjective : to determine the presence ...,A transversal study of asma gravidade in adult...,A transversal study of asma gravidade in adult...
6,background : since the family is a social syst...,Stress in the family is a source of stress for...,Stress in the family is a source of stress for...
7,background and objective : anxiety and depre...,Increasing number of cardiovascular diseases i...,Increasing number of cardiovascular diseases i...
8,worldwide emergence of variant viruses has pro...,nepali nephritid patients are screened for inf...,nepali nephritid patients are screened for inf...
9,excess weight has generally been associated wi...,Obesity is becoming a worldwide phenomenon and...,Obesity is becoming a worldwide phenomenon and...


Evalúe los modelos que calculan las métricas ROUGE. ¡Observe la mejora en los resultados!

In [22]:
# Evaluar modelo original
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# Evaluar modelo "fine-tuneado"
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('MODELO ORIGINAL (Resúmenes Médicos):')
for metric, score in original_model_results.items():
    print(f'   {metric}: {score:.4f}')

print('\nMODELO CON FINE-TUNING (Resúmenes Médicos):')
for metric, score in instruct_model_results.items():
    print(f'   {metric}: {score:.4f}')

print('\nMEJORA ABSOLUTA:')
improvement = {k: instruct_model_results[k] - original_model_results[k] for k in original_model_results.keys()}
for metric, improvement_score in improvement.items():
    print(f'   {metric}: {improvement_score*100:+.2f}%')

print('\nINTERPRETACIÓN DE MÉTRICAS ROUGE:')
print('   - ROUGE-1: Overlap de palabras individuales (precisión general)')
print('   - ROUGE-2: Overlap de bigramas (fluidez y coherencia)')
print('   - ROUGE-L: Subsecuencia común más larga (estructura)')
print('   - ROUGE-Lsum: ROUGE-L para resúmenes multi-oración')

MODELO ORIGINAL (Resúmenes Médicos):
   rouge1: 0.1112
   rouge2: 0.0446
   rougeL: 0.0953
   rougeLsum: 0.1026

MODELO CON FINE-TUNING (Resúmenes Médicos):
   rouge1: 0.1112
   rouge2: 0.0446
   rougeL: 0.0953
   rougeLsum: 0.1026

MEJORA ABSOLUTA:
   rouge1: +0.00%
   rouge2: +0.00%
   rougeL: +0.00%
   rougeLsum: +0.00%

INTERPRETACIÓN DE MÉTRICAS ROUGE:
   - ROUGE-1: Overlap de palabras individuales (precisión general)
   - ROUGE-2: Overlap de bigramas (fluidez y coherencia)
   - ROUGE-L: Subsecuencia común más larga (estructura)
   - ROUGE-Lsum: ROUGE-L para resúmenes multi-oración


<a name='3'></a>
## 3 - Fine-Tuning Eficiente de Parámetros (PEFT) para Resúmenes Médicos

Ahora, realicemos un ajuste fino **PEFT** en lugar del ajuste fino completo, como se hizo anteriormente. PEFT es una forma de ajuste fino de instrucciones mucho más eficiente que el ajuste fino completo, con resultados de evaluación comparables, como verá pronto.

PEFT es un término genérico que incluye **Adaptación de Bajo Rango (LoRA)** y ajuste de prompts (¡que NO ES LO MISMO que ingeniería de prompts!). En la mayoría de los casos, cuando alguien habla de PEFT, se refiere a LoRA. LoRA, a un nivel muy alto, permite al usuario ajustar su modelo utilizando menos recursos computacionales (en algunos casos, una sola GPU). 

Para resúmenes médicos, LoRA es especialmente útil porque:
- Preserva el conocimiento médico general del modelo base
- Se enfoca en aprender patrones específicos de resumen médico
- Requiere significativamente menos memoria y tiempo de entrenamiento
- Permite crear múltiples adaptadores para diferentes especialidades médicas

Después del ajuste fino para resúmenes médicos con LoRA, el resultado es que el LLM original permanece sin cambios y surge un nuevo "adaptador LoRA". Este adaptador LoRA es mucho más pequeño que el LLM original: aproximadamente un porcentaje de un solo dígito del tamaño del LLM original (MB vs. GB).

<a name='3.1'></a>
### 3.1 - Configuración del modelo PEFT/LoRA para resúmenes médicos

Debe configurar el modelo PEFT/LoRA para el ajuste fino con un nuevo adaptador de capa/parámetro. Al usar PEFT/LoRA, se congela el LLM subyacente y solo se entrena el adaptador. Observe la configuración de LoRA a continuación, específicamente ajustada para la tarea de resumen médico. 

Note el hiperparámetro de rango (`r`), que define el rango/dimensión del adaptador que se va a entrenar. Para resúmenes médicos, usamos un rango ligeramente mayor para capturar mejor la complejidad de la terminología médica.

In [23]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05, 
    bias="none", 
    task_type=TaskType.SEQ_2_SEQ_LM 
)


Agregar capas/parámetros del adaptador LoRA al LLM original para ser entrenado específicamente para resúmenes médicos.

In [25]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"Parametros entrenables del modelo: {trainable_model_params:,.0f}\n Total de parametros del modelo: {all_model_params:,.0f}\n Porcentaje de parametros entrenables {100 * trainable_model_params / all_model_params:.0f}%"


In [26]:
peft_model = get_peft_model(original_model, lora_config)

print(print_number_of_trainable_model_parameters(peft_model))


print(f"\nOMPARACIÓN DE PARÁMETROS:")
total_params = sum(p.numel() for p in peft_model.parameters())
trainable_params = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)


Parametros entrenables del modelo: 3,538,944
 Total de parametros del modelo: 251,116,800
 Porcentaje de parametros entrenables 1%

OMPARACIÓN DE PARÁMETROS:




<a name='3.2'></a>
### 3.2 - Entrenar Adaptador PEFT para Resúmenes Médicos

Definir argumentos de entrenamiento específicos para PEFT y crear instancia de `Trainer`. El entrenamiento PEFT para resúmenes médicos requiere ajustes específicos en la tasa de aprendizaje y estrategia de entrenamiento.

In [27]:
output_dir = f'./peft-medical-summarization-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    logging_steps=1,
    max_steps=50, 
    eval_strategy="steps",
    eval_steps=25,
    save_steps=25,
    warmup_steps=5,
    load_best_model_at_end=True,
    dataloader_drop_last=False,
    prediction_loss_only=True,
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

Now everything is ready to train the PEFT adapter and save the model.



In [28]:
peft_trainer.train()
peft_model_path="./peft-dialogue-summary-checkpoint-local"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path) 



Step,Training Loss,Validation Loss
25,6.0,3.274676
50,2.7031,2.848214




('./peft-dialogue-summary-checkpoint-local\\tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local\\special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local\\spiece.model',
 './peft-dialogue-summary-checkpoint-local\\added_tokens.json',
 './peft-dialogue-summary-checkpoint-local\\tokenizer.json')



That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3.

In [None]:
#!aws s3 cp --recursive s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/ ./peft-dialogue-summary-checkpoint-from-s3/

download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_config.json to peft-dialogue-summary-checkpoint-from-s3/adapter_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/special_tokens_map.json to peft-dialogue-summary-checkpoint-from-s3/special_tokens_map.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer_config.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer_config.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/tokenizer.json to peft-dialogue-summary-checkpoint-from-s3/tokenizer.json
download: s3://dlai-generative-ai/models/peft-dialogue-summary-checkpoint/adapter_model.bin to peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


Comprueba que el tamaño de este modelo es mucho menor que el LLM original:

In [None]:
!dir /a .\peft-dialogue-summary-checkpoint-from-s3\adapter_model.bin

El sistema no puede encontrar la ruta especificada.


Prepare este modelo añadiendo un adaptador al modelo FLAN-T5 original. Está configurando `is_trainable=False` porque el plan es realizar inferencia únicamente con este modelo PEFT. Si estuviera preparando el modelo para entrenamiento posterior, debería configurar `is_trainable=True`.

In [36]:
from peft import PeftModel, PeftConfig


peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(
    peft_model_base,
    './peft-dialogue-summary-checkpoint-local/', 
    torch_dtype=torch.bfloat16,
    is_trainable=True  
)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [37]:
print(print_number_of_trainable_model_parameters(peft_model))

Parametros entrenables del modelo: 3,538,944
 Total de parametros del modelo: 251,116,800
 Porcentaje de parametros entrenables 1%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [39]:
index = 25
article = dataset['test'][index]['article']
baseline_human_summary = dataset['test'][index]['abstract']


article_for_models = article[:1000] + "..." if len(article) > 1000 else article

prompt = f"""
Summarize the following medical research article.

{article_for_models}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).input_ids


# Modelo original (sin fine-tuning)
original_model_outputs = original_model.generate(
    input_ids=input_ids, 
    generation_config=GenerationConfig(max_new_tokens=120, num_beams=1)
)
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

# Modelo con full fine-tuning (simulado)
instruct_model_outputs = instruct_model.generate(
    input_ids=input_ids, 
    generation_config=GenerationConfig(max_new_tokens=120, num_beams=1)
)
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

# Modelo PEFT (simulado)
peft_model_outputs = peft_model.generate(
    input_ids=input_ids, 
    generation_config=GenerationConfig(max_new_tokens=120, num_beams=1)
)
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

dash_line = '-' * 100
print(dash_line)
print(f'ARTÍCULO MÉDICO (MUESTRA):')
print(f'{article_for_models[:300]}...')
print(dash_line)
print(f'RESUMEN HUMANO (BASELINE):\n{baseline_human_summary}')
print(dash_line)
print(f'MODELO ORIGINAL (Sin Fine-tuning):\n{original_model_text_output}')
print(dash_line)
print(f'MODELO FULL FINE-TUNING:\n{instruct_model_text_output}')
print(dash_line)
print(f'MODELO PEFT:\n{peft_model_text_output}')

print(f"\nANÁLISIS COMPARATIVO:")

def analyze_medical_summary_advanced(summary, reference):
    """Análisis avanzado de resúmenes médicos"""
    summary_words = set(summary.lower().split())
    reference_words = set(reference.lower().split())
    
    # Términos médicos especializados
    medical_terms = {
        'patient', 'treatment', 'study', 'clinical', 'therapy', 'disease', 
        'diagnosis', 'medical', 'health', 'hospital', 'symptoms', 'condition',
        'research', 'analysis', 'results', 'outcome', 'intervention', 'care'
    }
    
    # Métricas
    summary_medical = len(summary_words.intersection(medical_terms))
    overlap = len(summary_words.intersection(reference_words))
    
    return {
        'length': len(summary.split()),
        'medical_terms': summary_medical,
        'word_overlap': overlap,
        'similarity': overlap / max(len(summary_words), 1)
    }

# Análisis de los tres modelos
orig_analysis = analyze_medical_summary_advanced(original_model_text_output, baseline_human_summary)
inst_analysis = analyze_medical_summary_advanced(instruct_model_text_output, baseline_human_summary)
peft_analysis = analyze_medical_summary_advanced(peft_model_text_output, baseline_human_summary)

print(f"MÉTRICAS COMPARATIVAS:")
print(f"                    Original  Full-FT   PEFT")
print(f"Longitud:          {orig_analysis['length']:8d}  {inst_analysis['length']:7d}  {peft_analysis['length']:4d}")
print(f"Términos médicos:  {orig_analysis['medical_terms']:8d}  {inst_analysis['medical_terms']:7d}  {peft_analysis['medical_terms']:4d}")
print(f"Similitud:         {orig_analysis['similarity']:8.3f}  {inst_analysis['similarity']:7.3f}  {peft_analysis['similarity']:4.3f}")

----------------------------------------------------------------------------------------------------
ARTÍCULO MÉDICO (MUESTRA):
radiocontrast - induced nephropathy ( rin ) can lead to acute renal failure ( arf ) , which may require dialysis therapy . 
 arf increases treatment cost due to sepsis , hemorrhage , respiratory failure , and a long hospitalization.[13 ] rin is an important cause of hospital - acquired arf and is re...
----------------------------------------------------------------------------------------------------
RESUMEN HUMANO (BASELINE):
radiocontrast administration is an important cause of acute renal failure . in this study , 
 compared the plasma creatinine levels with spot urine il-18 levels following radiocontrast administration . 
 twenty patients ( 11 males , 9 females ) underwent radiocontrast diagnostic and therapeutic - enhanced examinations . 
 the rin mehran risk score was low ( 5 ) . 
 the radiocontrast agents used were 623 mg / ml iopromid ( 1.5 ml / kg ) 

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [40]:

articles = dataset['test'][0:10]['article']
human_baseline_summaries = dataset['test'][0:10]['abstract']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

print("EVALUACIÓN CUANTITATIVA DE RESÚMENES MÉDICOS...")
print("Generando resúmenes con los tres enfoques...")

for idx, article in enumerate(articles):
    article_truncated = article[:1000] + "..." if len(article) > 1000 else article
    
    prompt = f"""
Summarize the following medical research article.

{article_truncated}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).input_ids

    # Resumen humano baseline
    human_baseline_text_output = human_baseline_summaries[idx]

    # Modelo original
    original_model_outputs = original_model.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=120)
    )
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    # Modelo full fine-tuning
    instruct_model_outputs = instruct_model.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=120)
    )
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    # Modelo PEFT
    peft_model_outputs = peft_model.generate(
        input_ids=input_ids, 
        generation_config=GenerationConfig(max_new_tokens=120)
    )
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    # Almacenar resúmenes
    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)
    
    print(f"Artículo médico {idx+1}/10 procesado")

# Crear DataFrame comparativo
zipped_summaries = list(zip(
    human_baseline_summaries, 
    original_model_summaries, 
    instruct_model_summaries, 
    peft_model_summaries
))

df = pd.DataFrame(zipped_summaries, columns=[
    'human_baseline_summaries', 
    'original_model_summaries', 
    'instruct_model_summaries', 
    'peft_model_summaries'
])

print("\nMUESTRA DE RESULTADOS (Primeros 2 ejemplos):")
for i in range(min(2, len(df))):
    print(f"\n--- ARTÍCULO MÉDICO {i+1} ---")
    print(f"Humano: {df.iloc[i]['human_baseline_summaries'][:100]}...")
    print(f"Original: {df.iloc[i]['original_model_summaries'][:100]}...")
    print(f"Full-FT: {df.iloc[i]['instruct_model_summaries'][:100]}...")
    print(f"PEFT: {df.iloc[i]['peft_model_summaries'][:100]}...")

df

🔬 EVALUACIÓN CUANTITATIVA DE RESÚMENES MÉDICOS...
⏱️  Generando resúmenes con los tres enfoques...
Artículo médico 1/10 procesado
Artículo médico 2/10 procesado
Artículo médico 3/10 procesado
Artículo médico 4/10 procesado
Artículo médico 5/10 procesado
Artículo médico 6/10 procesado
Artículo médico 7/10 procesado
Artículo médico 8/10 procesado
Artículo médico 9/10 procesado
Artículo médico 10/10 procesado

MUESTRA DE RESULTADOS (Primeros 2 ejemplos):

--- ARTÍCULO MÉDICO 1 ---
Humano: research on the implications of anxiety in parkinson 's disease ( pd ) has been neglected despite it...
Original: anxiety and depression are common in pd patients . anxiety and depression are common in pd patients ...
Full-FT: anxiety and depression are common in pd patients...
PEFT: anxiety and depression are common in pd patients . anxiety and depression are common in pd patients ...

--- ARTÍCULO MÉDICO 2 ---
Humano: small non - coding rnas include sirna , mirna , pirna and snorna . 
 the involvement 

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,research on the implications of anxiety in par...,anxiety and depression are common in pd patien...,anxiety and depression are common in pd patients,anxiety and depression are common in pd patien...
1,"small non - coding rnas include sirna , mirna ...","a class of small non - coding rnas , known as ...",mrnas are small non - coding RNAs that regulat...,"a class of small non - coding rnas , known as ..."
2,objective : to evaluate the efficacy and safet...,ohss is a complication of ovulation induction ...,ohss is a complication of ovulation induction....,ohss is a complication of ovulation induction ...
3,congenital adrenal hyperplasia is a group of a...,adreno cortico trophic hormone ( ac ) deficien...,Acute congenital adrenal hyperplasia in children.,adreno cortico trophic hormone ( ac ) deficien...
4,objective(s):pentoxifylline is an immunomodula...,t1d is a disease that is mediated by the immun...,Type 1 diabetes is a autoimmune disease charac...,t1d is a disease that is mediated by the immun...
5,abstractobjective : to determine the presence ...,"a transversal study , a transversal study , a ...",A transversal study of asma gravidade in adult...,"a transversal study , a transversal study , a ..."
6,background : since the family is a social syst...,stress in the family is a source of stress for...,Stress in the family is a source of stress for...,stress in the family is a source of stress for...
7,background and objective : anxiety and depre...,cardiovascular disease is a disease that cause...,The disease pattern has changed from tradition...,Cardiovascular diseases are a major health pro...
8,worldwide emergence of variant viruses has pro...,nepal 's armed forces research institute for m...,nepali nephritid patients with influenza-like ...,nepal 's armed forces research institute for m...
9,excess weight has generally been associated wi...,obesity is a health problem . obesity is a ris...,obesity is becoming a worldwide phenomenon,obesity is a health problem . obesity is a ris...



Calcule la puntuación ROUGE para este subconjunto de los datos.

In [41]:
# Calcular métricas ROUGE para los tres enfoques
rouge = evaluate.load('rouge')

print("CALCULANDO MÉTRICAS ROUGE COMPARATIVAS...")

# Modelo original
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# Modelo full fine-tuning
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# Modelo PEFT
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('RESULTADOS ROUGE PARA RESÚMENES MÉDICOS:')
print('='*60)

print('MODELO ORIGINAL:')
for metric, score in original_model_results.items():
    print(f'   {metric}: {score:.4f}')

print('\nMODELO FULL FINE-TUNING:')
for metric, score in instruct_model_results.items():
    print(f'   {metric}: {score:.4f}')

print('\nMODELO PEFT:')
for metric, score in peft_model_results.items():
    print(f'   {metric}: {score:.4f}')

# Análisis comparativo
print(f"\nTABLA COMPARATIVA:")
print(f"{'Métrica':<12} {'Original':<10} {'Full-FT':<10} {'PEFT':<10}")
print(f"{'-'*12} {'-'*10} {'-'*10} {'-'*10}")
for metric in original_model_results.keys():
    print(f"{metric:<12} {original_model_results[metric]:<10.4f} "
          f"{instruct_model_results[metric]:<10.4f} {peft_model_results[metric]:<10.4f}")

CALCULANDO MÉTRICAS ROUGE COMPARATIVAS...
RESULTADOS ROUGE PARA RESÚMENES MÉDICOS:
MODELO ORIGINAL:
   rouge1: 0.1410
   rouge2: 0.0252
   rougeL: 0.1119
   rougeLsum: 0.1323

MODELO FULL FINE-TUNING:
   rouge1: 0.0677
   rouge2: 0.0210
   rougeL: 0.0573
   rougeLsum: 0.0602

MODELO PEFT:
   rouge1: 0.1414
   rouge2: 0.0254
   rougeL: 0.1127
   rougeLsum: 0.1333

TABLA COMPARATIVA:
Métrica      Original   Full-FT    PEFT      
------------ ---------- ---------- ----------
rouge1       0.1410     0.0677     0.1414    
rouge2       0.0252     0.0210     0.0254    
rougeL       0.1119     0.0573     0.1127    
rougeLsum    0.1323     0.0602     0.1333    


Notice, that PEFT model results are not too bad, while the training process was much easier!

The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [43]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 0.05%
rouge2: 0.02%
rougeL: 0.09%
rougeLsum: 0.10%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [44]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: 7.37%
rouge2: 0.44%
rougeL: 5.55%
rougeLsum: 7.31%
