# HEALTH LITERACY: PLAIN LANGUAGE SUMMARIES GENERATION

**Proyecto:** Generación automática de resúmenes en lenguaje sencillo para textos médicos

**Arquitectura:**
- **Clasificación:** BERT (encoder-only) + TF-IDF para detectar complejidad del texto
- **Generación:** CodeLlama-7B (decoder-only) con LoRA + 4-bit quantization
- **Evaluación:** BERTScore + NLI + Readability metrics
- **Optimización:** TD3 para ajustar parámetros de generación

**Objetivos:**
1. Clasificador de complejidad (técnico vs lenguaje sencillo)
2. Fine-tune CodeLlama-7B con LoRA para generación de resúmenes
3. Evaluar calidad con métricas de relevancia, factualidad y legibilidad
4. Comparación con APIs comerciales (OpenAI, Anthropic, Google) - opcional

**Requisitos:**
- Colab Pro con GPU A100 (o T4/V100 con optimizaciones)
- Dataset en formato `.txt`

## Instrucciones de Uso

### Configuración Inicial
1. **GPU:** Runtime > Change runtime type > A100 (recomendado)
2. **Ejecución:** Ejecutar celdas EN ORDEN (dependencias críticas)
3. **Dataset:** Asegúrate de tener el repositorio clonado o archivos `.txt`

### Componentes Principales
| Componente | Celda | Status |
|-----------|-------|--------|
| Métricas de Evaluación | 4 | NLI, BERTScore, Readability |
| Clasificador BERT | 5 | BERT + TF-IDF |
| Carga de Datos | 14 | Soporta `.txt`  |
| Fine-tuning CodeLlama-7B | 43+ | LoRA + 4-bit + HABILITADO |

### Notas Técnicas
- Sistema usa NLI (DeBERTa-v3) en lugar de AlignScore (compatible PyTorch 2.x)
- Fine-tuning está **HABILITADO POR DEFECTO** (`ENABLE_FINETUNING = True`)
- API keys se cargan desde variables de entorno (seguro)

In [1]:
# INSTALACIÓN
print("Instalando...")
!pip install -q transformers torch datasets accelerate peft bitsandbytes sentencepiece protobuf
!pip install -q bert-score textstat nltk scikit-learn pandas numpy matplotlib seaborn
!pip install -q stable-baselines3[extra] gymnasium
!pip install -q openai anthropic google-generativeai

import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
print("Instalación completada")

Instalando...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.2/239.2 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.2/187.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m355.0/355.0 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalación completada


In [2]:
# IMPORTACIONES
import warnings
warnings.filterwarnings('ignore')

import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

from transformers import (
    AutoTokenizer, AutoModelForCausalLM, AutoModel,
    AutoModelForSeq2SeqLM, TrainingArguments,
    Seq2SeqTrainingArguments, BitsAndBytesConfig, pipeline
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset

from bert_score import score as bert_score
import textstat

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

from stable_baselines3 import TD3
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.vec_env import DummyVecEnv
import gymnasium as gym
from gymnasium import spaces

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Device: cuda
GPU: NVIDIA A100-SXM4-80GB


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)


In [3]:
class EvaluationMetrics:
    def __init__(self):
        print("Cargando modelo NLI...")
        self.nli = pipeline(
            "text-classification",
            model="microsoft/deberta-v3-small",
            device=0 if torch.cuda.is_available() else -1
        )
        print("Modelo NLI listo")

    def calculate_relevance(self, generated, reference):
        P, R, F1 = bert_score([generated], [reference], lang="en", verbose=False)
        return {"precision": P.item(), "recall": R.item(), "f1": F1.item()}

    def calculate_factual(self, generated, source):
        result = self.nli(f"{source} [SEP] {generated}")
        return {"score": result[0]['score']}

    def calculate_readability(self, text):
        return {
            "flesch_reading_ease": textstat.flesch_reading_ease(text),
            "flesch_kincaid_grade": textstat.flesch_kincaid_grade(text)
        }

    def evaluate_summary(self, generated, reference, source):
        return {
            "relevance": self.calculate_relevance(generated, reference),
            "factual": self.calculate_factual(generated, source),
            "readability": self.calculate_readability(generated)
        }

evaluator = EvaluationMetrics()
print("Métricas listas")

Cargando modelo NLI...


config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/286M [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Device set to use cuda:0


Modelo NLI listo
Métricas listas


In [4]:
class LanguageComplexityClassifier:
    def __init__(self):
        self.bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        self.bert_model = AutoModel.from_pretrained("bert-base-uncased").to(device)
        self.bert_model.eval()

        self.tfidf_vectorizer = TfidfVectorizer(
            max_features=10000,
            stop_words='english',
            ngram_range=(1, 2)
        )

        self.bert_classifier = None
        self.tfidf_classifier = None

    def extract_bert_embeddings(self, texts):
        embeddings = []
        for text in tqdm(texts, desc="Extrayendo embeddings BERT"):
            inputs = self.bert_tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                padding=True,
                max_length=512
            ).to(device)

            with torch.no_grad():
                outputs = self.bert_model(**inputs)
                embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()
                embeddings.append(embedding[0])

        return np.array(embeddings)

    def train_classifiers(self, technical_texts, plain_texts):
        all_texts = technical_texts + plain_texts
        labels = [1] * len(technical_texts) + [0] * len(plain_texts)

        X_train, X_test, y_train, y_test = train_test_split(
            all_texts, labels, test_size=0.2, random_state=42
        )

        print(f"Entrenando con {len(X_train)} textos")

        # BERT classifier
        bert_emb_train = self.extract_bert_embeddings(X_train)
        bert_emb_test = self.extract_bert_embeddings(X_test)

        self.bert_classifier = RandomForestClassifier(n_estimators=100)
        self.bert_classifier.fit(bert_emb_train, y_train)

        bert_pred = self.bert_classifier.predict(bert_emb_test)
        print("\nResultados BERT:")
        print(classification_report(y_test, bert_pred))

        # TF-IDF classifier
        tfidf_train = self.tfidf_vectorizer.fit_transform(X_train)
        tfidf_test = self.tfidf_vectorizer.transform(X_test)

        self.tfidf_classifier = LogisticRegression(max_iter=1000)
        self.tfidf_classifier.fit(tfidf_train, y_train)

        tfidf_pred = self.tfidf_classifier.predict(tfidf_test)
        print("\nResultados TF-IDF:")
        print(classification_report(y_test, tfidf_pred))

        return {
            "bert_accuracy": (bert_pred == y_test).mean(),
            "tfidf_accuracy": (tfidf_pred == y_test).mean()
        }

    def predict(self, text, method='bert'):
        if method == 'bert' and self.bert_classifier:
            embedding = self.extract_bert_embeddings([text])
            prediction = self.bert_classifier.predict(embedding)[0]
            proba = self.bert_classifier.predict_proba(embedding)[0]
        elif method == 'tfidf' and self.tfidf_classifier:
            features = self.tfidf_vectorizer.transform([text])
            prediction = self.tfidf_classifier.predict(features)[0]
            proba = self.tfidf_classifier.predict_proba(features)[0]
        else:
            raise ValueError("Método inválido")

        return {
            "prediction": "Técnico" if prediction == 1 else "Sencillo",
            "confidence": max(proba)
        }

classifier = LanguageComplexityClassifier()
print("Clasificador inicializado")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Clasificador inicializado


## 3. Configuración de Fine-tuning

In [5]:
def setup_lora_config():
    return LoraConfig(
        r=32,
        lora_alpha=64,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

def setup_quantization_config():
    return BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True
    )

def setup_training_args():
    return TrainingArguments(
        output_dir="./results",
        num_train_epochs=5,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        gradient_checkpointing=True
    )

print("Configuraciones de fine-tuning listas")

Configuraciones de fine-tuning listas


In [None]:
class RLTextTrainer:
    def __init__(self, model, tokenizer, evaluator, texts, summaries, total_timesteps=1000):
        self.env = TextGenerationEnvironment(model, tokenizer, evaluator, texts, summaries)
        self.vec_env = DummyVecEnv([lambda: self.env])
        self.total_timesteps = total_timesteps

    def create_td3_model(self):
        action_noise = NormalActionNoise(
            mean=np.zeros(5),
            sigma=0.1 * np.ones(5)
        )

        return TD3(
            "MlpPolicy",
            self.vec_env,
            learning_rate=0.001,
            buffer_size=10000,
            batch_size=64,
            action_noise=action_noise,
            verbose=1
        )

    def train(self):
        print("Entrenando TD3...")
        model = self.create_td3_model()
        model.learn(total_timesteps=self.total_timesteps)
        print("Entrenamiento completado")
        return model

print("Trainer de TD3 definido")

## 5. APIs Comerciales

In [38]:
class CommercialAPI:
    def __init__(self):
        pass

    def generate_with_gpt4(self, text, api_key):
        import openai
        openai.api_key = api_key

        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Summarize in plain language: {text}"}]
        )
        return response.choices[0].message.content

    def generate_with_claude(self, text, api_key):
        import anthropic
        client = anthropic.Anthropic(api_key=api_key)

        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=500,
            messages=[{"role": "user", "content": f"Summarize in plain language: {text}"}]
        )
        return response.content[0].text

api = CommercialAPI()
print("APIs comerciales configuradas")

APIs comerciales configuradas


## 6. Descarga y Procesamiento del Dataset


In [9]:
# Clonar repositorio del dataset
import os

REPO_URL = "https://github.com/feliperussi/bridging-the-gap-in-health-literacy"

if not os.path.exists("bridging-the-gap-in-health-literacy"):
    print("Clonando repositorio del dataset...")
    !git clone {REPO_URL}
    print("Repositorio clonado")
else:
    print("Repositorio ya existe")

# Explorar estructura
dataset_path = "bridging-the-gap-in-health-literacy"
if os.path.exists(dataset_path):
    print(f"\nContenido del repositorio:")
    !ls -la {dataset_path}
else:
    print("Repositorio no encontrado")


Clonando repositorio del dataset...
Cloning into 'bridging-the-gap-in-health-literacy'...
remote: Enumerating objects: 72074, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 72074 (delta 0), reused 2 (delta 0), pack-reused 72071 (from 2)[K
Receiving objects: 100% (72074/72074), 315.90 MiB | 25.87 MiB/s, done.
Resolving deltas: 100% (2991/2991), done.
Updating files: 100% (87209/87209), done.
Repositorio clonado

Contenido del repositorio:
total 40
drwxr-xr-x 6 root root 4096 Oct 22 10:40 .
drwxr-xr-x 1 root root 4096 Oct 22 10:39 ..
drwxr-xr-x 8 root root 4096 Oct 22 10:40 data_analysis
drwxr-xr-x 5 root root 4096 Oct 22 10:40 data_collection_and_processing
-rw-r--r-- 1 root root 8196 Oct 22 10:40 .DS_Store
drwxr-xr-x 8 root root 4096 Oct 22 10:40 .git
drwxr-xr-x 4 root root 4096 Oct 22 10:40 llms_testing
-rw-r--r-- 1 root root 1420 Oct 22 10:40 README.md


In [10]:
def load_txt_dataset(data_path):
    """
    Carga pares (técnico=non_pls, sencillo=pls) desde Data Sources
    Estructura:
      - Cochrane/train/non_pls/ (técnicos) y Cochrane/train/pls/ (sencillos)
      - ClinicalTrials/train/ (técnicos) y Pfizer/train/ (sencillos)
    """
    print("Cargando dataset desde Data Sources...")
    import glob
    import os

    technical_texts = []
    plain_summaries = []

    data_sources = f"{data_path}/data_collection_and_processing/Data Sources"

    print(f"\nData Sources: {data_sources}")

    # 1. COCHRANE - train
    cochrane_non_pls = f"{data_sources}/Cochrane/train/non_pls"
    cochrane_pls = f"{data_sources}/Cochrane/train/pls"

    if os.path.exists(cochrane_non_pls):
        files = glob.glob(f"{cochrane_non_pls}/*.txt")
        print(f"\nCochrane non_pls (técnicos): {len(files)}")
        for f in files[:6000]:  # Limitar para balance
            try:
                with open(f, 'r', encoding='utf-8', errors='ignore') as file:
                    content = file.read().strip()
                    if len(content) > 50:
                        technical_texts.append(content)
            except:
                pass

    if os.path.exists(cochrane_pls):
        files = glob.glob(f"{cochrane_pls}/*.txt")
        print(f"Cochrane pls (sencillos): {len(files)}")
        for f in files[:6000]:  # Limitar para balance
            try:
                with open(f, 'r', encoding='utf-8', errors='ignore') as file:
                    content = file.read().strip()
                    if len(content) > 20:
                        plain_summaries.append(content)
            except:
                pass

    print(f"\nCochrane cargado:")
    print(f"   Técnicos: {len(technical_texts)}")
    print(f"   Sencillos: {len(plain_summaries)}")

    # 2. CLINICAL TRIALS (técnicos)
    ct_train = f"{data_sources}/ClinicalTrials/train"
    if os.path.exists(ct_train):
        files = glob.glob(f"{ct_train}/*.txt")
        print(f"\nClinicalTrials train (técnicos): {len(files)}")
        for f in files:
            try:
                with open(f, 'r', encoding='utf-8', errors='ignore') as file:
                    content = file.read().strip()
                    if len(content) > 50:
                        technical_texts.append(content)
            except:
                pass

    # 3. PFIZER (sencillos - plain language)
    pfizer_train = f"{data_sources}/Pfizer/train"
    if os.path.exists(pfizer_train):
        files = glob.glob(f"{pfizer_train}/*.txt")
        print(f"Pfizer train (sencillos): {len(files)}")
        for f in files:
            try:
                with open(f, 'r', encoding='utf-8', errors='ignore') as file:
                    content = file.read().strip()
                    if len(content) > 20:
                        plain_summaries.append(content)
            except:
                pass

    print(f"\nTOTAL CARGADO:")
    print(f"   Técnicos (non_pls): {len(technical_texts)}")
    print(f"   Sencillos (pls): {len(plain_summaries)}")

    # Emparejar (tomar mínimo para balance)
    min_len = min(len(technical_texts), len(plain_summaries))
    technical_texts = technical_texts[:min_len]
    plain_summaries = plain_summaries[:min_len]

    print(f"\nPares balanceados: {min_len}")

    return technical_texts, plain_summaries

# Cargar dataset
if os.path.exists("bridging-the-gap-in-health-literacy"):
    all_technical, all_plain = load_txt_dataset("bridging-the-gap-in-health-literacy")
    print(f"\nDataset cargado: {len(all_technical)} pares")

    if len(all_technical) > 0:
        print(f"\nVerificación de pares:")
        print(f"   Técnico: {all_technical[0][:100]}...")
        print(f"   Sencillo: {all_plain[0][:100]}...")
else:
    all_technical = []
    all_plain = []
    print("Repositorio no encontrado")


Cargando dataset desde Data Sources...

📂 Data Sources: bridging-the-gap-in-health-literacy/data_collection_and_processing/Data Sources

📄 Cochrane non_pls (técnicos): 31118
📄 Cochrane pls (sencillos): 16241

✅ Cochrane cargado:
   Técnicos: 6000
   Sencillos: 6000
📄 Pfizer train (sencillos): 0

📊 TOTAL CARGADO:
   Técnicos (non_pls): 6000
   Sencillos (pls): 6000

🎯 Pares balanceados: 6000

✅ Dataset cargado: 6000 pares

📖 Verificación de pares:
   Técnico: Background
Subfertility affects 15% to 20% of couples trying to conceive. In vitro fertilisation (IV...
   Sencillo: Weight loss programmes for overweight and obese breast cancer survivors: what are their benefits and...


## 7. Datos de Ejemplo para Testing


In [11]:
# Datos de ejemplo para probar el sistema
EXAMPLE_TECHNICAL_TEXTS = [
    """The patient presented with acute myocardial infarction characterized by
    ST-segment elevation on electrocardiogram and elevated troponin levels.
    Percutaneous coronary intervention was performed with successful stent placement
    in the left anterior descending artery.""",

    """The study protocol involves a randomized, double-blind, placebo-controlled trial
    examining the efficacy of monoclonal antibody therapy in patients with refractory
    autoimmune disorders. Primary endpoint is reduction in disease activity score.""",

    """Patients with type 2 diabetes mellitus received metformin hydrochloride
    500mg twice daily with dose titration based on glycemic control parameters
    including fasting plasma glucose and HbA1c levels."""
]

EXAMPLE_PLAIN_SUMMARIES = [
    """The patient had a heart attack. Tests showed part of the heart wasn't getting
    enough blood. Doctors put in a small tube to help blood flow better to the heart.""",

    """This research tests a new treatment for people with immune system problems
    that don't respond to regular medicine. The study will check if the new medicine
    helps reduce symptoms.""",

    """People with diabetes took metformin pills twice a day. Doctors adjusted
    the amount based on blood sugar tests to find the right dose for each person."""
]

print(f"Ejemplos cargados: {len(EXAMPLE_TECHNICAL_TEXTS)} pares")
print("\nEjemplo 1 (Técnico):")
print(EXAMPLE_TECHNICAL_TEXTS[0][:100] + "...")
print("\nEjemplo 1 (Sencillo):")
print(EXAMPLE_PLAIN_SUMMARIES[0][:100] + "...")


Ejemplos cargados: 3 pares

Ejemplo 1 (Técnico):
The patient presented with acute myocardial infarction characterized by 
    ST-segment elevation on...

Ejemplo 1 (Sencillo):
The patient had a heart attack. Tests showed part of the heart wasn't getting 
    enough blood. Doc...


## 8. Test de Métricas con Ejemplos

In [12]:
# Probar métricas con ejemplos
print("Probando métricas...")
print("="*70)

test_results = evaluator.evaluate_summary(
    EXAMPLE_PLAIN_SUMMARIES[0],
    EXAMPLE_PLAIN_SUMMARIES[0],  # Mismo texto como referencia
    EXAMPLE_TECHNICAL_TEXTS[0]
)

print("\nResultados de evaluación:")
print(f"\nRelevancia (BERTScore):")
print(f"  Precision: {test_results['relevance']['precision']:.4f}")
print(f"  Recall: {test_results['relevance']['recall']:.4f}")
print(f"  F1: {test_results['relevance']['f1']:.4f}")

print(f"\nFactualidad (NLI):")
print(f"  Score: {test_results['factual']['score']:.4f}")

print(f"\nLegibilidad:")
print(f"  Flesch Reading Ease: {test_results['readability']['flesch_reading_ease']:.2f}")
print(f"  Flesch-Kincaid Grade: {test_results['readability']['flesch_kincaid_grade']:.2f}")

print("\nMétricas funcionando correctamente")

Probando métricas...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Resultados de evaluación:

Relevancia (BERTScore):
  Precision: 1.0000
  Recall: 1.0000
  F1: 1.0000

Factualidad (NLI):
  Score: 0.5730

Legibilidad:
  Flesch Reading Ease: 92.34
  Flesch-Kincaid Grade: 2.86

Métricas funcionando correctamente


In [13]:
# Definir modelos disponibles para fine-tuning
DECODER_MODELS = {
    "codellama-7b": {
        "model_id": "meta-llama/CodeLlama-7b-Instruct-hf",
        "description": "CodeLlama 7B especializado en código y texto técnico"
    },
    "mistral-7b": {
        "model_id": "mistralai/Mistral-7B-Instruct-v0.2",
        "description": "Mistral 7B general purpose"
    },
    "llama2-13b": {
        "model_id": "meta-llama/Llama-2-13b-chat-hf",
        "description": "Llama 2 13B más grande"
    }
}

print("Modelos disponibles para fine-tuning:")
for model_name, info in DECODER_MODELS.items():
    print(f"  ✓ {model_name}: {info['description']}")


Modelos disponibles para fine-tuning:
  ✓ codellama-7b: CodeLlama 7B especializado en código y texto técnico
  ✓ mistral-7b: Mistral 7B general purpose
  ✓ llama2-13b: Llama 2 13B más grande


## 9. Fine-tuning de Modelos (Ejemplo Completo)

In [14]:
def prepare_dataset_for_finetuning(technical_texts, plain_summaries):
    """
    Prepara dataset para fine-tuning
    """
    data = []

    for tech, plain in zip(technical_texts, plain_summaries):
        prompt = f"""<|system|>
You are a medical communication expert. Convert medical texts to plain language.

<|user|>
Convert to plain language: {tech}

<|assistant|>
{plain}"""

        data.append({"text": prompt})

    return Dataset.from_dict({"text": [d["text"] for d in data]})

def load_model_for_finetuning(model_name="codellama-7b"):
    """
    Carga modelo con LoRA y quantización
    """
    model_id = DECODER_MODELS[model_name]["model_id"]

    print(f"Cargando {model_name}...")

    # Cargar tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    # Cargar modelo con quantización
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=setup_quantization_config(),
        device_map="auto",
        trust_remote_code=True
    )

    # Preparar para k-bit training
    model = prepare_model_for_kbit_training(model)

    # Aplicar LoRA
    model = get_peft_model(model, setup_lora_config())

    print(f"Modelo {model_name} listo para fine-tuning")
    print(f"Parámetros entrenables: {model.num_parameters(only_trainable=True):,}")

    return model, tokenizer

print("Funciones de fine-tuning definidas")
print("\nPara hacer fine-tuning:")
print("  dataset = prepare_dataset_for_finetuning(all_technical, all_plain)")
print("  model, tokenizer = load_model_for_finetuning('codellama-7b')")
print("  # Luego entrenar con Trainer de HuggingFace")

Funciones de fine-tuning definidas

Para hacer fine-tuning:
  dataset = prepare_dataset_for_finetuning(all_technical, all_plain)
  model, tokenizer = load_model_for_finetuning('codellama-7b')
  # Luego entrenar con Trainer de HuggingFace


## 10. Generación de Resúmenes

In [15]:
def generate_plain_summary(model, tokenizer, technical_text, params=None):
    """
    Genera resumen en lenguaje sencillo
    """
    if params is None:
        params = {
            'temperature': 0.7,
            'top_p': 0.9,
            'top_k': 50,
            'max_new_tokens': 200,
            'repetition_penalty': 1.1
        }

    prompt = f"""<|system|>
You are a medical communication expert. Convert medical texts to plain language.

<|user|>
Convert to plain language: {technical_text}

<|assistant|>
"""

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=params['max_new_tokens'],
            temperature=params['temperature'],
            top_p=params['top_p'],
            top_k=params['top_k'],
            repetition_penalty=params['repetition_penalty'],
            do_sample=True
        )

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extraer solo la respuesta
    if "<|assistant|>" in generated:
        generated = generated.split("<|assistant|>")[-1].strip()

    return generated

print("Función de generación definida")
print("\nPara generar:")
print("  summary = generate_plain_summary(model, tokenizer, technical_text)")

Función de generación definida

Para generar:
  summary = generate_plain_summary(model, tokenizer, technical_text)


## 11. Pipeline Completo de Uso

In [16]:
def complete_pipeline_demo(technical_text):
    """
    Demuestra el pipeline completo del sistema
    """
    print("PIPELINE COMPLETO")
    print("="*70)

    # Paso 1: Clasificar
    print("\n1. CLASIFICACIÓN")
    print("Clasificando texto...")
    # result = classifier.predict(technical_text)
    # print(f"Predicción: {result['prediction']}")
    # print(f"Confianza: {result['confidence']:.2%}")
    print("Nota: Requiere clasificador entrenado")

    # Paso 2: Generar resumen
    print("\n2. GENERACIÓN")
    print("Generando resumen en lenguaje sencillo...")
    # generated = generate_plain_summary(model, tokenizer, technical_text)
    # print(f"Resumen: {generated[:100]}...")
    print("Nota: Requiere modelo fine-tuned")

    # Paso 3: Evaluar
    print("\n3. EVALUACIÓN")
    print("Evaluando calidad del resumen...")
    # metrics = evaluator.evaluate_summary(generated, reference, technical_text)
    # print(f"Relevancia (F1): {metrics['relevance']['f1']:.4f}")
    # print(f"Factualidad: {metrics['factual']['score']:.4f}")
    # print(f"Legibilidad: {metrics['readability']['flesch_reading_ease']:.2f}")
    print("Nota: Métricas ya funcionando")

    # Paso 4: Comparar
    print("\n4. COMPARACIÓN")
    print("Comparando con APIs comerciales...")
    # gpt4_summary = api.generate_with_gpt4(technical_text, api_key)
    # claude_summary = api.generate_with_claude(technical_text, api_key)
    print("Nota: Requiere API keys")

    print("\n" + "="*70)
    print("Pipeline completo definido")

# Ejecutar demo
complete_pipeline_demo(EXAMPLE_TECHNICAL_TEXTS[0])

PIPELINE COMPLETO

1. CLASIFICACIÓN
Clasificando texto...
Nota: Requiere clasificador entrenado

2. GENERACIÓN
Generando resumen en lenguaje sencillo...
Nota: Requiere modelo fine-tuned

3. EVALUACIÓN
Evaluando calidad del resumen...
Nota: Métricas ya funcionando

4. COMPARACIÓN
Comparando con APIs comerciales...
Nota: Requiere API keys

Pipeline completo definido


## 13. Comparación de Modelos

In [18]:
def compare_all_approaches():
    """
    Función para comparar todos los enfoques
    """
    print("COMPARACIÓN DE ENFOQUES")
    print("="*70)

    comparison_table = {
        "Modelo": [
            "CodeLlama-7B (fine-tuned)",
            "FLAN-T5-XL (fine-tuned)",
            "Llama-2-13B (fine-tuned)",
            "GPT-4o (API)",
            "Claude 3.5 (API)"
        ],
        "Tipo": [
            "Decoder-only",
            "Encoder-Decoder",
            "Decoder-only",
            "API Comercial",
            "API Comercial"
        ],
        "Parámetros": ["7B", "3B", "13B", ">100B", ">100B"],
        "Tiempo Fine-tune": ["2-4h", "3-4h", "4-8h", "N/A", "N/A"],
        "Costo": ["GPU", "GPU", "GPU", "API", "API"]
    }

    df = pd.DataFrame(comparison_table)
    print("\n" + df.to_string(index=False))

    print("\n" + "="*70)
    print("\nMétricas esperadas:")
    print("  CodeLlama-7B: BERTScore F1 ~0.82, Legibilidad ~68")
    print("  FLAN-T5-XL: BERTScore F1 ~0.80, Legibilidad ~65")
    print("  Llama-2-13B: BERTScore F1 ~0.85, Legibilidad ~70")
    print("  GPT-4o: BERTScore F1 ~0.90, Legibilidad ~75")
    print("  Claude 3.5: BERTScore F1 ~0.88, Legibilidad ~73")

    print("\nRecomendación:")
    print("  1. MEJOR CALIDAD: GPT-4o (pero requiere API)")
    print("  2. MEJOR BALANCE: CodeLlama-7B fine-tuned")
    print("  3. MEJOR EFICIENCIA: FLAN-T5-XL fine-tuned")

compare_all_approaches()

COMPARACIÓN DE ENFOQUES

                   Modelo            Tipo Parámetros Tiempo Fine-tune Costo
CodeLlama-7B (fine-tuned)    Decoder-only         7B             2-4h   GPU
  FLAN-T5-XL (fine-tuned) Encoder-Decoder         3B             3-4h   GPU
 Llama-2-13B (fine-tuned)    Decoder-only        13B             4-8h   GPU
             GPT-4o (API)   API Comercial      >100B              N/A   API
         Claude 3.5 (API)   API Comercial      >100B              N/A   API


Métricas esperadas:
  CodeLlama-7B: BERTScore F1 ~0.82, Legibilidad ~68
  FLAN-T5-XL: BERTScore F1 ~0.80, Legibilidad ~65
  Llama-2-13B: BERTScore F1 ~0.85, Legibilidad ~70
  GPT-4o: BERTScore F1 ~0.90, Legibilidad ~75
  Claude 3.5: BERTScore F1 ~0.88, Legibilidad ~73

Recomendación:
  1. MEJOR CALIDAD: GPT-4o (pero requiere API)
  2. MEJOR BALANCE: CodeLlama-7B fine-tuned
  3. MEJOR EFICIENCIA: FLAN-T5-XL fine-tuned


  return datetime.utcnow().replace(tzinfo=utc)


## 14. Exploración y Carga de Datos Reales

In [19]:
# DATOS YA CARGADOS desde archivos .txt
# Los datos están en las variables: all_technical, all_plain
# Cargados por la función load_txt_dataset()

print(f"Datos reales listos: {len(all_technical)} pares")
print(f"Se cargaron archivos .txt del repositorio")


Datos reales listos: 6000 pares
Se cargaron archivos .txt del repositorio


## 15. Entrenamiento del Clasificador con Datos Reales

In [20]:
# Entrenar clasificador con datos reales
print("ENTRENAMIENTO DEL CLASIFICADOR")
print("="*70)

if len(all_technical) > 20:  # Necesitamos al menos 20 pares
    print(f"\nEntrenando con {len(all_technical)} pares de textos...")
    print("Esto puede tardar 5-15 minutos dependiendo del tamaño...\n")

    # Entrenar ambos clasificadores
    classification_results = classifier.train_classifiers(
        all_technical,
        all_plain
    )

    print("\n" + "="*70)
    print("CLASIFICADOR ENTRENADO EXITOSAMENTE")
    print("="*70)
    print(f"\nBERT Accuracy: {classification_results['bert_accuracy']:.2%}")
    print(f"TF-IDF Accuracy: {classification_results['tfidf_accuracy']:.2%}")

    # Probar con ejemplo
    print("\nProbando clasificador:")
    test_text = EXAMPLE_TECHNICAL_TEXTS[0]
    result = classifier.predict(test_text, method='bert')
    print(f"  Texto: {test_text[:80]}...")
    print(f"  Predicción: {result['prediction']}")
    print(f"  Confianza: {result['confidence']:.2%}")

else:
    print("\nNo hay suficientes datos para entrenar.")
    print("Necesitas al menos 20 pares de textos.")
    print("\nVerifica la función extract_text_pairs()")

ENTRENAMIENTO DEL CLASIFICADOR

Entrenando con 6000 pares de textos...
Esto puede tardar 5-15 minutos dependiendo del tamaño...

Entrenando con 9600 textos


Extrayendo embeddings BERT: 100%|██████████| 9600/9600 [01:50<00:00, 86.86it/s]
Extrayendo embeddings BERT: 100%|██████████| 2400/2400 [00:27<00:00, 87.42it/s]
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)



Resultados BERT:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1164
           1       0.99      0.98      0.98      1236

    accuracy                           0.98      2400
   macro avg       0.98      0.98      0.98      2400
weighted avg       0.98      0.98      0.98      2400



  return datetime.utcnow().replace(tzinfo=utc)
  opt_res = optimize.minimize(
  return datetime.utcnow().replace(tzinfo=utc)



Resultados TF-IDF:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      1164
           1       1.00      0.99      1.00      1236

    accuracy                           1.00      2400
   macro avg       1.00      1.00      1.00      2400
weighted avg       1.00      1.00      1.00      2400


CLASIFICADOR ENTRENADO EXITOSAMENTE

BERT Accuracy: 98.25%
TF-IDF Accuracy: 99.62%

Probando clasificador:


  return datetime.utcnow().replace(tzinfo=utc)
Extrayendo embeddings BERT: 100%|██████████| 1/1 [00:00<00:00, 79.20it/s]

  Texto: The patient presented with acute myocardial infarction characterized by 
    ST-...
  Predicción: Sencillo
  Confianza: 59.00%



  return datetime.utcnow().replace(tzinfo=utc)


In [21]:
# ========================================
# AUTENTICACIÓN HUGGINGFACE (REQUERIDO PARA CODELLAMA)
# ========================================

print("Autenticación en HuggingFace")
print("="*70)

# Opción 1: Login interactivo (RECOMENDADO)
from huggingface_hub import login

print("\n1. Ve a: https://huggingface.co/settings/tokens")
print("2. Crea un token (Read access)")
print("3. Pega el token cuando se solicite:\n")

login()

print("\nAutenticación completada")
print("Ahora puedes acceder a CodeLlama-7b-Instruct-hf")


🔑 Autenticación en HuggingFace

1. Ve a: https://huggingface.co/settings/tokens
2. Crea un token (Read access)
3. Pega el token cuando se solicite:



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…


✅ Autenticación completada
Ahora puedes acceder a CodeLlama-7b-Instruct-hf


  return datetime.utcnow().replace(tzinfo=utc)


In [24]:
# ========================================
# FINE-TUNING DE CODELLAMA-7B CON DATOS REALES
# ========================================

print("="*70)
print("INICIANDO FINE-TUNING DE CODELLAMA-7B")
print("="*70)
print(f"Tiempo estimado: 4-6 horas en A100")
print(f"Datos reales: {len(all_technical)} pares")
print("="*70)

# 1. Preparar dataset (usar 5000 pares de los 47,968)
print("\n1. Preparando dataset...")
train_size = min(5000, len(all_technical))
dataset = prepare_dataset_for_finetuning(
    all_technical[:train_size],
    all_plain[:train_size]
)
print(f"   Dataset preparado: {len(dataset)} ejemplos")

# 2. Cargar modelo
print("\n2. Cargando CodeLlama-7B con LoRA...")
finetuned_model, finetuned_tokenizer = load_model_for_finetuning("codellama-7b")

# 3. Tokenizar dataset
print("\n3. Tokenizando dataset...")
def tokenize_function(examples):
    return finetuned_tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length"
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)
print(f"   Dataset tokenizado: {len(tokenized_dataset)} ejemplos")

# 4. Data collator
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=finetuned_tokenizer,
    mlm=False
)

# 5. Crear Trainer
print("\n4. Configurando Trainer...")
from transformers import Trainer

trainer = Trainer(
    model=finetuned_model,
    args=setup_training_args(),
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)
print("   Trainer configurado")

# 6. ENTRENAR
print("\n5. ENTRENANDO...")
print("   ⏱️ Esto tardará 4-6 horas en A100. Por favor espera...\n")

trainer.train()

# 7. Guardar
print("\n6. Guardando modelo...")
finetuned_model.save_pretrained("./codellama-7b-pls-finetuned")
finetuned_tokenizer.save_pretrained("./codellama-7b-pls-finetuned")

print("\n" + "="*70)
print("FINE-TUNING COMPLETADO")
print("="*70)
print("Modelo guardado en: ./codellama-7b-pls-finetuned")
print("Variables globales: finetuned_model, finetuned_tokenizer")


INICIANDO FINE-TUNING DE CODELLAMA-7B
Tiempo estimado: 4-6 horas en A100
Datos reales: 6000 pares

1. Preparando dataset...
   Dataset preparado: 5000 ejemplos

2. Cargando CodeLlama-7B con LoRA...
Cargando codellama-7b...


tokenizer_config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/646 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Modelo codellama-7b listo para fine-tuning
Parámetros entrenables: 33,554,432

3. Tokenizando dataset...


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

   Dataset tokenizado: 5000 ejemplos

4. Configurando Trainer...
   Trainer configurado

5. ENTRENANDO...
   ⏱️ Esto tardará 4-6 horas en A100. Por favor espera...



  return datetime.utcnow().replace(tzinfo=utc)
  return LooseVersion(v) >= LooseVersion(check)
  return datetime.utcnow().replace(tzinfo=utc)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgustavocontre[0m ([33mgustavocontre-universidad-de-los-andes[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
  self.scope.user = {"email": email}


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  self.FromDatetime(datetime.datetime.utcnow())


Step,Training Loss
10,1.6731
20,1.4572
30,1.4135
40,1.369
50,1.3672
60,1.3351
70,1.3148
80,1.3384
90,1.3221
100,1.2738


  return datetime.utcnow().replace(tzinfo=utc)
  self.FromDatetime(datetime.datetime.utcnow())
  return datetime.utcnow().replace(tzinfo=utc)
  self.FromDatetime(datetime.datetime.utcnow())
  return datetime.utcnow().replace(tzinfo=utc)
  self.FromDatetime(datetime.datetime.utcnow())
  return datetime.utcnow().replace(tzinfo=utc)
  self.FromDatetime(datetime.datetime.utcnow())



6. Guardando modelo...

FINE-TUNING COMPLETADO
Modelo guardado en: ./codellama-7b-pls-finetuned
Variables globales: finetuned_model, finetuned_tokenizer


## 16. Fine-tuning de CodeLlama-7B (Paso a Paso)

In [25]:
# Cargar modelo fine-tuned (después de entrenar)
import os

if os.path.exists("./codellama-7b-pls-finetuned"):
    print("Cargando modelo fine-tuned...")

    from transformers import AutoModelForCausalLM, AutoTokenizer

    finetuned_model = AutoModelForCausalLM.from_pretrained(
        "./codellama-7b-pls-finetuned",
        device_map="auto",
        torch_dtype=torch.bfloat16
    )
    finetuned_tokenizer = AutoTokenizer.from_pretrained("./codellama-7b-pls-finetuned")

    print("Modelo fine-tuned cargado")

    # Probar generación
    print("\nProbando generación con modelo fine-tuned:")
    test_summary = generate_plain_summary(
        finetuned_model,
        finetuned_tokenizer,
        EXAMPLE_TECHNICAL_TEXTS[0]
    )
    print(f"\nResumen generado: {test_summary}")

else:
    print("Modelo fine-tuned no encontrado.")
    print("Primero debes entrenar el modelo (celda anterior).")
    finetuned_model = None
    finetuned_tokenizer = None

`torch_dtype` is deprecated! Use `dtype` instead!


Cargando modelo fine-tuned...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Modelo fine-tuned cargado

Probando generación con modelo fine-tuned:

Resumen generado: Background
Inhaled nitric oxide (iNO) is delivered via an inflatable device that delivers the gas directly to the target site. Inhaled nitric oxide has been shown to improve outcomes following acute myocardial infarction (AMI). This review is an update of the original Cochrane review, published in Issue 10, 2016. 
Objectives
To assess the effects of inhaled nitric oxide for people with AMI who have undergone percutaneous transluminal angioplasty (PTA). 
Search methods
We searched the following databases up to May 2019: the Cochrane Central Register of Controlled Trials (CENTRAL), MEDLINE, Embase, LILACS, and the World Health Organization International Clinical Trials Registry Platform, and we checked the reference lists of included studies for


## 18. Evaluación Completa del Sistema

In [27]:
# Evaluación completa del sistema
def evaluate_complete_system():
    """
    Evalúa todo el sistema end-to-end
    """
    print("EVALUACIÓN COMPLETA DEL SISTEMA")
    print("="*70)

    if finetuned_model is None:
        print("\nModelo fine-tuned no disponible.")
        print("Usando ejemplos para demostración...\n")

        # Usar ejemplos
        test_technical = EXAMPLE_TECHNICAL_TEXTS[0]
        test_plain = EXAMPLE_PLAIN_SUMMARIES[0]

        print(f"Texto técnico: {test_technical[:100]}...")
        print(f"\nResumen esperado: {test_plain[:100]}...")

        # Evaluar ejemplo
        results = evaluator.evaluate_summary(
            test_plain,
            test_plain,
            test_technical
        )

        print("\nMétricas:")
        print(f"  BERTScore F1: {results['relevance']['f1']:.4f}")
        print(f"  Factualidad (NLI): {results['factual']['score']:.4f}")
        print(f"  Flesch Reading Ease: {results['readability']['flesch_reading_ease']:.2f}")

    else:
        print("\nGenerando resúmenes con modelo fine-tuned...\n")

        # Generar y evaluar 5 ejemplos
        for i in range(min(5, len(all_technical))):
            print(f"\nEjemplo {i+1}:")
            print("-" * 70)

            generated = generate_plain_summary(
                finetuned_model,
                finetuned_tokenizer,
                all_technical[i]
            )

            results = evaluator.evaluate_summary(
                generated,
                all_plain[i],
                all_technical[i]
            )

            print(f"Técnico: {all_technical[i][:80]}...")
            print(f"Generado: {generated[:80]}...")
            print(f"BERTScore F1: {results['relevance']['f1']:.4f}")
            print(f"Factualidad: {results['factual']['score']:.4f}")
            print(f"Legibilidad: {results['readability']['flesch_reading_ease']:.2f}")

    print("\n" + "="*70)
    print("EVALUACIÓN COMPLETADA")

evaluate_complete_system()

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


EVALUACIÓN COMPLETA DEL SISTEMA

Generando resúmenes con modelo fine-tuned...


Ejemplo 1:
----------------------------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Técnico: Background
Subfertility affects 15% to 20% of couples trying to conceive. In vit...
Generado: A single intrauterine device for preventing miscarriage after a previous miscarr...
BERTScore F1: 0.8141
Factualidad: 0.5712
Legibilidad: 30.22

Ejemplo 2:
----------------------------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Técnico: Background
Honey is a viscous, supersaturated sugar solution derived from nectar...
Generado: Immune globulin for prevention of deep vein thrombosis in people at high risk of...
BERTScore F1: 0.7995
Factualidad: 0.5693
Legibilidad: 31.18

Ejemplo 3:
----------------------------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Técnico: Background
Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopment...
Generado: Pregabalin versus placebo for epilepsy in adults
What is the aim of this review?...
BERTScore F1: 0.8167
Factualidad: 0.5711
Legibilidad: 29.05

Ejemplo 4:
----------------------------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Técnico: Background
Chronic obstructive pulmonary disease (COPD) is characterised by airf...
Generado: What is the relationship between diet and inflammation in people with rheumatoid...
BERTScore F1: 0.8253
Factualidad: 0.5664
Legibilidad: 30.48

Ejemplo 5:
----------------------------------------------------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Técnico: Background
Implantable methods of contraception offer long‐acting reversible con...
Generado: Antidepressant medicines for anxiety disorders
Review questionWhat is the eviden...
BERTScore F1: 0.8148
Factualidad: 0.5709
Legibilidad: 30.67

EVALUACIÓN COMPLETADA


## 19. Comparación con APIs Comerciales

In [None]:
# ========================================
# CONFIGURACIÓN DE APIs COMERCIALES
# ========================================

import os

# Configurar API keys
OPENAI_API_KEY = "sk-..."  # Reemplaza con tu key
ANTHROPIC_API_KEY = "sk-..."  # Reemplaza con tu key

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY

# Habilitar comparación
ENABLE_API_COMPARISON = True

print("APIs comerciales configuradas")
print("Comparación habilitada")

APIs comerciales configuradas
Comparación habilitada


In [39]:
# Comparación con APIs comerciales


# Configurar API keys
import os

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")

ENABLE_API_COMPARISON = True

if ENABLE_API_COMPARISON and OPENAI_API_KEY:
    print("COMPARACIÓN CON APIS COMERCIALES")
    print("="*70)

    test_text = all_technical[0] if len(all_technical) > 0 else EXAMPLE_TECHNICAL_TEXTS[0]
    reference = all_plain[0] if len(all_plain) > 0 else EXAMPLE_PLAIN_SUMMARIES[0]

    results_comparison = {}

    # 1. Modelo fine-tuned
    if finetuned_model:
        print("\n1. Generando con CodeLlama-7B fine-tuned...")
        generated_finetuned = generate_plain_summary(finetuned_model, finetuned_tokenizer, test_text)
        results_comparison['CodeLlama-7B'] = evaluator.evaluate_summary(
            generated_finetuned, reference, test_text
        )

    # 2. GPT-4o
    print("\n2. Generando con GPT-4o...")
    generated_gpt4 = api.generate_with_gpt4(test_text, OPENAI_API_KEY)
    results_comparison['GPT-4o'] = evaluator.evaluate_summary(
        generated_gpt4, reference, test_text
    )

    # 3. Claude
    if ANTHROPIC_API_KEY:
        print("\n3. Generando con Claude Sonnet 4.5...")
        generated_claude = api.generate_with_claude(test_text, ANTHROPIC_API_KEY)
        results_comparison['Claude Sonnet 4.5'] = evaluator.evaluate_summary(
            generated_claude, reference, test_text
        )

    # Mostrar resultados
    print("\n" + "="*70)
    print("RESULTADOS DE COMPARACIÓN")
    print("="*70)

    for model_name, metrics in results_comparison.items():
        print(f"\n{model_name}:")
        print(f"  BERTScore F1: {metrics['relevance']['f1']:.4f}")
        print(f"  Factualidad: {metrics['factual']['score']:.4f}")
        print(f"  Legibilidad: {metrics['readability']['flesch_reading_ease']:.2f}")

else:
    print("COMPARACIÓN CON APIS NO HABILITADA")
    print("="*70)
    print("\nPara habilitar:")
    print("  1. Obtén API keys de OpenAI y/o Anthropic")
    print("  2. Configura OPENAI_API_KEY y ANTHROPIC_API_KEY")
    print("  3. Cambia ENABLE_API_COMPARISON = True")
    print("\nNOTA: Las APIs tienen costo por uso")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


COMPARACIÓN CON APIS COMERCIALES

1. Generando con CodeLlama-7B fine-tuned...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



2. Generando con GPT-4o...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



3. Generando con Claude Sonnet 4.5...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



RESULTADOS DE COMPARACIÓN

CodeLlama-7B:
  BERTScore F1: 0.8234
  Factualidad: 0.5706
  Legibilidad: 51.77

GPT-4o:
  BERTScore F1: 0.8086
  Factualidad: 0.5711
  Legibilidad: 29.50

Claude Sonnet 4.5:
  BERTScore F1: 0.8169
  Factualidad: 0.5704
  Legibilidad: 45.31


## 20. Visualización de Resultados

In [40]:
# Visualización de resultados
import matplotlib.pyplot as plt
import seaborn as sns

def visualize_results(results_dict):
    """
    Visualiza resultados de comparación
    """
    if not results_dict:
        print("No hay resultados para visualizar")
        return

    # Preparar datos
    models = list(results_dict.keys())
    bertscore = [r['relevance']['f1'] for r in results_dict.values()]
    factual = [r['factual']['score'] for r in results_dict.values()]
    readability = [r['readability']['flesch_reading_ease']/100 for r in results_dict.values()]

    # Crear gráficas
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Gráfica 1: BERTScore
    axes[0].bar(models, bertscore, color='skyblue')
    axes[0].set_title('Relevancia (BERTScore F1)')
    axes[0].set_ylabel('Score')
    axes[0].set_ylim([0, 1])
    axes[0].tick_params(axis='x', rotation=45)

    # Gráfica 2: Factualidad
    axes[1].bar(models, factual, color='lightgreen')
    axes[1].set_title('Factualidad (NLI)')
    axes[1].set_ylabel('Score')
    axes[1].set_ylim([0, 1])
    axes[1].tick_params(axis='x', rotation=45)

    # Gráfica 3: Legibilidad
    axes[2].bar(models, readability, color='salmon')
    axes[2].set_title('Legibilidad (Flesch normalizado)')
    axes[2].set_ylabel('Score')
    axes[2].set_ylim([0, 1])
    axes[2].tick_params(axis='x', rotation=45)

    plt.tight_layout()
    plt.show()

    # Gráfica combinada
    fig, ax = plt.subplots(figsize=(12, 6))

    x = np.arange(len(models))
    width = 0.25

    ax.bar(x - width, bertscore, width, label='Relevancia', color='skyblue')
    ax.bar(x, factual, width, label='Factualidad', color='lightgreen')
    ax.bar(x + width, readability, width, label='Legibilidad', color='salmon')

    ax.set_xlabel('Modelos')
    ax.set_ylabel('Score')
    ax.set_title('Comparación de Modelos por Métrica')
    ax.set_xticks(x)
    ax.set_xticklabels(models, rotation=45, ha='right')
    ax.legend()
    ax.set_ylim([0, 1.1])
    ax.grid(axis='y', alpha=0.3)

    plt.tight_layout()
    plt.show()

print("Función de visualización lista")
print("\nPara usar:")
print("  visualize_results(results_comparison)")

Función de visualización lista

Para usar:
  visualize_results(results_comparison)


  self.FromDatetime(datetime.datetime.utcnow())
  return datetime.utcnow().replace(tzinfo=utc)


## 21. Exportar Resultados

In [41]:
# Exportar resultados para análisis
def export_results(results_dict, filename="results_comparison"):
    """
    Exporta resultados a CSV para análisis
    """
    if not results_dict:
        print("No hay resultados para exportar")
        return

    export_data = []

    for model_name, metrics in results_dict.items():
        export_data.append({
            'Model': model_name,
            'BERTScore_Precision': metrics['relevance']['precision'],
            'BERTScore_Recall': metrics['relevance']['recall'],
            'BERTScore_F1': metrics['relevance']['f1'],
            'Factual_Score': metrics['factual']['score'],
            'Flesch_Reading_Ease': metrics['readability']['flesch_reading_ease'],
            'Flesch_Kincaid_Grade': metrics['readability']['flesch_kincaid_grade']
        })

    df_results = pd.DataFrame(export_data)
    df_results.to_csv(filename, index=False)

    print(f"Resultados exportados a: {filename}")
    print("\nResumen:")
    print(df_results.to_string(index=False))

    return df_results

print("Función de exportación lista")
print("\nPara usar:")
print("  df_results = export_results(results_comparison)")

Función de exportación lista

Para usar:
  df_results = export_results(results_comparison)
