<a href="https://colab.research.google.com/github/cbadenes/curso-pln/blob/main/notebooks/proyecto_apoyo/03_EntrenamientoModelos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Dependencias Necesarias

In [1]:
# Instalar las dependencias necesarias
!pip install transformers datasets
!pip install torch

Successfully installed nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nvjitlink-cu12-12.4.127


#Generación de Texto (AutoModeloForCausalLM)

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset
import torch

# Dataset con texto continuo sobre Estopa
datos = [
    {"text": "Estopa es un dúo musical español formado por los hermanos David y José Manuel Muñoz. La banda se fundó en 1999 en Cornellà de Llobregat, Barcelona. Su estilo musical combina flamenco, rock y rumba catalana, creando un sonido único que los ha llevado a la fama."},
    {"text": "El primer álbum de Estopa, titulado 'Estopa', se lanzó en  1999 y fue un éxito inmediato, vendiendo más de un millón de copias. Canciones como 'La raja de tu falda' y 'Como Camarón' se convirtieron en clásicos."},
    {"text": "A lo largo de su carrera, Estopa ha lanzado más de 10 discos de estudio, manteniendo su característico estilo y evolucionando con nuevas influencias. Su álbum 'Destrangis' consolidó aún más su éxito con canciones como 'Vino tinto'."},
    {"text": "Estopa ha ganado numerosos premios, incluidos los Premios Ondas y los 40 Principales, que reconocen su contribución a la música española. Sus conciertos son conocidos por su energía y conexión con el público."},
    {"text": "La ciudad natal de los hermanos, Cornellà de Llobregat, influyó profundamente en su música. La mezcla cultural y las tradiciones flamencas del lugar se reflejan en sus letras y melodías."},
    {"text": "Además de su música, Estopa es conocido por sus letras llenas de humor y referencias cotidianas. Estas características los han hecho destacar y conectar con una audiencia amplia y diversa."},
    {"text": "Estopa continúa siendo una de las bandas más queridas en España, manteniendo su esencia mientras exploran nuevas direcciones en su música. Su legado perdurará como un símbolo de creatividad y autenticidad en la música española."}
]

# Crear un Dataset compatible con Hugging Face
dataset = Dataset.from_list(datos)

# Cargar el modelo y el tokenizador preentrenados en español
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
modelo = AutoModelForCausalLM.from_pretrained(model_name)
tokenizador = AutoTokenizer.from_pretrained(model_name)

# Usar eos_token_id como pad_token_id
tokenizador.pad_token = tokenizador.eos_token
modelo.config.pad_token_id = tokenizador.eos_token_id

# Tokenizar el dataset
def procesar_datos(ejemplo):
    tokenizado = tokenizador(
        ejemplo["text"], max_length=128, truncation=True, padding="max_length", return_tensors="pt"
    )
    return {key: tensor.squeeze() for key, tensor in tokenizado.items()}

dataset_procesado = dataset.map(procesar_datos)

# Usar DataCollatorForLanguageModeling (para gestionar el padding correctamente)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizador, mlm=False  # mlm=False porque es modelado causal, no enmascarado
)

# Configuración del entrenamiento
argumentos = TrainingArguments(
    output_dir="./resultados",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    save_total_limit=1,
    logging_steps=10,
    report_to="none"  # Desactiva W&B
)

# Ajustar las dimensiones del modelo al tokenizador
modelo.resize_token_embeddings(len(tokenizador))

# Crear el Trainer
trainer = Trainer(
    model=modelo,
    args=argumentos,
    data_collator=data_collator,
    train_dataset=dataset_procesado,
    tokenizer=tokenizador,
    eval_dataset=dataset_procesado
)

def generar_texto(prompt, modelo, tokenizador, max_length=100):
    """
    Función para generar texto con el modelo actual.
    """
    inputs = tokenizador(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    modelo.to("cuda" if torch.cuda.is_available() else "cpu")

    with torch.no_grad():
        output = modelo.generate(**inputs, max_length=max_length, do_sample=True, top_k=50, top_p=0.95)

    return tokenizador.decode(output[0], skip_special_tokens=True)

# Prueba antes del ajuste fino
prompt_test = "Estopa es una banda española conocida por"
print("\n🔹 **Generación de texto ANTES del ajuste fino:**")
print(generar_texto(prompt_test, modelo, tokenizador))



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

  trainer = Trainer(



🔹 **Generación de texto ANTES del ajuste fino:**
Estopa es una banda española conocida por su estilo hard rock y su sonido punk. Formada en 1982 en Oviedo por Estel Mena y Juan Pablo Estel. Entre 1987 y 1993 estuvo integrada por Estel Mena, el bajista de la banda, que pasaría luego a formar la banda Maná.

Historia 

Estoló fue


Entrenamiento

In [3]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.699319
2,No log,0.28886
3,1.191700,0.172364


TrainOutput(global_step=12, training_loss=1.026526893178622, metrics={'train_runtime': 46.2751, 'train_samples_per_second': 0.454, 'train_steps_per_second': 0.259, 'total_flos': 16684615729152.0, 'train_loss': 1.026526893178622, 'epoch': 3.0})

Después del Ajuste fino

In [4]:
print("\n **Generación de texto DESPUÉS del ajuste fino:**")
print(generar_texto(prompt_test, modelo, tokenizador))


🔹 **Generación de texto DESPUÉS del ajuste fino:**
Estopa es una banda española conocida por sus letras llenas de humor y referencias cotidianas. Comenzaron a sonar en los años 90, siendo un de los grupos más queridos y queridos en España. Su legado perdurará como un símbolo de creatividad y autenticidad en la música española. Su álbum 'Qué bueno' se llevó a una raja de éxitos in


#2. Clasificación de Texto (AutoModelForSequenceClassification)

Antes de ajuste fino:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report
import torch

# Tokenizador y modelo preentrenado
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Positivo y Negativo

#EVALUACIÓN
texts = ["Amazing movie!", "Terrible plot.", "Loved the characters!", "Not my taste."]
true_labels = [1, 0, 1, 0]  # Etiquetas reales

# Predicciones
inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
predicted_classes = torch.argmax(outputs.logits, dim=1).tolist()

# Calcular métricas
accuracy = accuracy_score(true_labels, predicted_classes)
print("Accuracy:", accuracy)
print("Reporte de clasificación:")
print(classification_report(true_labels, predicted_classes, target_names=["Negative", "Positive"]))


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.5
Reporte de clasificación:
              precision    recall  f1-score   support

    Negative       0.50      1.00      0.67         2
    Positive       0.00      0.00      0.00         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Texto de prueba
text = "This movie is amazing, I loved it!"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Mover las entradas al mismo dispositivo que el modelo (GPU si está disponible)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model(**inputs)
logits = outputs.logits  # predicciones sin normalizar, es decir, los valores antes de aplicar una función como softmax.
predicted_class = torch.argmax(logits, dim=1).item()

# Mostrar resultado
label_map = {0: "Negative", 1: "Positive"}  # Cambia según las etiquetas de tu modelo
print(f"Texto: {text}")
print(f"Predicción: {label_map[predicted_class]}")

Texto: This movie is amazing, I loved it!
Predicción: Negative


AJUSTE FINO CON DATASET DE IMBD

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset

# Cargar el dataset IMDb (análisis de sentimientos)
dataset = load_dataset("imdb")

# Tokenizador y modelo preentrenado
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Positivo y Negativo

# Preprocesar datos
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=False)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Usar DataCollator para hacer padding dinámico
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized_datasets["test"].shuffle(seed=42).select(range(500)),
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Entrenar el modelo
trainer.train()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,0.285049
2,No log,0.319991
3,No log,0.338733


TrainOutput(global_step=375, training_loss=0.250618896484375, metrics={'train_runtime': 636.865, 'train_samples_per_second': 9.421, 'train_steps_per_second': 0.589, 'total_flos': 1555677003697920.0, 'train_loss': 0.250618896484375, 'epoch': 3.0})

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Texto de prueba
text = "This movie is amazing, I loved it!"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Mover las entradas al mismo dispositivo que el modelo (GPU si está disponible)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model(**inputs)
logits = outputs.logits  # predicciones sin normalizar, es decir, los valores antes de aplicar una función como softmax.
predicted_class = torch.argmax(logits, dim=1).item()

# Mostrar resultado
label_map = {0: "Negative", 1: "Positive"}  # Cambia según las etiquetas de tu modelo
print(f"Texto: {text}")
print(f"Predicción: {label_map[predicted_class]}")


Texto: This movie is amazing, I loved it!
Predicción: Positive


In [None]:
from sklearn.metrics import accuracy_score, classification_report

#EVALUACIÓN
texts = ["Amazing movie!", "Terrible plot.", "Loved the characters!", "Not my taste."]
true_labels = [1, 0, 1, 0]  # Etiquetas reales

# Predicciones
inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
predicted_classes = torch.argmax(outputs.logits, dim=1).tolist()

# Calcular métricas
accuracy = accuracy_score(true_labels, predicted_classes)
print("Accuracy:", accuracy)
print("Reporte de clasificación:")
print(classification_report(true_labels, predicted_classes, target_names=["Negative", "Positive"]))


Accuracy: 1.0
Reporte de clasificación:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         2
    Positive       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4



#2. Reconocimiento de Entidades Nombradas (AutoModelForTokenClassification)

In [6]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
from datasets import load_dataset

# Cargar el dataset CoNLL-2003
dataset = load_dataset("conll2003")

# Tokenizador y modelo preentrenado
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(dataset["train"].features["ner_tags"].feature.names))

# Preprocesar datos
def preprocess_function(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Usar DataCollator para alineación de etiquetas
data_collator = DataCollatorForTokenClassification(tokenizer)

# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized_datasets["validation"].shuffle(seed=42).select(range(500)),
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Texto de prueba
text = "My name is Wolfgang and I live in Berlin"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=2)

# Obtener las etiquetas predichas
predicted_labels = [dataset["train"].features["ner_tags"].feature.names[p] for p in predictions[0].tolist()]

# Imprimir los tokens y sus etiquetas predichas
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
print("Etiquetas predichas:", predicted_labels)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

  trainer = Trainer(


Tokens: ['[CLS]', 'My', 'name', 'is', 'Wolfgang', 'and', 'I', 'live', 'in', 'Berlin', '[SEP]']
Etiquetas predichas: ['I-ORG', 'I-ORG', 'I-MISC', 'I-ORG', 'I-MISC', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-MISC']


Entrenamiento

In [7]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.149108
2,No log,0.124064
3,No log,0.100066


TrainOutput(global_step=375, training_loss=0.16891073608398438, metrics={'train_runtime': 29.2864, 'train_samples_per_second': 204.873, 'train_steps_per_second': 12.805, 'total_flos': 152435476445472.0, 'train_loss': 0.16891073608398438, 'epoch': 3.0})

Después de Ajuste Fino

In [9]:
import torch

# Texto de prueba
text = "My name is Wolfgang and I live in Berlin"

# Tokenización y predicción
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=2)

# Obtener las etiquetas predichas
predicted_labels = [dataset["train"].features["ner_tags"].feature.names[p] for p in predictions[0].tolist()]

# Imprimir los tokens y sus etiquetas predichas
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
print("Etiquetas predichas:", predicted_labels)

Tokens: ['[CLS]', 'My', 'name', 'is', 'Wolfgang', 'and', 'I', 'live', 'in', 'Berlin', '[SEP]']
Etiquetas predichas: ['O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O', 'B-LOC', 'O']


#3. Preguntas y Respuestas (AutoModelForQuestionAnswering)

In [21]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset

# Cargar el dataset SQuAD v2
dataset = load_dataset("squad_v2")

# Tokenizador y modelo preentrenado
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Preprocesar datos
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        truncation=True,
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding=False
    )
    sample_mapping = inputs.pop("overflow_to_sample_mapping")
    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []
    for i, offsets in enumerate(offset_mapping):
        input_ids = inputs["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = inputs.sequence_ids(i)
        sample_index = sample_mapping[i]
        answer = answers[sample_index]
        if len(answer["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answer["answer_start"][0]
            end_char = start_char + len(answer["text"][0])
            token_start_index = 0
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1
            if offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)
            else:
                start_positions.append(cls_index)
                end_positions.append(cls_index)
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

# Usar DefaultDataCollator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='longest', max_length=512) #max_lenght para evitar error de memoria
# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(2000)),
    eval_dataset=tokenized_datasets["validation"].shuffle(seed=42).select(range(500)),
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Ejemplo de pregunta y contexto
context = """"Cáceres is a city located in the autonomous community of Extremadura, in western Spain, near the border with Portugal.
It was declared a UNESCO World Heritage City in 1986 due to the mixture of Roman, Islamic, Gothic, and Italian Renaissance
architecture found in its old town. The medieval walled enclosure preserves much of its walls and towers, making it
a significant tourist destination. The city also stands out for its cultural life, hosting the Womad festival and
several events throughout the year. Moreover, Cáceres is the capital of the province with the same name and
is situated in an area with a Mediterranean climate featuring mild winters and very hot summers. The local economy
relies on tourism, agriculture, and a growing industry related to food processing. Extremadura is known for its
diverse landscapes, including meadows, plains, and mountains, and for its gastronomic traditions, particularly
Iberian ham and Torta del Casar."""
question = "Where is Caceres placed?"
# Tokenizar la pregunta y el contexto
inputs = tokenizer(question, context, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Realizar la inferencia
with torch.no_grad():
    outputs = model(**inputs)

# Obtener las respuestas
answer_start_index = torch.argmax(outputs.start_logits)
answer_end_index = torch.argmax(outputs.end_logits)

predict_answer_tokens = inputs["input_ids"][0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens)

# Imprimir la respuesta
print(f"Pregunta: {question}")
print(f"Contexto: {context}")
print(f"Respuesta: {answer}")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

Pregunta: Where is Caceres placed?
Contexto: "Cáceres is a city located in the autonomous community of Extremadura, in western Spain, near the border with Portugal. 
It was declared a UNESCO World Heritage City in 1986 due to the mixture of Roman, Islamic, Gothic, and Italian Renaissance 
architecture found in its old town. The medieval walled enclosure preserves much of its walls and towers, making it 
a significant tourist destination. The city also stands out for its cultural life, hosting the Womad festival and 
several events throughout the year. Moreover, Cáceres is the capital of the province with the same name and 
is situated in an area with a Mediterranean climate featuring mild winters and very hot summers. The local economy 
relies on tourism, agriculture, and a growing industry related to food processing. Extremadura is known for its 
diverse landscapes, including meadows, plains, and mountains, and for its gastronomic traditions, particularly 
Iberian ham and Torta del Ca

  trainer = Trainer(


Entrenar Modelo

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.382369
2,2.746200,1.944828
3,2.746200,2.081701
4,1.111400,2.302031
5,1.111400,2.540076


TrainOutput(global_step=1250, training_loss=1.6604977172851563, metrics={'train_runtime': 77.4986, 'train_samples_per_second': 129.035, 'train_steps_per_second': 16.129, 'total_flos': 680578122430752.0, 'train_loss': 1.6604977172851563, 'epoch': 5.0})

Después de Ajuste Fino

In [23]:
# Ejemplo de pregunta y contexto
context = """"Cáceres is a city located in the autonomous community of Extremadura, in western Spain, near the border with Portugal.
It was declared a UNESCO World Heritage City in 1986 due to the mixture of Roman, Islamic, Gothic, and Italian Renaissance
architecture found in its old town. The medieval walled enclosure preserves much of its walls and towers, making it
a significant tourist destination. The city also stands out for its cultural life, hosting the Womad festival and
several events throughout the year. Moreover, Cáceres is the capital of the province with the same name and
is situated in an area with a Mediterranean climate featuring mild winters and very hot summers. The local economy
relies on tourism, agriculture, and a growing industry related to food processing. Extremadura is known for its
diverse landscapes, including meadows, plains, and mountains, and for its gastronomic traditions, particularly
Iberian ham and Torta del Casar."""
question = "Where is Caceres?"
# Tokenizar la pregunta y el contexto
inputs = tokenizer(question, context, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Realizar la inferencia
with torch.no_grad():
    outputs = model(**inputs)

# Obtener las respuestas
answer_start_index = torch.argmax(outputs.start_logits)
answer_end_index = torch.argmax(outputs.end_logits)

predict_answer_tokens = inputs["input_ids"][0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens)

# Imprimir la respuesta
print(f"Pregunta: {question}")
print(f"Contexto: {context}")
print(f"Respuesta: {answer}")

Pregunta: Where is Caceres?
Contexto: "Cáceres is a city located in the autonomous community of Extremadura, in western Spain, near the border with Portugal. 
It was declared a UNESCO World Heritage City in 1986 due to the mixture of Roman, Islamic, Gothic, and Italian Renaissance 
architecture found in its old town. The medieval walled enclosure preserves much of its walls and towers, making it 
a significant tourist destination. The city also stands out for its cultural life, hosting the Womad festival and 
several events throughout the year. Moreover, Cáceres is the capital of the province with the same name and 
is situated in an area with a Mediterranean climate featuring mild winters and very hot summers. The local economy 
relies on tourism, agriculture, and a growing industry related to food processing. Extremadura is known for its 
diverse landscapes, including meadows, plains, and mountains, and for its gastronomic traditions, particularly 
Iberian ham and Torta del Casar.
Re

#AutoModelForMaskedLM.from_pretrained (Masked Language Modeling)

Antes de ajuste Fino

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset

# Dataset para Masked LM
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# Tokenizador y modelo
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Tokenización
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# DataCollator para Masked LM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

# Configuración de entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Comprobar rendimiento antes del ajuste fino
text = "The capital of Extremadura is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predicted_token_id = outputs.logits[0, inputs.input_ids[0].tolist().index(tokenizer.mask_token_id)].argmax().item()
print("Antes del ajuste fino:")
print(f"Texto: {text}")
print(f"Predicción: {tokenizer.decode(predicted_token_id)}\n")

# Entrenamiento
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer)
trainer.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]



Antes del ajuste fino:
Texto: The capital of Extremadura is [MASK].
Predicción: madrid



  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.9025,
2,1.7645,
3,1.669,1.756723


TrainOutput(global_step=6885, training_loss=1.7913919242465506, metrics={'train_runtime': 3110.1472, 'train_samples_per_second': 35.418, 'train_steps_per_second': 2.214, 'total_flos': 7248265737292800.0, 'train_loss': 1.7913919242465506, 'epoch': 3.0})

Después del Ajuste fino

In [None]:
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Comprobar rendimiento después del ajuste fino
outputs = model(**inputs)
predicted_token_id = outputs.logits[0, inputs['input_ids'][0].tolist().index(tokenizer.mask_token_id)].argmax().item()
print("Después del ajuste fino:")
print(f"Texto: {text}")
print(f"Predicción: {tokenizer.decode(predicted_token_id)}\n")

Después del ajuste fino:
Texto: The capital of Extremadura is [MASK].
Predicción: merida



#3. AutoModelForMultipleChoice (Selección Múltiple)

Antes del ajuste fino

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForMultipleChoice, Trainer, TrainingArguments
import torch

dataset = load_dataset("swag", "regular")
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)



# Tokenización
def preprocess_function(examples):
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    second_sentences = [[f"{examples['sent2'][i]}" for i in range(len(examples["sent2"]))] * 4]

    # Aplanar las listas
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    # Tokenizar
    tokenized_inputs = tokenizer(
        first_sentences,
        second_sentences,
        truncation=True,  # Asegurar que todas las secuencias tengan la misma longitud
        padding='max_length',  # Rellenar con padding hasta la longitud máxima
        max_length=128,  # Longitud máxima de la secuencia
    )

    # Reagrupar los ejemplos para que cada uno tenga 4 opciones
    return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_inputs.items()}


# Aplicamos la función de preprocesado
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Cargamos el modelo
model = AutoModelForMultipleChoice.from_pretrained(model_name)

# Configuramos entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Comprobar logits antes del entrenamiento (opcional)
prompt = ["Alice started to read a book about cooking."]
choices = ["She found a recipe for cookies.", "She hated reading.", "Alice decided to go for a run.", "She turned on the TV instead."]
inputs = tokenizer(prompt * len(choices), choices, return_tensors="pt", padding=True, truncation=True)
# Re-dimensionamos para que sea (batch_size=1, num_choices=4, seq_length)
for k in inputs:
    inputs[k] = inputs[k].unsqueeze(0)

outputs = model(**inputs)
predicted_idx = torch.argmax(outputs.logits, dim=1).item()
best_choice = choices[predicted_idx]
print("Antes del ajuste fino")
print(f"Logits: {outputs.logits}\n")
print("Opción elegida:", best_choice)

# Entrenamos con Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    # Para el ejemplo, tomamos un subset pequeño
    train_dataset=tokenized_dataset["train"].select(range(2000)),
    eval_dataset=tokenized_dataset["validation"].select(range(500)),
    tokenizer=tokenizer
)
trainer.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.20k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/4.81M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/4.78M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/73546 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/20006 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/20005 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/73546 [00:00<?, ? examples/s]

Map:   0%|          | 0/20006 [00:00<?, ? examples/s]

Map:   0%|          | 0/20005 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Antes del ajuste fino
Logits: tensor([[-1.4649, -1.4547, -1.4532, -1.1536]], grad_fn=<ViewBackward0>)

Opción elegida: She turned on the TV instead.


  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,1.385377


TrainOutput(global_step=125, training_loss=1.3960203857421876, metrics={'train_runtime': 47.2977, 'train_samples_per_second': 42.285, 'train_steps_per_second': 2.643, 'total_flos': 526217385984000.0, 'train_loss': 1.3960203857421876, 'epoch': 1.0})

Después del ajuste fino

In [None]:
# Comprobar rendimiento después del ajuste fino
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model(**inputs)
predicted_idx = torch.argmax(outputs.logits, dim=1).item()
best_choice = choices[predicted_idx]
print("Después del ajuste fino:")
print(f"Logits: {outputs.logits}\n")
print("Opción elegida:", best_choice)

Después del ajuste fino:
Logits: tensor([[0.3694, 0.3624, 0.4118, 0.2889]], device='cuda:0',
       grad_fn=<ViewBackward0>)

Opción elegida: Alice decided to go for a run.


#4. AutoModelForSeq2SeqLM.from_pretrained (Traducción/Secuencia a Secuencia)

Antes del ajuste fino

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
from datasets import load_dataset

# Cargar dataset de traducción
dataset = load_dataset("opus_books", "en-es")
#Creamos split de validación (10% del train)
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)
dataset['validation'] = dataset['test']
# Tokenizador y modelo
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Preprocesar datos para traducción
def preprocess_function(examples):
    # "translation" es un diccionario con {"en": ..., "es": ...}
    # Recuperamos la lista de textos en inglés y español
    en_texts = [item["en"] for item in examples["translation"]]
    es_texts = [item["es"] for item in examples["translation"]]

    # Construimos el prompt: "translate English to Spanish: <texto_en>"
    inputs = [f"translate English to Spanish: {text}" for text in en_texts]
    # El target será directamente el texto en español
    targets = [text for text in es_texts]

    # Tokenizamos entradas y salidas
    model_inputs = tokenizer(inputs, truncation=True, padding="max_length", max_length=128)
    labels = tokenizer(targets, truncation=True, padding="max_length", max_length=128)

    # Añadimos las labels al diccionario de tokens
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Usar DataCollator para Seq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Configuración de entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Comprobar rendimiento antes del ajuste fino
def test_translation(model, tokenizer, input_text):
    inputs = tokenizer(f"translate English to Spanish: {input_text}", return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model.generate(**inputs, max_length=50)
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

print("Antes del ajuste fino:")
input_text = "The book is on the table."
print(f"Entrada: {input_text}")
print(f"Traducción: {test_translation(model, tokenizer, input_text)}\n")

# Entrenar el modelo
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42),
    eval_dataset=tokenized_datasets["validation"].shuffle(seed=42),
    tokenizer=tokenizer,
    data_collator=data_collator,
  )
trainer.train()


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/84123 [00:00<?, ? examples/s]

Map:   0%|          | 0/9347 [00:00<?, ? examples/s]

  trainer = Trainer(


Antes del ajuste fino:
Entrada: The book is on the table.
Traducción: Das Buch ist auf dem Tisch.



Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,1.1053,0.994683
2,1.0522,0.923904
3,1.009,0.905965


TrainOutput(global_step=15774, training_loss=1.1174980832004753, metrics={'train_runtime': 1368.16, 'train_samples_per_second': 184.459, 'train_steps_per_second': 11.529, 'total_flos': 8539018773921792.0, 'train_loss': 1.1174980832004753, 'epoch': 3.0})

Después del Ajsute Fino

In [None]:
# Comprobar rendimiento después del ajuste fino

print("Después del ajuste fino:")
print(f"Entrada: {input_text}")
print(f"Traducción: {test_translation(model, tokenizer, input_text)}\n")

Después del ajuste fino:
Entrada: The book is on the table.
Traducción: El libro está en la mesa.

