<a href="https://colab.research.google.com/github/OptimoCX/BootCampIA/blob/main/Decoder_GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decoder (GPT-2 para generación de texto)

In [None]:
# Bloque 1: Instalación de dependencias
# Instalamos Hugging Face transformers y datasets
!pip install transformers datasets




In [None]:
# Bloque 2: Importar librerías necesarias
# Cargamos GPT-2 para generación de texto
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


In [None]:
# Bloque 3: Cargar tokenizer y modelo decoder (GPT-2 pequeño)
# GPT-2 es un modelo "decoder-only", especializado en predecir la siguiente palabra.
# Usamos la versión "distilgpt2" que es más liviana y rápida para entrenar/demo.
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
# Bloque 4: Crear un pipeline de generación de texto
# Este pipeline nos permite darle un texto inicial (prompt)
# y el modelo genera automáticamente la continuación.
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)


Device set to use cpu


In [None]:
# Bloque 5: Ejemplo de generación de texto
# Aquí el modelo predice cómo continuar la frase dada.
output = generator("Once upon a time, in a faraway land,",
                   max_length=50,   # longitud máxima del texto generado
                   num_return_sequences=2,  # cuántas alternativas generar
                   temperature=0.7) # controla la creatividad (bajo = más preciso, alto = más creativo)

print(output)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'Once upon a time, in a faraway land, there is a strange, magical figure that is in a strange place that is so strange and beautiful. It has a curious, strange appearance, and it seems that it needs to be taken into account, because it is a man, a woman, a woman, and a man. It has a strange appearance, and it seems that it needs to be taken into account, because it is a man, a woman, and a man. It has a strange appearance, and it seems that it needs to be taken into account, because it is a man, a woman, and a man. It has a strange appearance, and it seems that it needs to be taken into account, because it is a man, a woman, and a man. It has a strange appearance, and it seems that it needs to be taken into account, because it is a man, a woman, and a man. It has a strange appearance, and it seems that it needs to be taken into account, because it is a man, a woman, and a man. It has a strange appearance, and it seems that it needs to be taken into account, because 

In [None]:
# Bloque 6: Usar tu propio dataset (opcional)
# pueden cargar un CSV con una columna de texto y "afinar" el modelo GPT-2.
# Ejemplo: dataset de frases para entrenar estilo específico.

from datasets import Dataset
import pandas as pd

# Ejemplo de un dataset propio )
data = {"text": [
    "El clima hoy está muy soleado.",
    "La inteligencia artificial está transformando el mundo.",
    "Había una vez un dragón que vivía en las montañas."
]}
df = pd.DataFrame(data)

dataset = Dataset.from_pandas(df)
print(dataset[0])


{'text': 'El clima hoy está muy soleado.'}


In [None]:
def tokenize_function(examples):
    tokens = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=50
    )
    # 👇 usamos input_ids también como labels
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_dataset = dataset.map(tokenize_function, batched=True)



Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [None]:
# Bloque 8: Pequeño fine-tuning (opcional)
# Si quieres, puedes entrenar GPT-2 con tu dataset para adaptarlo.
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results_decoder",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    logging_steps=5,
    report_to="none"  # evita usar wandb
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

trainer.train()


  trainer = Trainer(


Step,Training Loss


TrainOutput(global_step=2, training_loss=3.6226391792297363, metrics={'train_runtime': 15.32, 'train_samples_per_second': 0.196, 'train_steps_per_second': 0.131, 'total_flos': 22965534720.0, 'train_loss': 3.6226391792297363, 'epoch': 1.0})

In [None]:
input_text = "The future of AI is"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The future of AI is a matter of time.


# Ejemplo 1: Decoder para completar texto (GPT-2)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Modelo pequeño para que no tarde mucho
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Aseguramos pad_token
tokenizer.pad_token = tokenizer.eos_token

# Entrada del usuario
input_text = "In the future, artificial intelligence will"
inputs = tokenizer(input_text, return_tensors="pt")

# Generación
outputs = model.generate(**inputs, max_length=40, num_return_sequences=3, do_sample=True, top_k=50)
for i, output in enumerate(outputs):
    print(f"\n✨ Resultado {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



✨ Resultado 1: In the future, artificial intelligence will become one of the central features in society that is increasingly capable of keeping us on our toes. But until we live in a new age of AI tech that is capable

✨ Resultado 2: In the future, artificial intelligence will have to overcome the problems of a "human body", the future is not in vain. AI is fundamentally an outlier, but it's in our personal and society

✨ Resultado 3: In the future, artificial intelligence will be able to do so by creating intelligent, autonomous, and intelligent robotic machines whose function will be based on the same system,” he said.


# Ejemplo 2: Decoder para responder preguntas simples (Q&A estilo generativo)

In [None]:
question = "What is the capital of Spain?"

inputs = tokenizer("Q: " + question + "\nA:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=30, num_return_sequences=1)

print("❓ Pregunta:", question)
print("💡 Respuesta generada:", tokenizer.decode(outputs[0], skip_special_tokens=True))


❓ Pregunta: What is the capital of Spain?
💡 Respuesta generada: Madrid


# Probando otro modelo mas robusto

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

inputs = tokenizer("What is the capital of Spain?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


santos


# Ejemplo 3: Decoder para escribir estilo creativo

In [None]:
prompt = "Once upon a time in a magical forest,"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=50,
    num_return_sequences=2,   # Queremos dos textos diferentes
    do_sample=True,           # Sampling activado
    temperature=0.9,          # Controla creatividad (más alto = más creativo)
    top_p=0.95                # Nucleus sampling
)

for i, output in enumerate(outputs):
    print(f"\n📖 Historia {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



📖 Historia 1: Once upon a time in a magical forest, the light that moves by will become a glow. It‣s one of the most powerful elements in the universe.‣
The darkness begins to engulf the entire world. The darkness begins to engulf

📖 Historia 2: Once upon a time in a magical forest, in the middle of a forest, the light of the moon appears.



The first person who witnessed the scene had it made her laugh. She was very pleased. This person had not seen


# Ejemplo 4: Decoder para chat interactivo en Colab

In [None]:
def chat_with_model():
    print("💬 Bienvenido al chat con GPT2-mini. Escribe 'exit' para salir.\n")
    while True:
        user_input = input("Tú: ")
        if user_input.lower() == "exit":
            break
        inputs = tokenizer(user_input, return_tensors="pt")
        outputs = model.generate(**inputs, max_length=60, pad_token_id=tokenizer.eos_token_id)
        print("🤖 Modelo:", tokenizer.decode(outputs[0], skip_special_tokens=True))

chat_with_model()


💬 Bienvenido al chat con GPT2-mini. Escribe 'exit' para salir.

Tú: hola
🤖 Modelo: hola
Tú: how are you?
🤖 Modelo: a sexy person
Tú: exit


# Ejemplo 5: Decoder con dataset propio

In [None]:
from datasets import Dataset

# Pequeño dataset de frases propias
texts = [
    "Artificial Intelligence will change the world.",
    "Machine learning is a subset of AI.",
    "Transformers are powerful architectures."
]

dataset = Dataset.from_dict({"text": texts})

# Tokenización con labels
def tokenize_function(examples):
    tokens = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=30)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_dataset = dataset.map(tokenize_function, batched=True)
print(tokenized_dataset[0])


In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Cargar un modelo pequeño para pruebas (rápido en Colab)
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

training_args = TrainingArguments(
    output_dir="./results_decoder",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    logging_steps=5,
    report_to="none"  # evita usar wandb
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

trainer.train()





  trainer = Trainer(


Step,Training Loss


TrainOutput(global_step=2, training_loss=6.524113655090332, metrics={'train_runtime': 27.9172, 'train_samples_per_second': 0.107, 'train_steps_per_second': 0.072, 'total_flos': 22965534720.0, 'train_loss': 6.524113655090332, 'epoch': 1.0})

In [None]:
prompt = "Artificial Intelligence"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=40,
    num_return_sequences=1,
    temperature=0.7,
    top_p=0.9
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Artificial Intelligence (AI) is a technology that is designed to help people to understand and understand the world.




















Próximos pasos:
Probar modelos como:

distilgpt2 → demo rápido.

gpt2 o facebook/opt-125m → para resultados un poco mejores.

bigscience/bloom-560m → para ver generación en español.