# Procesamiento del lenguaje natural - 2025 - B4 - Desafio 4
**Inteligencia Artificial - CEIA - FIUBA**

## Autor

- **Mendoza Dante**.
- **SIU: e2206**.

**Nota:** Tomé como base el código compartido por los docentes. Como la idea era pasar de Keras a PyTorch, utilicé IA para consultar errores y adaptación de código. Las que utilicé fueron Copilot y ChatGPT.

## LSTM Bot QA
### Consigna
El objecto es utilizar datos disponibles del challenge ConvAI2 (Conversational Intelligence Challenge 2) de conversaciones en inglés. Se construirá un BOT para responder a preguntas del usuario (QA).

Link: http://convai.io/data/

### Recomendaciones:
- MAX_VOCAB_SIZE = 8000
- max_length ~ 10
- Embeddings 300 Fasttext
- n_units = 128
- LSTM Dropout 0.2
- Epochs 30~50

### Preguntas interesantes:
- Do you read?
- Do you have any pet?
- Where are you from?

In [None]:
# ================================================================================
# Leo los datos y hago una pequeña normalizacion
# ================================================================================
import os
import re
import json
import random
from pathlib import Path
import gdown

# Parámetros
DATA_FILENAME = "data_volunteers.json"
DRIVE_ID = "1awUxYwImF84MIT5-jCaYAPe2QwSgS1hN"
MAX_LENGTH = 10   # tokens por frase
SEED = 42
random.seed(SEED)

# Descargo si no existe
if not Path(DATA_FILENAME).exists():
    print("Descargando dataset...")
    url = f"https://drive.google.com/uc?id={DRIVE_ID}&export=download"
    gdown.download(url, DATA_FILENAME, quiet=False)
else:
    print("El dataset ya se encuentra descargado:", DATA_FILENAME)

# Cargo el JSON
with open(DATA_FILENAME, "r", encoding="utf-8") as f:
    data = json.load(f)

print("Registros en JSON:", len(data))
print("Ejemplo keys de la primera entrada:", list(data[0].keys()))

# Limpieza / normalización
def clean_text(txt):
    if not isinstance(txt, str):
        return ""
    txt = txt.lower().strip()
    # reemplazos simples de contracciones
    txt = txt.replace("i'm", "i am")
    txt = txt.replace("you're", "you are")
    txt = txt.replace("he's", "he is")
    txt = txt.replace("she's", "she is")
    txt = txt.replace("it's", "it is")
    txt = txt.replace("that's", "that is")
    txt = txt.replace("what's", "what is")
    txt = txt.replace("where's", "where is")
    txt = txt.replace("don't", "do not")
    txt = txt.replace("doesn't", "does not")
    txt = txt.replace("didn't", "did not")
    txt = txt.replace("won't", "will not")
    txt = txt.replace("can't", "can not")
    txt = txt.replace("'ll", " will")
    txt = txt.replace("'ve", " have")
    txt = txt.replace("'re", " are")
    txt = txt.replace("'d", " would")
    # para eliminar caracteres no alfanuméricos excepto espacios
    txt = re.sub(r"[^a-z0-9\s]", " ", txt)
    # colapsar espacios múltiples
    txt = re.sub(r"\s+", " ", txt).strip()
    return txt

# Extraigo pares input-output
input_sentences = []
output_sentences = []
output_sentences_inputs = []  # decoder inputs with <sos>
max_len_in_chars = 0

for entry in data:
    dialog = entry.get("dialog", [])
    # aseguramos al menos 2 turns
    if not dialog or len(dialog) < 2:
        continue
    for i in range(len(dialog) - 1):
        a = clean_text(dialog[i].get("text", ""))
        b = clean_text(dialog[i+1].get("text", ""))
        if not a or not b:
            continue
        # token-length check (por tokens)
        tokens_a = a.split()
        tokens_b = b.split()
        if len(tokens_a) > MAX_LENGTH or len(tokens_b) > MAX_LENGTH:
            continue
        # construyo las cadenas para encoder/decoder
        input_sentence = a
        output_sentence = b + " <eos>" # decoder target ends with <eos>
        output_sentence_input = "<sos> " + b # decoder input starts with <sos>

        input_sentences.append(input_sentence)
        output_sentences.append(output_sentence)
        output_sentences_inputs.append(output_sentence_input)

        max_len_in_chars = max(max_len_in_chars, len(a), len(b))

print("Cantidad de pares (después de filtrar por MAX_LENGTH={} tokens): {}".format(MAX_LENGTH, len(input_sentences)))
print("Longest example length (chars):", max_len_in_chars)

# Mostrar algunos ejemplos aleatorios
print("\nAlgunos ejemplos de pares (input -> output):\n")
n_show = 8
idxs = random.sample(range(len(input_sentences)), min(n_show, len(input_sentences)))
for i in idxs:
    print("IN : ", input_sentences[i])
    print("OUT: ", output_sentences_inputs[i], " -> target:", output_sentences[i])
    print("-" * 60)

# Guardamos
out_obj = {
    "input_sentences": input_sentences,
    "output_sentences_inputs": output_sentences_inputs,
    "output_sentences": output_sentences
}
with open("pairs_preprocessed.json", "w", encoding="utf-8") as f:
    json.dump(out_obj, f, ensure_ascii=False, indent=2)

print("\nSe guardó pairs_preprocessed.json con los pares procesados.")

El dataset ya se encuentra descargado: data_volunteers.json
Registros en JSON: 1111
Ejemplo keys de la primera entrada: ['dialog', 'start_time', 'end_time', 'bot_profile', 'user_profile', 'eval_score', 'profile_match', 'participant1_id', 'participant2_id']
Cantidad de pares (después de filtrar por MAX_LENGTH=10 tokens): 9886
Longest example length (chars): 87

Algunos ejemplos de pares (input -> output):

IN :  no
OUT:  <sos> oh okay you have any pets  -> target: oh okay you have any pets <eos>
------------------------------------------------------------
IN :  which sports
OUT:  <sos> i work in a computer company i love american sports  -> target: i work in a computer company i love american sports <eos>
------------------------------------------------------------
IN :  what do you do for work
OUT:  <sos> struggle  -> target: struggle <eos>
------------------------------------------------------------
IN :  fuking
OUT:  <sos> i love to read  -> target: i love to read <eos>
-------------

In [None]:
# ================================================================================
# Tokenización y creación de vocabularios
# ================================================================================
import json
from collections import Counter
import numpy as np
import torch

# Parámetros
MAX_VOCAB_SIZE = 8000

# Cargo los pares procesados
with open("pairs_preprocessed.json", "r", encoding="utf-8") as f:
    pairs = json.load(f)

input_sentences = pairs["input_sentences"]
output_sentences_inputs = pairs["output_sentences_inputs"]
output_sentences = pairs["output_sentences"]

print("Total de pares cargados:", len(input_sentences))

# ================================================================================
# Tokenizar
# ================================================================================
input_tokens = [s.split() for s in input_sentences]
output_tokens_in = [s.split() for s in output_sentences_inputs]
output_tokens_out = [s.split() for s in output_sentences]

# ================================================================================
# Crear vocabularios
# ================================================================================
def build_vocab(token_lists, max_vocab_size, add_specials=True):
    freq = Counter([tok for sent in token_lists for tok in sent])
    most_common = freq.most_common(max_vocab_size - 4 if add_specials else max_vocab_size)

    word2idx = {}
    idx2word = {}
    specials = ["<pad>", "<unk>", "<sos>", "<eos>"] if add_specials else []

    for idx, word in enumerate(specials + [w for w, _ in most_common]):
        word2idx[word] = idx
        idx2word[idx] = word

    return word2idx, idx2word

word2idx_inputs, idx2word_inputs = build_vocab(input_tokens, MAX_VOCAB_SIZE)
word2idx_outputs, idx2word_outputs = build_vocab(output_tokens_out, MAX_VOCAB_SIZE)

num_words_input = len(word2idx_inputs)
num_words_output = len(word2idx_outputs)

print(f"Vocabulario INPUT: {num_words_input} palabras")
print(f"Vocabulario OUTPUT: {num_words_output} palabras")

# ================================================================================
# Convertir tokens -> índices (padding)
# ================================================================================
def encode_sentences(token_lists, word2idx, max_len=None):
    sequences = []
    if not max_len:
        max_len = max(len(s) for s in token_lists)
    for sent in token_lists:
        seq = [word2idx.get(tok, word2idx["<unk>"]) for tok in sent]
        # padding
        if len(seq) < max_len:
            seq += [word2idx["<pad>"]] * (max_len - len(seq))
        else:
            seq = seq[:max_len]
        sequences.append(seq)
    return np.array(sequences), max_len

encoder_input_sequences, max_input_len = encode_sentences(input_tokens, word2idx_inputs)
decoder_input_sequences, max_out_len_in = encode_sentences(output_tokens_in, word2idx_outputs)
decoder_target_sequences, max_out_len_out = encode_sentences(output_tokens_out, word2idx_outputs)

# La longitud máxima de salida se toma igual para decoder input y target
max_out_len = max(max_out_len_in, max_out_len_out)

print(f"max_input_len = {max_input_len}")
print(f"max_out_len = {max_out_len}")

# ================================================================================
# Convertir a tensores PyTorch
# ================================================================================
encoder_input_sequences = torch.tensor(encoder_input_sequences, dtype=torch.long)
decoder_input_sequences = torch.tensor(decoder_input_sequences, dtype=torch.long)
decoder_target_sequences = torch.tensor(decoder_target_sequences, dtype=torch.long)

print("encoder_input_sequences.shape:", encoder_input_sequences.shape)
print("decoder_input_sequences.shape:", decoder_input_sequences.shape)
print("decoder_target_sequences.shape:", decoder_target_sequences.shape)

# ================================================================================
# Muestro ejemplo decodificado
# ================================================================================
def decode_sequence(seq, idx2word):
    words = [idx2word.get(idx, "<unk>") for idx in seq if idx != word2idx_inputs["<pad>"]]
    return " ".join(words)

example_idx = np.random.randint(0, len(encoder_input_sequences))
print("\nEjemplo de secuencia codificada y decodificada:")
print("Encoder input:", decode_sequence(encoder_input_sequences[example_idx].tolist(), idx2word_inputs))
print("Decoder input:", decode_sequence(decoder_input_sequences[example_idx].tolist(), idx2word_outputs))
print("Target:", decode_sequence(decoder_target_sequences[example_idx].tolist(), idx2word_outputs))

Total de pares cargados: 9886
Vocabulario INPUT: 2932 palabras
Vocabulario OUTPUT: 2935 palabras
max_input_len = 10
max_out_len = 11
encoder_input_sequences.shape: torch.Size([9886, 10])
decoder_input_sequences.shape: torch.Size([9886, 11])
decoder_target_sequences.shape: torch.Size([9886, 11])

Ejemplo de secuencia codificada y decodificada:
Encoder input: what kind of dog is he
Decoder input: <sos> german shepherd
Target: german shepherd <eos>


In [None]:
# ================================================================================
# Preparo embeddings GloVe 300d
# ================================================================================
import numpy as np
import torch
import os

# Descargar GloVe si no existe
glove_zip = "glove.6B.zip"
glove_dir = "glove.6B"
if not os.path.exists(glove_zip):
    print("Descargando GloVe (puede tardar unos minutos)...")
    !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip -q glove.6B.zip
else:
    print("GloVe ya está descargado.")

# Parámetros
EMBEDDING_DIM = 300
glove_path = f"glove.6B.{EMBEDDING_DIM}d.txt"

# Cargar embeddings a un diccionario
print("Cargando vectores GloVe en memoria...")
embeddings_index = {}
with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float32")
        embeddings_index[word] = vector

print(f"Total de vectores GloVe cargados: {len(embeddings_index):,}")

# ================================================================================
# Construir matriz de embeddings para vocabulario de INPUT
# ================================================================================
embedding_matrix_inputs = np.zeros((len(word2idx_inputs), EMBEDDING_DIM))
not_found_in_glove = 0

for word, idx in word2idx_inputs.items():
    if word in embeddings_index:
        embedding_matrix_inputs[idx] = embeddings_index[word]
    else:
        not_found_in_glove += 1
        embedding_matrix_inputs[idx] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

print(f"Palabras del vocabulario INPUT no encontradas en GloVe: {not_found_in_glove}")

# Convertir a tensor PyTorch
embedding_matrix_inputs = torch.tensor(embedding_matrix_inputs, dtype=torch.float32)

# ================================================================================
# Crear capa de embeddings
# ================================================================================
embedding_inputs = torch.nn.Embedding.from_pretrained(embedding_matrix_inputs, freeze=False)
print("Capa de embeddings (INPUT) creada:", embedding_inputs)

Descargando GloVe (puede tardar unos minutos)...
--2025-10-12 16:15:42--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-10-12 16:15:42--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-10-12 16:15:42--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) 

In [None]:
# ================================================================================
# Corrección de índices fuera de rango en los targets
# ================================================================================
# Reviso límites del vocabulario de salida
vocab_size_out = len(word2idx_outputs)
print("Tamaño del vocabulario de salida:", vocab_size_out)
print("Índices reservados:")
print(f"<pad>: {word2idx_outputs['<pad>']}, <sos>: {word2idx_outputs['<sos>']}, <eos>: {word2idx_outputs['<eos>']}, <unk>: {word2idx_outputs['<unk>']}")

# Forzamos que todos los targets estén dentro del rango válido
decoder_target_sequences = torch.clamp(decoder_target_sequences, max=vocab_size_out - 1)

# También el decoder_input_sequences
decoder_input_sequences = torch.clamp(decoder_input_sequences, max=vocab_size_out - 1)

print("Corrección aplicada: todos los índices están dentro del rango válido [0, vocab_size-1].")

Tamaño del vocabulario de salida: 2935
Índices reservados:
<pad>: 0, <sos>: 2, <eos>: 4, <unk>: 1
Corrección aplicada: todos los índices están dentro del rango válido [0, vocab_size-1].


In [None]:
# ================================================================================
# Modelo Seq2Seq (LSTM) + Entrenamiento
# ================================================================================
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import random
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Usando dispositivo:", device)

# ================================================================================
# Creo Dataset y DataLoader
# ================================================================================
BATCH_SIZE = 64 # 64 me funciono bien

dataset = TensorDataset(encoder_input_sequences, decoder_input_sequences, decoder_target_sequences)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

pad_idx = word2idx_outputs["<pad>"]
sos_idx = word2idx_outputs["<sos>"]
eos_idx = word2idx_outputs["<eos>"]

# ================================================================================
# Defino Encoder y Decoder
# ================================================================================
class Encoder(nn.Module):
    def __init__(self, embedding, hidden_dim=128, dropout=0.2):
        super().__init__()
        self.embedding = embedding
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(
            input_size=embedding.embedding_dim,
            hidden_size=hidden_dim,
            batch_first=True,
            dropout=dropout
        )

    def forward(self, src):
        embedded = self.embedding(src)
        outputs, (hidden, cell) = self.lstm(embedded)
        return hidden, cell

class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim=128, dropout=0.2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, dropout=dropout)
        self.fc_out = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input, hidden, cell):
        # input: [batch]
        input = input.unsqueeze(1)
        embedded = self.embedding(input)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden, cell

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.6):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        vocab_size = self.decoder.fc_out.out_features

        outputs = torch.zeros(batch_size, trg_len, vocab_size).to(self.device)
        hidden, cell = self.encoder(src)
        input = trg[:, 0]  # primer token <sos>

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[:, t] = output
            top1 = output.argmax(1)
            input = trg[:, t] if random.random() < teacher_forcing_ratio else top1

        return outputs

# ================================================================================
# Instancio modelo, pérdida y optimizador
# ================================================================================
HIDDEN_DIM = 128
DROPOUT = 0.2
EPOCHS = 50 # 30 / 50
LEARNING_RATE = 0.001

encoder = Encoder(embedding_inputs, hidden_dim=HIDDEN_DIM, dropout=DROPOUT)
decoder = Decoder(vocab_size=len(word2idx_outputs), embedding_dim=300, hidden_dim=HIDDEN_DIM, dropout=DROPOUT)
model = Seq2Seq(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# ================================================================================
# Entrenamiento
# ================================================================================
print("Comenzando entrenamiento...")
model.train()

for epoch in range(EPOCHS):
    epoch_loss = 0
    for src, trg_in, trg_out in tqdm(dataloader, desc=f"Epoch {epoch+1}/{EPOCHS}"):
        src, trg_in, trg_out = src.to(device), trg_in.to(device), trg_out.to(device)
        optimizer.zero_grad()

        output = model(src, trg_in)
        # output: [batch, trg_len, vocab_size]
        output_dim = output.shape[-1]

        output = output[:, 1:].reshape(-1, output_dim)   # ignoramos el primer <sos>
        trg_out = trg_out[:, 1:].reshape(-1)

        loss = criterion(output, trg_out)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{EPOCHS} - Pérdida promedio: {avg_loss:.4f}")

print("Entrenamiento finalizado")

# Guardar el modelo
torch.save(model.state_dict(), "seq2seq_glove_lstm.pt")
print("Modelo guardado como seq2seq_glove_lstm.pt")

Usando dispositivo: cpu




Comenzando entrenamiento...


Epoch 1/50: 100%|██████████| 155/155 [00:39<00:00,  3.89it/s]


Epoch 1/50 - Pérdida promedio: 4.8476


Epoch 2/50: 100%|██████████| 155/155 [00:36<00:00,  4.29it/s]


Epoch 2/50 - Pérdida promedio: 4.2308


Epoch 3/50: 100%|██████████| 155/155 [00:35<00:00,  4.35it/s]


Epoch 3/50 - Pérdida promedio: 4.1195


Epoch 4/50: 100%|██████████| 155/155 [00:32<00:00,  4.79it/s]


Epoch 4/50 - Pérdida promedio: 3.9719


Epoch 5/50: 100%|██████████| 155/155 [00:33<00:00,  4.60it/s]


Epoch 5/50 - Pérdida promedio: 3.8805


Epoch 6/50: 100%|██████████| 155/155 [00:32<00:00,  4.72it/s]


Epoch 6/50 - Pérdida promedio: 3.8107


Epoch 7/50: 100%|██████████| 155/155 [00:32<00:00,  4.77it/s]


Epoch 7/50 - Pérdida promedio: 3.7068


Epoch 8/50: 100%|██████████| 155/155 [00:32<00:00,  4.80it/s]


Epoch 8/50 - Pérdida promedio: 3.6318


Epoch 9/50: 100%|██████████| 155/155 [00:32<00:00,  4.80it/s]


Epoch 9/50 - Pérdida promedio: 3.5405


Epoch 10/50: 100%|██████████| 155/155 [00:36<00:00,  4.31it/s]


Epoch 10/50 - Pérdida promedio: 3.4887


Epoch 11/50: 100%|██████████| 155/155 [00:32<00:00,  4.78it/s]


Epoch 11/50 - Pérdida promedio: 3.4269


Epoch 12/50: 100%|██████████| 155/155 [00:32<00:00,  4.71it/s]


Epoch 12/50 - Pérdida promedio: 3.3470


Epoch 13/50: 100%|██████████| 155/155 [00:32<00:00,  4.82it/s]


Epoch 13/50 - Pérdida promedio: 3.2957


Epoch 14/50: 100%|██████████| 155/155 [00:32<00:00,  4.81it/s]


Epoch 14/50 - Pérdida promedio: 3.2076


Epoch 15/50: 100%|██████████| 155/155 [00:31<00:00,  4.87it/s]


Epoch 15/50 - Pérdida promedio: 3.1431


Epoch 16/50: 100%|██████████| 155/155 [00:31<00:00,  4.85it/s]


Epoch 16/50 - Pérdida promedio: 3.1041


Epoch 17/50: 100%|██████████| 155/155 [00:32<00:00,  4.83it/s]


Epoch 17/50 - Pérdida promedio: 3.0708


Epoch 18/50: 100%|██████████| 155/155 [00:32<00:00,  4.83it/s]


Epoch 18/50 - Pérdida promedio: 2.9859


Epoch 19/50: 100%|██████████| 155/155 [00:32<00:00,  4.75it/s]


Epoch 19/50 - Pérdida promedio: 2.9119


Epoch 20/50: 100%|██████████| 155/155 [00:32<00:00,  4.72it/s]


Epoch 20/50 - Pérdida promedio: 2.8720


Epoch 21/50: 100%|██████████| 155/155 [00:32<00:00,  4.78it/s]


Epoch 21/50 - Pérdida promedio: 2.8392


Epoch 22/50: 100%|██████████| 155/155 [00:32<00:00,  4.77it/s]


Epoch 22/50 - Pérdida promedio: 2.7562


Epoch 23/50: 100%|██████████| 155/155 [00:32<00:00,  4.82it/s]


Epoch 23/50 - Pérdida promedio: 2.7016


Epoch 24/50: 100%|██████████| 155/155 [00:32<00:00,  4.77it/s]


Epoch 24/50 - Pérdida promedio: 2.6618


Epoch 25/50: 100%|██████████| 155/155 [00:32<00:00,  4.73it/s]


Epoch 25/50 - Pérdida promedio: 2.6327


Epoch 26/50: 100%|██████████| 155/155 [00:35<00:00,  4.41it/s]


Epoch 26/50 - Pérdida promedio: 2.5814


Epoch 27/50: 100%|██████████| 155/155 [00:33<00:00,  4.62it/s]


Epoch 27/50 - Pérdida promedio: 2.5187


Epoch 28/50: 100%|██████████| 155/155 [00:32<00:00,  4.80it/s]


Epoch 28/50 - Pérdida promedio: 2.4693


Epoch 29/50: 100%|██████████| 155/155 [00:35<00:00,  4.37it/s]


Epoch 29/50 - Pérdida promedio: 2.4448


Epoch 30/50: 100%|██████████| 155/155 [00:32<00:00,  4.76it/s]


Epoch 30/50 - Pérdida promedio: 2.3835


Epoch 31/50: 100%|██████████| 155/155 [00:32<00:00,  4.80it/s]


Epoch 31/50 - Pérdida promedio: 2.3333


Epoch 32/50: 100%|██████████| 155/155 [00:32<00:00,  4.77it/s]


Epoch 32/50 - Pérdida promedio: 2.2786


Epoch 33/50: 100%|██████████| 155/155 [00:32<00:00,  4.71it/s]


Epoch 33/50 - Pérdida promedio: 2.2722


Epoch 34/50: 100%|██████████| 155/155 [00:33<00:00,  4.60it/s]


Epoch 34/50 - Pérdida promedio: 2.2047


Epoch 35/50: 100%|██████████| 155/155 [00:32<00:00,  4.78it/s]


Epoch 35/50 - Pérdida promedio: 2.1973


Epoch 36/50: 100%|██████████| 155/155 [00:32<00:00,  4.70it/s]


Epoch 36/50 - Pérdida promedio: 2.1281


Epoch 37/50: 100%|██████████| 155/155 [00:32<00:00,  4.74it/s]


Epoch 37/50 - Pérdida promedio: 2.1144


Epoch 38/50: 100%|██████████| 155/155 [00:32<00:00,  4.79it/s]


Epoch 38/50 - Pérdida promedio: 2.0418


Epoch 39/50: 100%|██████████| 155/155 [00:32<00:00,  4.79it/s]


Epoch 39/50 - Pérdida promedio: 2.0198


Epoch 40/50: 100%|██████████| 155/155 [00:32<00:00,  4.80it/s]


Epoch 40/50 - Pérdida promedio: 1.9520


Epoch 41/50: 100%|██████████| 155/155 [00:32<00:00,  4.80it/s]


Epoch 41/50 - Pérdida promedio: 1.9813


Epoch 42/50: 100%|██████████| 155/155 [00:31<00:00,  4.87it/s]


Epoch 42/50 - Pérdida promedio: 1.9191


Epoch 43/50: 100%|██████████| 155/155 [00:31<00:00,  4.85it/s]


Epoch 43/50 - Pérdida promedio: 1.8650


Epoch 44/50: 100%|██████████| 155/155 [00:32<00:00,  4.81it/s]


Epoch 44/50 - Pérdida promedio: 1.8466


Epoch 45/50: 100%|██████████| 155/155 [00:32<00:00,  4.79it/s]


Epoch 45/50 - Pérdida promedio: 1.8352


Epoch 46/50: 100%|██████████| 155/155 [00:32<00:00,  4.83it/s]


Epoch 46/50 - Pérdida promedio: 1.7515


Epoch 47/50: 100%|██████████| 155/155 [00:32<00:00,  4.80it/s]


Epoch 47/50 - Pérdida promedio: 1.7539


Epoch 48/50: 100%|██████████| 155/155 [00:32<00:00,  4.70it/s]


Epoch 48/50 - Pérdida promedio: 1.7331


Epoch 49/50: 100%|██████████| 155/155 [00:33<00:00,  4.62it/s]


Epoch 49/50 - Pérdida promedio: 1.7269


Epoch 50/50: 100%|██████████| 155/155 [00:32<00:00,  4.77it/s]

Epoch 50/50 - Pérdida promedio: 1.6974
Entrenamiento finalizado
Modelo guardado como seq2seq_glove_lstm.pt





In [None]:
# ================================================================================
# Inferencia del modelo Seq2Seq
# ================================================================================
import torch
import torch.nn.functional as F
import numpy as np

# Cargo modelo guardado
encoder = Encoder(embedding_inputs, hidden_dim=128, dropout=0.2)
decoder = Decoder(vocab_size=len(word2idx_outputs), embedding_dim=300, hidden_dim=128, dropout=0.2)
model = Seq2Seq(encoder, decoder, device).to(device)
model.load_state_dict(torch.load("seq2seq_glove_lstm.pt", map_location=device))
model.eval()
print("Modelo cargado correctamente")

# Crear diccionarios inversos
idx2word_inputs = {v: k for k, v in word2idx_inputs.items()}
idx2word_outputs = {v: k for k, v in word2idx_outputs.items()}

# Función auxiliar: texto → tensor
def sentence_to_tensor(sentence, word2idx, max_len):
    sentence = sentence.lower()
    sentence = re.sub(r'\W+', ' ', sentence)
    tokens = sentence.strip().split()
    seq = [word2idx.get(w, word2idx["<unk>"]) for w in tokens]
    seq = seq[:max_len]
    seq += [word2idx["<pad>"]] * (max_len - len(seq))
    return torch.tensor(seq).unsqueeze(0)  # shape [1, max_len]

# Función de generación (inferencia paso a paso)
def generate_reply(sentence, max_len=15):
    with torch.no_grad():
        src = sentence_to_tensor(sentence, word2idx_inputs, max_input_len).to(device)
        hidden, cell = model.encoder(src)

        # Primer token de entrada al decoder (<sos>)
        input_token = torch.tensor([word2idx_outputs["<sos>"]]).to(device)
        output_sentence = []

        for _ in range(max_len):
            output, hidden, cell = model.decoder(input_token, hidden, cell)
            pred_token = output.argmax(1).item()
            pred_word = idx2word_outputs.get(pred_token, "<unk>")

            if pred_word == "<eos>" or pred_word == "<pad>":
                break

            output_sentence.append(pred_word)
            input_token = torch.tensor([pred_token]).to(device)

    return " ".join(output_sentence)

test_questions = [
    "Do you read?",
    "Do you have any pet?",
    "Where are you from?",
    "What do you like to eat?",
    "Are you a student?"
]

print("Chatbot automático — probando preguntas típicas:\n")

for q in test_questions:
    reply = generate_reply(q)
    print(f"You: {q}")
    print(f"Bot: {reply}\n")

Modelo cargado correctamente
Chatbot automático — probando preguntas típicas:

You: Do you read?
Bot: do not like to i am

You: Do you have any pet?
Bot: have but

You: Where are you from?
Bot: am from the how about you

You: What do you like to eat?
Bot: i not play video

You: Are you a student?
Bot: i am i you you



El modelo aprendió algunos patrones básicos del diálogo, como frases comunes (“I am”, “how about you”), pero aun así, sus respuestas siguen siendo algo incoherentes o gramaticalmente incorrectas, pienso que es algo normal en este tipo de modelos con un desarrollo muy simple.

En general, se logra cierta fluidez, pero le cuesta mantener el contexto o generar respuestas variadas. Creo que se debe principalmente al tamaño limitado del dataset y a la falta de componentes o tecnicas más avanzadas.

Para mejorarlo, se podrían:
- Continuar con mas epochs de entrenamiento.
- Mejorar los datos o usar modelos preentrenados.
- Ajustar el teacher forcing para hacerlo más autónomo al generar texto.
- Limpiar mejor los datos de entrenamiento para evitar respuestas confusas.