# Transformer - Attention is all you need

## Máscaras en el Transformer: Padding y Causal Mask

### ¿Por qué se necesitan máscaras?

En un modelo Transformer, las **máscaras** se usan para controlar qué partes de la secuencia pueden "verse" entre sí durante la auto-atención. Esto es fundamental tanto en el **encoder** como en el **decoder**, pero por diferentes razones:

- **Encoder** → necesita una **padding mask** para ignorar los tokens de relleno (`<PAD>`).
- **Decoder** → necesita tanto:
  - una **padding mask**,
  - como una **máscara causal (look-ahead mask)** que impide ver tokens del futuro durante la generación.

---

| Tipo de Máscara         | Código / Ejemplo                                                                                     | Forma                                          | Descripción                                                                                                     |
|-------------------------|------------------------------------------------------------------------------------------------------|------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| **Padding Mask (source)**   | `(source != 0).unsqueeze(1).unsqueeze(2)`                                                           | `(batch_size, 1, 1, source_seq_len)`             | Ignora los tokens de padding (0) en el input del encoder.                                                       |
| **Padding Mask (target)**   | `(target != 0).unsqueeze(1).unsqueeze(2)`                                                           | `(batch_size, 1, 1, target_seq_len)`             | Ignora los tokens de padding en el input del decoder.                                                           |
| **Máscara Causal (Look-Ahead)** | `torch.tril(torch.ones(1, size, size)).bool()` <br> *(con `size = target.size(1)`)*                  | `(1, target_seq_len, target_seq_len)`            | Triangular inferior: permite que el token en posición *i* vea solo los tokens hasta la posición *i*.             |
| **Target Mask Combinada**   | `(target != 0).unsqueeze(1).unsqueeze(2) & no_mask` <br> *(donde `no_mask` es la máscara causal)*     | `(batch_size, 1, target_seq_len, target_seq_len)` | Combina la máscara de padding y la causal para el decoder, usando broadcasting para ajustar las dimensiones.     |




In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import math
import numpy as np
import re

# Semilla de reproducibilidad
torch.manual_seed(23)

<torch._C.Generator at 0x7f7ff889e0f0>

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [77]:
MAX_SEQ_LEN = 128

In [78]:
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_seq_len = MAX_SEQ_LEN):
        super().__init__()
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)
        token_pos = torch.arange(0, max_seq_len, dtype = torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() 
                             * (-math.log(10000.0)/d_model))
        
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0,1) # Para hacer broadcasting con x.shape
        
    def forward(self, x):
        # Broadcasting automático
        # x: (seq_len, batch_size, d_model)
        # pos_embed_matrix: (seq_len, d_model)
        # resultado: (seq_len, batch_size, d_model) + (seq_len, 1, d_model) = (seq_len, batch_size, d_model)
        # print(self.pos_embed_matrix.shape)
        # print(x.shape)
        return x + self.pos_embed_matrix[:x.size(0), :]
    
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8): # d_model tiene que ser divisible entre num_heads. d_v = 512/8 = 64. (8*64=512). Siendo 512 el tamaño del embedding y la concatenación de las 8 cabezas igual al tamaño del embedding
        super().__init__()
        assert d_model % num_heads == 0, 'Embedding size not compatible with num heads'
        
        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads
        
        self.W_q = nn.Linear(d_model, d_model) # En lugar de hacer 8 de 512x64 hacemos una de 512x512 (más eficiente)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        '''
        Q, K, V -> [batch_size, seq_len, num_heads*d_k]
        Después de view Q: (batch_size, 10, 8, 64)
        Luego de transpose se reorganiza a (batch_size, 8, 10, 64) para aplicar atención
        '''
        Q = self.W_q(Q).view(batch_size, -1, num_heads, self.d_k).transpose(1,2) # Partimos la dimension de 512 en 8 cabezas de 64. Cada token tiene 8 sub-vectores de 64 → 1 por cabeza
        K = self.W_k(K).view(batch_size, -1, num_heads, self.d_k).transpose(1,2)
        V = self.W_v(V).view(batch_size, -1, num_heads, self.d_k).transpose(1,2)
        
        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)
        
        weighted_values = weighted_values.transpose(1,2).contiguous().view(batch_size, -1, self.num_heads*self.d_k) # (batch_size, num_heads, seq_len, d_k)
        weighted_values = self.W_o(weighted_values)
        
        return weighted_values, attention
        
    def scale_dot_product(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(self.d_k)
        if mask is not None: # En el Encoder para el padding, en el Decoder para el padding y para no ver el futuro del output
            scores = scores.masked_fill(mask == 0, -1e9) # Para que al aplicar softmax den probabilidades de 0
            # scores.shape = (batch_size, num_heads, seq_len_q, seq_len_k)
        attention = F.softmax(scores, dim=-1) # dim=-1 normaliza por filas
        weighted_values = torch.matmul(attention, V)
        
        return weighted_values, attention
    
class PositionFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        
    def forward(self, x):
        return self.linear2(F.relu(self.linear1(x)))
    
    
class EncoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        attn_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.dropout1(attn_score) # Skip connection
        x = self.norm1(x) # Normalización
        x = x + self.dropout2(self.ffn(x)) # Skip connection
        return self.norm2(x) # Normalización 

class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]) # Nx capas secuenciales
        self.norm = nn.LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        # mask para el padding
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

class DecoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
    
    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):
        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)
        
        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)
        x = x + self.dropout2(encoder_attn)
        x = self.norm2(x)
        
        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output)
        return self.norm3(x)
        
class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)
        
    def forward(self, x, encoder_output, target_mask, encoder_mask):
        # cross-attention
        # Necesitamos el encoder_mask para no atender a los maskings
        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)
        return self.norm(x)

In [79]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, input_vocab_size, target_vocab_size,
                max_len=MAX_SEQ_LEN, dropout=0.1):
        # d_model: Tamaño de los embeddings
        # num_heads: Número de cabezas paralelas de atención
        # d_ff: Tamaño de las redes neuronales Feed-Forward
        # num_layers: Número de capas secuenciales tanto para el encoder como para el decoder
        # input_vocab_size
        # target_vocab_size
        # max_len: Tamaño de la ventana de contexto
        
        super().__init__()
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)
        self.pos_embedding = PositionalEmbedding(d_model, max_len)
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.output_layer = nn.Linear(d_model, target_vocab_size)
        
    def forward(self, source, target):
        # Encoder mask
        source_mask, target_mask = self.mask(source, target)
        # Embedding and positional Encoding
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim) # Técnica de escalado para normalizar los valores de los embeddings
        source = self.pos_embedding(source)
        # Encoder
        encoder_output = self.encoder(source, source_mask)
        
        # Decoder embedding and positional encoding
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)
        # Decoder
        output = self.decoder(target, encoder_output, target_mask, source_mask)
        
        return self.output_layer(output)
        
    def mask(self, source, target):
        # El token de 0 es de padding
        # El resto o bien son tokens especiales (<SOS>, <EOS>) o bien palabras (Aqui cada palabra equivale a un token)
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)
        size = target.size(1)  # La dimensión 1 representa la longitud de la secuencia (max_seq_len)
        no_mask = torch.tril(torch.ones(1, size, size, device=device)).bool() # Para evitar ver palabras futuras que aún no se han generado
        target_mask = target_mask & no_mask # Broadcasting automático  # (B, 1, T, T)
        return source_mask, target_mask

### Simple test

In [80]:
seq_len_source = 10
seq_len_target = 10
batch_size = 2
input_vocab_size = 50
target_vocab_size = 50

source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

model = Transformer(d_model, num_heads, d_ff, num_layers, input_vocab_size, target_vocab_size,
                max_len=MAX_SEQ_LEN, dropout=0.1)

model = model.to(device)
source = source.to(device)
target = target.to(device)

In [81]:
output = model(source, target)

In [82]:
output.shape

torch.Size([2, 10, 50])

In [83]:
import pandas as pd

# Ruta al archivo original
PATH = './Parejas de oraciones en InglésEspañol - 2025-04-01.csv'

# 1. Cargar el CSV ignorando las columnas vacías
df = pd.read_csv(PATH, sep=';', header=None)

# 2. Extraer solo las columnas de texto: inglés (col 1) y español (col 3)
eng_spa_cols = df.iloc[:, [1, 3]]
eng_spa_cols.columns = ['en', 'es']  # (opcional pero más legible)

# 3. Calcular la longitud de la oración en inglés
eng_spa_cols['length'] = eng_spa_cols['en'].str.len()

# 4. Ordenar por longitud (frases más cortas primero)
eng_spa_cols = eng_spa_cols.sort_values(by='length')

# 5. Eliminar la columna de longitud
eng_spa_cols = eng_spa_cols.drop(columns=['length'])

# 6. Guardar el resultado como archivo de texto tabulado
output_file_path = './eng-spa.txt'
eng_spa_cols.to_csv(output_file_path, sep='\t', index=False, header=False)

  df = pd.read_csv(PATH, sep=';', header=None)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eng_spa_cols['length'] = eng_spa_cols['en'].str.len()


In [84]:
eng_spa_cols.head()

Unnamed: 0,en,es
165605,Go.,Vaya.
159526,Ah!,¡Anda!
95238,Go!,¡Fuera!
95239,Go!,¡Ya!
162092,OK.,Bueno.


In [85]:
eng_spa_cols.tail()

Unnamed: 0,en,es
145656,As far as I understand despite my limited know...,Hasta donde comprendo a pesar de mi poco conoc...
161607,"Three hours later, the King was loitering arou...","Tres horas después, el rey estaba merodeando p..."
265346,"In Japanese, conjugation is fundamental to pie...","En japonés, la conjugación es fundamental para..."
73305,"There is no such thing, at this stage of the w...","No existe tal cosa, en esta etapa de la histor..."
250695,I know some things about reading when it comes...,Sé algunas cosas sobre la lectura cuando se tr...


In [86]:
PATH = './eng-spa.txt'

In [87]:
with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

In [88]:
eng_spa_pairs[:10]

[['Go.', 'Vaya.'],
 ['Ah!', '¡Anda!'],
 ['Go!', '¡Fuera!'],
 ['Go!', '¡Ya!'],
 ['OK.', 'Bueno.'],
 ['Go!', '¡Sal!'],
 ['Go!', '¡Ve!'],
 ['Hi.', 'Hola.'],
 ['Go!', 'Vete'],
 ['No.', 'No.']]

In [89]:
eng_sentences = [pair[0] for pair in eng_spa_pairs]
spa_sentences = [pair[1] for pair in eng_spa_pairs]

In [90]:
print(eng_sentences[:10])
print(spa_sentences[:10])

['Go.', 'Ah!', 'Go!', 'Go!', 'OK.', 'Go!', 'Go!', 'Hi.', 'Go!', 'No.']
['Vaya.', '¡Anda!', '¡Fuera!', '¡Ya!', 'Bueno.', '¡Sal!', '¡Ve!', 'Hola.', 'Vete', 'No.']


In [91]:
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip()
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    sentence = sentence.strip()
    sentence = '<sos> ' + sentence + ' <eos>'
    return sentence

In [92]:
s1 = '¿Hola @ cómo estás? 123'

In [93]:
print(s1)
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


In [94]:
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [95]:
spa_sentences[:10]

['<sos> vaya <eos>',
 '<sos> anda <eos>',
 '<sos> fuera <eos>',
 '<sos> ya <eos>',
 '<sos> bueno <eos>',
 '<sos> sal <eos>',
 '<sos> ve <eos>',
 '<sos> hola <eos>',
 '<sos> vete <eos>',
 '<sos> no <eos>']

In [96]:
def build_vocab(sentences):
    words = [word for sentence in sentences for word in sentence.split()]
    word_count = Counter(words)
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1
    idx2word = {idx: word for word, idx in word2idx.items()}
    return word2idx, idx2word

In [97]:
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [98]:
print(eng_vocab_size, spa_vocab_size)

27945 47492


In [99]:
class EngSpaDataset(Dataset):
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):
        self.eng_sentences = eng_sentences
        self.spa_sentences = spa_sentences
        self.eng_word2idx = eng_word2idx
        self.spa_word2idx = spa_word2idx
        
    def __len__(self):
        return len(self.eng_sentences)
    
    def __getitem__(self, idx):
        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]
        # return tokens idxs
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]
        
        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

In [100]:
def collate_fn(batch):
    eng_batch, spa_batch = zip(*batch)
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)
    return eng_batch, spa_batch

In [101]:
def train(model, dataloader, loss_function, optimiser, epochs):
    model.train()
    for epoch in range(epochs):
        total_loss = 0 
        for i, (eng_batch, spa_batch) in enumerate(dataloader):
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)
            # Decoder preprocessing
            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)
            # Zero grads
            optimiser.zero_grad()
            # run model
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))
            # loss\
            loss = loss_function(output, target_output)
            # gradient and update parameters
            loss.backward()
            optimiser.step()
            total_loss += loss.item()
            
        avg_loss = total_loss/len(dataloader)
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')

In [102]:
BATCH_SIZE = 64
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

In [103]:
model = Transformer(d_model=512, num_heads=8, d_ff=2048, num_layers=6,
                    input_vocab_size=eng_vocab_size, target_vocab_size=spa_vocab_size,
                    max_len=MAX_SEQ_LEN, dropout=0.1)

In [104]:
model = model.to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimiser = optim.Adam(model.parameters(), lr=0.0001)

In [105]:
train(model, dataloader, loss_function, optimiser, epochs = 10)

KeyboardInterrupt: 

In [None]:
def sentence_to_indices(sentence, word2idx):
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

def indices_to_sentence(indices, idx2word):
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])

def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    model.eval()
    sentence = preprocess_sentence(sentence)
    input_indices = sentence_to_indices(sentence, eng_word2idx)
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)

    # Initialize the target tensor with <sos> token
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    with torch.no_grad():
        for _ in range(max_len):
            output = model(input_tensor, tgt_tensor)
            output = output.squeeze(0)
            next_token = output.argmax(dim=-1)[-1].item()
            tgt_indices.append(next_token)
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)
            if next_token == spa_word2idx['<eos>']:
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)

In [None]:
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    for sentence in sentences:
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        print(f'Input sentence: {sentence}')
        print(f'Traducción: {translation}')
        print()

# Example sentences to test the translator
test_sentences = [
    "Hello, how are you?",
    "I am learning artificial intelligence.",
    "Artificial intelligence is great.",
    "Good night!"
]

# Assuming the model is trained and loaded
# Set the device to 'cpu' or 'cuda' as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluate translations
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)

In [106]:
# Guardar pesos
import torch

# --- Después de entrenar tu modelo ---
torch.save(model.state_dict(), "transformer_weights.pth")