# Demo de un Transformer desde cero para un modelo de traducción Español-Ingles

**Autor:** [Armand](https://armandds.github.io/)<br>
**Creado:** 2021/11/20<br>
**Modificado:** 2021/12/16<br>
**Descripción:** implementar un modelo transformer para traduccion español-ingles, solo para aprendizaje, del [post como funciona el transformer](https://txtdatos.digital/2021/11/23/transformers-arquitectura-encoder/)

## Cargamos las librerias a utilizar

In [135]:
import spacy
import string
import re
import pathlib
import random
import string
import re
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import time
import math


## Descargamos los datos a utilizar

Para esta demo vamos a usar el dataset de traducción español-ingles

In [136]:

if not os.path.isdir('spa-eng'):
    !wget -O spa-eng.zip -q "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
    !unzip -q spa-eng.zip

In [137]:
text_file = "/content/spa-eng/spa.txt"


with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    eng, spa = line.split("\t")
    text_pairs.append(( spa, eng))

Veremos como lucen nuestros datos:

In [138]:
for _ in range(5):
    print(random.choice(text_pairs))

('Tom le pidió un menú al mesero.', 'Tom asked the waiter for a menu.')
('Tom tiene una pila entera de boletas de estacionamiento sin pagar.', 'Tom has a whole pile of unpaid parking tickets.')
('Tom almorzó con Mary.', 'Tom had lunch with Mary.')
('Si estás aquí quiere decir que te importa.', 'If you are here, it means you care.')
('¡Que te mejores!', 'Get well soon!')


## Procesado de datos

Nos apoyeramos en la libreria spacy para crear tokenizadores para el idioma ingles y español

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download es_core_news_sm

In [140]:
!python -m spacy link en_core_web_sm en_core_web_sm
!python -m spacy link es_core_news_sm es_core_news_sm


[38;5;1m✘ Link 'en_core_web_sm' already exists[0m
To overwrite an existing link, use the --force flag


[38;5;1m✘ Link 'es_core_news_sm' already exists[0m
To overwrite an existing link, use the --force flag



Creamos los 2 tokenizadores, es decir funciones que divididen la oración en tokens

In [141]:
nlp_en = spacy.load('en_core_web_sm')

def tokenizer_en(sentence):
    
    sentence = re.sub(
    r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’\!;]", " ", str(sentence))
    sentence = re.sub(r"[ ]+", " ", sentence)
    sentence = re.sub(r"\!+", "!", sentence)
    sentence = re.sub(r"\,+", ",", sentence)
    sentence = re.sub(r"\?+", "?", sentence)
    sentence = sentence.lower()
    return [tok.text for tok in nlp_en.tokenizer(sentence) if tok.text != " "]

In [142]:
nlp_es = spacy.load('es_core_news_sm')
def tokenizer_es(sentence):
    
    sentence = re.sub(
    r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’\!;]", " ", str(sentence))
    sentence = re.sub(r"[ ]+", " ", sentence)
    sentence = re.sub(r"\!+", "!", sentence)
    sentence = re.sub(r"\,+", ",", sentence)
    sentence = re.sub(r"\?+", "?", sentence)
    sentence = sentence.lower()
    return [tok.text for tok in nlp_es.tokenizer(sentence) if tok.text != " "]

Probamos su funcionamiento

In [143]:
tokenizer_es("¿quieres sonar como un hablante nativo?")

['¿', 'quieres', 'sonar', 'como', 'un', 'hablante', 'nativo', '?']

In [144]:
tokenizer_en("hello, how are you?")

['hello', ',', 'how', 'are', 'you', '?']

Ahora crearemos el diccionario que contiene todas las palabras presentes en nuestro dataset tanto en español como en ingles

In [145]:
word2idx_spa = {}
word2idx_eng = {}
idx2word_eng = {}
idx2word_spa= {}


strip_chars = string.punctuation + "¿"
j = 1
z = 3
word2idx_eng["[PAD]"] = 0
word2idx_eng["[SOS]"] = 1
word2idx_eng["[EOS]"] = 2
word2idx_spa["[PAD]"] = 0

for i, (spa, eng) in enumerate(text_pairs):
  for w1 in tokenizer_es(spa):
    if w1 not in word2idx_spa:
      word2idx_spa[w1] = j
      j +=1  
  for w in tokenizer_en(eng):
    if w not in word2idx_eng:
      word2idx_eng[w] = z
      z +=1   
def ws2seq(s):
    return [word2idx_spa[i] for i in s if i in word2idx_spa.keys()] 
    
def seq2ws(s):
    return [idx2word_spa[i] for i in s if idx2word_spa[i]]

def we2seq(s):
    return [word2idx_eng['[SOS]']] + [word2idx_eng[i] for i in s if i in word2idx_eng.keys()] + [word2idx_eng['[EOS]']]

def seq2we(s):
    return [idx2word_eng[i] for i in s]


idx2word_eng = {v:k for k,v in word2idx_eng.items() }
idx2word_spa= {v:k for k,v in word2idx_spa.items() }

El diccionario que nos permitira convertir una palabra a número queda de la siguiente manera:

Para el Español:

In [146]:
dict(random.sample(word2idx_spa.items(), 10))

{'amañado': 6881,
 'bistec': 8430,
 'chingón': 20377,
 'despiertas': 12326,
 'entrenado': 13015,
 'estupendos': 4393,
 'llamará': 21023,
 'masticaste': 21418,
 'pelón': 1588,
 'préstamo': 2664}

Para el Inglés:

In [147]:
dict(random.sample(word2idx_eng.items(), 10))

{'basking': 6582,
 'cancer': 1190,
 'dilemma': 3975,
 'fix': 124,
 'heartbreaking': 7561,
 'hero': 423,
 'labor': 8558,
 'lifetime': 9823,
 'race': 2619,
 'unless': 8697}

Y la cantidad de palabras o tokens presentes es:

In [148]:
len(word2idx_eng), len(word2idx_spa)

(13193, 26041)

Tenemos que hay 13K palabras para el ingles y 26K palabras para el español

In [149]:
word2idx_spa["[PAD]"]

0

Ahora creamos las clases que nos permitiran cargar los datos por lotes (batch)

In [150]:

random.seed(30)
torch.manual_seed(30)
torch.cuda.manual_seed(30)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
batch_size = 64


class TextLoader(torch.utils.data.Dataset):
    def __init__(self, path= "/content/spa-eng/spa.txt"):
        self.x, self.y = [], []
        with open(text_file) as f:
            lines = f.read().split("\n")[:-1]
        for line in lines:
            eng, spa = line.split("\t")
            eng_l = tokenizer_en(eng)
            spa_l = tokenizer_en(spa)
            self.x.append(ws2seq(spa_l))
            self.y.append(we2seq(eng_l))
    def __getitem__(self, index):
        return (torch.LongTensor(self.x[index]), torch.LongTensor(self.y[index]))

    def __len__(self):
        return len(self.x)


class TextCollate():
    def __call__(self, batch):
        max_x_len = max([i[0].size(0) for i in batch])
        x_padded = torch.LongTensor( len(batch), max_x_len)
        x_padded.zero_()

        max_y_len = max([i[1].size(0) for i in batch])
        y_padded = torch.LongTensor( len(batch), max_y_len)
        y_padded.zero_()

        for i in range(len(batch)):
            x = batch[i][0]
            x_padded[i, :x.size(0)] = x
            y = batch[i][1]
            y_padded[i,:y.size(0)] = y

        return x_padded, y_padded

separamos en datos de entrenamiento (train) y datos de validación (val), con un ratio de 90:10 es decir entrenaremos con el 90% de nuestros datos y validaremos con el 10% restante

In [None]:
pin_memory = True
num_workers = 2

dataset = TextLoader()
train_len = int(len(dataset) * 0.9)
trainset, valset = torch.utils.data.random_split(dataset, [train_len, len(dataset) - train_len])

collate_fn = TextCollate()

train_loader = torch.utils.data.DataLoader(trainset, num_workers=num_workers, shuffle=True,
                          batch_size=batch_size, pin_memory=pin_memory,
                          drop_last=True, collate_fn=collate_fn)

val_loader = torch.utils.data.DataLoader(valset, num_workers=num_workers, shuffle=False,
                        batch_size=batch_size, pin_memory=pin_memory,
                        drop_last=False, collate_fn=collate_fn)

Comprobamos el funcionamiento:

In [None]:
for x,y in train_loader:
  print(x.shape, y.shape)
  print(y[:,1])
  print(y[1:,1])
  break

In [None]:
for x,y in zip(dataset.x, dataset.y):
  print(x)
  print(seq2ws(x))
  print(y)
  print(seq2we(y))
  break

Es decir solo hemos reemplazado cada palabra por un número y le hemos agregado los tokens especiales SOS, y EOS

# Implementanción de la arquitectura Transfomers


Para el funcionamiento teorico por favor leer el [blog](http://txtdatos.digital/)

Debemos destacar que la capa de embedding y la capa de normalización se utilizará la implementacion que trae pytorch

### Embedding

Utilizaremos el de pytorch

### Positional encoding

Aquí definimos la capa que crea la matriz de posiciones

In [None]:

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len = 80):
        super().__init__()
        self.d_model = d_model
        
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                # se calcula las posiciones pares 
                pe[pos, i] = \
                math.sin(pos / (10000 ** ((2 * i)/d_model)))
                # se calcula las posiciones impares
                pe[pos, i + 1] = \
                math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
                
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
 
    
    def forward(self, x):
        # Aqui incrementamos el valor del embedding para que la matriz de posiciones no domine
        x = x * math.sqrt(self.d_model)
        #sumamos para obtener el vector de entrada final tal como se comenta en el post
        seq_len = x.size(1)
        x = x + self.pe[:,:seq_len]
        return x

## Crearemos la capa de auto-atención

Se define una función auxiliar para la multiplicación de matrices Q,K y V

In [None]:


def attention(q, k, v, d_k, mask=None, dropout=None):
    
    scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(d_k)
    if mask is not None:
            mask = mask.unsqueeze(1)
            scores = scores.masked_fill(mask == 0, -1e9)
    scores = F.softmax(scores, dim=-1)
    
    if dropout is not None:
        scores = dropout(scores)
        
    output = torch.matmul(scores, v)
    return output

### Ahora definimos la capa de atención:

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        #Aqui definimos las matrices WQ, WK y WV explicadas en el post
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)
    
    def forward(self, q, k, v, mask=None):
        
        bs = q.size(0)
        
        # realizamos operaciones para obtener las dimensiones adecuadas
        
        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
        
       
        k = k.transpose(1,2)
        q = q.transpose(1,2)
        v = v.transpose(1,2)

        # calculamos la atención
        scores = attention(q, k, v, self.d_k, mask, self.dropout)
        
        # concatemos y obtenemos la salida
        concat = scores.transpose(1,2).contiguous()\
        .view(bs, -1, self.d_model)
        
        output = self.out(concat)
    
        return output

### La capa de Feed Forward 

In [None]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout = 0.1):
        super().__init__() 
        # definimos el tamaño por defecto a 2048
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        x = x.cuda()
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)
        return x

### Capa de normalización

vamos a usar la que ya implementa pytorch

## Definimos el Encoder

Cada decoder es 1 capa de multiatencion y 1 capa feed-forward

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout = 0.1):
        super().__init__()
        self.norm_1 = nn.LayerNorm(d_model)
        self.norm_2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(heads, d_model)
        self.ff = FeedForward(d_model).cuda()
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn(x2,x2,x2,mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.ff(x2))
        return x

In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads):
        super().__init__()
        self.N = N
        self.embed =  nn.Embedding(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model)
        self.layers = get_clones(EncoderLayer(d_model, heads), N)
        self.norm = nn.LayerNorm(d_model)
    def forward(self, src, mask):
        src = src.cuda()
        x = self.embed(src)
        x = self.pe(x)
        for i in range(N):
            x = self.layers[i](x, mask)
        return self.norm(x)

In [None]:
import copy

def get_clones(module, N):
    return torch.nn.ModuleList([copy.deepcopy(module) for i in range(N)])

### y el Decoder

cada decoder contiene principalmente 2 capas de multi-head attention (1 enmascarada como se vio en el post) y una capa feed forward


In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = nn.LayerNorm(d_model)
        self.norm_2 = nn.LayerNorm(d_model)
        self.norm_3 = nn.LayerNorm(d_model)
        
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)
        
        self.attn_1 = MultiHeadAttention(heads, d_model)
        self.attn_2 = MultiHeadAttention(heads, d_model)
        self.ff = FeedForward(d_model).cuda()
    def forward(self, x, e_outputs, src_mask, trg_mask):

        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn_1(x2, x2, x2, trg_mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.attn_2(x2, e_outputs, e_outputs,
        src_mask))
        x2 = self.norm_3(x)
        x = x + self.dropout_3(self.ff(x2))
        return x


In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads):
        super().__init__()
        self.N = N
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model)
        self.layers = get_clones(DecoderLayer(d_model, heads), N)
        self.norm = nn.LayerNorm(d_model)
    def forward(self, trg, e_outputs, src_mask, trg_mask):
        trg = trg.cuda()
        x = self.embed(trg)
        x = self.pe(x)
        for decoder in self.layers:
            x = decoder(x, e_outputs, src_mask, trg_mask)
        return self.norm(x)

## Ahora el Transformer Final es:

In [None]:
class Transformer(nn.Module):
    def __init__(self, src_vocab, trg_vocab, d_model, N, heads):
        super().__init__()
        self.encoder = Encoder(src_vocab, d_model, N, heads)
        self.decoder = Decoder(trg_vocab, d_model, N, heads)
        self.out = nn.Linear(d_model, trg_vocab)
        self.apply(self._init_weights)
      
      #esto para inicializar los pesos
    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
        
        elif isinstance(module, nn.LayerNorm):
            module.weight.data.fill_(1.0)

        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()
    def forward(self, src, trg, src_mask, trg_mask):
        e_outputs = self.encoder(src, src_mask)
        d_output = self.decoder(trg, e_outputs, src_mask, trg_mask)
        output = self.out(d_output)
        return output

# Entrenamiento y evaluación

Vamos a crear un transfomer con 1 Encoder y 1 Decoder

In [None]:
# Configuramos los parametros del modelo
d_model = 256
n_heads = 8
N = 1

src_vocab_size = len(word2idx_spa) +1
trg_vocab_size = len(word2idx_eng) +1

In [None]:
# instaciamos el modelo con los parametros y lo enviamos a la GPU
model = Transformer(src_vocab_size, trg_vocab_size, d_model, N, n_heads)
model.cuda()

In [None]:
for p in model.parameters():
    if p.dim() > 1:
        torch.nn.init.xavier_uniform(p)

In [None]:
# Elegimos el optimizador a utilizar, en este caso el Adam con decaimiento
optimizer = torch.optim.AdamW(model.parameters())

In [None]:
import numpy as np
import time


def create_mask(src_input, trg_input):
    # mascara de entrada para evitar el padding
    pad = word2idx_spa["[PAD]"]
    src_mask = (src_input != pad).unsqueeze(1)
    
    # mascara de salida
    trg_mask = (trg_input != pad).unsqueeze(1)
    
    seq_len = trg_input.size(1)
    nopeak_mask = np.tril(np.ones((1, seq_len, seq_len)), k=0).astype('uint8')
    nopeak_mask = torch.from_numpy(nopeak_mask) != 0
    trg_mask = trg_mask & nopeak_mask
    
    return src_mask.cuda(), trg_mask.cuda()

In [None]:
def train(model, optimizer, criterion, iterator):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        src_input = batch[0]# tamaño (batch_size, seq_len)
        trg = batch[1]# tamaño (batch_size, seq_len)
        
        trg_input = trg[:, :-1]
        ys = trg[:, 1:].contiguous().view(-1).cuda()
        
        # creamos las mascaras
        src_mask, trg_mask = create_mask(src_input, trg_input)
        preds = model(src_input, trg_input, src_mask, trg_mask)
        # aplicamos el backpropagation
        optimizer.zero_grad()
        loss = criterion(preds.view(-1, preds.size(-1)), ys)
        loss.backward()
        # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, criterion, iterator):

    model.eval()

    epoch_loss = 0

    with torch.no_grad():    
        for i, batch in enumerate(iterator):
          src_input = batch[0]# tamaño (batch_size, seq_len)
          trg = batch[1]# tamaño (batch_size, seq_len)
          
          trg_input = trg[:, :-1]
          ys = trg[:, 1:].contiguous().view(-1).cuda()
          src_mask, trg_mask = create_mask(src_input, trg_input)
          preds = model(src_input, trg_input, src_mask, trg_mask)
          loss = criterion(preds.view(-1, preds.size(-1)), ys)
          epoch_loss += loss.item()

          if i%32 == 0:
            for i in [5,12, 15]:
              out = F.softmax(preds[i], dim=-1)
              val, ix = out.data.topk(1)
              print("Oración en Español: ",  seq2ws(src_input[i].tolist()))
              print("Oración Real en Ingles: ", seq2we(trg_input[i].tolist()))
              print("Oración Predicha en Ingles: ",seq2we([ix1[0] for ix1 in ix.tolist()]))
              print("\n")

    return epoch_loss / len(iterator)

In [None]:
def translate(model, src, max_len = 80, custom_string=False):
    
    model.eval()

    if custom_string == True:
            src = tokenizer_es(src)
            src=(torch.LongTensor([[word2idx_spa[tok] for tok in src]])).cuda()
    src_mask = (src != word2idx_spa["[PAD]"]).unsqueeze(-2)
    e_outputs = model.encoder(src, src_mask)
    
    outputs = torch.zeros(max_len).type_as(src.data)
    outputs[0] = torch.LongTensor([word2idx_eng['[SOS]']])
    for i in range(1, max_len):    
            
        trg_mask = np.triu(np.ones((1, i, i))).astype('uint8')
        trg_mask= (torch.from_numpy(trg_mask) == 0).cuda()
        
        out = model.out(model.decoder(outputs[:i].unsqueeze(0),
        e_outputs, src_mask, trg_mask))
        out = F.softmax(out, dim=-1)
        val, ix = out[:, -1].data.topk(1)
        
        outputs[i] = ix[0][0]
        if ix[0][0] == word2idx_eng['[EOS]']: # TRG.vocab.stoi['<eos>']:
            break
    return ' '.join( seq2we(outputs[:i].tolist()))

In [None]:
# Seleccionamos el cross entropy como nuestra funcion de perdida e ignoramos el padding
criterion = nn.CrossEntropyLoss(ignore_index = word2idx_spa["[PAD]"])

Al fin, entrenamos:

In [173]:
%%time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

N_EPOCHS = 7

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    print(f'Epoch: {epoch+1:02}')

    start_time = time.time()

    train_loss = train(model, optimizer, criterion, train_loader)
    valid_loss = evaluate(model, criterion, val_loader)

    epoch_mins, epoch_secs = epoch_time(start_time, time.time())

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        # torch.save(model.state_dict(), 's2e_model.pt') # si quieres guardar el modelo

    print(f'Time: {epoch_mins}m {epoch_secs}s')
    print(f'Train Loss: {train_loss:.3f}')
    print(f'Val   Loss: {valid_loss:.3f}')
print(best_valid_loss)

Epoch: 01
Oración en Español:  ['las', 'malas', 'hierbas', 'aparecieron', 'súbitamente', 'en', 'el', 'jardín', '.', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Oración Real en Ingles:  ['[SOS]', 'weeds', 'sprang', 'up', 'in', 'the', 'garden', '.', '[EOS]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Oración Predicha en Ingles:  ['the', 'is', 'out', 'in', 'the', 'garden', '.', '[EOS]', 'in', 'in', 'in', 'in', 'in', 'in', 'in', 'in', '.', '.']


Oración en Español:  ['muéstrenme', 'el', 'dinero', '.', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Oración Real en Ingles:  ['[SOS]', 'show', 'me', 'the', 'money', '.', '[EOS]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Oración Predicha en Ingles:  ['i', 'me', 'the', 'money', '.', '[EOS]', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.']


Oración en Español: 

### Probamos con algunas oraciones para ver el resultado del entrenamiento:

In [179]:
translate(model, "voy a tomar café", custom_string=True)

"[SOS] i 'm going to drink coffee ."

In [184]:
translate(model, "¿cómo estás tú?", custom_string=True)

'[SOS] how are you ?'

In [185]:
translate(model, "dime la verdad",  custom_string=True)

'[SOS] tell me the truth the truth .'

Nada mal para un modelo con solo 1 encoder/decoder, recuerda que el modelo original fue entrenado con millones de datos durante varios días.

así es el Transformer de principio a fin, si quieres utilizar el transformer es mejor utilizar los modelos de hugginface y keras ya que son modelos robustos y altamente optimizados, el modelo de acá es solo para aprendizaje.
Recuerda suscribirte al blog