# Traduccion usando modelos Seq2Seq

Este notebook está fuertemente basado en el tutorial de PyTorch [*NLP From Scratch: Translation with a Sequence to Sequence Network and Attention*](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) creado por Sean Robertson.

Vamos a estar viendo como traducir frases en francés al inglés.

    IN: il est en train de peindre un tableau .
    TRG: he is painting a picture .
    OUT: he is painting a picture .

    IN: pourquoi ne pas essayer ce vin delicieux ?
    TRG: why not try that delicious wine ?
    OUT: why not try that delicious wine ?

    IN: elle n est pas poete mais romanciere .
    TRG: she is not a poet but a novelist .
    OUT: she not not a poet but a novelist .

    IN: vous etes trop maigre .
    TRG: you re too skinny .
    OUT: you re all alone .

... con distintos niveles de éxito.


### Imports

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import pandas as pd
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


Preparando los Datos
==================

Los datos para este problema son miles de parejas de sentencias en inglés y francés.

    I am cold.    J'ai froid.

In [10]:
# !wget https://download.pytorch.org/tutorial/data.zip
# !unzip data.zip

In [11]:
# Take a peek at the dataset
dataset = pd.read_csv("data/eng-fra.txt", sep="\t", header=None)
dataset.columns = ["English", "French"]
dataset

Unnamed: 0,English,French
0,Go.,Va !
1,Run!,Cours !
2,Run!,Courez !
3,Wow!,Ça alors !
4,Fire!,Au feu !
...,...,...
135837,A carbon footprint is the amount of carbon dio...,Une empreinte carbone est la somme de pollutio...
135838,Death is something that we're often discourage...,La mort est une chose qu'on nous décourage sou...
135839,Since there are usually multiple websites on a...,Puisqu'il y a de multiples sites web sur chaqu...
135840,If someone who doesn't know your background sa...,Si quelqu'un qui ne connaît pas vos antécédent...


We will represent each word in a language as a one-hot vector (i.e., a giant vector of zeros except for a single one at the index of the word). We will however cheat a bit and trim the data to only use a few thousand words per language.

Vamos a representar cada palabra como un one-hot encoded vector (un índice por palabra). Para esto vamos a crear un vocabulario y limitar el número máximo de palabras para solamente usar unas cuántas miles de palabras por lenguaje. 


![Word encoding](https://drive.google.com/uc?id=1aLm__m9YWaKRZdDmdInE5rT-et0jXTci "Word encoding")



Vamos a necesitar un índice por palabra, para esto (y como hemos hecho antes) vamos a crear un vocabulario. En particular en este caso vamos a  hacer uso de una clase auxiliar `Lang` que tiene:
  - word → index (``word2index``) 
  - index → word (``index2word``)
  - ``word2count`` 

In [23]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def add_sentence(self, sentence):
      for word in sentence.split(" "):
        self.add_word(word)

    def add_word(self, word):
      if word not in self.word2index:
        self.word2index[word] = self.n_words
        self.word2count[word] = 1
        self.index2word[self.n_words] = word
        self.n_words += 1
      else:
        self.word2count[word] += 1
  

Los archivos estan en Unicode, para simplificar los transformamos a ASCII, pasamos todo a minúscula y quitamos la mayor parte de la puntuación.

In [24]:
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicode2ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )


def normalize_string(s):
    s = unicode2ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

Para leer los datos necesitamos leer cada línea a la vez, y luego separar cada línea en las dos sentencias que la componen. Todos los archivos que descargamos están en Inglés → Otro Idioma, por lo que si queremos traducir desde Otro Idioma → Inglés tenemos que usar la flag `reverse` para invertir los pares.

In [25]:
def read_langs(lang1, lang2, reverse=False):
    # Read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalize_string(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

Como tenemos muchos ejemplos y queremos entrenar algo rápidamente, vamos a recortar los datos a un máximo de 10 palabras y nos quedamos con sentencias que se traducen a la forma "I am", "He is", etc.

In [26]:
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)


def filter_pair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)


def filter_pairs(pairs):
    return [pair for pair in pairs if filter_pair(pair)]

El proceso completo para preparar los datos es:

- Leer el archivo, separarlo en líneas y separar cada línea en parejas
- Normalizar y filtrar los textos
- Crear los vocabularios a partir de los pares




In [27]:
def prepare_data(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = read_langs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))

    pairs = filter_pairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))

    for pair in pairs:
      input_lang.add_sentence(pair[0])
      output_lang.add_sentence(pair[1])

    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    
    return input_lang, output_lang, pairs


input_lang, output_lang, pairs = prepare_data('eng', 'fra', True)
print(random.choice(pairs))

Read 135842 sentence pairs
Trimmed to 10599 sentence pairs
Counted words:
fra 4345
eng 2803
['je suis heureuse de vous avoir invitee .', 'i m glad i invited you .']


Modelo
=================

![Seq2Seq Architecture](https://drive.google.com/uc?id=14XIFBXqpos7Z_spBMtK5gyWbl5MJyMud "Seq2Seq Architecture")



Encoder
-----------

![Encoder Network](https://drive.google.com/uc?id=17D4YBVh630jJBo6TVquS2a1R6XmPs3qK "Encoder Network")

In [38]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size,embedding_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.gru = nn.GRU(embedding_size, hidden_size)

    def forward(self, input_data, hidden):
      # los embeddings son una palabra a la vez, va a ser necesario hacer un .view(1, 1, -1)
      embedded = self.embedding(input_data).view(1, 1, -1)
      output, hidden = self.gru(embedded, hidden)
      return output, hidden

    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

## Decoder Simple

![Decoder Network](https://drive.google.com/uc?id=13kddnNWcPFku6SUS4ZbTMnDZLmyB2baD "Decoder Network")

In [95]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, embedding_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(output_size, embedding_size)
        self.gru = nn.GRU(embedding_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden, encoder_outputs):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output[0])
        output = self.softmax(output)
        return output, hidden

    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

# Funciones Auxiliares

### Preparando los datos

Vamos a transformar cada pareja de sentencias a una tupla de tensores con índices. Al crearlos, vamos a agregar el token de EOS en ambos.

In [96]:
def indexes_from_sentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(" ")]


def tensor_from_sentence(lang, sentence):
    indexes = indexes_from_sentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensors_from_pair(pair):
    input_tensor = tensor_from_sentence(input_lang, pair[0])
    target_tensor = tensor_from_sentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

### Entrenando el modelo
------------------

Para entrenar el modelo, pasamos la sentencia de entrada (palabra a palabra) a través del encoder y nos quedamos con sus outputs y último hidden state. El decoder luego recibe el token de `<SOS>` como primer input y el hidden state del encoder como su hidden state inicial. 

"Teacher Forcing" es el concepto de usar el target real como inputs nuevos para cada paso, en lugar de usar las predicciones del decoder. Esto ayuda a la convergencia pero puede traer inestabilidad si la red es explotada: [inestabilidad](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf).

Gracias a la libertad que nos da autograd de PyTorch podemos elegir usar teacher forcing solamente un porcentaje de las veces con un simple if, y nuestros optimizadores funcionan sin alterarse. En particular vamos a usar `teacher_forcing_ratio` de 0.5


In [97]:
teacher_forcing_ratio = 0.5


def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):

    # Reset optimizers
    encoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    # Feed inputs to the encoder one by one ~3 Lines
    encoder_hidden = encoder.init_hidden()

    for i in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[i], encoder_hidden)

    # Initialize decoder input and hidden state ~2 Lines
    decoder_hidden = encoder_hidden
    decoder_input = torch.tensor([[SOS_token]], device=device)

    use_teacher_forcing =  random.random() < teacher_forcing_ratio  # Randomly choose whether to use teacher forcing. 
    #Teacher forcing is useful when the target output is known beforehand, but it may lead to suboptimal performance during inference time (when we don't know the target output beforehand). 

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        # Feed the decoder inputs one by one, add the loss at each timestep and move forward using the next target as input
        # ~3 Lines
        for j in range(target_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output.squeeze(1), target_tensor[j])
            decoder_input = target_tensor[j]

    else:
        # Without teacher forcing: use its own predictions as the next input
        # Feed the decoder inputs one by one, find the predicted next token and set it as next input (use .detach() on this tensor)
        # Compute the loss at each timestep
        # If the decoder generates an EOS, stop.
        # ~8 Lines
        for j in range(target_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output.squeeze(1), target_tensor[j])
            
            decoder_input = torch.argmax(decoder_output).detach()
            if decoder_input == EOS_token:
                break
    # Backprop! ~3 Lines
    loss.backward()
    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

Funciones auxiliares para contabilizar el tiempo y estimar el tiempo restante.


In [98]:
import time
import math


def as_minutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def time_since(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (as_minutes(s), as_minutes(rs))

Todo el proceso de entrenamiento consiste en:

-  Comenzar un timer
-  Inicializar los optimizadores y el costo. Vamos a usar NLLLoss como costo.
-  Crear un set de parejas de entrenamiento
-  Inicializar array vacío para los costos

Luego llamamos a ``train`` muchas veces y ocasionalmente imprimimos el progreso.

In [99]:
def train_iters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [
        tensors_from_pair(random.choice(pairs))
        for i in range(n_iters)
    ]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(
            input_tensor, target_tensor, encoder, decoder,
            encoder_optimizer, decoder_optimizer, criterion
        )
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (
                time_since(start, iter / n_iters),
                iter,
                iter / n_iters * 100,
                print_loss_avg
            ))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    show_plot(plot_losses)

Mostrando los resultados
----------------


In [100]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np

%matplotlib inline

def show_plot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

Evaluación
==========

La evaluacion se hace de igual manera que el entrenamiento, pero, al no tener objetivos, usamos las mismas predicciones del decoder como inputs. Hacemos esto hasta que el decorer genere un token de EOS. 




In [101]:
def evaluate(encoder, decoder, sequence, max_length=MAX_LENGTH):
    with torch.no_grad():
        # Code to run an evaluation step, without targets for the decoder
        # ~21 Lines
            # Reset optimizers
        input_tensor = tensor_from_sentence(input_lang, sequence)
        input_length = input_tensor.size(0)
        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)


        # Feed inputs to the encoder one by one ~3 Lines
        encoder_hidden = encoder.init_hidden()

        for i in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[i], encoder_hidden)

        # Initialize decoder input and hidden state ~2 Lines
        decoder_hidden = encoder_hidden
        decoder_input = torch.tensor([[SOS_token]], device=device)
        length = 0 
        output_indexes = []
        
        while decoder_input != EOS_token and length < max_length:

            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)            
            decoder_input = torch.argmax(decoder_output)
            length += 1
            output_indexes.append(decoder_input)
            if decoder_input == EOS_token:
                break
        
        decoded_words = [output_lang.index2word[index] for index in output_indexes]
        
        return decoded_words

In [102]:
def evaluate_randomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('IN:', pair[0])
        print('TRG:', pair[1])
        output_words = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('OUT:', output_sentence)
        print('')

Entrenando y Evaluando
=======================



In [103]:
hidden_size = 256
embedding_size = 10
encoder = EncoderRNN(input_lang.n_words, hidden_size,embedding_size).to(device)
decoder = DecoderRNN(hidden_size, output_lang.n_words,embedding_size=embedding_size).to(device)

train_iters(encoder, decoder, 75000, print_every=200)

0m 27s (- 168m 23s) (200 0%) 5.5046
0m 32s (- 100m 51s) (400 0%) 20.1626
0m 37s (- 78m 5s) (600 0%) 86.2402
0m 43s (- 66m 38s) (800 1%) 223.1924
0m 48s (- 59m 43s) (1000 1%) 435.3788
0m 53s (- 55m 2s) (1200 1%) 691.8427
1m 0s (- 53m 2s) (1400 1%) 965.3614
1m 13s (- 56m 0s) (1600 2%) 1195.3542
1m 24s (- 57m 3s) (1800 2%) 1694.8058
1m 32s (- 56m 18s) (2000 2%) 2176.8437
1m 42s (- 56m 19s) (2200 2%) 2386.3550
1m 47s (- 53m 59s) (2400 3%) 2750.4694
1m 52s (- 52m 5s) (2600 3%) 2847.6209
1m 56s (- 49m 56s) (2800 3%) 3472.7681
2m 0s (- 48m 14s) (3000 4%) 4270.5136
2m 4s (- 46m 39s) (3200 4%) 4128.7411
2m 8s (- 45m 15s) (3400 4%) 4276.1327
2m 14s (- 44m 21s) (3600 4%) 4445.7848
2m 18s (- 43m 15s) (3800 5%) 4514.9044
2m 22s (- 42m 9s) (4000 5%) 4809.1805
2m 26s (- 41m 7s) (4200 5%) 5490.5834
2m 30s (- 40m 11s) (4400 5%) 5304.3302
2m 34s (- 39m 20s) (4600 6%) 5564.7890
2m 38s (- 38m 35s) (4800 6%) 6368.1662
2m 43s (- 38m 2s) (5000 6%) 7375.3547
2m 48s (- 37m 35s) (5200 6%) 6795.0613
2m 52s (- 37

Locator attempting to generate 3206823 ticks ([-29147.800000000003, ..., 612216.6]), which exceeds Locator.MAXTICKS (1000).


Falta corregir mirar el video de la clase (algo menor)

In [None]:
evaluate_randomly(encoder, decoder)