# Introduction Sequence to Sequence Models
Those are the models that are used on Machine translation. Different than the many-to-many approach, the sequence to sequence can output a sequence with size different than the input sequence.
This example we're going to learn about the encoder/decoder architecture (Without attention)

### Encoder / Decoder architecture.
The sequence to sequence model is divided on an encoder network that condense the input sequence into the hidden vector, and an decoder consume the hidden vector into another sequence.
![alt text](imgs/seq_to_seq_anim.gif "Sequence to Sequence")
The problem of this method is that we will condense a sequence of unknown size into a hidden vector of fixed size, which means some information will be lost. To mitigate this we can use the Attention mechanism.

### References
* https://towardsdatascience.com/transformers-141e32e69591
* https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
* http://nlp.seas.harvard.edu/2018/04/03/attention.html
* https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
* https://github.com/harvardnlp/annotated-transformer
* https://arxiv.org/pdf/1902.10525.pdf
* https://distill.pub/2017/ctc/
* https://distill.pub/2019/memorization-in-rnns/
* https://jalammar.github.io/illustrated-word2vec/

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random
import time
import math

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
#from torchsummary import summary

from utils_seq_to_seq import *

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Compute device:',device)

SOS_token = 0
EOS_token = 1
MAX_LENGTH = 10
hidden_size = 256
teacher_forcing_ratio = 0.5

Compute device: cpu


In [2]:
class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1


### Load the dataset

In [3]:
# Convert to lowercase and simplify expressions
print(normalizeString('Hi hello world! What\'s your name?'))
# Load dataset
input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
# Get some X-Y data
print(random.choice(pairs))

hi hello world ! what s your name ?
Reading lines...
Read 135842 sentence pairs
Trimmed to 10599 sentence pairs
Counting words...
Counted words:
fra 4345
eng 2803
['il t attend a la maison .', 'he s waiting for you at home .']


### Encoder
The encoder input will be input sequence word index, and the previous output (hidden state). The Encoder has an embedding layer so it will automatically learn an embedding based on your data.
#### Shapes
```
    Input shape: torch.Size([1])
    Hidden shape: torch.Size([1, 1, 256])
```

In [36]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        input = input.type(torch.LongTensor)
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### Decoder
We will pull out of the decoder the output sequece, starting by giving the decoder the input accumulated hidden_vector and the SOS(start of sequence) input. After that the decoder will receive the last decoded word and it's own hidden state.

#### Shapes
```
Input shape: torch.Size([1, 1])
Hidden shape: torch.Size([1, 1, 256])
```

In [37]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        # The embedding layer will learn a word2vec with your training data
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):        
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### Training Functions

In [38]:
def train(input_tensor, target_tensor, encoder, decoder, 
          encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)    

    loss = 0

    # Push input sequence into encoder and accumulate into encoder_hidden
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)        

    # Prepare Decoder input (SOS(Start of sequence) and encoder hidden state)
    decoder_input = torch.tensor([[SOS_token]], device=device)
    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

In [39]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs), input_lang, output_lang, device)
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

### Evaluate Functions

In [40]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence, device)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()        

        # Push input sequence into encoder and accumulate into encoder_hidden
        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)            

        # Prepare Decoder input (SOS(Start of sequence) and encoder hidden state)
        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS
        decoder_hidden = encoder_hidden

        # List with output words
        decoded_words = []

        # Pull the sequence out of the decoder and append results into decoded_words
        for di in range(max_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden)
            
            # Get first topk results (Greedy most probable word from output)
            topv, topi = decoder_output.data.topk(1)
            
            # Append more words to the output or finish (received EOS)
            if topi.item() == EOS_token:
                # Stop if End of sequence <EOS>
                decoded_words.append('<EOS>')
                break
            else:
                # Append word to decoded_words
                decoded_words.append(output_lang.index2word[topi.item()])

            # Put back to the input the word with highest score
            decoder_input = topi.squeeze().detach()

        return decoded_words


def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

### Train and Evaluate

In [10]:
# Instantiate Encoder and Decoder Networks
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
decoder1 = DecoderRNN(hidden_size, output_lang.n_words).to(device)

# Train
trainIters(encoder1, decoder1, 75000, print_every=5000)

# Evaluate
evaluateRandomly(encoder1, decoder1)

> je suis sure que tout ira bien .
= i m sure everything will be fine .
< outgoing wasting oldest bride boyfriend everybody laws research owl absorbed

> tu n es pas tres amusant .
= you re not very funny .
< outgoing wasting oldest bride boyfriend everybody laws research owl absorbed

> je vais avoir trente ans en octobre .
= i m turning thirty in october .
< south outgoing wasting cream boat propose propose boyfriend boat grounded

> nous ne sommes pas impressionnes .
= we re not impressed .
< outgoing wasting oldest bride boyfriend everybody laws research owl absorbed

> tu es distrait .
= you re forgetful .
< outgoing wasting oldest bride boyfriend everybody laws research owl absorbed

> c est un cardiologue .
= he s a cardiologist .
< outgoing wasting oldest bride boyfriend everybody laws research owl absorbed

> je suis desole si je vous ai effraye .
= i m sorry if i frightened you .
< south outgoing wasting conservative sixteen conservative facing play teaser seen

> je ne suis 

### Summary of the networks

In [None]:
a = torch.arange(10,)
print('Some tensor:',a)
values_topk, indices_topk = a.topk(2)
print('Top k values:',values_topk)