### Objective :  
- We will learn to implement a sequence to sequence model which can tranlate German(encoder) to English(decoder) Language.

### Dataset Used
- The dataset we'll be using is the Multi30k dataset. This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence.

### Introduction to  sequence-to-sequence(seq2seq):
- **The most common sequence-to-sequence (seq2seq) models are encoder-decoder models, which commonly use a recurrent neural network (RNN) to encode the source (input) sentence into a single vector.**
- **Then this single vector also called as context vector is then decoded by a second RNN which learns to output the target (output) sentence by generating it one word at a time.**


#### Import Libraries 

In [1]:
import spacy
import time
import random
import math
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# We will use seed to get deterministic results
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

#### Download models  ` de_core_news_sm` & `en_core_web_sm` from spacy

In [None]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

#### Create the tokenizer functions

In [4]:
def tokenize_german(text):
    return [token.text for token in spacy_de.tokenizer(text)][::-1]

def tokenize_english(text):
    return [token.text for token in spacy_en.tokenizer(text)]

#### Field in Torchtext = handle how our data should be processed
- tokenize argument with German being the SRC (source) field and English being the TRG (target) field.
- The field also appends the "start of sequence" and "end of sequence" tokens via the init_token and eos_token arguments, and converts all words to lowercase.

In [5]:
SOURCE = Field(tokenize= tokenize_german,
              init_token='<sos>',
              eos_token='<eos>',
              lower=True)
TARGET = Field(tokenize=tokenize_english,
              init_token='<sos>',
              eos_token='<eos>',
              lower=True)

#### Download Multi30k dataset
- exts specifies which languages to use as the source and target (source goes first) and fields specifies which field to use for the source and target.

In [6]:
train_data, valid_data,test_data = Multi30k.splits(exts = ('.de','.en'),fields=(SOURCE,TARGET))

In [7]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


#### Build  vocabulary for the source and target languages.
- `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary

In [8]:
SOURCE.build_vocab(train_data, min_freq = 2)
TARGET.build_vocab(train_data,min_freq = 2)

In [9]:
print("length of source vocabulary", len(SOURCE.vocab))
print("length of target vocabulary", len(TARGET.vocab))

length of source vocabulary 7853
length of target vocabulary 5893


In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

#### Create Iterators
- `BucketIterator`  creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.

In [11]:
BATCH_SIZE = 128
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

#### Building the Seq2Seq Model
**1. Encoder**
- The encoder, a 2 layer LSTM and Encoder takes the following arguments
- `input_dim` is the size of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
- `emb_dim` is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with `emb_dim` dimensions. 
- `hid_dim` is the dimensionality of the hidden and cell states.
- `n_layers` is the number of layers in the RNN.
- `dropout` is the amount of dropout to use. This is a regularization parameter

In [12]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim,emb_dim)
        self.rnn = nn.LSTM(emb_dim,hid_dim,n_layers,dropout=dropout)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, src):
        #src = [src len, batch size]
        embedded = self.dropout(self.embedding(src))
        #embedded = [src len, batch size, emb dim]
        outputs,(hidden,cell) = self.rnn(embedded)
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        return hidden, cell

**2. Decoder**
-  Decoder will also be  a 2 layer LSTM.
- The arguments and initialization are similar to the Encoder class,
- except we now have an output_dim which is the size of the vocabulary for the output/target. 
- There is also the addition of the Linear layer, used to make the predictions from the top layer hidden state.
- Within the forward method, we accept a batch of input tokens, previous hidden states and previous cell states.

In [13]:
class Decoder(nn.Module):
    def __init__(self,output_dim,emb_dim,hid_dim,n_layers,dropout):
        super().__init__()
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(output_dim,emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim,n_layers,dropout=dropout)
        self.fc = nn.Linear(hid_dim,output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self,inp, hidden,cell):
        inp = inp.unsqueeze(0) # We unsqueeze the input tokens to add a sentence length dimension of 1
        embedded = self.dropout(self.embedding(inp))
        output,(hidden,cell) = self.rnn(embedded,(hidden,cell))
        prediction = self.fc(output.squeeze(0))
        return prediction, hidden, cell
        

**3.Seq2Seq**
- For the final part of the implemenetation, we'll implement the seq2seq model.
    - RECEIVING THE INPUT SENTENCE
    - USE ENCODER TO PRODUCE CONTEXT VECTOR
    - USE DECODE TO PROUCE THE PRDICTED SENTENCE 

- The Seq2Seq model takes in an Encoder, Decoder, and a device.
- we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the Encoder and Decoder.
- Our forward method takes the source sentence, target sentence and a teacher-forcing ratio.
- The first thing we do in the forward method is to create an outputs tensor that will store all of our predictions.
- We then feed the input sentence, into the encoder and receive out final hidden and cell states.

During each iteration of the loop, we:
- pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
- receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
- place our prediction, $\hat{y}_{t+1}$/`output` in our tensor of predictions, $\hat{Y}$/`outputs`
- decide if we are going to "teacher force" or not
    - if we do, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$/`trg[t]`
    - if we don't, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$/`top1`, which we get by doing an `argmax` over the output tensor

In [14]:
class Sequence2Sequence(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions must be Equal"
        assert encoder.n_layers == decoder.n_layers, "encoder and decoder should have equal no of layers"
        
    def forward(self, source, target, teacher_forcing_ratio = 0.5):
        batch_size = target.shape[1]
        target_len = target.shape[0]
        target_vocab_size = self.decoder.output_dim
        outputs = torch.zeros(target_len,batch_size, target_vocab_size).to(device)
        hidden, cell = self.encoder(source)
        inp= target[0,:]
        
        for t in range(1,target_len):
            output, hidden, cell = self.decoder(inp, hidden,cell)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top = output.argmax(1)
            inp = target[t] if teacher_force else top
        return outputs

#### Training the Seq2Seq Model

In [15]:
INPUT_DIM = len(SOURCE.vocab)
OUTPUT_DIM = len(TARGET.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Sequence2Sequence(enc, dec, device).to(device)

In [16]:
model

Sequence2Sequence(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Next up is initializing the weights of our model. In the actual paper they initialize all weights from a uniform distribution between -0.08 and +0.08, i.e. $\mathcal{U}(-0.08, 0.08)$.

In [17]:
def init_weights(model):
    for name, param in model.named_parameters():
        nn.init.uniform_(param.data,-0.08,0.08)

In [18]:
model.apply(init_weights)

Sequence2Sequence(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

#### Define our optimizer, which we use to update our parameters in the training loop.

In [19]:
optimizer = optim.Adam(model.parameters())

#### Define our loss function-  CrossEntropyLoss

In [20]:
TARGET_PAD_INDEX = TARGET.vocab.stoi[TARGET.pad_token]

In [21]:
criterion = nn.CrossEntropyLoss(ignore_index=TARGET_PAD_INDEX)

In [22]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        # the loss function only works on 2d inputs with 1d targets we need to flatten each of them with .view
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

#### Evaluating the Seq2Seq Model

In [23]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [24]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [25]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 2m 0s
	Train Loss: 5.064 | Train PPL: 158.249
	 Val. Loss: 4.940 |  Val. PPL: 139.814
Epoch: 02 | Time: 1m 18s
	Train Loss: 4.507 | Train PPL:  90.687
	 Val. Loss: 4.823 |  Val. PPL: 124.390
Epoch: 03 | Time: 0m 34s
	Train Loss: 4.199 | Train PPL:  66.639
	 Val. Loss: 4.635 |  Val. PPL: 102.979
Epoch: 04 | Time: 0m 50s
	Train Loss: 3.973 | Train PPL:  53.135
	 Val. Loss: 4.451 |  Val. PPL:  85.702
Epoch: 05 | Time: 0m 39s
	Train Loss: 3.800 | Train PPL:  44.693
	 Val. Loss: 4.351 |  Val. PPL:  77.573
Epoch: 06 | Time: 0m 49s
	Train Loss: 3.654 | Train PPL:  38.633
	 Val. Loss: 4.221 |  Val. PPL:  68.099
Epoch: 07 | Time: 1m 9s
	Train Loss: 3.515 | Train PPL:  33.624
	 Val. Loss: 4.155 |  Val. PPL:  63.740
Epoch: 08 | Time: 1m 29s
	Train Loss: 3.379 | Train PPL:  29.350
	 Val. Loss: 4.065 |  Val. PPL:  58.289
Epoch: 09 | Time: 1m 22s
	Train Loss: 3.251 | Train PPL:  25.828
	 Val. Loss: 4.015 |  Val. PPL:  55.438
Epoch: 10 | Time: 1m 20s
	Train Loss: 3.164 | Train PPL: 

In [26]:
model.load_state_dict(torch.load('model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.906 | Test PPL:  49.694 |
