### Language Translation with TorchText.

This notebook is based on [pytorch tutorial](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html) tutorial. In this notebook we are going to learn how we can use `torchtext` with a well known dataset to create a model that translates sentences from German to English.

### Data Processing

In [1]:
import torchtext
import torch

from torch import nn
from torch.nn import functional as F

from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.legacy.vocab import Vocab
from torchtext.utils import download_from_url, extract_archive

import os, io, time, random

import numpy as np

### SEEDS

In [2]:
SEED = 42

np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### URLs and Paths

In [3]:
url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'

train_urls = ('train.de.gz', 'train.en.gz')
val_urls = ('val.de.gz', 'val.en.gz')
test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz')


In [4]:
train_filepaths = [
      extract_archive(
          download_from_url(url_base + url)
      )[0] for url in train_urls
]
test_filepaths = [
      extract_archive(
          download_from_url(url_base + url)
      )[0] for url in test_urls
]
val_filepaths = [
      extract_archive(
          download_from_url(url_base + url)
      )[0] for url in test_urls
]

In [5]:
train_filepaths

['/content/.data/train.de', '/content/.data/train.en']

### Creating Tokenizers

In [None]:
!python -m spacy download en
!python -m spacy download de

In [7]:
de_tokenizer = get_tokenizer('spacy', language="de")
en_tokenizer = get_tokenizer('spacy', language="en")

### `build_vocab()`
This function will build the vocabulary for us.

In [8]:
def build_vocab(filepath, tokenizer):
  counter = Counter()
  with io.open(filepath, encoding="utf8") as f:
    for string_ in f:
      counter.update(tokenizer(string_))
  return Vocab(counter,specials=['<unk>', '<pad>', '<bos>', '<eos>'] )


In [9]:
de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

### `data_process()`

This function will process the data for us. This function will read the text files line by line then tokenize each line, after tokenizing then they will convert each sentence to a vector of numbers.

In [10]:
def data_process(filepaths):
  raw_de_iter = iter(io.open(filepaths[0], encoding="utf8"))
  raw_en_iter = iter(io.open(filepaths[1], encoding="utf8"))
  data = []
  for raw_de, raw_en in zip(raw_de_iter, raw_en_iter):
    
    de_tensor_ = torch.tensor([de_vocab[token] for token in de_tokenizer(raw_de)])
    en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_de)])
    
    data.append((de_tensor_, en_tensor_))
  return data
  

Creating the sets

In [11]:
train_data = data_process(train_filepaths)
test_data = data_process(test_filepaths)
val_data = data_process(val_filepaths)

val_data[:1]

[(tensor([  6,  13,  11,   7, 179, 109,   9,  17,  79,   0,   5,   4]),
  tensor([ 0,  0,  0,  0,  0,  0, 16,  0,  0,  0,  6,  5]))]

#### DataLoader

We are going to use the DataLoader class to create the dataset. DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

### Device

In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Hyper params

In [13]:
BATCH_SIZE = 128
PAD_IDX = de_vocab['<pad>']
BOS_IDX = de_vocab['<bos>']
EOS_IDX = de_vocab['<eos>']

PAD_IDX, BOS_IDX, EOS_IDX

(1, 2, 3)

### Imports 2

In [14]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

### `generate_batch()`

In [15]:
def generate_batch(data_batch):
  de_batch = []
  en_batch = []
  for de_item, en_item in data_batch:
    de_batch.append(
        torch.cat([torch.tensor([BOS_IDX]),
                   de_item, torch.tensor([EOS_IDX])], dim=0
                  )
    )
    en_batch.append(
        torch.cat([torch.tensor([BOS_IDX]),
                   en_item, torch.tensor([EOS_IDX])], dim=0
                  )
    )
  
  de_batch = pad_sequence(de_batch, padding_value=PAD_IDX)
  en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
  return de_batch, en_batch

### Creating the Iterators

In [16]:
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
test_iter = DataLoader(test_data, batch_size=BATCH_SIZE,
                       shuffle=True, collate_fn=generate_batch)

### Creating our model.

* The model achitecture that we will be using if found [here](https://arxiv.org/abs/1409.0473)
* The coded version of it will be found [here](https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb)


### Import 3

In [17]:
from typing import Tuple
from torch import Tensor
from torch import optim

### Encoder

In [18]:
class Encoder(nn.Module):
  def __init__(self, 
               input_dim,
               emb_dim,
               enc_hid_dim,
               dec_hid_dim,
               dropout
               ):
    super(Encoder, self).__init__()
    self.input_dim = input_dim
    self.emb_dim = emb_dim
    self.enc_hid_dim = enc_hid_dim
    self.dec_hid_dim = dec_hid_dim
    self.dropout = dropout

    self.embedding = nn.Embedding(input_dim, emb_dim)
    self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
    self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, src):
    embedded = self.dropout(self.embedding(src))
    outputs, hidden = self.rnn(embedded)
    hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
    return outputs, hidden
  

### Attention

In [19]:
class Attention(nn.Module):
  def __init__(self,
              enc_hid_dim,
              dec_hid_dim,
              attn_dim
               ):
    super(Attention, self).__init__()

    self.enc_hid_dim = enc_hid_dim
    self.dec_hid_dim = dec_hid_dim
    self.attn_in = (enc_hid_dim * 2) + dec_hid_dim
    self.attn = nn.Linear(self.attn_in, attn_dim)

  def forward(self,
              decoder_hidden,
              encoder_outputs
              ):
    src_len = encoder_outputs.shape[0]
    repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, 
                                                                 src_len
                                                                 , 1)
    encoder_outputs = encoder_outputs.permute(1, 0, 2)
    energy = torch.tanh(self.attn(torch.cat((
        repeated_decoder_hidden,
        encoder_outputs),
        dim = 2)))
    attention = torch.sum(energy, dim=2)
    return F.softmax(attention, dim=1)


### Decoder

In [20]:
class Decoder(nn.Module):
  def __init__(self,
               output_dim,
               emb_dim,
               enc_hid_dim,
               dec_hid_dim,
               dropout,
               attention
               ):
    super(Decoder, self).__init__()
    self.emb_dim = emb_dim
    self.enc_hid_dim = enc_hid_dim
    self.dec_hid_dim = dec_hid_dim
    self.output_dim = output_dim
    self.dropout = dropout
    self.attention = attention

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
    self.out = nn.Linear(self.attention.attn_in + emb_dim, output_dim)
    self.dropout = nn.Dropout(dropout)

  def _weighted_encoder_rep(self,
                            decoder_hidden,
                            encoder_outputs
                            ):
    a = self.attention(decoder_hidden, encoder_outputs)
    a = a.unsqueeze(1)
    encoder_outputs = encoder_outputs.permute(1, 0, 2)
    weighted_encoder_rep = torch.bmm(a, encoder_outputs)
    weighted_encoder_rep = weighted_encoder_rep.permute(1, 0, 2)
    return weighted_encoder_rep

  def forward(self,
              input,
              decoder_hidden,
              encoder_outputs,
             ):
    input = input.unsqueeze(0)
    embedded = self.dropout(self.embedding(input))
    weighted_encoder_rep = self._weighted_encoder_rep(decoder_hidden,
                                                      encoder_outputs)
    rnn_input = torch.cat((embedded, weighted_encoder_rep), dim = 2)
    output, decoder_hidden = self.rnn(rnn_input, 
                                      decoder_hidden.unsqueeze(0)
                                      )
    embedded = embedded.squeeze(0)
    output = output.squeeze(0)
    weighted_encoder_rep = weighted_encoder_rep.squeeze(0)
    output = self.out(torch.cat((output,
                                  weighted_encoder_rep,
                                  embedded), dim = 1))

    return output, decoder_hidden.squeeze(0)
      

### Seq2Seq

In [21]:
class Seq2Seq(nn.Module):
  def __init__(self,
                 encoder,
                 decoder,
                 device):
    super(Seq2Seq, self).__init__()

    self.encoder = encoder
    self.decoder = decoder
    self.device = device
    
  def forward(self, src, trg, teacher_forcing_ratio=.5):
    batch_size = src.shape[1]
    max_len = trg.shape[0]
    trg_vocab_size = self.decoder.output_dim
    outputs = torch.zeros(max_len, batch_size,
                          trg_vocab_size).to(self.device)
    encoder_outputs, hidden = self.encoder(src)
    # first input to the decoder is the <sos> token
    output = trg[0,:]
    for t in range(1, max_len):
      output, hidden = self.decoder(output, hidden, encoder_outputs)
      outputs[t] = output
      teacher_force = random.random() < teacher_forcing_ratio
      top1 = output.max(1)[1]
      output = (trg[t] if teacher_force else top1)
      
    return outputs
  

### Hyper parameters

We are going to use these hyper params:

```
ENC_EMB_DIM = 32
DEC_EMB_DIM = 32
ENC_HID_DIM = 64
DEC_HID_DIM = 64
ATTN_DIM = 8
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
```

Insteady of these:
```
# ENC_EMB_DIM = 256
# DEC_EMB_DIM = 256
# ENC_HID_DIM = 512
# DEC_HID_DIM = 512
# ATTN_DIM = 64
# ENC_DROPOUT = 0.5
# DEC_DROPOUT = 0.5
```
**This is so the model will not takes long to train**.

In [22]:
INPUT_DIM = len(de_vocab)
OUTPUT_DIM = len(en_vocab)

ENC_EMB_DIM = 32
DEC_EMB_DIM = 32
ENC_HID_DIM = 64
DEC_HID_DIM = 64
ATTN_DIM = 8
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)

attn = Attention(ENC_HID_DIM, DEC_HID_DIM, ATTN_DIM)

dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)
model


Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(19206, 32)
    (rnn): GRU(32, 64, bidirectional=True)
    (fc): Linear(in_features=128, out_features=64, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=192, out_features=8, bias=True)
    )
    (embedding): Embedding(10840, 32)
    (rnn): GRU(160, 64)
    (out): Linear(in_features=224, out_features=10840, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Initializing model weights

In [24]:
def init_weights(m: nn.Module):
  for name, param in m.named_parameters():
    if 'weight' in name:
      nn.init.normal_(param.data, mean=0, std=0.01)
    else:
      nn.init.constant_(param.data, 0)

In [25]:
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(19206, 32)
    (rnn): GRU(32, 64, bidirectional=True)
    (fc): Linear(in_features=128, out_features=64, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=192, out_features=8, bias=True)
    )
    (embedding): Embedding(10840, 32)
    (rnn): GRU(160, 64)
    (out): Linear(in_features=224, out_features=10840, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Model parameters

In [26]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 3,491,296
Total tainable parameters: 3,491,296


### Optimizer.

In [27]:
optimizer = optim.Adam(model.parameters())

### Criterion

In [28]:
PAD_IDX = en_vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX).to(device)

### Training and evaluation functions

In [29]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for _, (src, trg) in enumerate(iterator):
      src, trg = src.to(device), trg.to(device)
      optimizer.zero_grad()

      output = model(src, trg)
      output = output[1:].view(-1, output.shape[-1])
      trg = trg[1:].view(-1)

      loss = criterion(output, trg)
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

      optimizer.step()
      epoch_loss += loss.item()

  return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for _, (src, trg) in enumerate(iterator):
      src, trg = src.to(device), trg.to(device)
      output = model(src, trg, 0) #turn off teacher forcing
      output = output[1:].view(-1, output.shape[-1])
      trg = trg[1:].view(-1)
      loss = criterion(output, trg)
      epoch_loss += loss.item()

  return epoch_loss / len(iterator)

A function that will help us to visualize how long an epoch took to train.

In [30]:
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs
  

Train loop.

In [31]:
import math
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
  start_time = time.time()

  train_loss = train(model, train_iter, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, valid_iter, criterion)

  end_time = time.time()
  epoch_mins, epoch_secs = epoch_time(start_time, end_time)

  print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
  print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
  print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

test_loss = evaluate(model, test_iter, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')


Epoch: 01 | Time: 1m 8s
	Train Loss: 2.137 | Train PPL:   8.472
	 Val. Loss: 1.403 |  Val. PPL:   4.069
Epoch: 02 | Time: 1m 7s
	Train Loss: 1.104 | Train PPL:   3.015
	 Val. Loss: 1.439 |  Val. PPL:   4.216
Epoch: 03 | Time: 1m 7s
	Train Loss: 1.076 | Train PPL:   2.933
	 Val. Loss: 1.422 |  Val. PPL:   4.144
Epoch: 04 | Time: 1m 8s
	Train Loss: 1.033 | Train PPL:   2.809
	 Val. Loss: 1.400 |  Val. PPL:   4.054
Epoch: 05 | Time: 1m 7s
	Train Loss: 1.032 | Train PPL:   2.807
	 Val. Loss: 1.393 |  Val. PPL:   4.028
Epoch: 06 | Time: 1m 8s
	Train Loss: 1.013 | Train PPL:   2.753
	 Val. Loss: 1.264 |  Val. PPL:   3.539
Epoch: 07 | Time: 1m 8s
	Train Loss: 0.905 | Train PPL:   2.473
	 Val. Loss: 1.034 |  Val. PPL:   2.812
Epoch: 08 | Time: 1m 7s
	Train Loss: 0.830 | Train PPL:   2.294
	 Val. Loss: 1.027 |  Val. PPL:   2.793
Epoch: 09 | Time: 1m 7s
	Train Loss: 0.803 | Train PPL:   2.233
	 Val. Loss: 0.960 |  Val. PPL:   2.613
Epoch: 10 | Time: 1m 7s
	Train Loss: 0.738 | Train PPL:   2.091


### Conclusion

This tutorial give me a clear vision of how we can load our custom dataset. In this tutorial we use the `Multi30K` which is familiar.

### Ref
* [Torch Turorials](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html)

### Usefull resources
* [Torch Turorials](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html)
* [SethHWeidman](https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb)


### Next

We are going to expand from this notebook and forcus on how we can load our custom dataset from files.
