### Custom dataset + Translation.

Based on our previous notebook and othor previous notebooks we are going to try and use our custom dataset to create and train a translation model that translate text from french to english.

### Path

In [1]:
base_path = '/content/drive/MyDrive/NLP Data/seq2seq/fr-eng'

### Imports

In [2]:
import os
import torch
from torchtext.legacy import data, datasets
import json
import pandas as pd
from sklearn.model_selection import train_test_split

We have two text files for the french and english sentences with the following file names:

```py
fr = "europarl-v7.fr-en.fr"
en = "europarl-v7.fr-en.en"
```

In [3]:
fr_path = "europarl-v7.fr-en.fr"
en_path = "europarl-v7.fr-en.en"

Now let's load the text into list of strings. We are going to use the new line as the surperator of each sentence.

In [4]:
eng_sentences = open(os.path.join(base_path, en_path), encoding='utf8').read().split('\n')
fr_sentences = open(os.path.join(base_path, fr_path), encoding='utf8').read().split('\n')

### Next we will check how many examples do we have for each language.

In [5]:
print("eng: ", len(eng_sentences))
print("fr: ", len(fr_sentences))

eng:  2007724
fr:  2007724


### Creating a pandas dataframe
Creatting the pd dataframe will help us to split the sets into train and test and the convert the splitted dataframes into either `.json` or `.csv` files which are the formats that are accepted by the `torchtext`. To make this very simple Im going to use only `500` sentence french to english pairs.

In [6]:
size = 500
raw_data ={
    'eng': [sent for sent in eng_sentences[:size]],
    'fr': [sent for sent in fr_sentences[:size]],
}

dataframe = pd.DataFrame(raw_data, columns=['eng', 'fr'])

### Checking our dataframe

In [7]:
dataframe.head(4)

Unnamed: 0,eng,fr
0,Resumption of the session,Reprise de la session
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog..."
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...


### Spliting the datasets.
We are going to use `sklearn` `train_test_split` to split these two datasets for the train and validation sets.

In [8]:
train, val = train_test_split(dataframe, test_size=.005)
train, test = train_test_split(train, test_size=.005)
len(train), len(val), len(test)

(494, 3, 3)

### Creating json files.

We are going to create `json` files and save them to the `base_path` for these two sets. We will be using the `.to_json()` method to do this. 

**Note** you can also use the `.to_csv()` to create `csv` files for example:

```py
train.to_csv("train.csv", index=False)
val.to_csv("val.csv", index=False)
```

**Note**: When you are using `.to_json()` we should pass the arg `orient="records"` so that these json files will be the files that can be accepted by the `torchtext`. Basically what this is doing is to add json files as records by removing the list `[]` brakets

In [9]:
train.to_json(os.path.join(base_path, 'train.json'), orient="records", lines=True)
val.to_json(os.path.join(base_path, 'val.json'), orient="records", lines=True)
test.to_json(os.path.join(base_path, 'test.json'), orient="records", lines=True)

Now each record has the following format:

```json
{"eng":"For us new members, it was the first time, and this was a very interesting process.","fr":"C' \u00e9tait pour nous, nouveaux d\u00e9put\u00e9s, la premi\u00e8re fois, et c' est un processus extr\u00eamement int\u00e9ressant."}
```

### Let's load the tokenizer models

In [10]:
import spacy
import spacy.cli
spacy.cli.download('fr_core_news_sm')
import fr_core_news_sm, en_core_web_sm
spacy_fr = spacy.load('fr_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


In [11]:
def tokenize_fr(sent):
  sent = sent.lower()
  return [tok for tok in spacy_fr.tokenizer(sent)]

def tokenize_en(sent):
  sent = sent.lower()
  return [tok for tok in spacy_en.tokenizer(sent)]

### Creating fields

In [12]:
SRC = data.Field(
    tokenize = tokenize_fr,
    init_token = "<sos>",
     eos_token = "<eos>",
     include_lengths =True
)
TRG = data.Field(
    tokenize = tokenize_en,
    init_token = "<sos>",
     eos_token = "<eos>"
)

In [13]:
fields ={
    "fr": ("src", SRC),
    "eng": ("trg", TRG)
}

### We are now ready to create our dataset.

We are going to use the `TabularDataset.splits()` method to create the train and validation datasets.

In [14]:
train_data, test_data, val_data = data.TabularDataset.splits(
  base_path,
  format="json",
  train="train.json",
  test="test.json",
  validation= 'val.json',
  fields=fields
)

In [15]:
print(vars(train_data.examples[0]))

{'src': [je, propose, que, nous, votions, sur, la, demande, du, groupe, socialiste, visant, à, réinscrire, la, déclaration, de, la, commission, sur, ses, objectifs, stratégiques, .], 'trg': [i, propose, that, we, vote, on, the, request, of, the, group, of, the, party, of, european, socialists, that, the, commission, statement, on, its, strategic, objectives, should, be, reinstated, .]}


### Building the vocabulary
Now we are ready to build the vocabulary.

**Note** In this simple example we will build the vocab on both sets. It is recomended that _when building the vocabulary we only need to build it on the train set_.

We will be building the vocab as follows without `min_freq=2` args since our dataset is small:

**Note**: The `min_freq=2` allows us to set the minimum frequency of each word meaning a word that appears less than two times will be converted to `<unk>` token.

```py
SRC.build_vocab(train_data, val_data, max_size=1000)
TRG.build_vocab(train_data, val_data, max_size=1000)
```



In [16]:
max_size = 25000

SRC.build_vocab(train_data, max_size=max_size)
TRG.build_vocab(train_data, max_size=max_size)

In [17]:
len(SRC.vocab.itos)

16321

### Creating iterators

 Now you can create iterators and then load the iterators to the models. Again we are going to use the `BucketIterator`.

In [18]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 1

In [19]:
train_iter,val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    batch_size=BATCH_SIZE,
    device=device,
    sort_key=lambda x: len(x.src)
)

### Checking the a single batch

In [20]:
batch = next(iter(train_iter))
batch.src

(tensor([[    2],
         [  273],
         [  669],
         [  877],
         [ 1194],
         [ 1657],
         [ 2302],
         [ 3472],
         [ 3743],
         [ 4236],
         [ 4654],
         [ 4986],
         [ 5225],
         [ 6099],
         [ 6721],
         [ 7004],
         [ 7693],
         [ 7909],
         [ 8246],
         [ 8495],
         [ 8819],
         [ 9418],
         [ 9592],
         [10032],
         [10685],
         [10830],
         [10926],
         [11360],
         [11537],
         [11856],
         [    3]], device='cuda:0'), tensor([31], device='cuda:0'))

In [21]:
from torch import nn
from torch.nn import functional as F
import random

### Encoder, Attention, Decoder and `Seq2Seq`

In [22]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
    super(Encoder, self).__init__()

    self.embedding = nn.Embedding(input_dim, embedding_dim=emb_dim)
    self.gru = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
    self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, src, src_len):

    embedded = self.dropout(self.embedding(src)) # embedded = [src len, batch size, emb dim]
    # need to explicitly put lengths on cpu!
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len.to('cpu'),  enforce_sorted=False)
    packed_outputs, hidden = self.gru(packed_embedded)

    outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs) 
    hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
    return outputs, hidden

class Attention(nn.Module):
  def __init__(self, enc_hid_dim, dec_hid_dim):
    super(Attention, self).__init__()
    self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
    self.v = nn.Linear(dec_hid_dim, 1, bias = False)

  def forward(self, hidden, encoder_outputs, mask):

    batch_size = encoder_outputs.shape[1]
    src_len = encoder_outputs.shape[0]
    # repeat decoder hidden state src_len times
    hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
    encoder_outputs = encoder_outputs.permute(1, 0, 2)
    energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) # energy = [batch size, src len, dec hid dim]
    attention = self.v(energy).squeeze(2) # attention= [batch size, src len]
    attention = attention.masked_fill(mask == 0, -1e10)
    return F.softmax(attention, dim=1)

class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
    super(Decoder, self).__init__()
    self.output_dim = output_dim
    self.attention = attention

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.gru = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
    self.fc = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
    self.dropout = nn.Dropout(dropout)
        
  def forward(self, input, hidden, encoder_outputs, mask):
    input = input.unsqueeze(0) # input = [1, batch size]
    embedded = self.dropout(self.embedding(input)) # embedded = [1, batch size, emb dim]
    a = self.attention(hidden, encoder_outputs, mask)# a = [batch size, src len]
    a = a.unsqueeze(1) # a = [batch size, 1, src len]
    encoder_outputs = encoder_outputs.permute(1, 0, 2) # encoder_outputs = [batch size, src len, enc hid dim * 2]
    weighted = torch.bmm(a, encoder_outputs) # weighted = [batch size, 1, enc hid dim * 2]
    weighted = weighted.permute(1, 0, 2) # weighted = [1, batch size, enc hid dim * 2]
    rnn_input = torch.cat((embedded, weighted), dim = 2) # rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
    output, hidden = self.gru(rnn_input, hidden.unsqueeze(0))
    
    assert (output == hidden).all()
    embedded = embedded.squeeze(0)
    output = output.squeeze(0)
    weighted = weighted.squeeze(0)

    prediction = self.fc(torch.cat((output, weighted, embedded), dim = 1)) # prediction = [batch size, output dim]
    return prediction, hidden.squeeze(0), a.squeeze(1)

class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder, src_pad_idx, device):
    super().__init__()
    self.encoder = encoder
    self.decoder = decoder
    self.device = device
    self.src_pad_idx = src_pad_idx
  
  def create_mask(self, src):
    mask = (src != self.src_pad_idx).permute(1, 0)
    return mask
  def forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):

    trg_len, batch_size = trg.shape
    trg_vocab_size = self.decoder.output_dim
        
    # tensor to store decoder outputs
    outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
    encoder_outputs, hidden = self.encoder(src, src_len)     
    # first input to the decoder is the <sos> tokens
    input = trg[0,:]
    mask = self.create_mask(src) # mask = [batch size, src len]
    for t in range(1, trg_len):
      # insert input token embedding, previous hidden state and all encoder hidden states and mask
      # receive output tensor (predictions) and new hidden state
      output, hidden, _ = self.decoder(input, hidden, encoder_outputs, mask)
      
      # place predictions in a tensor holding predictions for each token
      outputs[t] = output
      
      # decide if we are going to use teacher forcing or not
      teacher_force = random.random() < teacher_forcing_ratio
      
      # get the highest predicted token from our predictions
      top1 = output.argmax(1) 
      
      # if teacher forcing, use actual next token as next input
      # if not, use predicted token
      input = trg[t] if teacher_force else top1
    return outputs


### Creating the model instance

In [23]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = DEC_EMB_DIM = 256
ENC_HID_DIM = DEC_HID_DIM = 512
ENC_DROPOUT = DEC_DROPOUT = 0.5
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(16321, 256)
    (gru): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(14928, 256)
    (gru): GRU(1280, 512)
    (fc): Linear(in_features=1792, out_features=14928, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Initilize model weights

In [24]:
def init_weights(m):
  for name, param in m.named_parameters():
    if 'weight' in name:
        nn.init.normal_(param.data, mean=0, std=0.01)
    else:
        nn.init.constant_(param.data, 0)   
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(16321, 256)
    (gru): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(14928, 256)
    (gru): GRU(1280, 512)
    (fc): Linear(in_features=1792, out_features=14928, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Counting model parameters

In [25]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 41,198,928
Total tainable parameters: 41,198,928


### Optimizer and Loss

In [26]:
optimizer = torch.optim.Adam(model.parameters())
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

### Training and evaluation function.

In [27]:
import math

In [28]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    for i, batch in enumerate(iterator):
        src, src_len = batch.src
        trg = batch.trg
        optimizer.zero_grad()
        output = model(src, src_len, trg)
        """
        trg = [trg len, batch size]
        output = [trg len, batch size, output dim]
        """
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        """
        trg = [(trg len - 1) * batch size]
        output = [(trg len - 1) * batch size, output dim]
        """
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
      for i, batch in enumerate(iterator):
          src, src_len = batch.src
          trg = batch.trg
          optimizer.zero_grad()
          output = model(src, src_len, trg, 0) ## Turn off the teacher forcing ratio.
          """
          trg = [trg len, batch size]
          output = [trg len, batch size, output dim]
          """
          output_dim = output.shape[-1]
          output = output[1:].view(-1, output_dim)
          trg = trg[1:].view(-1)
          """
          trg = [(trg len - 1) * batch size]
          output = [(trg len - 1) * batch size, output dim]
          """
          loss =  criterion(output, trg)
          epoch_loss += loss.item()
    return epoch_loss / len(iterator)

### Helper function that tell us how long each epoch took

In [29]:
def epoch_time(start_time, end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs

### Running the training loop

In [30]:
import time
N_EPOCHS = 10
CLIP = 1
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
  start_time = time.time()
  train_loss = train(model, train_iter, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, val_iter, criterion)
  end_time = time.time()
  epoch_mins, epoch_secs = epoch_time(start_time, end_time)
  if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(model.state_dict(), 'best-model.pt')
  
  print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
  print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
  print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 2m 19s
	Train Loss: 9.412 | Train PPL: 12229.659
	 Val. Loss: 9.294 |  Val. PPL: 10873.185
Epoch: 02 | Time: 2m 18s
	Train Loss: 9.481 | Train PPL: 13107.400
	 Val. Loss: 9.959 |  Val. PPL: 21148.458
Epoch: 03 | Time: 2m 19s
	Train Loss: 8.831 | Train PPL: 6846.489
	 Val. Loss: 10.588 |  Val. PPL: 39672.123
Epoch: 04 | Time: 2m 18s
	Train Loss: 7.804 | Train PPL: 2449.558
	 Val. Loss: 11.436 |  Val. PPL: 92563.089
Epoch: 05 | Time: 2m 18s
	Train Loss: 6.571 | Train PPL: 713.912
	 Val. Loss: 11.768 |  Val. PPL: 129096.973
Epoch: 06 | Time: 2m 19s
	Train Loss: 5.345 | Train PPL: 209.572
	 Val. Loss: 12.472 |  Val. PPL: 260952.143
Epoch: 07 | Time: 2m 19s
	Train Loss: 3.942 | Train PPL:  51.512
	 Val. Loss: 13.818 |  Val. PPL: 1002316.166
Epoch: 08 | Time: 2m 19s
	Train Loss: 2.691 | Train PPL:  14.749
	 Val. Loss: 14.749 |  Val. PPL: 2542320.809
Epoch: 09 | Time: 2m 19s
	Train Loss: 2.224 | Train PPL:   9.240
	 Val. Loss: 15.224 |  Val. PPL: 4091551.732
Epoch: 10 | Time

### Conclusion
As we can see that the model is `Terrible`. We will find the way to improve the training speed and training metrics as well.

### Resources used.

1. [This Blog Post](https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95)
2. [Datasets List](http://www.statmt.org/europarl/)
3. [Alen Nie](https://anie.me/On-Torchtext/)

### Extra resources
1 [Harvard](http://nlp.seas.harvard.edu/2018/04/03/attention.html)