### Frencsh t english short phrases

In this notebook we are going to create a simple model that translate senetence from french to english. We are going to leran how we can load the dataset from our local file and prepare it for training. We are in google colab, as usual we are going to use google drive as our local storage. I've a text file that contains the following data in it:

```
Go.	Va !
Hi.	Salut !
Run!	Cours !
Run!	Courez !
Wow!	Ça alors !
...
```

In this toy dataset we are not going to have a test dataset since the dataset is very small.

We are going to create two files:

```
fr.txt -> contains all french phrases that matches the eng phrases
en.txt -> contains all english phrases that matches the french phrases
```


In [1]:
import os 
import time
import io

### Mounting the google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [8]:

base_path = '/content/drive/MyDrive/NLP Data/seq2seq/fr-en-short-phrases'

Creating two text files one for french and the otherone for english.

In [23]:
fr_path = "fr.txt"
en_path = "en.txt"

rows = open(os.path.join(base_path, 'fra.txt'), encoding='utf8').read().split('\n')

eng = []
fra = []
for row in rows:
  try:
    en, fr = row.split('\t')
    eng.append(en)
    fra.append(fr)
  except:
    pass

with open(os.path.join(base_path, 'fr.txt'), 'w') as f:
  for line in fra:
    f.write(line+'\n')
with open(os.path.join(base_path, 'en.txt'), 'w') as f:
  for line in eng:
    f.write(line+'\n')
print("done")

done


### Imports

In [33]:
import torch
import time, os, math, random

from torch import nn
from torch.nn import functional as F

from torchtext.legacy.vocab import Vocab
import numpy as np
from torchtext.data.utils import get_tokenizer

from collections import Counter

### SEEDS

In [25]:
SEED = 42

np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Next we will create `tokenizers`

We are going to use two tokenizers, one for `english` and the other one for `french`

In [None]:
!python -m spacy download en
!python -m spacy download fr

In [28]:
fr_tokenizer = get_tokenizer('spacy', language="fr")
en_tokenizer = get_tokenizer('spacy', language="en")

### Building the vocabulary
WE are goig to create a function called `build_vocab()` that will read data from our local files and build the vocabulary for us. We are going to set the `min_freq=1` so that the token that appears less than 1 times will be automatically converted to unknown.

In [59]:
def build_vocab(filepath, tokenizer):
  counter = Counter()

  with open(os.path.join(base_path, filepath), 'r', encoding="utf8") as f:
    for line in f:
      counter.update(tokenizer(line.lower()))
  return Vocab(counter=counter, min_freq=2, specials=('<unk>', '<pad>', '<sos>', '<eos>'), specials_first=True)

In [60]:
fr_vocab = build_vocab(fr_path, fr_tokenizer)
en_vocab = build_vocab(en_path, en_tokenizer)

### Data processing.

We are going to create a function that is called `data_process` that will tokenize each sentence and returns vectors representation of each word in a senetence.

In [61]:
def data_process(fr_path:str, en_path: str):
  raw_fr_iter = iter(io.open(os.path.join(base_path, fr_path), encoding="utf8"))
  raw_en_iter = iter(io.open(os.path.join(base_path, en_path), encoding="utf8"))
  data = []
  for raw_fr, raw_en in zip(raw_fr_iter, raw_en_iter):
    fr_tensor = torch.tensor([fr_vocab[token] for token in fr_tokenizer(raw_fr)])
    en_tensor = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en)])
    data.append([fr_tensor, en_tensor])
  return data

In [62]:
data = data_process(fr_path, en_path)

In [63]:
data[:2]

[[tensor([ 0, 42,  4]), tensor([0, 5, 4])],
 [tensor([ 0, 42,  4]), tensor([0, 5, 4])]]

### Splitting the train and validation data.

In this case we are going to use `sklearn` to split the data into train and validation sets.

In [64]:
from sklearn.model_selection import train_test_split

In [65]:
train_data, val_data = train_test_split(data, test_size=.2, random_state=SEED)

### Checking how many examples Do we have for each set.

In [66]:
from prettytable import PrettyTable
def tabulate(column_names, data):
  table = PrettyTable(column_names)
  table.title= "VISUALIZING SETS EXAMPLES"
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'r'
  for row in data:
    table.add_row(row)
  print(table)

column_names = ["SUBSET", "EXAMPLE(s)"]
row_data = [
        ["training", len(train_data)],
        ['validation', len(val_data)],
]
tabulate(column_names, row_data)

+-----------------------------+
|  VISUALIZING SETS EXAMPLES  |
+--------------+--------------+
| SUBSET       |   EXAMPLE(s) |
+--------------+--------------+
| training     |       128697 |
| validation   |        32175 |
+--------------+--------------+


### Helper functions.

Let's create a function that converts sequences to string representation.

In [None]:
def seq_to_text(seq):
  reversed_vocab_eng = dict((v, k) for (k, v) in en_vocab.stoi.items())
  reversed_vocab_fra = dict((v, k) for (k, v) in fr_vocab.stoi.items())
  en = " ".join(reversed_vocab_eng[i.item()] for i in seq[1])
  fr = " ".join(reversed_vocab_fra[i.item()] for i in seq[0])
  return en, fr

for i in range(10, 20):
  print(seq_to_text(train_data[i]))

### DataLoader

Now it's time to create our dataset from tensors using dataloader. But first let's create a device variable.

### Device

In [77]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Hyper parameters

In [78]:
BATCH_SIZE = 128
PAD_IDX = fr_vocab['<pad>']
SOS_IDX = fr_vocab['<sos>']
EOS_IDX = fr_vocab['<eos>']
PAD_IDX, SOS_IDX, EOS_IDX

(1, 2, 3)

In [79]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

### Generating bactches

We are going to create a `generate_batch` function that will generate the batch for us.


In [80]:
def generate_batch(data_batch):
  fr_batch = []
  en_batch = []
  for fr_item, en_item in data_batch:
    fr_batch.append(
        torch.cat([
            torch.tensor([SOS_IDX]),
            fr_item,
            torch.tensor([EOS_IDX])
        ], dim=0)
    )
    en_batch.append(
        torch.cat([
            torch.tensor([SOS_IDX]),
            en_item,
            torch.tensor([EOS_IDX])
        ], dim=0)
    )

  fr_batch = pad_sequence(fr_batch, padding_value=PAD_IDX)
  en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
  return fr_batch, en_batch


### Creating iterators.

Now we will then create an iterators, we only have two iterators which are the `train_iterator` and the `test_iterator`.

In [81]:
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)

### Finally we create the model.

We are going to create an `Seq2Seq` model based on [this](https://github.com/CrispenGari/pytorch-python/blob/main/09_TorchText/03_Sequence_To_Sequence/11_Language_Translation_With_Torchtext_Custom_datasetworked.ipynb) notebook.


### Encoder.

In [98]:
class Encoder(nn.Module):
  def __init__(self, 
               input_dim,
               emb_dim,
               enc_hid_dim,
               dec_hid_dim,
               dropout
               ):
    super(Encoder, self).__init__()
    self.input_dim = input_dim
    self.emb_dim = emb_dim
    self.enc_hid_dim = enc_hid_dim
    self.dec_hid_dim = dec_hid_dim
    self.dropout = dropout

    self.embedding = nn.Embedding(input_dim, emb_dim)
    self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
    self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, src):
    embedded = self.dropout(self.embedding(src))
    outputs, hidden = self.rnn(embedded)
    hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
    return outputs, hidden

### Attention

In [99]:
class Attention(nn.Module):
  def __init__(self,
              enc_hid_dim,
              dec_hid_dim,
              attn_dim
               ):
    super(Attention, self).__init__()

    self.enc_hid_dim = enc_hid_dim
    self.dec_hid_dim = dec_hid_dim
    self.attn_in = (enc_hid_dim * 2) + dec_hid_dim
    self.attn = nn.Linear(self.attn_in, attn_dim)

  def forward(self,
              decoder_hidden,
              encoder_outputs
              ):
    src_len = encoder_outputs.shape[0]
    repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)
    encoder_outputs = encoder_outputs.permute(1, 0, 2)
    energy = torch.tanh(self.attn(torch.cat((
        repeated_decoder_hidden,
        encoder_outputs),
        dim = 2)))
    attention = torch.sum(energy, dim=2)
    return F.softmax(attention, dim=1)

### Decoder

In [100]:
class Decoder(nn.Module):
  def __init__(self,
               output_dim,
               emb_dim,
               enc_hid_dim,
               dec_hid_dim,
               dropout,
               attention
               ):
    super(Decoder, self).__init__()
    self.emb_dim = emb_dim
    self.enc_hid_dim = enc_hid_dim
    self.dec_hid_dim = dec_hid_dim
    self.output_dim = output_dim
    self.dropout = dropout
    self.attention = attention

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
    self.out = nn.Linear(self.attention.attn_in + emb_dim, output_dim)
    self.dropout = nn.Dropout(dropout)

  def _weighted_encoder_rep(self,
                            decoder_hidden,
                            encoder_outputs
                            ):
    a = self.attention(decoder_hidden, encoder_outputs)
    a = a.unsqueeze(1)
    encoder_outputs = encoder_outputs.permute(1, 0, 2)
    weighted_encoder_rep = torch.bmm(a, encoder_outputs)
    weighted_encoder_rep = weighted_encoder_rep.permute(1, 0, 2)
    return weighted_encoder_rep

  def forward(self,
              input,
              decoder_hidden,
              encoder_outputs,
             ):
    input = input.unsqueeze(0)
    embedded = self.dropout(self.embedding(input))
    weighted_encoder_rep = self._weighted_encoder_rep(decoder_hidden,
                                                      encoder_outputs)
    rnn_input = torch.cat((embedded, weighted_encoder_rep), dim = 2)
    output, decoder_hidden = self.rnn(rnn_input, 
                                      decoder_hidden.unsqueeze(0)
                                      )
    embedded = embedded.squeeze(0)
    output = output.squeeze(0)
    weighted_encoder_rep = weighted_encoder_rep.squeeze(0)
    output = self.out(torch.cat((output,
                                  weighted_encoder_rep,
                                  embedded), dim = 1))

    return output, decoder_hidden.squeeze(0)

### Seq2Seq

In [101]:
class Seq2Seq(nn.Module):
  def __init__(self,
                 encoder,
                 decoder,
                 device):
    super(Seq2Seq, self).__init__()

    self.encoder = encoder
    self.decoder = decoder
    self.device = device
    
  def forward(self, src, trg, teacher_forcing_ratio=.5):
    batch_size = src.shape[1]
    max_len = trg.shape[0]
    trg_vocab_size = self.decoder.output_dim
    outputs = torch.zeros(max_len, batch_size,
                          trg_vocab_size).to(self.device)
    encoder_outputs, hidden = self.encoder(src)
    # first input to the decoder is the <sos> token
    output = trg[0,:]
    for t in range(1, max_len):
      output, hidden = self.decoder(output, hidden, encoder_outputs)
      outputs[t] = output
      teacher_force = random.random() < teacher_forcing_ratio
      top1 = output.max(1)[1]
      output = (trg[t] if teacher_force else top1)
      
    return outputs

### Model Hyper parameters

In [112]:
# ENC_EMB_DIM = 256
# DEC_EMB_DIM = 256
# ENC_HID_DIM = 512
# DEC_HID_DIM = 512
# ATTN_DIM = 64
# ENC_DROPOUT = 0.5
# DEC_DROPOUT = 0.5

ENC_EMB_DIM = 32
DEC_EMB_DIM = 32
ENC_HID_DIM = 64
DEC_HID_DIM = 64
ATTN_DIM = 8
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

INPUT_DIM = len(fr_vocab)
OUTPUT_DIM = len(en_vocab)

### Model Creation

In [113]:

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
attn = Attention(ENC_HID_DIM, DEC_HID_DIM, ATTN_DIM)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(15265, 32)
    (rnn): GRU(32, 64, bidirectional=True)
    (fc): Linear(in_features=128, out_features=64, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=192, out_features=8, bias=True)
    )
    (embedding): Embedding(9540, 32)
    (rnn): GRU(160, 64)
    (out): Linear(in_features=224, out_features=9540, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Initializing the weights for the `model`.

In [114]:
def init_weights(m: nn.Module):
  for name, param in m.named_parameters():
    if 'weight' in name:
      nn.init.normal_(param.data, mean=0, std=0.01)
    else:
      nn.init.constant_(param.data, 0)

### Applying the weights to the model.

In [115]:
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(15265, 32)
    (rnn): GRU(32, 64, bidirectional=True)
    (fc): Linear(in_features=128, out_features=64, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=192, out_features=8, bias=True)
    )
    (embedding): Embedding(9540, 32)
    (rnn): GRU(160, 64)
    (out): Linear(in_features=224, out_features=9540, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Model parameters.

In [116]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 3,031,084
Total tainable parameters: 3,031,084


The model have `~30M` trainable parameters.


### Next.

We are going to create the `optimizer` and `criterion`

In [117]:
optimizer = torch.optim.Adam(model.parameters())

PAD_IDX = en_vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX).to(device)

### Training and Evaluation loops.

In [118]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for _, (src, trg) in enumerate(iterator):
      src, trg = src.to(device), trg.to(device)
      optimizer.zero_grad()
      output = model(src, trg)
      output = output[1:].view(-1, output.shape[-1])
      trg = trg[1:].view(-1)

      loss = criterion(output, trg)
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

      optimizer.step()
      epoch_loss += loss.item()

  return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for _, (src, trg) in enumerate(iterator):
      src, trg = src.to(device), trg.to(device)
      output = model(src, trg, 0) #turn off teacher forcing
      output = output[1:].view(-1, output.shape[-1])
      trg = trg[1:].view(-1)
      loss = criterion(output, trg)
      epoch_loss += loss.item()

  return epoch_loss / len(iterator)

### Helper functions.

1. time to string

In [119]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

2. Tabulate training epoch

In [120]:
def tabulate_training(column_names, data, title):
  table = PrettyTable(column_names)
  table.title= title
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'r'
  table.align[column_names[2]] = 'r'
  table.align[column_names[3]] = 'r'
  for row in data:
    table.add_row(row)
  print(table)

In [121]:
N_EPOCHS = 10
CLIP = 1
best_valid_loss = float('inf')
column_names = ["SET", "LOSS", "PPL", "ETA"]
print("TRAINING STARTS....")
for epoch in range(N_EPOCHS):
  start = time.time()
  train_loss = train(model, train_iter, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, valid_iter, criterion)
  end = time.time()
  title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} | {'saving model...' if valid_loss < best_valid_loss else 'not saving...'}" 
  if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(model.state_dict(), 'best-model.pt')
  rows_data =[
        ["train", f"{train_loss:.3f}", f"{math.exp(train_loss):7.3f}", hms_string(end - start) ],
        ["val", f"{valid_loss:.3f}", f"{math.exp(train_loss):7.3f}", '' ]
  ]
  tabulate_training(column_names, rows_data, title)

print("TRAINING ENDS....")


TRAINING STARTS....
+--------------------------------------+
|    EPOCH: 01/10 | saving model...    |
+-------+-------+---------+------------+
| SET   |  LOSS |     PPL |        ETA |
+-------+-------+---------+------------+
| train | 4.222 |  68.154 | 0:03:17.99 |
| val   | 4.154 |  68.154 |            |
+-------+-------+---------+------------+
+--------------------------------------+
|    EPOCH: 02/10 | saving model...    |
+-------+-------+---------+------------+
| SET   |  LOSS |     PPL |        ETA |
+-------+-------+---------+------------+
| train | 3.601 |  36.643 | 0:03:18.18 |
| val   | 3.974 |  36.643 |            |
+-------+-------+---------+------------+
+--------------------------------------+
|    EPOCH: 03/10 | saving model...    |
+-------+-------+---------+------------+
| SET   |  LOSS |     PPL |        ETA |
+-------+-------+---------+------------+
| train | 3.426 |  30.750 | 0:03:18.33 |
| val   | 3.764 |  30.750 |            |
+-------+-------+---------+----------

### Evaluating the best model

In [123]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss = evaluate(model, valid_iter, criterion)
title = "Model Evaluation Summary"
data_rows = [["Test", f'{test_loss:.3f}', f'{math.exp(test_loss):7.3f}', ""]]

tabulate_training(["SET", "LOSS", "PPL", "ETA"], data_rows, title)

+------------------------------+
|   Model Evaluation Summary   |
+------+-------+---------+-----+
| SET  |  LOSS |     PPL | ETA |
+------+-------+---------+-----+
| Test | 2.523 |  12.467 |     |
+------+-------+---------+-----+


### Model Inference

In [125]:
en_vocab.stoi["<sos>"]

en_tokenizer("this is the boy")

['this', 'is', 'the', 'boy']

In [162]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 50):
    model.eval()
    if isinstance(sentence, str):
        tokens = [token.lower() for token in fr_tokenizer(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = ["<sos>"] + tokens + ["<eos>"]
    src_indexes = [src_field.stoi[token] for token in tokens]

    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    with torch.no_grad():
      encoder_outputs, hidden = model.encoder(src_tensor)
    trg_indexes = [trg_field.stoi["<sos>"]]
    for i in range(max_len):
      trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
      with torch.no_grad():
        output, hidden = model.decoder(trg_tensor, hidden, encoder_outputs)
      pred_token = output.argmax(1).item()
      trg_indexes.append(pred_token)
      if pred_token == trg_field.stoi['<eos>']:
        break

    trg_tokens = [trg_field.itos[i] for i in trg_indexes]
    return trg_tokens
translate_sentence("va", fr_vocab, en_vocab, model, device, max_len = 50)

['<sos>', '<unk>', 'has', 'a', 'talented', '.', '\n', '<eos>']

### Let's make some predictions.

In [169]:
def tabulate_translations(column_names, data, title, max_characters=25):
  table = PrettyTable(column_names)
  table.title= title
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'l'

  table._max_width = {column_names[0] :max_characters, column_names[1] :max_characters}
  for row in data:
    table.add_row(row)
  print(table)

columns_names = [
    "FR (real src sentence)", "EN (translated version)"
]
title = "FRENCH to ENGLISH TRANSLATOR"

In [170]:
i = 0
with open(os.path.join(base_path, fr_path), encoding="utf8") as f:
  for index, line in enumerate(f):
    if index % 12 == 0:
      i += 1
      translation = translate_sentence(line, fr_vocab, en_vocab, model, device, max_len = 50)
      tabulate_translations(columns_names, [[line, " ".join(translation)]], title=title)
    if i == 10:
      break


+--------------------------------------------------+
|           FRENCH to ENGLISH TRANSLATOR           |
+------------------------+-------------------------+
| FR (real src sentence) | EN (translated version) |
+------------------------+-------------------------+
| Va !                   | <sos> <unk> 's !        |
|                        |  <eos>                  |
+------------------------+-------------------------+
+--------------------------------------------------+
|           FRENCH to ENGLISH TRANSLATOR           |
+------------------------+-------------------------+
| FR (real src sentence) | EN (translated version) |
+------------------------+-------------------------+
| Attendez !             | <sos> <unk> 's !        |
|                        |  <eos>                  |
+------------------------+-------------------------+
+--------------------------------------------------+
|           FRENCH to ENGLISH TRANSLATOR           |
+------------------------+--------------------

### Conclusion.

There's still a lot of things to cover. Next we will try to make accurate translation as you can see that this model did not perform well during inference.