# Seq2Seq Translation with Attention

## Introduction

<span style="font-size:1.15em;">In this notebook, we'll be exploring the use of a sequence-to-sequence (Seq2Seq) model for machine translation. Our model is based on a neural network architecture that uses an encoder to process the input text and a decoder to generate the translated output. Specifically, we'll be training our model to translate text from one language to another, using an Arabic-to-English translation task as an example. Let's dive in and see how our model performs!</span>


In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import math
import random
import re
import time
import unicodedata

import nltk
import numpy as np
import pandas as pd

import pyarabic.araby as araby
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.nn.functional as F
from torch.utils.data import Dataset

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.metrics import bleu_score

from tqdm import tqdm
tqdm.pandas()

In [3]:
# %%script echo skipping

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
print(torch.cuda.get_device_name(0))

cuda
Tesla P100-PCIE-16GB


## Download and preprocess our data

In [4]:
#https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

nltk.download('punkt')
def tokenize_ar(text):
    return [tok for tok in nltk.tokenize.wordpunct_tokenize(unicodeToAscii(text))]

def tokenize_en(text):
    return [tok for tok in nltk.tokenize.wordpunct_tokenize(unicodeToAscii(text))]

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
%%capture
!pip install datasets
from datasets import load_dataset
dataset = load_dataset('opus100', 'ar-en')

df=dataset['train']['translation']
df = pd.DataFrame(df)
df = df.head(100000)

<span style="font-size:1.1em;">The 'opus100' dataset is a parallel corpus containing Arabic-English sentence pairs. It can be loaded using the Hugging Face library's 'load_dataset' function.</span>


In [6]:
def tashkeel(text):
    text = araby.strip_diacritics(text)
    text = text.strip()
    return text

def clean(text):
    return re.sub(r'[_|\d+|\\|\-|؛|،|,|\[|\]|\(|\)|\"|/|%|!|,|.|:|♪|«|»|123456789]', '', text)

abbr_dict = {
    "what's": "what is",
    "what're": "what are",
    "who's": "who is",
    "who're": "who are",
    "where's": "where is",
    "where're": "where are",
    "when's": "when is",
    "when're": "when are",
    "how's": "how is",
    "how're": "how are",
    "i'm": "i am",
    "we're": "we are",
    "you're": "you are",
    "they're": "they are",
    "it's": "it is",
    "he's": "he is",
    "she's": "she is",
    "that's": "that is",
    "there's": "there is",
    "there're": "there are",
    "i've": "i have",
    "we've": "we have",
    "you've": "you have",
    "they've": "they have",
    "who've": "who have",
    "would've": "would have",
    "not've": "not have",
    "i'll": "i will",
    "we'll": "we will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "it'll": "it will",
    "they'll": "they will",
    "isn't": "is not",
    "wasn't": "was not",
    "aren't": "are not",
    "weren't": "were not",
    "can't": "can not",
    "couldn't": "could not",
    "don't": "do not",
    "didn't": "did not",
    "shouldn't": "should not",
    "wouldn't": "would not",
    "doesn't": "does not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "won't": "will not",
    "let's": "let us",
    "here's":"here is",
    "y'all": "you all",
    "ain't": "am not",
    "dont": "do not",
    "havent":"have not",
    "cant":"can not",
    "cannot":"can not",
    "wouldnt":"would not",
    "hasnt":"has not",
    "hadnt":"had not",
    "doesnt": "does not",
    "didnt": "did not",
    "wasnt": "was not",
    "werent": "were not",
    "'cause": "because",
    "could've": "could have",
    "he'd": "he would",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "wouldn't've": "would not have",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "you've": "you have"
}

def replace_abbreviations(text):
    text = re.sub('’', '\'', text)
    text = re.sub(r'\bdon t\b', 'do not', text, flags=re.IGNORECASE)
    text = re.sub(r'\b[mf](\d+)\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(\d+)[mf]\b', '', text, flags=re.IGNORECASE)

    for word in text.split():
        if word.lower() in abbr_dict:
            text = re.sub(r'\b{}\b'.format(word), abbr_dict[word.lower()], text, flags=re.IGNORECASE)
    return text

In [7]:
df['en'] = df['en'].str.lower()
df['ar'] = df['ar'].progress_apply(tashkeel)
df['ar'] = df['ar'].progress_apply(clean)
df['en'] = df['en'].progress_apply(clean)
df['en'] = df['en'].progress_apply(replace_abbreviations)

100%|██████████| 100000/100000 [00:00<00:00, 131893.82it/s]
100%|██████████| 100000/100000 [00:00<00:00, 299897.32it/s]
100%|██████████| 100000/100000 [00:00<00:00, 303919.88it/s]
100%|██████████| 100000/100000 [00:01<00:00, 71711.86it/s]


In [8]:
df.iloc[5:10]

Unnamed: 0,ar,en
5,مقرف,ugh disgusting
6,لا أحب ذلك,i do not like it
7,هل حصلت على جزء ?,did you get the part?
8,إتركه,leave him
9,هذا ليس من شأنك,it is none of your business


In [9]:
train, val, test = np.split(df.sample(frac=1, random_state=42), 
                                [int(.9*len(df)), int(.95*len(df))])

print(train.shape, val.shape, test.shape, sep='\n')

(90000, 2)
(5000, 2)
(5000, 2)


In [10]:
def yield_tokens(data_iter, src=True):
    for text in data_iter:
        if src:
            yield tokenize_ar(text)
        else:
            yield tokenize_en(text)

src_vocab = build_vocab_from_iterator(yield_tokens(iter(train['ar'])),
                                      min_freq=2, 
                                      specials=[""])
src_vocab.set_default_index(src_vocab[""])

trg_vocab = build_vocab_from_iterator(yield_tokens(iter(train['en']),src=False), 
                                      min_freq=2, 
                                      specials=[""])
trg_vocab.set_default_index(trg_vocab[""])

In [11]:
print(len(src_vocab))
print(len(trg_vocab))

37705
18343


In [12]:
def preprocess(sequence, vocab, src=True):
    if src:
        tokens = tokenize_ar(sequence.lower())
    else:
        tokens = tokenize_en(sequence.lower())

    sequence = []
    sequence.append(vocab[''])
    sequence.extend([vocab[token] for token in tokens])
    sequence.append(vocab[''])
    sequence = torch.Tensor(sequence)
    return sequence

## Load our data into a DataLoader

In [13]:
class CustomDataset(Dataset):
    def __init__(self, src, trg, src_vocab, trg_vocab):
        self.src_seqs = src
        self.trg_seqs = trg
        self.num_total_seqs = len(self.src_seqs)
        self.src_vocab= src_vocab
        self.trg_vocab = trg_vocab

    def __getitem__(self, index):
        src_seq = self.src_seqs.iloc[index]
        trg_seq = self.trg_seqs.iloc[index]
        src_seq = self.preprocess(src_seq, self.src_vocab)
        trg_seq = self.preprocess(trg_seq, self.trg_vocab, src=False)
        return src_seq, trg_seq

    def __len__(self):
        return self.num_total_seqs

    def preprocess(self, sequence, vocab, src=True):
        if src:
            tokens = tokenize_ar(sequence.lower())
        else:
            tokens = tokenize_en(sequence.lower())

        sequence = []
        sequence.append(vocab[''])
        sequence.extend([vocab[token] for token in tokens])
        sequence.append(vocab[''])
        sequence = torch.Tensor(sequence)
        return sequence


def collate_fn(data):
    def merge(sequences):
        lengths = [len(seq) for seq in sequences]
        padded_seqs = torch.zeros(len(sequences), max(lengths)).long()
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq[:end]
        return padded_seqs, lengths

    data.sort(key=lambda x: len(x[0]), reverse=True)
    src_seqs, trg_seqs = zip(*data)
    src_seqs, src_lengths = merge(src_seqs)
    trg_seqs, trg_lengths = merge(trg_seqs)

    return src_seqs, src_lengths, trg_seqs, trg_lengths

In [14]:
def get_loader(src, trg, src_vocab, trg_vocab, batch_size=128):
    dataset = CustomDataset(src, trg, src_vocab, trg_vocab)
    data_loader = torch.utils.data.DataLoader(dataset=dataset,
                                              batch_size=batch_size,
                                              shuffle=True,
                                              collate_fn=collate_fn)

    return data_loader

In [15]:
BATCH_SIZE = 16

In [16]:
train_loader = get_loader(train['ar'], train['en'], src_vocab, trg_vocab, batch_size=BATCH_SIZE)
val_loader = get_loader(val['ar'], val['en'], src_vocab, trg_vocab, batch_size=BATCH_SIZE)
test_loader = get_loader(test['ar'], test['en'], src_vocab, trg_vocab, batch_size=BATCH_SIZE)

# Create our Model

## Encoder

In [17]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.rnn(embedded)
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        return outputs, hidden

## Implementing Attention

In [18]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)

    def forward(self, hidden, encoder_outputs):
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        attention = self.v(energy).squeeze(2)
        return F.softmax(attention, dim=1)

## Decoder

In [19]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        a = self.attention(hidden, encoder_outputs)
        a = a.unsqueeze(1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        return prediction, hidden.squeeze(0)

In [20]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        encoder_outputs, hidden = self.encoder(src)
        
        input = trg[0,:]
        for t in range(1, trg_len):
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[t] if teacher_force else top1
        
        return outputs

In [21]:
INPUT_DIM = len(src_vocab)
OUTPUT_DIM = len(trg_vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.3
DEC_DROPOUT = 0.3


attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT).to(device)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn).to(device)

model = Seq2Seq(enc, dec, device).to(device)

In [22]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(37705, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(18343, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=18343, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
)

<span style="font-size:1.1em;">Our Seq2Seq model consists of an encoder that processes the input text using an embedding layer and a bidirectional GRU, followed by a fully connected layer with dropout. The decoder uses attention to focus on different parts of the input and generates the output text using an embedding layer, a unidirectional GRU, and a fully connected layer with dropout.

<div style="text-align: center;">
    <img src="general_scheme.png" alt="general_scheme" width="800"/>
</div>

[Voita, L. (2021). Sequence-to-Sequence Models with Attention.](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)


In [23]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 53,670,567 trainable parameters


# Train our model

In [24]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index = trg_vocab['']).to(device)

In [25]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    for i, batch in enumerate(iterator):
        src = torch.transpose(batch[0], 0,1).to(device)
        trg = torch.transpose(batch[2], 0,1).to(device)
        optimizer.zero_grad()
        output = model(src, trg)
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].reshape(-1)
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [26]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = torch.transpose(batch[0], 0, 1).to(device)
            trg = torch.transpose(batch[2], 0, 1).to(device)
            output = model(src, trg, 0)
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].reshape(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [27]:
def epoch_time(start_time, end_time):
    time = end_time - start_time
    mins = int(time / 60)
    secs = int(time - (mins * 60))
    return mins, secs

In [28]:
%%time
N_EPOCHS = 25
CLIP = 1
PATIENCE = 3

best_valid_loss = float('inf')
early_stop_counter = 0
best_valid_epoch = 0

for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, val_loader, criterion)
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        best_valid_epoch = epoch
        torch.save(model.state_dict(), 'model.pt')
        early_stop_counter = 0
    else:
        early_stop_counter += 1
        if early_stop_counter >= PATIENCE:
            print(f'Early stopping after epoch {epoch+1}: no improvement for {PATIENCE} epochs.')
            break

if early_stop_counter < PATIENCE:
    print(f'Best validation loss of {best_valid_loss:.3f} at epoch {best_valid_epoch+1}.')

Epoch: 01 | Time: 21m 28s
	Train Loss: 5.630 | Train PPL: 278.648
	 Val. Loss: 5.393 |  Val. PPL: 219.891
Epoch: 02 | Time: 21m 32s
	Train Loss: 4.529 | Train PPL:  92.630
	 Val. Loss: 5.107 |  Val. PPL: 165.123
Epoch: 03 | Time: 20m 7s
	Train Loss: 3.859 | Train PPL:  47.402
	 Val. Loss: 5.102 |  Val. PPL: 164.297
Epoch: 04 | Time: 20m 13s
	Train Loss: 3.375 | Train PPL:  29.222
	 Val. Loss: 5.262 |  Val. PPL: 192.843
Epoch: 05 | Time: 20m 11s
	Train Loss: 3.061 | Train PPL:  21.341
	 Val. Loss: 5.385 |  Val. PPL: 218.133
Epoch: 06 | Time: 20m 17s
	Train Loss: 2.889 | Train PPL:  17.978
	 Val. Loss: 5.506 |  Val. PPL: 246.208
Early stopping after epoch 6: no improvement for 3 epochs.
CPU times: user 1h 50min, sys: 13min 22s, total: 2h 3min 22s
Wall time: 2h 3min 53s


<span style="font-size:1.1em;">Perplexity (PPL) is a commonly used evaluation metric in natural language processing tasks. It measures the uncertainty or perplexity of a language model's predictions on a given test set. The lower the perplexity, the better the model is at predicting the test data.</span>


In [29]:
model.load_state_dict(torch.load('model.pt'))
test_loss = evaluate(model, test_loader, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 5.118 | Test PPL: 167.046 |


## The Translation Process

In [30]:
source = "كيف حالك ؟"
input = preprocess(source, src_vocab)
input = input[:,None].to(torch.int64).to(device)

In [31]:
target = torch.zeros(len(source.split(' '))+2,1,).to(torch.int64)

with torch.no_grad():
    model.eval()
    input = input.to(device)
    target = target.to(device)
    output = model(input, target, 0)
    output_dim = output.shape[-1]
    output = output[1:].view(-1, output_dim)

In [32]:
prediction = []
for i in output:
    prediction.append(torch.argmax(i).item())
tokens = trg_vocab.lookup_tokens(prediction)

In [33]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(tokens).replace('', "").replace('"',"").strip()

'how are you ?'