<a href="https://colab.research.google.com/github/Mohamad-Atif1/paper2code/blob/main/Transformers/Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is made by **Eng. Mohammed Alshabrawi**

**Attention is all you need**

Transformers are the dominant neural network architecture for sequential datasets due to their ability to model long-range dependencies and process sequences in parallel.

To gain a deep understanding of how they work, I built a Transformer model from scratch, implementing core components such as multi-head self-attention, positional encoding, and cross-attention.



<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:676/1*MU9no9JcYWJCeDE7zc5vsQ.png" height=500>
</div>

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import math
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
from collections import Counter
import random
import numpy as np

# Positional Encoding

Let us start with **Positional Encoding**

Transformers don't have a built-in sense of order (unlike RNNs), so positional encodings provide a way for the model to understand the relative and absolute position of tokens by adding sinusoidal vectors to the input tokens .

Alternatively, this can be seen as each input token being conditioned on its corresponding positional encoding vector. Positional Encoding fourmla:

$$ PE_(pos,2i) = sin(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}) $$

$$ PE_(pos,2i) = cos(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}) $$

for efficiency and numerical stability, we will use exp and ln to implement the denominator.


$$ {10000^{\frac{-2i}{d_{\text{model}}}}} = e^{ \ln(10000)^{\frac{-2i}{d_{\text{model}}}}} $$
$$ {10000^{\frac{-2i}{d_{\text{model}}}}} = e^{ \frac{-2i}{{d_{\text{model}}}}  \ln(10000)} $$


In [2]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_len=5000):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_len, embed_size)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # add max_len dim
        div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-math.log(10000.0) / embed_size))

        pe[:, 0::2] = torch.sin(position * div_term)  # even indices
        pe[:, 1::2] = torch.cos(position * div_term)  # odd indices

        pe = pe.unsqueeze(0)  # add bs dim , shape: (1, max_len, embed_size)
        self.register_buffer('pe', pe) # This save pe as non trainable parameter

    def forward(self, x):
        # x: (batch_size, seq_len (num of tokens), embed_size)
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len]


# Multi-head attention (MHA)

The main component in Transformers is the attention layer. There are many varieties of attention layers, such as Multi-head attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA). These versions are almost similar with fewer parameters.
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*r-3sWaUT4K-5ogX99hqT0A.png" height=200>

The core idea of an attention layer focuses on around three learnable matrices: **Query (Q)**, **Key (K)**, and **Value (V)**, all derived from the input embeddings. Conceptually, a Query seeks out relevant Keys. When a Query vector is multiplied with all Key vectors (typically using a dot product to measure similartiy), it generates attention scores. These scores indicate how related each Key is to the Query (how each token to another token) .

These raw attention scores are normalized through a softmax function, transforming them into a probability distribution. This normalized distribution determines the "attention weights" or enegry.  Finally, these attention weights are used to compute a weighted sum of the Value vectors.

**Multi-Head Attention (MHA)** is all about splitting the input embedding into N number of heads that are then independently processed in parallel.

In [3]:
class MultiHeadAttention(nn.Module):
    def __init__(self,embed_size,heads):
        super(MultiHeadAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads


        assert (self.head_dim * heads == embed_size), "embed_size // heads is not divisible"

        self.queries = nn.Linear(self.embed_size,self.embed_size)
        self.keys = nn.Linear(self.embed_size,self.embed_size)
        self.values = nn.Linear(self.embed_size,self.embed_size)

        self.fc = nn.Linear(self.embed_size, self.embed_size)

    def forward(self,q,k,v,mask):
        q = self.queries(q)
        k = self.keys(k)
        v = self.values(v)
        # split the embedding into k heads
        # q,v,k shapes: [bs, num_of_tokens (N) , heads * head_dim ]
        # Now techincally, we will work on each head independently, as if there were batches of heads!
        # (batch_size * heads, num_tokens, head_dim)
        q = q.reshape(q.shape[0]*self.heads, q.shape[1], self.head_dim)
        k = k.reshape(k.shape[0]*self.heads, k.shape[1], self.head_dim)
        v = v.reshape(v.shape[0]*self.heads, v.shape[1], self.head_dim)

        energy = torch.bmm(q,k.permute(0,2,1)) # Q*K^T -> bs, q_N, k_N
        if mask is not None:
            energy = energy.masked_fill(mask==0,float("-1e20"))

        attention = torch.softmax(energy/(self.head_dim ** (0.5)), dim=2)
        # 

        out = torch.bmm(attention,v)
        out = out.reshape(q.shape[0]//self.heads, q.shape[1], self.embed_size) # bs,N,h*heads
        out = self.fc(out)
        return out




# Encoder

Each layer within the Encoder  consists of two key sub-layers: a Multi-Head Attention mechanism (as described above, allowing each token to attend to all other tokens in the sequence) and a simple feed-forward neural network. After each sub-layer, Layer normlization and resdual connection are applied as shown in the fig

Encoders are mainly used to process the input tokens

<img src="https://www.researchgate.net/publication/334288604/figure/fig1/AS:778232232148992@1562556431066/The-Transformer-encoder-structure.ppm" width=200 >

In [4]:
class EncoderBlock(nn.Module):

    def __init__(self,embed_size,heads, ff_expantion,dout=0.1):
        super(EncoderBlock,self).__init__()
        self.attention = MultiHeadAttention(embed_size,heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size,embed_size*ff_expantion),
            nn.ReLU(),
            nn.Linear(embed_size*ff_expantion,embed_size))
        self.dout = nn.Dropout(dout)


    def forward(self,q,k,v,mask):
        sub_layer_one = self.attention(q,k,v,mask)
        sub_layer_one = self.dout(sub_layer_one)
        sub_layer_one += q # skip connection
        sub_layer_one = self.norm1(sub_layer_one)

        sub_layer_two = self.feed_forward(sub_layer_one)
        sub_layer_two = self.dout(sub_layer_two)
        sub_layer_two += sub_layer_one
        sub_layer_two = self.norm2(sub_layer_two)



        return sub_layer_two


In [5]:
class Encoder(nn.Module):

    def __init__(self,vocab_size,embed_size,heads,ff_expansion,num_layers=6, dout=0.1,max_len=5000):

        super(Encoder,self).__init__()
        self.word_embedding = nn.Embedding(vocab_size,embed_size)
        self.layers = nn.ModuleList()
        self.pe = PositionalEncoding(embed_size,max_len)
        self.dout = nn.Dropout(dout)
        for i in range(num_layers):
            self.layers.append(EncoderBlock(embed_size,heads, ff_expansion,dout=0.1))


    def forward(self,x,mask):
        x = self.word_embedding(x)
        x = self.dout(self.pe(x))
        for layer in self.layers:
            x = layer(x,x,x,mask)
        return x



# Decoder

The Decoder in a Transformer shares many similarities with the Encoder, also comprising a stack of identical layers. However, it is designed for generating output sequences, so it can onlt attending to the previous tokens.  We use casual mask to prevent the decoder from payinh attention to the next token. casual mask is simply a lower triangler matrix.

The Decoder consists of two attention layers and a feed-forward layer. The first one is for paying attention to the output embedding or the previously generated tokens (self-attention). The next one pays attention to the output of the Encoder layer.


<img src="https://miro.medium.com/v2/resize:fit:676/1*MU9no9JcYWJCeDE7zc5vsQ.png" height=400>

In [6]:

class DecoderBlock(nn.Module):
    def __init__(self, embed_size, heads, ff_expansion, dout):
        super(DecoderBlock, self).__init__()
        self.norm = nn.LayerNorm(embed_size)
        self.attention = MultiHeadAttention(embed_size, heads=heads)
        self.encoder_block = EncoderBlock(
            embed_size, heads, ff_expansion,dout
        )
        self.dout = nn.Dropout(dout)

    def forward(self, x, value, key, cross_mask, trg_mask):
        self_attention = self.attention(x, x, x, trg_mask)
        self_attention = self.dout(self_attention)
        self_attention += x
        self_attention = self.norm(self_attention)
        # the output of the self attention layer = query to the cross attention
        # Cross_attention:
        #value,key (encoder) and 'query' (decoder) are fed into the encoder block.
        out = self.encoder_block(self_attention,key,value, cross_mask)
        return out




In [7]:
class Decoder(nn.Module):
    def __init__(
        self,
        trg_vocab_size,
        embed_size,
        num_layers,
        heads,
        ff_expansion,
        dout,
        max_len,
    ):
        super(Decoder, self).__init__()
        self.device = device
        self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
        self.pe = PositionalEncoding(embed_size,max_len)
        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_size, heads, ff_expansion, dout)
                for _ in range(num_layers)
            ]
        )
        self.fc_out = nn.Linear(embed_size, trg_vocab_size)
        self.dout = nn.Dropout(dout)

    def forward(self, x, enc_out, cross_mask, trg_mask):
        x = self.word_embedding(x)
        x = self.dout(self.pe(x))
        for layer in self.layers:
            x = layer(x, enc_out, enc_out, cross_mask, trg_mask)

        out = self.fc_out(x)

        return out


# Putting all together

In [8]:
class Transformer(nn.Module):
    def __init__(
        self,
        src_vocab_size,
        trg_vocab_size,
        src_pad_idx,
        trg_pad_idx,
        forward_expansion=4,
        embed_size=512,
        num_layers=6,
        ff_expansion=4,
        heads=8,
        dout=0.1,
        max_len=5000,
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    ):
        super(Transformer, self).__init__()

        self.heads = heads
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device

        self.encoder = Encoder(
            src_vocab_size,
            embed_size,
            heads,
            ff_expansion,
            num_layers,
            dout,
            max_len
        )

        self.decoder = Decoder(
            trg_vocab_size,
            embed_size,
            num_layers,
            heads,
            ff_expansion,
            dout,
            max_len,
        )

    def make_src_mask(self, src):
        N, src_len = src.shape

        pad_mask = (src != self.src_pad_idx)

        # expand to (N, src_len, src_len) to match self-attention mask shapes
        src_mask = pad_mask.unsqueeze(1).expand(N, src_len, src_len)

        # repeat for all heads: (N * heads, src_len, src_len) to match self-attention mask shapes
        src_mask = src_mask.repeat(self.heads, 1, 1)

        return src_mask.to(self.device)

    def make_trg_mask(self, trg):
        N, trg_len = trg.shape

        pad_mask = (trg != self.trg_pad_idx)

        causal_mask = torch.tril(torch.ones(trg_len, trg_len)).bool()

        # repeat for all heads: (N * heads, trg_len, trg_len) to match self-attention mask shapes
        trg_mask = causal_mask.repeat(self.heads*N, 1, 1)

        return trg_mask.to(self.device)

    def make_cross_mask(self, src, trg):
        """(decoder attending to encoder)"""
        N, src_len = src.shape
        _, trg_len = trg.shape

        src_pad_mask = (src != self.src_pad_idx)

        trg_pad_mask = (trg != self.trg_pad_idx)

        # cross-attention mask: (N, trg_len, src_len)
        cross_mask = trg_pad_mask.unsqueeze(2) & src_pad_mask.unsqueeze(1)

        # repeat for all heads: (N * heads, trg_len, src_len) to match self-attention mask shapes
        cross_mask = cross_mask.repeat(self.heads, 1, 1)

        return cross_mask.to(self.device)

    def forward(self, src, trg):
        # Create masks
        src_mask = self.make_src_mask(src)           # For encoder self-attention
        trg_mask = self.make_trg_mask(trg)           # For decoder self-attention
        cross_mask = self.make_cross_mask(src, trg)  # For decoder cross-attention

        # Forward pass
        enc_src = self.encoder(src, src_mask)

        # Pass cross_mask to decoder instead of src_mask
        out = self.decoder(trg, enc_src, cross_mask, trg_mask)
        return out

In [9]:
# Test code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(
    device
)
trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)

src_pad_idx = 0
trg_pad_idx = 0
src_vocab_size = 80
trg_vocab_size = 80
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(
    device
)
out = model(x, trg)
print(out.shape)

cuda
torch.Size([2, 8, 80])


# Let us test it on a Machine Translation task!

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
import math
import random
import numpy as np
from collections import Counter
import pickle



In [None]:
class Vocabulary:
    def __init__(self):
        self.word2idx = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "<unk>": 3}
        self.idx2word = {0: "<pad>", 1: "<sos>", 2: "<eos>", 3: "<unk>"}
        self.word_count = Counter()

    def add_sentence(self, sentence):
        for word in sentence.split():
            self.word_count[word] += 1
            if word not in self.word2idx:
                idx = len(self.word2idx)
                self.word2idx[word] = idx
                self.idx2word[idx] = word

    def build_vocab(self, sentences, min_freq=1):
        for sentence in sentences:
            for word in sentence.split():
                self.word_count[word] += 1

        for word, count in self.word_count.items():
            if count >= min_freq and word not in self.word2idx:
                idx = len(self.word2idx)
                self.word2idx[word] = idx
                self.idx2word[idx] = word

    def sentence_to_indices(self, sentence):
        return [self.word2idx.get(word, self.word2idx["<unk>"]) for word in sentence.split()]

    def indices_to_sentence(self, indices):
        return " ".join([self.idx2word[idx] for idx in indices if idx not in [0, 1, 2]])

    def __len__(self):
        return len(self.word2idx)

In [None]:
class TranslationDataset(Dataset):
    def __init__(self, src_sentences, tgt_sentences, src_vocab, tgt_vocab, max_len=100):
        self.src_sentences = src_sentences
        self.tgt_sentences = tgt_sentences
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab
        self.max_len = max_len

    def __len__(self):
        return len(self.src_sentences)

    def __getitem__(self, idx):
        src_sentence = self.src_sentences[idx]
        tgt_sentence = self.tgt_sentences[idx]

        src_indices = [self.src_vocab.word2idx["<sos>"]] + \
                     self.src_vocab.sentence_to_indices(src_sentence) + \
                     [self.src_vocab.word2idx["<eos>"]]

        tgt_indices = [self.tgt_vocab.word2idx["<sos>"]] + \
                     self.tgt_vocab.sentence_to_indices(tgt_sentence) + \
                     [self.tgt_vocab.word2idx["<eos>"]]

        src_indices = src_indices[:self.max_len]
        tgt_indices = tgt_indices[:self.max_len]

        return torch.tensor(src_indices), torch.tensor(tgt_indices)

def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)
    src_batch = pad_sequence(src_batch, batch_first=True, padding_value=0)
    tgt_batch = pad_sequence(tgt_batch, batch_first=True, padding_value=0)
    return src_batch, tgt_batch

In [None]:
def create_translation_data():
    """Create 100 English-Arabic translation pairs for training + test data"""

    # Training data - 100 pairs
    en_train = [
        # Greetings and basic conversation
        "hello how are you today",
        "good morning my friend",
        "good evening everyone",
        "have a nice day",
        "see you later",
        "goodbye and take care",
        "nice to meet you",
        "how have you been",
        "i hope you are well",
        "thank you very much",

        # Personal information
        "what is your name",
        "my name is ahmed",
        "my name is sara",
        "my name is mohammed",
        "where are you from",
        "i am from egypt",
        "i am from saudi arabia",
        "i am from jordan",
        "i am from lebanon",
        "i live in cairo",

        # Age and family
        "how old are you",
        "i am twenty years old",
        "i am thirty years old",
        "i am a young man",
        "i am a young woman",
        "do you have siblings",
        "i have two brothers",
        "i have one sister",
        "i have a big family",
        "my family is small",

        # Occupation and education
        "what do you do",
        "i am a student",
        "i am a teacher",
        "i am a doctor",
        "i am an engineer",
        "where do you work",
        "i work in a hospital",
        "i work in a school",
        "i study at university",
        "i study medicine",

        # Hobbies and interests
        "what do you like",
        "i like reading books",
        "i like watching movies",
        "i love playing football",
        "i enjoy listening to music",
        "do you like sports",
        "yes i love sports",
        "i play basketball",
        "i like swimming",
        "i enjoy cooking",

        # Weather and time
        "how is the weather",
        "the weather is nice",
        "it is sunny today",
        "it is raining outside",
        "it is very hot",
        "it is cold today",
        "what time is it",
        "it is three o clock",
        "it is morning time",
        "it is evening now",

        # Food and drink
        "are you hungry",
        "yes i am hungry",
        "what do you want to eat",
        "i want to eat rice",
        "i like arabic food",
        "do you want some tea",
        "yes please give me tea",
        "i prefer coffee",
        "the food is delicious",
        "i am thirsty",

        # Places and directions
        "where is the mosque",
        "the mosque is near here",
        "where is the hospital",
        "go straight then turn right",
        "the school is far",
        "the market is close",
        "i want to go home",
        "where do you live",
        "i live in this city",
        "the house is big",

        # Daily activities
        "what are you doing",
        "i am reading a book",
        "i am watching television",
        "i am going to sleep",
        "i wake up early",
        "i go to work",
        "i study every day",
        "i help my mother",
        "i play with friends",
        "i eat breakfast",

        # Shopping and money
        "how much does this cost",
        "this costs ten dollars",
        "it is very expensive",
        "it is cheap",
        "where can i buy this",
        "you can buy it here",
        "i need to go shopping",
        "do you have money",
        "i have some money",
        "i want to buy clothes",

        # Travel and transportation
        "i want to travel",
        "where do you want to go",
        "i want to visit mecca",
        "how do you go to work",
        "i go by car",
        "i take the bus",
        "the plane is fast",
        "the trip was long",
        "i love traveling",
        "when will you return"
    ]

    ar_train = [
        # Greetings and basic conversation
        "مرحباً كيف حالك اليوم",
        "صباح الخير يا صديقي",
        "مساء الخير للجميع",
        "أتمنى لك يوماً سعيداً",
        "أراك لاحقاً",
        "وداعاً واعتن بنفسك",
        "سررت بلقائك",
        "كيف كان حالك",
        "أتمنى أن تكون بخير",
        "شكراً لك جزيلاً",

        # Personal information
        "ما اسمك",
        "اسمي أحمد",
        "اسمي سارة",
        "اسمي محمد",
        "من أين أنت",
        "أنا من مصر",
        "أنا من السعودية",
        "أنا من الأردن",
        "أنا من لبنان",
        "أعيش في القاهرة",

        # Age and family
        "كم عمرك",
        "عمري عشرون سنة",
        "عمري ثلاثون سنة",
        "أنا شاب",
        "أنا شابة",
        "هل لديك إخوة",
        "لدي أخوان",
        "لدي أخت واحدة",
        "لدي عائلة كبيرة",
        "عائلتي صغيرة",

        # Occupation and education
        "ماذا تعمل",
        "أنا طالب",
        "أنا مدرس",
        "أنا طبيب",
        "أنا مهندس",
        "أين تعمل",
        "أعمل في مستشفى",
        "أعمل في مدرسة",
        "أدرس في الجامعة",
        "أدرس الطب",

        # Hobbies and interests
        "ماذا تحب",
        "أحب قراءة الكتب",
        "أحب مشاهدة الأفلام",
        "أحب لعب كرة القدم",
        "أستمتع بسماع الموسيقى",
        "هل تحب الرياضة",
        "نعم أحب الرياضة",
        "ألعب كرة السلة",
        "أحب السباحة",
        "أستمتع بالطبخ",

        # Weather and time
        "كيف الطقس",
        "الطقس جميل",
        "الجو مشمس اليوم",
        "الجو ممطر بالخارج",
        "الجو حار جداً",
        "الجو بارد اليوم",
        "كم الساعة",
        "الساعة الثالثة",
        "الوقت صباح",
        "الوقت مساء الآن",

        # Food and drink
        "هل أنت جائع",
        "نعم أنا جائع",
        "ماذا تريد أن تأكل",
        "أريد أن آكل رز",
        "أحب الطعام العربي",
        "هل تريد بعض الشاي",
        "نعم من فضلك أعطني شاي",
        "أفضل القهوة",
        "الطعام لذيذ",
        "أنا عطشان",

        # Places and directions
        "أين المسجد",
        "المسجد قريب من هنا",
        "أين المستشفى",
        "اذهب مستقيماً ثم انعطف يميناً",
        "المدرسة بعيدة",
        "السوق قريب",
        "أريد أن أذهب للبيت",
        "أين تسكن",
        "أسكن في هذه المدينة",
        "البيت كبير",

        # Daily activities
        "ماذا تفعل",
        "أقرأ كتاباً",
        "أشاهد التلفزيون",
        "سأذهب للنوم",
        "أستيقظ مبكراً",
        "أذهب للعمل",
        "أدرس كل يوم",
        "أساعد أمي",
        "ألعب مع الأصدقاء",
        "آكل الفطور",

        # Shopping and money
        "كم يكلف هذا",
        "هذا يكلف عشرة دولارات",
        "إنه غالي جداً",
        "إنه رخيص",
        "أين يمكنني شراء هذا",
        "يمكنك شراؤه هنا",
        "أحتاج للذهاب للتسوق",
        "هل لديك مال",
        "لدي بعض المال",
        "أريد شراء ملابس",

        # Travel and transportation
        "أريد أن أسافر",
        "أين تريد أن تذهب",
        "أريد زيارة مكة",
        "كيف تذهب للعمل",
        "أذهب بالسيارة",
        "آخذ الحافلة",
        "الطائرة سريعة",
        "الرحلة كانت طويلة",
        "أحب السفر",
        "متى ستعود"
    ]

    # Test data - 20 pairs
    en_test = [
        "hello my name is omar",
        "i am a student at university",
        "the weather is beautiful today",
        "i want to drink some water",
        "where is the nearest restaurant",
        "i like to read books in arabic",
        "how much does this book cost",
        "i am going to visit my family",
        "do you speak english well",
        "what time does the store open",
        "i need to buy some groceries",
        "the city is very crowded",
        "i work as a software engineer",
        "can you help me please",
        "i am learning arabic language",
        "the hospital is on the main street",
        "i have been living here for two years",
        "what is your favorite food",
        "i want to travel to dubai",
        "thank you for your help"
    ]

    ar_test = [
        "مرحباً اسمي عمر",
        "أنا طالب في الجامعة",
        "الطقس جميل اليوم",
        "أريد أن أشرب بعض الماء",
        "أين أقرب مطعم",
        "أحب قراءة الكتب بالعربية",
        "كم يكلف هذا الكتاب",
        "سأذهب لزيارة عائلتي",
        "هل تتكلم الإنجليزية جيداً",
        "متى يفتح المتجر",
        "أحتاج لشراء بعض البقالة",
        "المدينة مزدحمة جداً",
        "أعمل كمهندس برمجيات",
        "هل يمكنك مساعدتي من فضلك",
        "أتعلم اللغة العربية",
        "المستشفى في الشارع الرئيسي",
        "أعيش هنا منذ سنتين",
        "ما طعامك المفضل",
        "أريد السفر إلى دبي",
        "شكراً لك على مساعدتك"
    ]

    return en_train, ar_train, en_test, ar_test

In [None]:
def evaluate_model(model, test_loader, src_vocab, tgt_vocab, device):
    """Evaluate model on test data"""
    model.eval()
    total_loss = 0
    num_samples = 0
    criterion = nn.CrossEntropyLoss(ignore_index=0)

    with torch.no_grad():
        for src, tgt in test_loader:
            src, tgt = src.to(device), tgt.to(device)

            tgt_input = tgt[:, :-1]
            tgt_output = tgt[:, 1:]

            output = model(src, tgt_input)

            output = output.reshape(-1, output.size(-1))
            tgt_output = tgt_output.reshape(-1)

            loss = criterion(output, tgt_output)
            total_loss += loss.item()
            num_samples += 1

    avg_loss = total_loss / num_samples
    perplexity = math.exp(avg_loss)

    return avg_loss, perplexity

def translate_sentence(model, sentence, src_vocab, tgt_vocab, device, max_len=50):
    """Translate a single sentence"""
    model.eval()

    with torch.no_grad():
        src_indices = [src_vocab.word2idx["<sos>"]] + \
                     src_vocab.sentence_to_indices(sentence) + \
                     [src_vocab.word2idx["<eos>"]]
        src_tensor = torch.tensor(src_indices).unsqueeze(0).to(device)

        tgt_indices = [tgt_vocab.word2idx["<sos>"]]

        for _ in range(max_len):
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)
            output = model(src_tensor, tgt_tensor)
            next_token = output[0, -1, :].argmax().item()
            tgt_indices.append(next_token)

            if next_token == tgt_vocab.word2idx["<eos>"]:
                break

        translated = tgt_vocab.indices_to_sentence(tgt_indices[1:])
        return translated

def train_model_with_validation(model, train_loader, test_loader, optimizer, criterion,
                              src_vocab, tgt_vocab, device, num_epochs=50):
    """Train model with validation"""
    model.train()

    train_losses = []
    test_losses = []

    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        num_batches = 0

        for batch_idx, (src, tgt) in enumerate(train_loader):
            src, tgt = src.to(device), tgt.to(device)

            tgt_input = tgt[:, :-1]
            tgt_output = tgt[:, 1:]

            optimizer.zero_grad()
            output = model(src, tgt_input)

            output = output.reshape(-1, output.size(-1))
            tgt_output = tgt_output.reshape(-1)

            loss = criterion(output, tgt_output)
            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

            total_train_loss += loss.item()
            num_batches += 1

        avg_train_loss = total_train_loss / num_batches
        train_losses.append(avg_train_loss)

        test_loss, perplexity = evaluate_model(model, test_loader, src_vocab, tgt_vocab, device)
        test_losses.append(test_loss)

        print(f'Epoch {epoch+1}/{num_epochs}:')
        print(f'  Train Loss: {avg_train_loss:.4f}')
        print(f'  Test Loss: {test_loss:.4f}')
        print(f'  Perplexity: {perplexity:.2f}')

        # test some translations every 10 epochs
        if (epoch + 1) % 10 == 0:
            print("\nSample translations:")
            test_sentences = ["hello how are you", "i am a student", "what is your name"]
            for sentence in test_sentences:
                translation = translate_sentence(model, sentence, src_vocab, tgt_vocab, device)
                print(f"EN: {sentence} -> AR: {translation}")
            print()

    return train_losses, test_losses


In [10]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"device: {device}")

en_train, ar_train, en_test, ar_test = create_translation_data()

print(f"training samples: {len(en_train)}")
print(f"test samples: {len(en_test)}")

# Build vocabularies on training data only
src_vocab = Vocabulary()
tgt_vocab = Vocabulary()

src_vocab.build_vocab(en_train, min_freq=1)
tgt_vocab.build_vocab(ar_train, min_freq=1)

print(f"source vocabulary size: {len(src_vocab)}")
print(f"target vocabulary size: {len(tgt_vocab)}")

train_dataset = TranslationDataset(en_train, ar_train, src_vocab, tgt_vocab)
test_dataset = TranslationDataset(en_test, ar_test, src_vocab, tgt_vocab)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)

src_pad_idx = 0
trg_pad_idx = 0

model = Transformer(len(src_vocab), len(tgt_vocab), src_pad_idx, trg_pad_idx).to(device)

optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
criterion = nn.CrossEntropyLoss(ignore_index=0)

print("Ready to train with expanded dataset!")

train_losses, test_losses = train_model_with_validation(
    model, train_loader, test_loader, optimizer, criterion,
    src_vocab, tgt_vocab, device, num_epochs=50
)

Using device: cuda
Training samples: 110
Test samples: 20
Source vocabulary size: 182
Target vocabulary size: 196
Ready to train with expanded dataset!
Epoch 1/50:
  Train Loss: 4.3911
  Test Loss: 4.9349
  Perplexity: 139.06
Epoch 2/50:
  Train Loss: 3.2054
  Test Loss: 4.6247
  Perplexity: 101.97
Epoch 3/50:
  Train Loss: 2.4582
  Test Loss: 4.6906
  Perplexity: 108.92
Epoch 4/50:
  Train Loss: 1.9074
  Test Loss: 4.6410
  Perplexity: 103.65
Epoch 5/50:
  Train Loss: 1.5615
  Test Loss: 4.8004
  Perplexity: 121.56
Epoch 6/50:
  Train Loss: 1.3368
  Test Loss: 4.9773
  Perplexity: 145.08
Epoch 7/50:
  Train Loss: 1.1771
  Test Loss: 4.7552
  Perplexity: 116.19
Epoch 8/50:
  Train Loss: 1.1373
  Test Loss: 4.9644
  Perplexity: 143.22
Epoch 9/50:
  Train Loss: 0.9428
  Test Loss: 4.9482
  Perplexity: 140.92
Epoch 10/50:
  Train Loss: 0.9236
  Test Loss: 5.0070
  Perplexity: 149.45

Sample translations:
EN: hello how are you -> AR: يمكنك شراؤه هنا
EN: i am a student -> AR: سررت
EN: what 

In [11]:
trg = torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)

trg_pad_mask = (trg != 0).unsqueeze(1).unsqueeze(2)
trg_pad_mask = trg_pad_mask.expand(2, 1, 8, 8)

In [12]:
trg_pad_mask = (trg != 0).unsqueeze(1).unsqueeze(2)
trg_pad_mask.shape

torch.Size([2, 1, 1, 8])

In [13]:
print("\nTesting translations:")
test_sentences = [
    "انا طالب",
    "what is your name",
    "i am a student"
]

for sentence in test_sentences:
    translation = translate_sentence(model, sentence, src_vocab, tgt_vocab, device)
    print(f"EN: {sentence}")
    print(f"FR: {translation}")
    print()




Testing translations:
EN: انا طالب
FR: أحب قراءة الكتب

EN: what is your name
FR: كم الساعة

EN: i am a student
FR: أتمنى



Ref:

[Original paper](https://arxiv.org/abs/1706.03762)

[Jay Alammar blog](https://jalammar.github.io/illustrated-transformer/)