**NORMALIZATION AND TOKENIZATION**

In [17]:

# Full Urdu Text Normalization & Tokenization
# ============================================

from google.colab import files
import pandas as pd
from sklearn.model_selection import train_test_split
import sentencepiece as spm

# Install SentencePiece
!pip install sentencepiece --quiet

# -----------------------------
# Upload Dataset
# -----------------------------
uploaded = files.upload()
filename = list(uploaded.keys())[0]  # e.g., 'final_main_dataset.tsv'

# Load dataset and keep only 'sentence' column
df = pd.read_csv(filename, sep='\t')
df = df[['sentence']].dropna()

# -----------------------------
# Normalization Functions
# -----------------------------
def remove_diacritics(text):
    diacritics = ['َ', 'ً', 'ُ', 'ٌ', 'ِ', 'ٍ', 'ْ', 'ّ', 'ْ', 'ٰ', 'ٔ']
    for d in diacritics:
        text = text.replace(d, "")
    return text

def standardize_aleph_yeh(text):
    text = text.replace('ے','ی' )  # Standardize Yeh
    text = text.replace('آ','ٰا' )  # Standardize Alef
    return text

def normalize_text(text):
    text = remove_diacritics(text)
    text = standardize_aleph_yeh(text)
    return text

#  Apply Normalization
# -----------------------------
df['normalized'] = df['sentence'].apply(normalize_text)

# Display first 5 sentences with original and normalized form
print("Sample sentences after normalization:\n")
for i in range(min(5, len(df))):
    print(f"Original   : {df['sentence'].iloc[i]}")
    print(f"Normalized : {df['normalized'].iloc[i]}\n")


#Save Normalized Sentences for SentencePiece
# -----------------------------
df['normalized'].to_csv("corpus.txt", index=False, header=False)

# -----------------------------
#  Train SentencePiece Tokenizer
# -----------------------------
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="ur_model",
    vocab_size=8000,
    character_coverage=0.9995,
    model_type='bpe',
    user_defined_symbols='[MASK],<sos>,<eos>,<pad>'
)

# Load trained model
sp = spm.SentencePieceProcessor(model_file='ur_model.model')

# -----------------------------
#  Tokenize Sentences
# -----------------------------
df['tokenized'] = df['normalized'].apply(lambda x: sp.encode(x, out_type=int))

# Display first 5 sentences with normalized text and tokenized form
print("\nSample tokenized sentences:\n")
for i in range(min(5, len(df))):
    print(f"Normalized : {df['normalized'].iloc[i]}")
    print(f"Tokens     : {df['tokenized'].iloc[i]}\n")

# -----------------------------
# Split Dataset
# -----------------------------
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Print split sizes
print(f"Training set size: {len(train_df)} ({len(train_df)/len(df)*100:.1f}%)")
print(f"Validation set size: {len(val_df)} ({len(val_df)/len(df)*100:.1f}%)")
print(f"Test set size: {len(test_df)} ({len(test_df)/len(df)*100:.1f}%)")

# Save splits (optional)
train_df.to_csv("train.csv", index=False)
val_df.to_csv("val.csv", index=False)
test_df.to_csv("test.csv", index=False)

# -----------------------------
#  Example of tokenized output
# -----------------------------
print("\nExample tokenized sentences from training set:\n")
for i in range(min(3, len(train_df))):
    print(f"Sentence: {train_df['normalized'].iloc[i]}")
    print(f"Tokens  : {train_df['tokenized'].iloc[i]}\n")


Saving final_main_dataset.tsv to final_main_dataset (1).tsv
Sample sentences after normalization:

Original   : کبھی کبھار ہی خیالی پلاو بناتا ہوں
Normalized : کبھی کبھار ہی خیالی پلاو بناتا ہوں

Original   : اور پھر ممکن ہے کہ پاکستان بھی ہو
Normalized : اور پھر ممکن ہی کہ پاکستان بھی ہو

Original   : یہ فیصلہ بھی گزشتہ دو سال میں
Normalized : یہ فیصلہ بھی گزشتہ دو سال میں

Original   : ان کے بلے بازوں کے سامنے ہو گا
Normalized : ان کی بلی بازوں کی سامنی ہو گا

Original   : آبی جانور میں بطخ بگلا اور دُوسْرا آبی پرندہ شامل ہونا
Normalized : ٰابی جانور میں بطخ بگلا اور دوسرا ٰابی پرندہ شامل ہونا


Sample tokenized sentences:

Normalized : کبھی کبھار ہی خیالی پلاو بناتا ہوں
Tokens     : [390, 1933, 17, 3220, 2325, 7927, 3840, 94]

Normalized : اور پھر ممکن ہی کہ پاکستان بھی ہو
Tokens     : [56, 221, 1282, 17, 68, 125, 77, 24]

Normalized : یہ فیصلہ بھی گزشتہ دو سال میں
Tokens     : [54, 392, 77, 1884, 105, 355, 27]

Normalized : ان کی بلی بازوں کی سامنی ہو گا
Tokens     : [51, 11, 1757,

**ARCHITECTURE: TRANSFORMER**

In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Multi-Head Attention
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_v = d_model // num_heads
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        Q = self.w_q(q).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
        K = self.w_k(k).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
        V = self.w_v(v).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
        att_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_v)
        if mask is not None:
            att_scores = att_scores.masked_fill(mask == 0, -1e9)
        att_probs = F.softmax(att_scores, dim=-1)
        att_output = torch.matmul(att_probs, V).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.w_o(att_output)

# Feed-Forward Network
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=1024):
        super().__init__()
        self.lin1 = nn.Linear(d_model, d_ff)
        self.lin2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.lin2(F.relu(self.lin1(x)))

# Encoder Layer
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.3):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.dropout(self.attn(x, x, x, mask))
        x = self.norm1(x + attn_output)
        ff_output = self.dropout(self.ff(x))
        x = self.norm2(x + ff_output)
        return x

# Decoder Layer
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.3):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.enc_dec_attn = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        self_attn_output = self.dropout(self.self_attn(x, x, x, tgt_mask))
        x = self.norm1(x + self_attn_output)
        enc_dec_attn_output = self.dropout(self.enc_dec_attn(x, enc_output, enc_output, src_mask))
        x = self.norm2(x + enc_dec_attn_output)
        ff_output = self.dropout(self.ff(x))
        x = self.norm3(x + ff_output)
        return x

# Encoder
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, dropout):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, src, mask):
        x = self.pe(self.embed(src))
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

# Decoder
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, dropout):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, tgt, enc_output, src_mask, tgt_mask):
        x = self.pe(self.embed(tgt))
        for layer in self.layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return self.norm(x)

# Transformer Model
class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model=256, num_layers=2, num_heads=2, dropout=0.1):
        super().__init__()
        self.encoder = Encoder(vocab_size, d_model, num_layers, num_heads, dropout)
        self.decoder = Decoder(vocab_size, d_model, num_layers, num_heads, dropout)
        self.linear = nn.Linear(d_model, vocab_size)

    def forward(self, src, tgt, src_mask, tgt_mask):
        enc_output = self.encoder(src, src_mask)
        dec_output = self.decoder(tgt, enc_output, src_mask, tgt_mask)
        return self.linear(dec_output)

# Example instantiation
if __name__ == "__main__":
    vocab_size = 8000  # Example vocab size from SentencePiece
    model = Transformer(vocab_size=vocab_size, d_model=256, num_layers=2, num_heads=2, dropout=0.1)
    print(model)

Transformer(
  (encoder): Encoder(
    (embed): Embedding(8000, 256)
    (pe): PositionalEncoding()
    (layers): ModuleList(
      (0-1): 2 x EncoderLayer(
        (attn): MultiHeadAttention(
          (w_q): Linear(in_features=256, out_features=256, bias=True)
          (w_k): Linear(in_features=256, out_features=256, bias=True)
          (w_v): Linear(in_features=256, out_features=256, bias=True)
          (w_o): Linear(in_features=256, out_features=256, bias=True)
        )
        (ff): FeedForward(
          (lin1): Linear(in_features=256, out_features=1024, bias=True)
          (lin2): Linear(in_features=1024, out_features=256, bias=True)
        )
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): Decoder(
    (embed): Embedding(8000, 256)
    (

In [4]:
!pip install sacrebleu --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/104.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h

**TRAINING AND PARAMETERS**

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import sacrebleu
from tqdm import tqdm
import ast
import pandas as pd
import math


# ============================================
# Dataset Class
# ============================================
class UrduDataset(Dataset):
    def __init__(self, df, pad_id=0, max_len=100):
        # Convert string representation of list back to list if needed
        if isinstance(df['tokenized'].iloc[0], str):
            self.sentences = df['tokenized'].apply(ast.literal_eval).tolist()
        else:
            self.sentences = df['tokenized'].tolist()
        self.pad_id = pad_id
        self.max_len = max_len

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        tokens = self.sentences[idx][:self.max_len - 2]  # Reserve for <sos>, <eos>
        src = [1] + tokens + [2]  # <sos>=1, <eos>=2
        tgt = [1] + tokens + [2]

        # Pad sequences
        src += [self.pad_id] * (self.max_len - len(src))
        tgt += [self.pad_id] * (self.max_len - len(tgt))

        return torch.tensor(src), torch.tensor(tgt)

# ============================================
# Masking Function
# ============================================
def create_mask(src, tgt, pad_id=0):
    # src: (B, src_len), tgt: (B, tgt_len)
    src_mask = (src != pad_id).unsqueeze(1).unsqueeze(2)  # (B,1,1,src_len)
    tgt_pad_mask = (tgt != pad_id).unsqueeze(1).unsqueeze(2)
    tgt_len = tgt.size(1)
    tgt_sub_mask = torch.tril(torch.ones((tgt_len, tgt_len), device=tgt.device)).bool()
    tgt_mask = tgt_pad_mask & tgt_sub_mask
    return src_mask, tgt_mask


# ============================================
# Translation Function
# ============================================
def translate_sentence(model, sentence, sp, device='cuda' if torch.cuda.is_available() else 'cpu', max_len=50):
    model.eval()
    with torch.no_grad():
        # Tokenize input sentence
        tokens = sp.encode(sentence, out_type=int)
        src = [1] + tokens + [2]  # Add <sos> and <eos>
        src = torch.tensor(src).unsqueeze(0).to(device) # Add batch dimension and move to device

        # Create source mask
        src_mask, _ = create_mask(src, src)

        # Encode the source sentence
        enc_out = model.encoder(src, src_mask)

        # Initialize decoded sequence with <sos>
        decoded = torch.ones((1, 1), dtype=torch.long, device=device)

        # Decode token by token
        for _ in range(max_len):
            _, tgt_mask = create_mask(src, decoded)
            out = model.decoder(decoded, enc_out, src_mask, tgt_mask)
            logits = model.linear(out[:, -1, :])
            next_token = torch.argmax(logits, dim=-1, keepdim=True)
            decoded = torch.cat((decoded, next_token), dim=1)

            if next_token.item() == 2:  # Stop if <eos> is predicted
                break

        # Decode the tokenized output back to text
        translation = sp.decode([t for t in decoded.squeeze(0).tolist() if t not in [0, 1, 2]]) # remove pad, sos, eos
        return translation


# ============================================
# Training Function
# ============================================
def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for src, tgt in tqdm(dataloader, desc="Training", leave=False):
        src, tgt = src.to(device), tgt.to(device)

        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]

        src_mask, tgt_mask = create_mask(src, tgt_input)

        optimizer.zero_grad()
        logits = model(src, tgt_input, src_mask, tgt_mask)
        logits = logits.view(-1, logits.size(-1))
        tgt_output = tgt_output.reshape(-1)

        loss = criterion(logits, tgt_output)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

# ============================================
# BLEU Evaluation
# ============================================
def evaluate_bleu(model, dataloader, sp, device, max_len=50):
    model.eval()
    references, hypotheses = [], []

    with torch.no_grad():
        for src, tgt in tqdm(dataloader, desc="Evaluating", leave=False):
            src = src.to(device)
            src_mask, _ = create_mask(src, src)
            enc_out = model.encoder(src, src_mask)

            decoded = torch.ones((src.size(0), 1), dtype=torch.long, device=device)  # start with <sos>

            for _ in range(max_len):
                _, tgt_mask = create_mask(src, decoded)
                out = model.decoder(decoded, enc_out, src_mask, tgt_mask)
                logits = model.linear(out[:, -1, :])
                next_token = torch.argmax(logits, dim=-1, keepdim=True)
                decoded = torch.cat((decoded, next_token), dim=1)
                if (next_token == 2).all():  # stop if all finished
                    break

            for ref, hyp in zip(tgt.cpu().tolist(), decoded.cpu().tolist()):
                ref_text = sp.decode([t for t in ref if t not in [0, 1, 2]])  # remove pad, sos, eos
                hyp_text = sp.decode([t for t in hyp if t not in [0, 1, 2]])
                references.append([ref_text])
                hypotheses.append(hyp_text)

    bleu = sacrebleu.corpus_bleu(hypotheses, list(zip(*references)))
    return bleu.score



# ============================================
# Main Training Script
# ============================================
def main():


    # Positional Encoding
    class PositionalEncoding(nn.Module):
        def __init__(self, d_model, max_len=5000):
            super().__init__()
            pe = torch.zeros(max_len, d_model)
            position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
            div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
            pe[:, 0::2] = torch.sin(position * div_term)
            pe[:, 1::2] = torch.cos(position * div_term)
            pe = pe.unsqueeze(0)
            self.register_buffer('pe', pe)

        def forward(self, x):
            return x + self.pe[:, :x.size(1)]

    # Multi-Head Attention
    class MultiHeadAttention(nn.Module):
        def __init__(self, d_model, num_heads):
            super().__init__()
            assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
            self.num_heads = num_heads
            self.d_model = d_model
            self.d_v = d_model // num_heads
            self.w_q = nn.Linear(d_model, d_model)
            self.w_k = nn.Linear(d_model, d_model)
            self.w_v = nn.Linear(d_model, d_model)
            self.w_o = nn.Linear(d_model, d_model)

        def forward(self, q, k, v, mask=None):
            batch_size = q.size(0)
            Q = self.w_q(q).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
            K = self.w_k(k).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
            V = self.w_v(v).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
            att_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_v)
            if mask is not None:
                att_scores = att_scores.masked_fill(mask == 0, -1e9)
            att_probs = F.softmax(att_scores, dim=-1)
            att_output = torch.matmul(att_probs, V).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
            return self.w_o(att_output)

    # Feed-Forward Network
    class FeedForward(nn.Module):
        def __init__(self, d_model, d_ff=1024):
            super().__init__()
            self.lin1 = nn.Linear(d_model, d_ff)
            self.lin2 = nn.Linear(d_ff, d_model)

        def forward(self, x):
            return self.lin2(F.relu(self.lin1(x)))

    # Encoder Layer
    class EncoderLayer(nn.Module):
        def __init__(self, d_model, num_heads, dropout=0.3):
            super().__init__()
            self.attn = MultiHeadAttention(d_model, num_heads)
            self.ff = FeedForward(d_model)
            self.norm1 = nn.LayerNorm(d_model)
            self.norm2 = nn.LayerNorm(d_model)
            self.dropout = nn.Dropout(dropout)

        def forward(self, x, mask):
            attn_output = self.dropout(self.attn(x, x, x, mask))
            x = self.norm1(x + attn_output)
            ff_output = self.dropout(self.ff(x))
            x = self.norm2(x + ff_output)
            return x

    # Decoder Layer
    class DecoderLayer(nn.Module):
        def __init__(self, d_model, num_heads, dropout=0.3):
            super().__init__()
            self.self_attn = MultiHeadAttention(d_model, num_heads)
            self.enc_dec_attn = MultiHeadAttention(d_model, num_heads)
            self.ff = FeedForward(d_model)
            self.norm1 = nn.LayerNorm(d_model)
            self.norm2 = nn.LayerNorm(d_model)
            self.norm3 = nn.LayerNorm(d_model)
            self.dropout = nn.Dropout(dropout)

        def forward(self, x, enc_output, src_mask, tgt_mask):
            self_attn_output = self.dropout(self.self_attn(x, x, x, tgt_mask))
            x = self.norm1(x + self_attn_output)
            enc_dec_attn_output = self.dropout(self.enc_dec_attn(x, enc_output, enc_output, src_mask))
            x = self.norm2(x + enc_dec_attn_output)
            ff_output = self.dropout(self.ff(x))
            x = self.norm3(x + ff_output)
            return x

    # Encoder
    class Encoder(nn.Module):
        def __init__(self, vocab_size, d_model, num_layers, num_heads, dropout):
            super().__init__()
            self.embed = nn.Embedding(vocab_size, d_model)
            self.pe = PositionalEncoding(d_model)
            self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, dropout) for _ in range(num_layers)])
            self.norm = nn.LayerNorm(d_model)

        def forward(self, src, mask):
            x = self.pe(self.embed(src))
            for layer in self.layers:
                x = layer(x, mask)
            return self.norm(x)

    # Decoder
    class Decoder(nn.Module):
        def __init__(self, vocab_size, d_model, num_layers, num_heads, dropout):
            super().__init__()
            self.embed = nn.Embedding(vocab_size, d_model)
            self.pe = PositionalEncoding(d_model)
            self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, dropout) for _ in range(num_layers)])
            self.norm = nn.LayerNorm(d_model)

        def forward(self, tgt, enc_output, src_mask, tgt_mask):
            x = self.pe(self.embed(tgt))
            for layer in self.layers:
                x = layer(x, enc_output, src_mask, tgt_mask)
            return self.norm(x)

    # Transformer Model
    class Transformer(nn.Module):
        def __init__(self, vocab_size, d_model=256, num_layers=2, num_heads=2, dropout=0.1):
            super().__init__()
            self.encoder = Encoder(vocab_size, d_model, num_layers, num_heads, dropout)
            self.decoder = Decoder(vocab_size, d_model, num_layers, num_heads, dropout)
            self.linear = nn.Linear(d_model, vocab_size)

        def forward(self, src, tgt, src_mask, tgt_mask):
            enc_output = self.encoder(src, src_mask)
            dec_output = self.decoder(tgt, enc_output, src_mask, tgt_mask)
            return self.linear(dec_output)


    # ---- Load Data ----
    train_df = pd.read_csv("train.csv")
    val_df = pd.read_csv("val.csv")

    # ---- Dataset & Dataloader ----
    train_dataset = UrduDataset(train_df, pad_id=0)
    val_dataset = UrduDataset(val_df, pad_id=0)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=32)

    # ---- Load SentencePiece ----
    import sentencepiece as spm
    sp = spm.SentencePieceProcessor(model_file='ur_model.model')

    # ---- Model, Optimizer, Loss ----
    vocab_size = sp.get_piece_size()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = Transformer(vocab_size=vocab_size, d_model=256, num_layers=2, num_heads=2, dropout=0.1)
    model.to(device)

    optimizer = optim.Adam(model.parameters(), lr=1e-4)
    criterion = nn.CrossEntropyLoss(ignore_index=0)

    best_bleu = 0.0
    num_epochs = 5

    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch+1}/{num_epochs}")

        train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
        val_bleu = evaluate_bleu(model, val_loader, sp, device)

        print(f"Train Loss: {train_loss:.4f} | Validation BLEU: {val_bleu:.2f}")

        if val_bleu > best_bleu:
            best_bleu = val_bleu
            torch.save(model.state_dict(), "best_transformer.pt")
            print(f"✅ Saved new best model with BLEU = {best_bleu:.2f}")

    # -----------------------------
    # Example sentences to test
    # -----------------------------
    test_sentences = [
        "میں لاہور گیا تھا۔",
        "آج موسم بہت اچھا ہے۔",
        "آپ کیسے ہیں؟",
        "کل اسکول جانا ہے۔",
        "یہ ایک بہت خوبصورت جگہ ہے۔"
    ]

    # Apply normalization first
    def normalize_text(text):
        diacritics = ['َ', 'ً', 'ُ', 'ٌ', 'ِ', 'ٍ', 'ْ', 'ّ', 'ْ', 'ٰ', 'ٔ']
        for d in diacritics:
            text = text.replace(d, "")
        text = text.replace('ے','ی' )  # Standardize Yeh
        text = text.replace('آ','ٰا' )  # Standardize Alef (corrected from 'آ' -> 'ٰا' to 'ٰا' -> 'آ')
        return text
    normalized_sentences = [normalize_text(s) for s in test_sentences]

    # Display normalized sentences
    print("Normalized Sentences:\n")
    for orig, norm in zip(test_sentences, normalized_sentences):
        print(f"Original   : {orig}")
        print(f"Normalized : {norm}\n")

    # -----------------------------
    # Translate using trained model
    # -----------------------------
    print("\nTranslated Sentences:\n")
    for sentence in normalized_sentences:
        translation = translate_sentence(model, sentence, sp)
        print(f"Input : {sentence}")
        print(f"Output: {translation}\n")


if __name__ == "__main__":
    main()


Epoch 1/5




Train Loss: 6.1356 | Validation BLEU: 3.96
✅ Saved new best model with BLEU = 3.96

Epoch 2/5




Train Loss: 4.1912 | Validation BLEU: 18.49
✅ Saved new best model with BLEU = 18.49

Epoch 3/5




Train Loss: 2.9594 | Validation BLEU: 33.85
✅ Saved new best model with BLEU = 33.85

Epoch 4/5




Train Loss: 2.0958 | Validation BLEU: 49.18
✅ Saved new best model with BLEU = 49.18

Epoch 5/5




Train Loss: 1.4905 | Validation BLEU: 59.77
✅ Saved new best model with BLEU = 59.77
Normalized Sentences:

Original   : میں لاہور گیا تھا۔
Normalized : میں لاہور گیا تھا۔

Original   : آج موسم بہت اچھا ہے۔
Normalized : ٰاج موسم بہت اچھا ہی۔

Original   : آپ کیسے ہیں؟
Normalized : ٰاپ کیسی ہیں؟

Original   : کل اسکول جانا ہے۔
Normalized : کل اسکول جانا ہی۔

Original   : یہ ایک بہت خوبصورت جگہ ہے۔
Normalized : یہ ایک بہت خوبصورت جگہ ہی۔


Translated Sentences:

Input : میں لاہور گیا تھا۔
Output: میں لاہور گیا تھا۔

Input : ٰاج موسم بہت اچھا ہی۔
Output: ٰاج موسم بہت اچھا ہی۔

Input : ٰاپ کیسی ہیں؟
Output: ٰاپ کیسی ہیں؟

Input : کل اسکول جانا ہی۔
Output: کل فلم جانا ہی۔

Input : یہ ایک بہت خوبصورت جگہ ہی۔
Output: یہ ایک بہت خوبصورت جگہ ہی۔



In [21]:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import sacrebleu
from tqdm import tqdm
import pandas as pd
import math
import sentencepiece as spm
import torch.nn.functional as F

# ============================================
# Dataset Class
# ============================================
class UrduConversationDataset(Dataset):
    def __init__(self, df, pad_id=0, max_len=50):
        self.sentences = df['sentence'].tolist()
        self.pad_id = pad_id
        self.max_len = max_len

    def __len__(self):
        return max(1, len(self.sentences) - 1)  # Ensure at least 1 pair

    def __getitem__(self, idx):
        src_text = self.sentences[idx]
        tgt_text = self.sentences[min(idx + 1, len(self.sentences) - 1)]  # Avoid index out of range
        return src_text, tgt_text

# ============================================
# Masking Function
# ============================================
def create_mask(src, tgt, pad_id=0):
    src_mask = (src != pad_id).unsqueeze(1).unsqueeze(2)
    tgt_pad_mask = (tgt != pad_id).unsqueeze(1).unsqueeze(2)
    tgt_len = tgt.size(1)
    tgt_sub_mask = torch.tril(torch.ones((tgt_len, tgt_len), device=tgt.device)).bool()
    tgt_mask = tgt_pad_mask & tgt_sub_mask
    return src_mask, tgt_mask

# ============================================
# Translation Function
# ============================================
def translate_sentence(model, sentence, sp, device='cuda' if torch.cuda.is_available() else 'cpu', max_len=50):
    model.eval()
    with torch.no_grad():
        try:
            tokens = sp.encode(sentence, out_type=int)
            src = torch.tensor([1] + tokens[:max_len - 2] + [2]).unsqueeze(0).to(device)
            src_mask, _ = create_mask(src, src)
            enc_out = model.encoder(src, src_mask)
            decoded = torch.ones((1, 1), dtype=torch.long, device=device)
            for _ in range(max_len):
                _, tgt_mask = create_mask(src, decoded)
                out = model.decoder(decoded, enc_out, src_mask, tgt_mask)
                logits = model.linear(out[:, -1, :])
                next_token = torch.argmax(logits, dim=-1, keepdim=True)
                decoded = torch.cat((decoded, next_token), dim=1)
                if next_token.item() == 2:
                    break
            translation = sp.decode([t for t in decoded.squeeze(0).tolist() if t not in [0, 1, 2]])
            return translation
        except Exception as e:
            print(f"Translation error for '{sentence}': {e}")
            return ""

# ============================================
# Training Function
# ============================================
def train_epoch(model, dataloader, optimizer, criterion, sp, device, max_len=50):
    model.train()
    total_loss = 0
    for src_text, tgt_text in tqdm(dataloader, desc="Training", leave=False):
        try:
            src_tokens = [sp.encode(s, out_type=int)[:max_len - 2] for s in src_text]
            tgt_tokens = [sp.encode(t, out_type=int)[:max_len - 2] for t in tgt_text]
            if not src_tokens or not tgt_tokens:
                continue  # Skip empty batches
            src = [torch.tensor([1] + t + [2]) for t in src_tokens]
            tgt = [torch.tensor([1] + t + [2]) for t in tgt_tokens]
            src = torch.nn.utils.rnn.pad_sequence(src, batch_first=True, padding_value=0).to(device)
            tgt = torch.nn.utils.rnn.pad_sequence(tgt, batch_first=True, padding_value=0).to(device)
            tgt_input = tgt[:, :-1]
            tgt_output = tgt[:, 1:]
            src_mask, tgt_mask = create_mask(src, tgt_input)
            optimizer.zero_grad()
            logits = model(src, tgt_input, src_mask, tgt_mask)
            logits = logits.view(-1, logits.size(-1))
            tgt_output = tgt_output.reshape(-1)
            loss = criterion(logits, tgt_output)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        except Exception as e:
            print(f"Training batch error: {e}")
            continue
    return total_loss / max(1, len(dataloader))

# ============================================
# BLEU Evaluation
# ============================================
def evaluate_bleu(model, dataloader, sp, device, max_len=50):
    model.eval()
    references, hypotheses = [], []
    with torch.no_grad():
        for src_text, tgt_text in tqdm(dataloader, desc="Evaluating", leave=False):
            try:
                src_tokens = [sp.encode(s, out_type=int)[:max_len - 2] for s in src_text]
                src = torch.nn.utils.rnn.pad_sequence(
                    [torch.tensor([1] + t + [2]) for t in src_tokens], batch_first=True, padding_value=0
                ).to(device)
                src_mask, _ = create_mask(src, src)
                enc_out = model.encoder(src, src_mask)
                decoded = torch.ones((src.size(0), 1), dtype=torch.long, device=device)
                for _ in range(max_len):
                    _, tgt_mask = create_mask(src, decoded)
                    out = model.decoder(decoded, enc_out, src_mask, tgt_mask)
                    logits = model.linear(out[:, -1, :])
                    next_token = torch.argmax(logits, dim=-1, keepdim=True)
                    decoded = torch.cat((decoded, next_token), dim=1)
                    if (next_token == 2).all():
                        break
                for ref, hyp in zip(tgt_text, decoded.cpu().tolist()):
                    ref_text = sp.decode([t for t in sp.encode(ref, out_type=int) if t not in [0, 1, 2]])
                    hyp_text = sp.decode([t for t in hyp if t not in [0, 1, 2]])
                    references.append([ref_text])
                    hypotheses.append(hyp_text)
            except Exception as e:
                print(f"Evaluation error: {e}")
                continue
    try:
        bleu = sacrebleu.corpus_bleu(hypotheses, list(zip(*references)))
        return bleu.score
    except Exception as e:
        print(f"BLEU calculation error: {e}")
        return 0.0

# ============================================
# Model Architecture
# ============================================
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_v = d_model // num_heads
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        Q = self.w_q(q).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
        K = self.w_k(k).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
        V = self.w_v(v).view(batch_size, -1, self.num_heads, self.d_v).transpose(1, 2)
        att_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_v)
        if mask is not None:
            att_scores = att_scores.masked_fill(mask == 0, -1e9)
        att_probs = F.softmax(att_scores, dim=-1)
        att_output = torch.matmul(att_probs, V).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.w_o(att_output)

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=1024):
        super().__init__()
        self.lin1 = nn.Linear(d_model, d_ff)
        self.lin2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.lin2(F.relu(self.lin1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.3):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.dropout(self.attn(x, x, x, mask))
        x = self.norm1(x + attn_output)
        ff_output = self.dropout(self.ff(x))
        x = self.norm2(x + ff_output)
        return x

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.3):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.enc_dec_attn = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        self_attn_output = self.dropout(self.self_attn(x, x, x, tgt_mask))
        x = self.norm1(x + self_attn_output)
        enc_dec_attn_output = self.dropout(self.enc_dec_attn(x, enc_output, enc_output, src_mask))
        x = self.norm2(x + enc_dec_attn_output)
        ff_output = self.dropout(self.ff(x))
        x = self.norm3(x + ff_output)
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, dropout):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, src, mask):
        x = self.pe(self.embed(src))
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, dropout):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, tgt, enc_output, src_mask, tgt_mask):
        x = self.pe(self.embed(tgt))
        for layer in self.layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return self.norm(x)

class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model=256, num_layers=2, num_heads=2, dropout=0.1):
        super().__init__()
        self.encoder = Encoder(vocab_size, d_model, num_layers, num_heads, dropout)
        self.decoder = Decoder(vocab_size, d_model, num_layers, num_heads, dropout)
        self.linear = nn.Linear(d_model, vocab_size)

    def forward(self, src, tgt, src_mask, tgt_mask):
        enc_output = self.encoder(src, src_mask)
        dec_output = self.decoder(tgt, enc_output, src_mask, tgt_mask)
        return self.linear(dec_output)

# ============================================
# Main Training Script
# ============================================
def main():
    # ---- Normalize Text ----
    def normalize_text(text):
        diacritics = ['َ', 'ً', 'ُ', 'ٌ', 'ِ', 'ٍ', 'ْ', 'ّ', 'ْ', 'ٰ', 'ٔ']
        for d in diacritics:
            text = text.replace(d, "")
        text = text.replace('ے', 'ی')  # Standardize Yeh
        text = text.replace('آ', 'ٰا')  # Standardize Alef
        return text

    # ---- Load and Preprocess Data ----
    try:
        df = pd.read_csv("final_main_dataset.tsv", sep='\t')
        print(f"Loaded dataset with {len(df)} rows")
        print(df.head())
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return

    df['sentence'] = df['sentence'].apply(normalize_text)
    train_size = int(0.8 * len(df))
    val_size = int(0.1 * len(df))
    train_df = df[:train_size]
    val_df = df[train_size:train_size + val_size]

    # ---- Load SentencePiece ----
    try:
        sp = spm.SentencePieceProcessor(model_file='ur_model.model')
        print(f"SentencePiece vocab size: {sp.get_piece_size()}")
    except Exception as e:
        print(f"Error loading SentencePiece model: {e}")
        return

    # ---- Dataset & Dataloader ----
    train_dataset = UrduConversationDataset(train_df, pad_id=0, max_len=50)
    val_dataset = UrduConversationDataset(val_df, pad_id=0, max_len=50)
    print(f"Train dataset size: {len(train_dataset)}, Val dataset size: {len(val_dataset)}")
    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16)

    # ---- Model, Optimizer, Loss ----
    vocab_size = sp.get_piece_size()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = Transformer(vocab_size=vocab_size, d_model=256, num_layers=2, num_heads=2, dropout=0.1)
    model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=1e-4)
    criterion = nn.CrossEntropyLoss(ignore_index=0)

    best_bleu = 0.0
    num_epochs = 10

    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch+1}/{num_epochs}")
        train_loss = train_epoch(model, train_loader, optimizer, criterion, sp, device)
        val_bleu = evaluate_bleu(model, val_loader, sp, device)
        print(f"Train Loss: {train_loss:.4f} | Validation BLEU: {val_bleu:.2f}")
        if val_bleu > best_bleu:
            best_bleu = val_bleu
            torch.save(model.state_dict(), "best_transformer.pt")
            print(f"✅ Saved new best model with BLEU = {best_bleu:.2f}")

    # ---- Test Sentences ----
    test_sentences = [
        "میں لاہور گیا تھا۔",
        "آج موسم بہت اچھا ہے۔",
        "آپ کیسے ہیں؟",
        "کل اسکول جانا ہے۔",
        "یہ ایک بہت خوبصورت جگہ ہے۔"
    ]
    normalized_sentences = [normalize_text(s) for s in test_sentences]

    print("\nNormalized Sentences:\n")
    for orig, norm in zip(test_sentences, normalized_sentences):
        print(f"Original   : {orig}")
        print(f"Normalized : {norm}\n")

    print("\nGenerated Responses:\n")
    for sentence in normalized_sentences:
        response = translate_sentence(model, sentence, sp, device)
        print(f"Input : {sentence}")
        print(f"Response: {response}\n")

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print(f"Error in main: {e}")


Loaded dataset with 20000 rows
                                           client_id  \
0  e53f84d151d6cc6d45a57decde08a99efe47d7751a4ca6...   
1  e53f84d151d6cc6d45a57decde08a99efe47d7751a4ca6...   
2  e53f84d151d6cc6d45a57decde08a99efe47d7751a4ca6...   
3  e53f84d151d6cc6d45a57decde08a99efe47d7751a4ca6...   
4  e53f84d151d6cc6d45a57decde08a99efe47d7751a4ca6...   

                           path  \
0  common_voice_ur_31771683.mp3   
1  common_voice_ur_31771684.mp3   
2  common_voice_ur_31771685.mp3   
3  common_voice_ur_31771730.mp3   
4  common_voice_ur_31771732.mp3   

                                            sentence  up_votes  down_votes  \
0                 کبھی کبھار ہی خیالی پلاو بناتا ہوں         2           0   
1                  اور پھر ممکن ہے کہ پاکستان بھی ہو         2           1   
2                      یہ فیصلہ بھی گزشتہ دو سال میں         2           0   
3                     ان کے بلے بازوں کے سامنے ہو گا         3           0   
4  آبی جانور میں بطخ بگلا اور د



Train Loss: 6.5414 | Validation BLEU: 0.17
✅ Saved new best model with BLEU = 0.17

Epoch 2/10




Train Loss: 5.8890 | Validation BLEU: 0.05

Epoch 3/10




Train Loss: 5.4677 | Validation BLEU: 0.12

Epoch 4/10




Train Loss: 5.0900 | Validation BLEU: 0.14

Epoch 5/10




Train Loss: 4.7265 | Validation BLEU: 0.10

Epoch 6/10




Train Loss: 4.3646 | Validation BLEU: 0.09

Epoch 7/10




Train Loss: 4.0035 | Validation BLEU: 0.12

Epoch 8/10




Train Loss: 3.6446 | Validation BLEU: 0.17

Epoch 9/10




Train Loss: 3.3014 | Validation BLEU: 0.17

Epoch 10/10




Train Loss: 2.9678 | Validation BLEU: 0.22
✅ Saved new best model with BLEU = 0.22

Normalized Sentences:

Original   : میں لاہور گیا تھا۔
Normalized : میں لاہور گیا تھا۔

Original   : آج موسم بہت اچھا ہے۔
Normalized : ٰاج موسم بہت اچھا ہی۔

Original   : آپ کیسے ہیں؟
Normalized : ٰاپ کیسی ہیں؟

Original   : کل اسکول جانا ہے۔
Normalized : کل اسکول جانا ہی۔

Original   : یہ ایک بہت خوبصورت جگہ ہے۔
Normalized : یہ ایک بہت خوبصورت جگہ ہی۔


Generated Responses:

Input : میں لاہور گیا تھا۔
Response: میں نی اپنی سی ایک کو مسترد کر بریک لگائی ہوئی ہی

Input : ٰاج موسم بہت اچھا ہی۔
Response: یہ ایک اچھی پلئیرز دیی ہیں۔

Input : ٰاپ کیسی ہیں؟
Response: لیکن اب سنا ہی۔

Input : کل اسکول جانا ہی۔
Response: وہ بھی نہ تھی۔

Input : یہ ایک بہت خوبصورت جگہ ہی۔
Response: اب تو ہم کیوں نہیں کر سکتی؟

