In [None]:
!pip install torchtext==0.17.1

In [1]:
!ls /kaggle/input/urdu-ghazals-rekhta

dataset


In [2]:
!cp -r /kaggle/input/urdu-ghazals-rekhta/dataset /kaggle/working/dataset

## Building the Parallel Dataset (Urdu ↔ Roman Urdu)

### **In the code below, we will:**

1. **Iterate through each poet’s folder** (e.g., `ahmad-faraz`, `faiz-ahmad-faiz`, etc.).  
2. Look inside the **`ur/`** and **`en/`** subfolders.  
3. **Read each poem file** from both subfolders.  
4. **Collect the parallel lines**:
   - Urdu (from `ur/`)  
   - Roman Urdu (from `en/`)  
5. **Save everything into two files**:
   - `/kaggle/working/source.txt` → contains **all Urdu lines**.  
   - `/kaggle/working/target.txt` → contains **all Roman Urdu lines**.  

---

### **Why is this important?**
- Ensures poems are aligned **line-by-line** across languages.  
- Produces a clean parallel dataset ready for **Neural Machine Translation (NMT)** training.  
- Keeps the workflow simple: just two files (`source.txt`, `target.txt`) that will be fed into the model later.  


In [52]:
import os

base_dir = "/kaggle/working/dataset"
authors = os.listdir(base_dir)

src_lines, tgt_lines = [], []

for author in authors:
    author_path = os.path.join(base_dir, author)
    if not os.path.isdir(author_path):
        continue
    
    ur_dir = os.path.join(author_path, "ur")
    en_dir = os.path.join(author_path, "en")
    
    if not os.path.exists(ur_dir) or not os.path.exists(en_dir):
        continue
    
    files = os.listdir(ur_dir)
    for f in files:
        ur_file = os.path.join(ur_dir, f)
        en_file = os.path.join(en_dir, f)
        if not os.path.exists(en_file):
            continue
        
        with open(ur_file, "r", encoding="utf-8") as fu, \
             open(en_file, "r", encoding="utf-8") as fe:
            
            ur_lines = [l.strip() for l in fu if l.strip()]
            en_lines = [l.strip() for l in fe if l.strip()]
            
            for u, e in zip(ur_lines, en_lines):
                src_lines.append(u)
                tgt_lines.append(e)

# Save parallel corpus
with open("/kaggle/working/source.txt", "w", encoding="utf-8") as fs, \
     open("/kaggle/working/target.txt", "w", encoding="utf-8") as ft:
    for u, e in zip(src_lines, tgt_lines):
        fs.write(u + "\n")
        ft.write(e + "\n")

print(f"Saved {len(src_lines)} parallel lines.")


Saved 21003 parallel lines.


## Dataset Splitting for Urdu ↔ Roman Urdu NMT

### **What are we doing?**
We are dividing our collected parallel dataset (**`source.txt`** for Urdu and **`target.txt`** for Roman Urdu) into **three subsets**:

- **Training set (50%)** → used to fit the model.  
- **Validation set (25%)** → used during training to tune hyperparameters and prevent overfitting.  
- **Test set (25%)** → used **only after training** to evaluate the final model’s real performance.  

---

### **How is it done?**
1. **Read** all Urdu and Roman Urdu lines.  
2. **Shuffle** them randomly (with a fixed seed = **42** for reproducibility).  
3. **Calculate** exact sizes (**50/25/25**).  
4. **Write** them into **six separate files**:
   - `train.source`, `train.target`  
   - `val.source`, `val.target`  
   - `test.source`, `test.target`  

This ensures that Urdu and Roman Urdu lines remain **aligned**.

---

### **Why is this important?**
- **NMT models** require separate **train, validation, and test** splits to evaluate generalization.  
- **50/25/25 split** provides a good balance → enough data to train while still having reliable evaluation sets.  
- **Reproducibility** (fixed seed) ensures results can be compared consistently across experiments.  

---


In [53]:
# Split dataset into train/val/test (50/25/25) with reproducible shuffle
import random
from pathlib import Path

random.seed(42)

base = Path("/kaggle/working")
src_path = base / "source.txt"
tgt_path = base / "target.txt"

# read
with src_path.open("r", encoding="utf-8") as f:
    src_lines = [l.rstrip("\n") for l in f]
with tgt_path.open("r", encoding="utf-8") as f:
    tgt_lines = [l.rstrip("\n") for l in f]

assert len(src_lines) == len(tgt_lines), "Source and target line counts differ!"

n = len(src_lines)
indices = list(range(n))
random.shuffle(indices)

def write_split(name, idxs):
    out_src = base / f"{name}.source"
    out_tgt = base / f"{name}.target"
    with out_src.open("w", encoding="utf-8") as fs, out_tgt.open("w", encoding="utf-8") as ft:
        for i in idxs:
            fs.write(src_lines[i] + "\n")
            ft.write(tgt_lines[i] + "\n")
    print(f"Wrote {len(idxs)} lines to {out_src.name} and {out_tgt.name}")

# sizes
train_n = int(0.50 * n)
val_n   = int(0.25 * n)
test_n  = n - train_n - val_n

train_idx = indices[:train_n]
val_idx   = indices[train_n:train_n+val_n]
test_idx  = indices[train_n+val_n:]

write_split("train", train_idx)
write_split("val", val_idx)
write_split("test", test_idx)

print(f"Total lines: {n}  -> train: {len(train_idx)}, val: {len(val_idx)}, test: {len(test_idx)}")


Wrote 10501 lines to train.source and train.target
Wrote 5250 lines to val.source and val.target
Wrote 5252 lines to test.source and test.target
Total lines: 21003  -> train: 10501, val: 5250, test: 5252


##  Normalizing Urdu ↔ Roman Urdu Dataset

We normalize the **source (Urdu)** and **target (Roman Urdu)** lines for each **data split** and write the cleaned files to:

/kaggle/working/normalized/{train,val,test}.source

/kaggle/working/normalized/{train,val,test}.target

### **🔧 How**

#### **Urdu normalization (`normalize_urdu`)**
- Remove **tatweel (ـ)** and optional **diacritics (tashkeel)** to reduce unseen variants.  
-  Normalize multiple orthographic variants into single forms:  
  - `إ أ آ ٱ → ا`  
  - `ي → ی`  
  - `ك → ک`  
- Replace common **presentation ligatures**.  
- Collapse multiple whitespace.  
- *Optional*: Remove punctuation while preserving Urdu-specific marks like `،` `۔` `؟`.  

#### **Roman normalization (`normalize_roman`)**
- Strip leading/trailing whitespace.  
- Convert to lowercase.  
- Collapse repeated spaces.  
- *Optional*: Remove punctuation, but **keep diacritic letters** (e.g., `ā`, `ī`) since they are informative for transliteration.  

---

### **❓ Why**

- **Poetry** often contains stylistic marks + inconsistent spellings → this increases vocabulary sparsity and harms model learning.  
- **Normalization reduces irrelevant variation** so the model can focus on phonetic/orthographic mapping.  
- We **do not aggressively strip punctuation by default** because punctuation in poetry encodes structure (pauses, line breaks).  
  - You can enable `remove_extra_punct=True` for a stricter dataset (e.g., ablation experiments).  
- Keeping **diacritics in Roman target** helps the model learn precise **phonetic correspondences** → useful for **transliteration tasks**.  
-  If plain ASCII Roman Urdu is required, we can later **remove or map diacritics**.  

---


In [54]:
# Enhanced normalization with quality filtering
import re
import unicodedata
from pathlib import Path

def normalize_urdu(text, remove_tashkeel=True, remove_tatweel=True,
                  normalize_alef=True, normalize_ye=True, normalize_kaf=True,
                  remove_extra_punct=False):
    """Enhanced Urdu normalization with better error handling"""
    if text is None or not text.strip():
        return ""
    
    s = text.strip()
    
    # Remove tatweel (kashida)
    if remove_tatweel:
        s = re.sub('\u0640+', '', s)
    
    # Remove tashkeel / diacritics / vowel marks (optional)
    if remove_tashkeel:
        s = re.sub(r'[\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06ED]', '', s)
    
    # Normalize alef variants to bare alef
    if normalize_alef:
        s = re.sub('[إأآٱ]', 'ا', s)
    
    # Normalize Arabic Yeh (ي) to Persian/Urdu Yeh (ی)
    if normalize_ye:
        s = s.replace('ي', 'ی')
    
    # Normalize Arabic Kaf (ك) to Persian/Urdu Kaf (ک)
    if normalize_kaf:
        s = s.replace('ك', 'ک')
    
    # Replace some presentation-form ligatures if present
    s = s.replace('ﻻ', 'لا')
    
    # Collapse multiple spaces/newlines
    s = re.sub(r'\s+', ' ', s).strip()
    
    # Optionally remove most punctuation while keeping Urdu punctuation
    if remove_extra_punct:
        keep = set(['،', '۔', '؟'])
        s = ''.join(ch for ch in s if (unicodedata.category(ch)[0] != 'P') or (ch in keep))
        s = re.sub(r'\s+', ' ', s).strip()
    
    return s


def normalize_roman(text, lower=True, remove_extra_punct=False):
    """Enhanced Roman Urdu normalization"""
    if text is None or not text.strip():
        return ""
    
    s = text.strip()
    
    if lower:
        s = s.lower()
    
    s = re.sub(r'\s+', ' ', s)
    
    if remove_extra_punct:
        s = ''.join(ch for ch in s if unicodedata.category(ch)[0] != 'P')
        s = re.sub(r'\s+', ' ', s).strip()
    
    return s


In [55]:
def is_valid_pair(urdu_text, roman_text, min_words=2, max_words=50, min_chars=5, max_chars=200):
    """Quality filtering for sentence pairs"""
    # Check if either is empty after normalization
    if not urdu_text.strip() or not roman_text.strip():
        return False
    
    # Word count filtering
    urdu_words = len(urdu_text.split())
    roman_words = len(roman_text.split())
    
    if urdu_words < min_words or roman_words < min_words:
        return False
    if urdu_words > max_words or roman_words > max_words:
        return False
    
    # Character count filtering  
    if len(urdu_text) < min_chars or len(roman_text) < min_chars:
        return False
    if len(urdu_text) > max_chars or len(roman_text) > max_chars:
        return False
    
    # Length ratio check (avoid severely misaligned pairs)
    word_ratio = max(urdu_words, roman_words) / min(urdu_words, roman_words)
    if word_ratio > 2.5:  # One sentence is >2.5x longer than the other
        return False
    
    # Check if sentences are identical (likely error)
    if urdu_text == roman_text:
        return False
    
    # Check for reasonable character diversity
    urdu_unique_chars = len(set(urdu_text))
    roman_unique_chars = len(set(roman_text))
    
    if urdu_unique_chars < 3 or roman_unique_chars < 3:  # Too repetitive
        return False
    
    return True


# Enhanced normalization with quality filtering
base = Path("/kaggle/working")
splits = ["train", "val", "test"]
out_dir = base / "normalized"
out_dir.mkdir(exist_ok=True)

summary = {}
total_filtered = 0

print("Starting enhanced normalization with quality filtering...")

for sp in splits:
    src_in = base / f"{sp}.source"
    tgt_in = base / f"{sp}.target"
    src_out = out_dir / f"{sp}.source"
    tgt_out = out_dir / f"{sp}.target"
    
    if not src_in.exists() or not tgt_in.exists():
        print(f"Skipping {sp}: missing {src_in.name} or {tgt_in.name}")
        continue
    
    with src_in.open("r", encoding="utf-8") as f:
        src_lines = [l.rstrip("\n") for l in f]
    with tgt_in.open("r", encoding="utf-8") as f:
        tgt_lines = [l.rstrip("\n") for l in f]
    
    assert len(src_lines) == len(tgt_lines), f"Line count mismatch in {sp}"
    
    # Normalize and filter
    norm_src, norm_tgt = [], []
    filtered_count = 0
    
    for i, (src, tgt) in enumerate(zip(src_lines, tgt_lines)):
        # Normalize
        norm_s = normalize_urdu(src, remove_tashkeel=True, remove_tatweel=True,
                               normalize_alef=True, normalize_ye=True, normalize_kaf=True,
                               remove_extra_punct=False)
        norm_t = normalize_roman(tgt, lower=True, remove_extra_punct=False)
        
        # Quality filter
        if is_valid_pair(norm_s, norm_t):
            norm_src.append(norm_s)
            norm_tgt.append(norm_t)
        else:
            filtered_count += 1
    
    # Save filtered, normalized files
    with src_out.open("w", encoding="utf-8") as f:
        f.write("\n".join(norm_src))
    with tgt_out.open("w", encoding="utf-8") as f:
        f.write("\n".join(norm_tgt))
    
    # Statistics
    original_count = len(src_lines)
    final_count = len(norm_src)
    filter_rate = (filtered_count / original_count) * 100
    
    print(f"\n{sp.upper()}: {original_count} -> {final_count} ({filtered_count} filtered, {filter_rate:.1f}%)")
    
    # Sample pairs for verification
    sample_pairs = list(zip(norm_src[:5], norm_tgt[:5]))
    summary[sp] = {
        "original": original_count,
        "final": final_count, 
        "filtered": filtered_count,
        "sample": sample_pairs
    }
    
    total_filtered += filtered_count

print(f"\n=== SUMMARY ===")
print(f"Total pairs filtered out: {total_filtered}")

# Print samples
for sp, info in summary.items():
    print(f"\n--- {sp.upper()} SAMPLES ({info['final']} pairs) ---")
    for i, (u, r) in enumerate(info["sample"], 1):
        print(f"{i}) URDU:  {u}")
        print(f"   ROMAN: {r}")
        print(f"   WORDS: {len(u.split())} -> {len(r.split())}")

# Additional quality checks
print(f"\n=== QUALITY ANALYSIS ===")
all_src = []
all_tgt = []

for sp in ["train", "val", "test"]:
    src_file = out_dir / f"{sp}.source"
    tgt_file = out_dir / f"{sp}.target"
    
    if src_file.exists() and tgt_file.exists():
        with src_file.open("r", encoding="utf-8") as f:
            all_src.extend([l.strip() for l in f if l.strip()])
        with tgt_file.open("r", encoding="utf-8") as f:
            all_tgt.extend([l.strip() for l in f if l.strip()])

if all_src and all_tgt:
    src_lengths = [len(s.split()) for s in all_src]
    tgt_lengths = [len(t.split()) for t in all_tgt]
    
    print(f"Source word count - Min: {min(src_lengths)}, Max: {max(src_lengths)}, Avg: {sum(src_lengths)/len(src_lengths):.1f}")
    print(f"Target word count - Min: {min(tgt_lengths)}, Max: {max(tgt_lengths)}, Avg: {sum(tgt_lengths)/len(tgt_lengths):.1f}")
    
    # Vocabulary size estimation
    src_vocab = set(' '.join(all_src).split())
    tgt_vocab = set(' '.join(all_tgt).split())
    
    print(f"Estimated source vocabulary: {len(src_vocab):,}")
    print(f"Estimated target vocabulary: {len(tgt_vocab):,}")

print("\n✅ Enhanced normalization completed!")

Starting enhanced normalization with quality filtering...

TRAIN: 10501 -> 10445 (56 filtered, 0.5%)

VAL: 5250 -> 5219 (31 filtered, 0.6%)

TEST: 5252 -> 5223 (29 filtered, 0.6%)

=== SUMMARY ===
Total pairs filtered out: 116

--- TRAIN SAMPLES (10445 pairs) ---
1) URDU:  جو ازل سے چھڑ گیا ہے اس فسانے کی کہو
   ROMAN: jo azal se chhiḍ gayā hai us fasāne kī kaho
   WORDS: 10 -> 10
2) URDU:  ایک بے چہرہ سی امید ہے چہرہ چہرہ
   ROMAN: ek be-chehra sī ummīd hai chehra chehra
   WORDS: 8 -> 7
3) URDU:  حسن کے خضر نے کیا لبریز
   ROMAN: husn ke ḳhizr ne kiyā labrez
   WORDS: 6 -> 6
4) URDU:  برسا بھی تو کس دشت کے بے فیض بدن پر
   ROMAN: barsā bhī to kis dasht ke be-faiz badan par
   WORDS: 10 -> 9
5) URDU:  کچھ دور چل کے راستے سب ایک سے لگے
   ROMAN: kuchh duur chal ke rāste sab ek se lage
   WORDS: 9 -> 9

--- VAL SAMPLES (5219 pairs) ---
1) URDU:  استان یار سے اٹھ جائیں کیا
   ROMAN: āstān-e-yār se uth jaa.eñ kyā
   WORDS: 6 -> 5
2) URDU:  اتا ہے ہوش مجھ کو اب تو پہر پہر میں
   ROMAN: aat

## SentencePiece Tokenization for Urdu → Roman Urdu

---

### What we are doing
We will **train a SentencePiece tokenizer** on our **Urdu → Roman Urdu dataset** and then apply it to all splits (**train, validation, test**).  
This will convert raw text into **token IDs** that can be fed into the **BiLSTM seq2seq model**.  

---

### How we are doing it
1.  **Train two SentencePiece models**:
   - One for **Urdu (source language)**.  
   - One for **Roman Urdu (target language)**.  
2.  Save the trained models (`.model`, `.vocab`).  
3.  Encode dataset files (`train.source`, `train.target`, `val.*`, `test.*`) into **integer token IDs**.  

---

### Why we are doing this
- Neural networks work with **numbers, not raw text**.  
- **Tokenization bridges this gap** by mapping text → tokens → IDs.  

---

### Why SentencePiece over others?

| Method | Problem |
|--------|---------|
| **One-hot encoding** |  Inefficient, huge sparse vectors, no subword knowledge. |
| **Word-level tokenization** |  Breaks on unseen/rare poetic words in Urdu. |
| **Character-level tokenization** |  Handles everything, but creates long sequences & loses semantics. |
| **Subword (SentencePiece / BPE)** |  Handles rare & unseen words.<br> Keeps vocabulary compact.<br> Learns meaningful chunks (e.g., *mohabbat* → `["moh", "abbat"]`). |

---

**Conclusion:**  
**SentencePiece** is **ideal** for our low-resource dataset (**Urdu ghazals**) because it generalizes better than word-level or character-level methods, while still keeping vocabulary size manageable.  


In [56]:
import sentencepiece as spm
import os

# ----------------------------
# Paths
# ----------------------------
data_dir = "/kaggle/working/normalized"   # use normalized data
smp_dir  = "/kaggle/working/spm_models"   # save trained models
tok_dir  = "/kaggle/working/tokenized"    # save tokenized data

os.makedirs(smp_dir, exist_ok=True)
os.makedirs(tok_dir, exist_ok=True)

train_src = os.path.join(data_dir, "train.source")
train_tgt = os.path.join(data_dir, "train.target")

print("Training SentencePiece models...")

# ----------------------------
# 1. Train SentencePiece Models
# ----------------------------
# Urdu (source)
spm.SentencePieceTrainer.train(
    input=train_src,
    model_prefix=os.path.join(smp_dir, "spm_ur"),
    vocab_size=8000,  # Increased from 5000 for better coverage
    character_coverage=1.0,   # cover full Urdu script
    model_type="bpe",
    unk_id=0, pad_id=1, bos_id=2, eos_id=3,
    max_sentence_length=512,  # Add length limit
    shuffle_input_sentence=True,  # Better training
    split_by_whitespace=True
)

# Roman Urdu (target)
spm.SentencePieceTrainer.train(
    input=train_tgt,
    model_prefix=os.path.join(smp_dir, "spm_ro"),
    vocab_size=8000,  # Increased from 5000 for better coverage
    character_coverage=1.0,
    model_type="bpe",
    unk_id=0, pad_id=1, bos_id=2, eos_id=3,
    max_sentence_length=512,  # Add length limit
    shuffle_input_sentence=True,  # Better training
    split_by_whitespace=True
)

print("✅ SentencePiece models trained and saved!")

# ----------------------------
# 2. Load Tokenizers and Validate
# ----------------------------
sp_ur = spm.SentencePieceProcessor(model_file=os.path.join(smp_dir, "spm_ur.model"))
sp_ro = spm.SentencePieceProcessor(model_file=os.path.join(smp_dir, "spm_ro.model"))

print(f"Urdu vocab size: {sp_ur.get_piece_size()}")
print(f"Roman vocab size: {sp_ro.get_piece_size()}")

# Validate special tokens
print(f"Urdu - UNK: {sp_ur.unk_id()}, PAD: {sp_ur.pad_id()}, BOS: {sp_ur.bos_id()}, EOS: {sp_ur.eos_id()}")
print(f"Roman - UNK: {sp_ro.unk_id()}, PAD: {sp_ro.pad_id()}, BOS: {sp_ro.bos_id()}, EOS: {sp_ro.eos_id()}")

# ----------------------------
# 3. Test Tokenization Quality
# ----------------------------
print("\n=== TOKENIZATION QUALITY TEST ===")

test_sentences_ur = [
    "میں تمہیں بہت پسند کرتا ہوں",
    "یہ ایک خوبصورت دن ہے",
    "شاعری کی دنیا بہت وسیع ہے"
]

test_sentences_ro = [
    "maiñ tumheñ bahut pasand kartā hūñ",
    "ye aik ḳhūbsūrat din hai", 
    "shā.irī kī duniyā bahut vasī.a hai"
]

for i, (ur_sent, ro_sent) in enumerate(zip(test_sentences_ur, test_sentences_ro), 1):
    print(f"\nTest {i}:")
    
    # Urdu tokenization
    ur_tokens = sp_ur.encode(ur_sent, out_type=str)
    ur_ids = sp_ur.encode(ur_sent, out_type=int)
    ur_decoded = sp_ur.decode(ur_ids)
    
    print(f"Urdu Original:  {ur_sent}")
    print(f"Urdu Tokens:    {ur_tokens}")
    print(f"Urdu IDs:       {ur_ids}")
    print(f"Urdu Decoded:   {ur_decoded}")
    
    # Roman tokenization
    ro_tokens = sp_ro.encode(ro_sent, out_type=str)
    ro_ids = sp_ro.encode(ro_sent, out_type=int)
    ro_decoded = sp_ro.decode(ro_ids)
    
    print(f"Roman Original: {ro_sent}")
    print(f"Roman Tokens:   {ro_tokens}")
    print(f"Roman IDs:      {ro_ids}")
    print(f"Roman Decoded:  {ro_decoded}")

# ----------------------------
# 4. IMPROVED Encoding with BOS/EOS tokens
# ----------------------------
def encode_file_with_special_tokens(input_path, output_path, sp, add_bos=True, add_eos=True):
    """Encode file with proper BOS/EOS token handling"""
    if not os.path.exists(input_path):
        print(f"Warning: {input_path} does not exist!")
        return
    
    encoded_count = 0
    with open(input_path, "r", encoding="utf-8") as fin, \
         open(output_path, "w", encoding="utf-8") as fout:
        
        for line in fin:
            line = line.strip()
            if not line:
                continue
            
            # Encode the sentence
            ids = sp.encode(line, out_type=int)
            
            # Add special tokens
            if add_bos:
                ids = [sp.bos_id()] + ids
            if add_eos:
                ids = ids + [sp.eos_id()]
            
            # Write to file
            fout.write(" ".join(map(str, ids)) + "\n")
            encoded_count += 1
    
    print(f"  Encoded {encoded_count} sentences to {os.path.basename(output_path)}")

print("\n=== ENCODING DATASETS ===")

splits = ["train", "val", "test"]
for split in splits:
    print(f"Processing {split} split...")
    
    # Source (Urdu) - add BOS/EOS for encoder input
    src_input = f"{data_dir}/{split}.source"
    src_output = f"{tok_dir}/{split}.src"
    encode_file_with_special_tokens(src_input, src_output, sp_ur, add_bos=True, add_eos=True)
    
    # Target (Roman) - add BOS/EOS for decoder
    tgt_input = f"{data_dir}/{split}.target"
    tgt_output = f"{tok_dir}/{split}.tgt"
    encode_file_with_special_tokens(tgt_input, tgt_output, sp_ro, add_bos=True, add_eos=True)

print("✅ All dataset splits tokenized with BOS/EOS tokens!")

# ----------------------------
# 5. Validate Tokenized Data
# ----------------------------
print("\n=== TOKENIZED DATA VALIDATION ===")

for split in splits:
    src_file = f"{tok_dir}/{split}.src"
    tgt_file = f"{tok_dir}/{split}.tgt"
    
    if os.path.exists(src_file) and os.path.exists(tgt_file):
        with open(src_file, "r") as f_src, open(tgt_file, "r") as f_tgt:
            src_lines = [l.strip() for l in f_src if l.strip()]
            tgt_lines = [l.strip() for l in f_tgt if l.strip()]
        
        print(f"{split.upper()}: {len(src_lines)} src, {len(tgt_lines)} tgt")
        
        if len(src_lines) != len(tgt_lines):
            print(f"  ⚠️  WARNING: Mismatched counts in {split}!")
        
        # Show sample tokenized pairs
        if src_lines and tgt_lines:
            print(f"  Sample src: {src_lines[0]}")
            print(f"  Sample tgt: {tgt_lines[0]}")
            
            # Decode to verify
            src_ids = list(map(int, src_lines[0].split()))
            tgt_ids = list(map(int, tgt_lines[0].split()))
            
            src_decoded = sp_ur.decode(src_ids)
            tgt_decoded = sp_ro.decode(tgt_ids)
            
            print(f"  Decoded src: {src_decoded}")
            print(f"  Decoded tgt: {tgt_decoded}")
            print()

# ----------------------------
# 6. Generate Statistics
# ----------------------------
print("=== TOKENIZATION STATISTICS ===")

all_src_lengths = []
all_tgt_lengths = []

for split in splits:
    src_file = f"{tok_dir}/{split}.src"
    if os.path.exists(src_file):
        with open(src_file, "r") as f:
            for line in f:
                if line.strip():
                    length = len(line.strip().split())
                    all_src_lengths.append(length)

for split in splits:
    tgt_file = f"{tok_dir}/{split}.tgt"
    if os.path.exists(tgt_file):
        with open(tgt_file, "r") as f:
            for line in f:
                if line.strip():
                    length = len(line.strip().split())
                    all_tgt_lengths.append(length)

if all_src_lengths and all_tgt_lengths:
    print(f"Source token lengths - Min: {min(all_src_lengths)}, Max: {max(all_src_lengths)}, Avg: {sum(all_src_lengths)/len(all_src_lengths):.1f}")
    print(f"Target token lengths - Min: {min(all_tgt_lengths)}, Max: {max(all_tgt_lengths)}, Avg: {sum(all_tgt_lengths)/len(all_tgt_lengths):.1f}")

print("\n✅ Improved tokenization completed!")

Training SentencePiece models...
✅ SentencePiece models trained and saved!
Urdu vocab size: 8000
Roman vocab size: 8000
Urdu - UNK: 0, PAD: 1, BOS: 2, EOS: 3
Roman - UNK: 0, PAD: 1, BOS: 2, EOS: 3

=== TOKENIZATION QUALITY TEST ===

Test 1:
Urdu Original:  میں تمہیں بہت پسند کرتا ہوں
Urdu Tokens:    ['▁میں', '▁تمہیں', '▁بہت', '▁پسند', '▁کرتا', '▁ہوں']
Urdu IDs:       [20, 576, 179, 2073, 959, 92]
Urdu Decoded:   میں تمہیں بہت پسند کرتا ہوں
Roman Original: maiñ tumheñ bahut pasand kartā hūñ
Roman Tokens:   ['▁maiñ', '▁tumheñ', '▁bahut', '▁pasand', '▁kartā', '▁h', 'ūñ']
Roman IDs:      [112, 811, 239, 4185, 1108, 18, 121]
Roman Decoded:  maiñ tumheñ bahut pasand kartā hūñ

Test 2:
Urdu Original:  یہ ایک خوبصورت دن ہے
Urdu Tokens:    ['▁یہ', '▁ایک', '▁خوب', 'صورت', '▁دن', '▁ہے']
Urdu IDs:       [59, 170, 1066, 6512, 151, 15]
Urdu Decoded:   یہ ایک خوبصورت دن ہے
Roman Original: ye aik ḳhūbsūrat din hai
Roman Tokens:   ['▁ye', '▁a', 'ik', '▁ḳhūb', 'sūrat', '▁din', '▁hai']
Roman IDs:      [8

## Dataset & DataLoader in PyTorch

---

### What we are doing
We need to prepare our tokenized data (`train.source`, `train.target`, etc.) so the BiLSTM model can use it.  

- PyTorch requires data in batches of padded sequences.  
- The SentencePiece tokenizer gives us IDs for each token.  
- We will build a custom Dataset class and use DataLoader to efficiently feed data during training.  

---

### How we do it
1. Load the SentencePiece models (`urdu.model` and `roman.model`).  
2. Read the tokenized files (train/val/test splits).  
3. Convert each line into integer token IDs using the tokenizer.  
4. Pad sequences so that all examples in a batch have the same length.  
5. Wrap everything inside a PyTorch Dataset and feed with DataLoader.  

---

### Why we do it
- Neural networks need fixed-size tensors, but sentences naturally have variable lengths.  
- Padding and batching allow the model to process multiple sentences at once, making training much faster.  
- A clean Dataset class keeps the code modular, reusable, and easy to debug.  


In [57]:
# ✅ FIXED Dataset & DataLoader in PyTorch (using pre-tokenized files)
import torch
from torch.utils.data import Dataset, DataLoader
import sentencepiece as spm

# Load trained SentencePiece models
sp_ur = spm.SentencePieceProcessor(model_file="/kaggle/working/spm_models/spm_ur.model")
sp_ro = spm.SentencePieceProcessor(model_file="/kaggle/working/spm_models/spm_ro.model")

# Utility function: read tokenized IDs from file
def read_file(path):
    with open(path, "r", encoding="utf-8") as f:
        return [[int(tok) for tok in line.strip().split()] for line in f if line.strip()]

# FIXED Custom Dataset (tokens already include BOS/EOS)
class TranslationDataset(Dataset):
    def __init__(self, source_file, target_file):
        self.src_ids = read_file(source_file)
        self.tgt_ids = read_file(target_file)
        
        # Verify lengths match
        assert len(self.src_ids) == len(self.tgt_ids), f"Mismatched lengths: {len(self.src_ids)} vs {len(self.tgt_ids)}"
        
        # Verify BOS/EOS tokens are already present (optional check)
        print(f"Sample src tokens: {self.src_ids[0][:5]}...{self.src_ids[0][-5:]}")
        print(f"Sample tgt tokens: {self.tgt_ids[0][:5]}...{self.tgt_ids[0][-5:]}")
        print(f"BOS ID: {sp_ur.bos_id()}, EOS ID: {sp_ur.eos_id()}")
    
    def __len__(self):
        return len(self.src_ids)
    
    def __getitem__(self, idx):
        # ✅ NO DOUBLE BOS/EOS - tokens already have them!
        src = torch.tensor(self.src_ids[idx], dtype=torch.long)
        tgt = torch.tensor(self.tgt_ids[idx], dtype=torch.long)
        return src, tgt

# Collate function to pad batches
def collate_fn(batch):
    src_batch, tgt_batch = zip(*batch)
    
    # Pad sequences
    src_padded = torch.nn.utils.rnn.pad_sequence(src_batch, batch_first=True, padding_value=sp_ur.pad_id())
    tgt_padded = torch.nn.utils.rnn.pad_sequence(tgt_batch, batch_first=True, padding_value=sp_ro.pad_id())
    
    return src_padded, tgt_padded

# Create datasets
train_dataset = TranslationDataset("/kaggle/working/tokenized/train.src", "/kaggle/working/tokenized/train.tgt")
val_dataset   = TranslationDataset("/kaggle/working/tokenized/val.src", "/kaggle/working/tokenized/val.tgt")
test_dataset  = TranslationDataset("/kaggle/working/tokenized/test.src", "/kaggle/working/tokenized/test.tgt")

# Create dataloaders
BATCH_SIZE = 32  # Reduced batch size for better gradient updates
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn, drop_last=True)
val_loader   = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_loader  = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

print("✅ FIXED DataLoaders ready!")
print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")
print(f"Test batches: {len(test_loader)}")

# Quick verification
sample_batch = next(iter(train_loader))
src_batch, tgt_batch = sample_batch
print(f"\nBatch shapes: src={src_batch.shape}, tgt={tgt_batch.shape}")
print(f"Sample src: {src_batch[0]}")
print(f"Sample tgt: {tgt_batch[0]}")

Sample src tokens: [2, 69, 1874, 27, 7030]...[42, 3522, 28, 854, 3]
Sample tgt tokens: [2, 94, 3075, 27, 4042]...[91, 4149, 43, 961, 3]
BOS ID: 2, EOS ID: 3
Sample src tokens: [2, 6357, 258, 27, 407]...[27, 407, 491, 51, 3]
Sample tgt tokens: [2, 452, 1700, 7974, 7964]...[88, 7988, 21, 79, 3]
BOS ID: 2, EOS ID: 3
Sample src tokens: [2, 6294, 35, 12, 435]...[378, 266, 47, 6919, 3]
Sample tgt tokens: [2, 5740, 7964, 45, 1810]...[446, 309, 59, 5747, 3]
BOS ID: 2, EOS ID: 3
✅ FIXED DataLoaders ready!
Train batches: 326
Val batches: 164
Test batches: 164

Batch shapes: src=torch.Size([32, 14]), tgt=torch.Size([32, 16])
Sample src: tensor([   2, 7778,  979,   85,   20,   76,  549,   32,   56,  616,  121,    3,
           1,    1])
Sample tgt: tensor([   2, 5783,  103, 2840,  109,  112,   99,  639,   53,   82, 1641, 7974,
         476,    3,    1,    1])


## Seq2Seq Model Architecture (BiLSTM Encoder → LSTM Decoder)

---

### 1. Encoder (BiLSTM)

**What?**  
Converts the Urdu input sequence into hidden states.  

**How?**  
- Uses embeddings to represent tokens.  
- BiLSTM captures both forward and backward context.  
- Final hidden states are combined (concatenated forward + backward → passed through a fully connected layer).  

**Why?**  
Poetic Urdu often has dependencies across long spans. A BiLSTM captures richer context than a single-direction LSTM.  

---

### 2. Decoder (LSTM)

**What?**  
Generates Roman Urdu output tokens one by one.  

**How?**  
- Embedding → LSTM → Linear layer → Vocabulary distribution.  
- Uses teacher forcing (sometimes feeding the true token, sometimes the model’s prediction).  

**Why?**  
Teacher forcing speeds up training and stabilizes learning.  

---

### 3. Seq2Seq Model

**What?**  
Combines Encoder and Decoder into an end-to-end NMT model.  

**How?**  
- Encoder processes the entire source sentence.  
- Decoder starts with the `<sos>` token and generates output step by step.  
- Teacher forcing ratio controls learning stability.  

**Why?**  
This architecture is the core requirement of the project: **BiLSTM encoder → LSTM decoder**.  


In [58]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import random

# -------------------------------
# 1. Improved Encoder (BiLSTM)
# -------------------------------
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout):
        super().__init__()
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim, emb_dim, padding_idx=1)  # pad_idx=1
        self.rnn = nn.LSTM(emb_dim, enc_hidden_dim, num_layers=n_layers,
                          bidirectional=True, dropout=dropout if n_layers > 1 else 0.0, 
                          batch_first=True)
        
        # Project bidirectional hidden to decoder hidden size
        self.fc_hidden = nn.Linear(enc_hidden_dim * 2, dec_hidden_dim)
        self.fc_cell = nn.Linear(enc_hidden_dim * 2, dec_hidden_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_len=None):
        # src: [batch_size, src_len]
        embedded = self.dropout(self.embedding(src))  # [batch, src_len, emb_dim]
        
        # Pack if lengths provided (optional optimization)
        if src_len is not None:
            embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len, batch_first=True, enforce_sorted=False)
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        # Unpack if packed
        if src_len is not None:
            outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True)
        
        # outputs: [batch, src_len, enc_hidden_dim * 2]
        # hidden: [n_layers * 2, batch, enc_hidden_dim]
        # cell: [n_layers * 2, batch, enc_hidden_dim]
        
        # Concatenate final forward and backward hidden states
        # Take the last layer's forward and backward states
        hidden_fwd = hidden[-2, :, :]  # [batch, enc_hidden_dim]
        hidden_bwd = hidden[-1, :, :]  # [batch, enc_hidden_dim] 
        hidden_cat = torch.cat((hidden_fwd, hidden_bwd), dim=1)  # [batch, enc_hidden_dim * 2]
        
        cell_fwd = cell[-2, :, :]
        cell_bwd = cell[-1, :, :]
        cell_cat = torch.cat((cell_fwd, cell_bwd), dim=1)
        
        # Project to decoder dimensions and repeat for all decoder layers
        hidden_proj = torch.tanh(self.fc_hidden(hidden_cat))  # [batch, dec_hidden_dim]
        cell_proj = torch.tanh(self.fc_cell(cell_cat))  # [batch, dec_hidden_dim]
        
        # Repeat for all decoder layers
        hidden_proj = hidden_proj.unsqueeze(0).repeat(self.n_layers, 1, 1)  # [n_layers, batch, dec_hidden_dim]
        cell_proj = cell_proj.unsqueeze(0).repeat(self.n_layers, 1, 1)  # [n_layers, batch, dec_hidden_dim]
        
        return outputs, hidden_proj, cell_proj

# -------------------------------
# 2. Attention Mechanism
# -------------------------------
class Attention(nn.Module):
    def __init__(self, enc_hidden_dim, dec_hidden_dim):
        super().__init__()
        self.attn = nn.Linear(enc_hidden_dim * 2 + dec_hidden_dim, dec_hidden_dim)
        self.v = nn.Linear(dec_hidden_dim, 1, bias=False)
        
    def forward(self, hidden, encoder_outputs, mask=None):
        # hidden: [batch, dec_hidden_dim]
        # encoder_outputs: [batch, src_len, enc_hidden_dim * 2]
        
        batch_size = encoder_outputs.shape[0]
        src_len = encoder_outputs.shape[1]
        
        # Repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)  # [batch, src_len, dec_hidden_dim]
        
        # Calculate attention energies
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))  # [batch, src_len, dec_hidden_dim]
        attention = self.v(energy).squeeze(2)  # [batch, src_len]
        
        # Apply mask if provided (for padding)
        if mask is not None:
            attention = attention.masked_fill(mask == 0, -1e10)
        
        return F.softmax(attention, dim=1)

# -------------------------------
# 3. Improved Decoder with Attention
# -------------------------------
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim, padding_idx=1)
        self.rnn = nn.LSTM(emb_dim + enc_hidden_dim * 2, dec_hidden_dim, num_layers=n_layers,
                          dropout=dropout if n_layers > 1 else 0.0, batch_first=True)
        
        self.fc_out = nn.Linear(emb_dim + enc_hidden_dim * 2 + dec_hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell, encoder_outputs, mask=None):
        # input: [batch]
        # hidden: [n_layers, batch, dec_hidden_dim]
        # cell: [n_layers, batch, dec_hidden_dim]
        # encoder_outputs: [batch, src_len, enc_hidden_dim * 2]
        
        input = input.unsqueeze(1)  # [batch, 1]
        embedded = self.dropout(self.embedding(input))  # [batch, 1, emb_dim]
        
        # Calculate attention using the top layer hidden state
        a = self.attention(hidden[-1], encoder_outputs, mask)  # [batch, src_len]
        a = a.unsqueeze(1)  # [batch, 1, src_len]
        
        # Calculate weighted encoder outputs (context vector)
        weighted = torch.bmm(a, encoder_outputs)  # [batch, 1, enc_hidden_dim * 2]
        
        # Concatenate embedding with context
        rnn_input = torch.cat((embedded, weighted), dim=2)  # [batch, 1, emb_dim + enc_hidden_dim * 2]
        
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        # output: [batch, 1, dec_hidden_dim]
        
        # Prediction
        prediction = self.fc_out(torch.cat((output.squeeze(1), weighted.squeeze(1), embedded.squeeze(1)), dim=1))
        # [batch, output_dim]
        
        return prediction, hidden, cell, a.squeeze(1)

# -------------------------------
# 4. Complete Seq2Seq Model
# -------------------------------
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.device = device
        
    def create_mask(self, src):
        # Create mask for padding tokens
        mask = (src != self.src_pad_idx)  # [batch, src_len]
        return mask
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.output_dim
        
        # Create mask for source padding
        mask = self.create_mask(src)
        
        # Encode
        encoder_outputs, hidden, cell = self.encoder(src)
        
        # Initialize outputs tensor
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size, device=self.device)
        attentions = torch.zeros(batch_size, trg_len, src.shape[1], device=self.device)
        
        # First input to decoder is the BOS token
        input = trg[:, 0]
        
        for t in range(1, trg_len):
            # Pass through decoder
            output, hidden, cell, attention = self.decoder(input, hidden, cell, encoder_outputs, mask)
            
            # Store output and attention
            outputs[:, t] = output
            attentions[:, t] = attention
            
            # Decide whether to use teacher forcing
            teacher_force = random.random() < teacher_forcing_ratio
            
            # Get the highest predicted token
            top1 = output.argmax(1)
            
            # Update input for next time step
            input = trg[:, t] if teacher_force else top1
            
        return outputs, attentions

# -------------------------------
# 5. Model Initialization Function
# -------------------------------
def create_model(src_vocab_size, trg_vocab_size, device, 
                emb_dim=256, enc_hidden_dim=256, dec_hidden_dim=256, 
                n_layers=2, dropout=0.3, src_pad_idx=1):
    
    attention = Attention(enc_hidden_dim, dec_hidden_dim)
    encoder = Encoder(src_vocab_size, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout)
    decoder = Decoder(trg_vocab_size, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout, attention)
    
    model = Seq2Seq(encoder, decoder, src_pad_idx, device).to(device)
    
    return model

# -------------------------------
# 6. Model Parameter Initialization
# -------------------------------
def init_weights(model):
    for name, param in model.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)

# Usage example:
# model = create_model(src_vocab_size=8000, trg_vocab_size=8000, device=device)
# init_weights(model)
# print(f"Model has {sum(p.numel() for p in model.parameters() if p.requires_grad):,} trainable parameters")

## Training Setup (Loss, Optimizer, Hyperparameters)

---

### What we are doing
We set up the model with vocabulary sizes, dimensions, loss function, optimizer, and training parameters.  

---

### How
- SentencePiece provides `ur_vocab_size` and `ro_vocab_size`.  
- Encoder and Decoder are initialized with BiLSTM layers and dropout.  
- **Loss function**: `CrossEntropyLoss`, ignoring `<pad>` so padding does not affect training.  
- **Optimizer**: Adam, which adapts the learning rate automatically.  
- **Training hyperparameters**:  
  - Batch size  
  - Number of epochs  
  - Teacher forcing ratio (controls how often the decoder uses the true token vs. its own prediction)  

---

### Why
- Padding should not influence the gradient updates, so we mask it in the loss.  
- Adam optimizer generally works well for sequence models due to its adaptive learning rate.  
- Proper hyperparameter settings (batch size, epochs, teacher forcing ratio) are crucial for stable training and good performance.  


In [77]:
# COMPLETE FIXED TRAINING SETUP
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau
import math
import time
# --------------------------
# 1. Load vocab sizes from SentencePiece
# --------------------------
ur_vocab_size = sp_ur.get_piece_size()   # Urdu vocab size
ro_vocab_size = sp_ro.get_piece_size()   # Roman Urdu vocab size
print("Urdu vocab size:", ur_vocab_size)
print("Roman vocab size:", ro_vocab_size)
# --------------------------
# 2. FIXED Hyperparameters
# --------------------------
# Model Architecture
EMB_DIM = 256           # ✅ Increased from 128
ENC_HIDDEN_DIM = 384    # ✅ Encoder hidden dimension
DEC_HIDDEN_DIM = 384   # ✅ Decoder hidden dimension
N_LAYERS = 2            # ✅ Same for encoder and decoder
DROPOUT = 0.3           # ✅ Reduced from 0.4
# Training Parameters
LEARNING_RATE = 0.001   # ✅ Increased from 1e-4
BATCH_SIZE = 32         # ✅ Match your dataloader
NUM_EPOCHS = 30         # ✅ More epochs
TEACHER_FORCING_RATIO = 0.6  # ✅ Start lower
CLIP = 0.5              # ✅ Gradient clipping
PATIENCE = 5            # ✅ Early stopping patience
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# --------------------------
# 3. Initialize IMPROVED Model
# --------------------------
# Use the improved model architecture I provided earlier
model = create_model(
    src_vocab_size=ur_vocab_size,
    trg_vocab_size=ro_vocab_size,
    device=device,
    emb_dim=EMB_DIM,
    enc_hidden_dim=ENC_HIDDEN_DIM,
    dec_hidden_dim=DEC_HIDDEN_DIM,
    n_layers=N_LAYERS,
    dropout=DROPOUT,
    src_pad_idx=sp_ur.pad_id()  # ✅ Proper pad index
)
# Initialize weights
init_weights(model)
# Count parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')
# --------------------------
# 4. Define Loss Function & Optimizer
# --------------------------
PAD_IDX = sp_ro.pad_id()  # Roman Urdu pad token
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
# ✅ Better optimizer with weight decay
optimizer = optim.AdamW(model.parameters(), 
                       lr=LEARNING_RATE, 
                       weight_decay=1e-5,
                       betas=(0.9, 0.98),
                       eps=1e-9)
# ✅ Learning rate scheduler
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5, verbose=True)

Urdu vocab size: 8000
Roman vocab size: 8000
Using device: cuda
The model has 25,267,520 trainable parameters


In [78]:
# --------------------------
# 5. IMPROVED Training Functions
# --------------------------
def train_one_epoch(model, dataloader, optimizer, criterion, clip, device):
    model.train()
    epoch_loss = 0
    total_tokens = 0
    
    for batch_idx, (src, tgt) in enumerate(dataloader):
        src, tgt = src.to(device), tgt.to(device)
        
        optimizer.zero_grad()
        
        # Forward pass - returns (outputs, attentions)
        outputs, _ = model(src, tgt, teacher_forcing_ratio=TEACHER_FORCING_RATIO)
        
        # Reshape for loss calculation
        # outputs: [batch_size, trg_len, vocab_size]
        output_dim = outputs.shape[-1]
        
        # Skip BOS token (index 0) for loss calculation
        outputs = outputs[:, 1:].contiguous().view(-1, output_dim)
        tgt = tgt[:, 1:].contiguous().view(-1)
        
        # Calculate loss
        loss = criterion(outputs, tgt)
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        # Accumulate loss and token count
        epoch_loss += loss.item()
        total_tokens += (tgt != PAD_IDX).sum().item()
        
        # Print progress every 50 batches
        if batch_idx % 50 == 0:
            print(f'  Batch {batch_idx}/{len(dataloader)}, Loss: {loss.item():.4f}')
    
    return epoch_loss / len(dataloader)

def evaluate(model, dataloader, criterion, device):
    model.eval()
    epoch_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for src, tgt in dataloader:
            src, tgt = src.to(device), tgt.to(device)
            
            # No teacher forcing during evaluation
            outputs, _ = model(src, tgt, teacher_forcing_ratio=0.0)
            
            # Reshape for loss calculation
            output_dim = outputs.shape[-1]
            outputs = outputs[:, 1:].contiguous().view(-1, output_dim)
            tgt = tgt[:, 1:].contiguous().view(-1)
            
            loss = criterion(outputs, tgt)
            epoch_loss += loss.item()
            total_tokens += (tgt != PAD_IDX).sum().item()
    
    return epoch_loss / len(dataloader)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

# --------------------------
# 6. TRAINING LOOP with Early Stopping
# --------------------------
SAVE_PATH = "/kaggle/working/best_seq2seq_model.pt"
best_valid_loss = float('inf')
patience_counter = 0

print("Starting training...")
print("-" * 50)

for epoch in range(NUM_EPOCHS):
    start_time = time.time()
    
    print(f'Epoch {epoch+1:02}/{NUM_EPOCHS}')
    
    # Training
    train_loss = train_one_epoch(model, train_loader, optimizer, criterion, CLIP, device)
    
    # Validation
    valid_loss = evaluate(model, val_loader, criterion, device)
    
    # Learning rate scheduling
    scheduler.step(valid_loss)
    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # Save best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        patience_counter = 0
        torch.save({
            'epoch': epoch + 1,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'best_valid_loss': best_valid_loss,
            'hyperparameters': {
                'emb_dim': EMB_DIM,
                'enc_hidden_dim': ENC_HIDDEN_DIM,
                'dec_hidden_dim': DEC_HIDDEN_DIM,
                'n_layers': N_LAYERS,
                'dropout': DROPOUT
            }
        }, SAVE_PATH)
        print(f'  ✅ New best model saved!')
    else:
        patience_counter += 1
    
    # Current learning rate
    current_lr = optimizer.param_groups[0]['lr']
    
    print(f'  Time: {epoch_mins}m {epoch_secs}s')
    print(f'  Train Loss: {train_loss:.4f} | Valid Loss: {valid_loss:.4f}')
    print(f'  Learning Rate: {current_lr:.6f}')
    print(f'  Best Valid Loss: {best_valid_loss:.4f}')
    print(f'  Patience: {patience_counter}/{PATIENCE}')
    print("-" * 50)
    
    # Early stopping
    if patience_counter >= PATIENCE:
        print(f'Early stopping triggered after {epoch+1} epochs')
        break

print("Training complete!")
print(f"Best validation loss: {best_valid_loss:.4f}")

# --------------------------
# 7. Load Best Model for Testing
# --------------------------
print("Loading best model for evaluation...")
checkpoint = torch.load(SAVE_PATH, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])

# Test evaluation
test_loss = evaluate(model, test_loader, criterion, device)
print(f'Test Loss: {test_loss:.4f}')
print(f'Test Perplexity: {math.exp(test_loss):.4f}')

Starting training...
--------------------------------------------------
Epoch 01/30
  Batch 0/326, Loss: 8.9873
  Batch 50/326, Loss: 6.3916
  Batch 100/326, Loss: 6.1001
  Batch 150/326, Loss: 6.0167
  Batch 200/326, Loss: 5.6326
  Batch 250/326, Loss: 5.9795
  Batch 300/326, Loss: 5.5891
  ✅ New best model saved!
  Time: 0m 29s
  Train Loss: 5.9757 | Valid Loss: 6.1227
  Learning Rate: 0.001000
  Best Valid Loss: 6.1227
  Patience: 0/5
--------------------------------------------------
Epoch 02/30
  Batch 0/326, Loss: 5.1740
  Batch 50/326, Loss: 5.3226
  Batch 100/326, Loss: 4.9432
  Batch 150/326, Loss: 4.9394
  Batch 200/326, Loss: 5.0912
  Batch 250/326, Loss: 4.7332
  Batch 300/326, Loss: 4.2444
  ✅ New best model saved!
  Time: 0m 30s
  Train Loss: 4.9673 | Valid Loss: 5.1839
  Learning Rate: 0.001000
  Best Valid Loss: 5.1839
  Patience: 0/5
--------------------------------------------------
Epoch 03/30
  Batch 0/326, Loss: 4.1789
  Batch 50/326, Loss: 3.9374
  Batch 100/326, 

In [79]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchtext.data.metrics import bleu_score
import math
import sentencepiece as smp
import editdistance   # pip install editdistance
import random

# =====================================================
# 1. Load SentencePiece Models
# =====================================================
sp_ur = spm.SentencePieceProcessor(model_file="spm_models/spm_ur.model")
sp_ro = spm.SentencePieceProcessor(model_file="spm_models/spm_ro.model")

ur_vocab_size = sp_ur.get_piece_size()
ro_vocab_size = sp_ro.get_piece_size()

print(f"Loaded vocabularies: Urdu={ur_vocab_size}, Roman={ro_vocab_size}")

# =====================================================
# 2. UPDATED Model Architecture (same as training)
# =====================================================
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout):
        super().__init__()
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim, emb_dim, padding_idx=1)  # pad_idx=1
        self.rnn = nn.LSTM(emb_dim, enc_hidden_dim, num_layers=n_layers,
                          bidirectional=True, dropout=dropout if n_layers > 1 else 0.0, 
                          batch_first=True)
        
        # Project bidirectional hidden to decoder hidden size
        self.fc_hidden = nn.Linear(enc_hidden_dim * 2, dec_hidden_dim)
        self.fc_cell = nn.Linear(enc_hidden_dim * 2, dec_hidden_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_len=None):
        # src: [batch_size, src_len]
        embedded = self.dropout(self.embedding(src))  # [batch, src_len, emb_dim]
        
        # Pack if lengths provided (optional optimization)
        if src_len is not None:
            embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len, batch_first=True, enforce_sorted=False)
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        # Unpack if packed
        if src_len is not None:
            outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True)
        
        # outputs: [batch, src_len, enc_hidden_dim * 2]
        # hidden: [n_layers * 2, batch, enc_hidden_dim]
        # cell: [n_layers * 2, batch, enc_hidden_dim]
        
        # Concatenate final forward and backward hidden states
        # Take the last layer's forward and backward states
        hidden_fwd = hidden[-2, :, :]  # [batch, enc_hidden_dim]
        hidden_bwd = hidden[-1, :, :]  # [batch, enc_hidden_dim] 
        hidden_cat = torch.cat((hidden_fwd, hidden_bwd), dim=1)  # [batch, enc_hidden_dim * 2]
        
        cell_fwd = cell[-2, :, :]
        cell_bwd = cell[-1, :, :]
        cell_cat = torch.cat((cell_fwd, cell_bwd), dim=1)
        
        # Project to decoder dimensions and repeat for all decoder layers
        hidden_proj = torch.tanh(self.fc_hidden(hidden_cat))  # [batch, dec_hidden_dim]
        cell_proj = torch.tanh(self.fc_cell(cell_cat))  # [batch, dec_hidden_dim]
        
        # Repeat for all decoder layers
        hidden_proj = hidden_proj.unsqueeze(0).repeat(self.n_layers, 1, 1)  # [n_layers, batch, dec_hidden_dim]
        cell_proj = cell_proj.unsqueeze(0).repeat(self.n_layers, 1, 1)  # [n_layers, batch, dec_hidden_dim]
        
        return outputs, hidden_proj, cell_proj

class Attention(nn.Module):
    def __init__(self, enc_hidden_dim, dec_hidden_dim):
        super().__init__()
        self.attn = nn.Linear(enc_hidden_dim * 2 + dec_hidden_dim, dec_hidden_dim)
        self.v = nn.Linear(dec_hidden_dim, 1, bias=False)
        
    def forward(self, hidden, encoder_outputs, mask=None):
        # hidden: [batch, dec_hidden_dim]
        # encoder_outputs: [batch, src_len, enc_hidden_dim * 2]
        
        batch_size = encoder_outputs.shape[0]
        src_len = encoder_outputs.shape[1]
        
        # Repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)  # [batch, src_len, dec_hidden_dim]
        
        # Calculate attention energies
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))  # [batch, src_len, dec_hidden_dim]
        attention = self.v(energy).squeeze(2)  # [batch, src_len]
        
        # Apply mask if provided (for padding)
        if mask is not None:
            attention = attention.masked_fill(mask == 0, -1e10)
        
        return F.softmax(attention, dim=1)

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim, padding_idx=1)
        self.rnn = nn.LSTM(emb_dim + enc_hidden_dim * 2, dec_hidden_dim, num_layers=n_layers,
                          dropout=dropout if n_layers > 1 else 0.0, batch_first=True)
        
        self.fc_out = nn.Linear(emb_dim + enc_hidden_dim * 2 + dec_hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell, encoder_outputs, mask=None):
        # input: [batch]
        # hidden: [n_layers, batch, dec_hidden_dim]
        # cell: [n_layers, batch, dec_hidden_dim]
        # encoder_outputs: [batch, src_len, enc_hidden_dim * 2]
        
        input = input.unsqueeze(1)  # [batch, 1]
        embedded = self.dropout(self.embedding(input))  # [batch, 1, emb_dim]
        
        # Calculate attention using the top layer hidden state
        a = self.attention(hidden[-1], encoder_outputs, mask)  # [batch, src_len]
        a = a.unsqueeze(1)  # [batch, 1, src_len]
        
        # Calculate weighted encoder outputs (context vector)
        weighted = torch.bmm(a, encoder_outputs)  # [batch, 1, enc_hidden_dim * 2]
        
        # Concatenate embedding with context
        rnn_input = torch.cat((embedded, weighted), dim=2)  # [batch, 1, emb_dim + enc_hidden_dim * 2]
        
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        # output: [batch, 1, dec_hidden_dim]
        
        # Prediction
        prediction = self.fc_out(torch.cat((output.squeeze(1), weighted.squeeze(1), embedded.squeeze(1)), dim=1))
        # [batch, output_dim]
        
        return prediction, hidden, cell, a.squeeze(1)

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.device = device
        
    def create_mask(self, src):
        # Create mask for padding tokens
        mask = (src != self.src_pad_idx)  # [batch, src_len]
        return mask
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.output_dim
        
        # Create mask for source padding
        mask = self.create_mask(src)
        
        # Encode
        encoder_outputs, hidden, cell = self.encoder(src)
        
        # Initialize outputs tensor
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size, device=self.device)
        attentions = torch.zeros(batch_size, trg_len, src.shape[1], device=self.device)
        
        # First input to decoder is the BOS token
        input = trg[:, 0]
        
        for t in range(1, trg_len):
            # Pass through decoder
            output, hidden, cell, attention = self.decoder(input, hidden, cell, encoder_outputs, mask)
            
            # Store output and attention
            outputs[:, t] = output
            attentions[:, t] = attention
            
            # Decide whether to use teacher forcing
            teacher_force = random.random() < teacher_forcing_ratio
            
            # Get the highest predicted token
            top1 = output.argmax(1)
            
            # Update input for next time step
            input = trg[:, t] if teacher_force else top1
            
        return outputs, attentions

def create_model(src_vocab_size, trg_vocab_size, device, 
                emb_dim=256, enc_hidden_dim=256, dec_hidden_dim=256, 
                n_layers=2, dropout=0.3, src_pad_idx=1):
    
    attention = Attention(enc_hidden_dim, dec_hidden_dim)
    encoder = Encoder(src_vocab_size, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout)
    decoder = Decoder(trg_vocab_size, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers, dropout, attention)
    
    model = Seq2Seq(encoder, decoder, src_pad_idx, device).to(device)
    
    return model

# =====================================================
# 3. Load Trained Model with CORRECT Parameters
# =====================================================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ✅ UPDATED Hyperparameters (match your training)
EMB_DIM = 256           
ENC_HIDDEN_DIM = 384    
DEC_HIDDEN_DIM = 384    
N_LAYERS = 2            
DROPOUT = 0.3   

# Create model with updated architecture
model = create_model(
    src_vocab_size=ur_vocab_size,
    trg_vocab_size=ro_vocab_size,
    device=device,
    emb_dim=EMB_DIM,
    enc_hidden_dim=ENC_HIDDEN_DIM,
    dec_hidden_dim=DEC_HIDDEN_DIM,
    n_layers=N_LAYERS,
    dropout=DROPOUT,
    src_pad_idx=sp_ur.pad_id()
)

# Load the trained weights
try:
    checkpoint = torch.load("/kaggle/working/best_seq2seq_model.pt", map_location=device)
    if isinstance(checkpoint, dict) and 'model_state_dict' in checkpoint:
        model.load_state_dict(checkpoint['model_state_dict'])
        print("✅ Loaded model from checkpoint")
    else:
        model.load_state_dict(checkpoint)
        print("✅ Loaded model state dict directly")
    model.eval()
except FileNotFoundError:
    print("❌ Model file not found! Please check the path.")
    exit(1)
except Exception as e:
    print(f"❌ Error loading model: {e}")
    exit(1)

# =====================================================
# 4. BEAM SEARCH AND TRANSLATION FUNCTIONS
# =====================================================
from collections import namedtuple

BeamNode = namedtuple('BeamNode', ['tokens', 'log_prob', 'hidden', 'cell', 'finished'])

def beam_search_translate(model, sentence, sp_src, sp_trg, beam_size=5, max_len=50, device='cpu'):
    """
    Beam search translation - significantly better than greedy search
    """
    model.eval()
    
    # Tokenize input sentence
    tokens = [sp_src.bos_id()] + sp_src.encode(sentence.strip(), out_type=int) + [sp_src.eos_id()]
    src_tensor = torch.LongTensor(tokens).unsqueeze(0).to(device)
    
    with torch.no_grad():
        # Create mask and encode
        mask = model.create_mask(src_tensor)
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        
        # Initialize beam with BOS token
        initial_token = sp_trg.bos_id()
        beams = [BeamNode(
            tokens=[initial_token],
            log_prob=0.0,
            hidden=hidden,
            cell=cell,
            finished=False
        )]
        
        finished_beams = []
        
        # Generate tokens step by step
        for step in range(max_len):
            if not beams:
                break
                
            candidates = []
            
            for beam in beams:
                if beam.finished:
                    finished_beams.append(beam)
                    continue
                
                # Get predictions for current beam
                input_token = torch.LongTensor([beam.tokens[-1]]).to(device)
                output, new_hidden, new_cell, _ = model.decoder(
                    input_token, beam.hidden, beam.cell, encoder_outputs, mask
                )
                
                # Get top beam_size candidates
                log_probs = F.log_softmax(output, dim=-1)
                top_log_probs, top_indices = torch.topk(log_probs, beam_size)
                
                # Expand beam
                for i in range(beam_size):
                    token_id = top_indices[0, i].item()
                    token_log_prob = top_log_probs[0, i].item()
                    
                    new_tokens = beam.tokens + [token_id]
                    new_log_prob = beam.log_prob + token_log_prob
                    
                    # Check if finished
                    is_finished = (token_id == sp_trg.eos_id())
                    
                    candidates.append(BeamNode(
                        tokens=new_tokens,
                        log_prob=new_log_prob,
                        hidden=new_hidden.clone(),
                        cell=new_cell.clone(),
                        finished=is_finished
                    ))
            
            # Keep top beam_size candidates (length normalized)
            candidates.sort(key=lambda x: x.log_prob / len(x.tokens), reverse=True)
            beams = [beam for beam in candidates[:beam_size] if not beam.finished]
            
            # Add finished beams
            finished_beams.extend([beam for beam in candidates if beam.finished])
        
        # Add remaining beams to finished
        finished_beams.extend(beams)
        
        if not finished_beams:
            return ""
        
        # Get best beam (length normalized score)
        best_beam = max(finished_beams, key=lambda x: x.log_prob / len(x.tokens))
        
        # Decode tokens (remove BOS/EOS)
        decoded_tokens = [t for t in best_beam.tokens[1:] if t != sp_trg.eos_id()]
        return sp_trg.decode(decoded_tokens).strip()

def translate_sentence(sentence, model, sp_src, sp_trg, max_len=50, device='cpu'):
    """Greedy search translation (original method)"""
    model.eval()
    
    # Tokenize input sentence
    tokens = [sp_src.bos_id()] + sp_src.encode(sentence.strip(), out_type=int) + [sp_src.eos_id()]
    src_tensor = torch.LongTensor(tokens).unsqueeze(0).to(device)

    with torch.no_grad():
        # Create mask
        mask = model.create_mask(src_tensor)
        
        # Encode
        encoder_outputs, hidden, cell = model.encoder(src_tensor)

    # Start with BOS token
    trg_indexes = [sp_trg.bos_id()]
    
    for _ in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
        
        with torch.no_grad():
            # Use decoder with attention
            output, hidden, cell, _ = model.decoder(trg_tensor, hidden, cell, encoder_outputs, mask)
            pred_token = output.argmax(1).item()
        
        trg_indexes.append(pred_token)
        
        # Stop if EOS token is generated
        if pred_token == sp_trg.eos_id():
            break
    
    # Decode without BOS/EOS tokens
    translated_tokens = [t for t in trg_indexes[1:] if t != sp_trg.eos_id()]
    translated = sp_trg.decode(translated_tokens)
    return translated.strip()

def translate_sentence_improved(sentence, model, sp_src, sp_trg, use_beam_search=True, beam_size=5, max_len=50, device='cpu'):
    """
    Improved translation with beam search option
    """
    if use_beam_search:
        return beam_search_translate(model, sentence, sp_src, sp_trg, beam_size, max_len, device)
    else:
        return translate_sentence(sentence, model, sp_src, sp_trg, max_len, device)

def calculate_bleu_correct(references, hypotheses):
    """Calculate BLEU score with proper format"""
    if not references or not hypotheses:
        return 0.0
    
    # Convert to proper format for torchtext BLEU
    refs_tokenized = [[ref.split()] for ref in references]  # List of [list of tokens]
    hyps_tokenized = [hyp.split() for hyp in hypotheses]    # List of tokens
    
    try:
        return bleu_score(hyps_tokenized, refs_tokenized)
    except:
        return 0.0

def calculate_cer_correct(references, hypotheses):
    """Calculate Character Error Rate correctly"""
    if not references or not hypotheses:
        return 1.0
        
    total_edits = 0
    total_chars = 0
    
    for ref, hyp in zip(references, hypotheses):
        # Use character-level comparison
        total_edits += editdistance.eval(ref, hyp)
        total_chars += len(ref)
    
    return total_edits / total_chars if total_chars > 0 else 1.0

def calculate_wer(references, hypotheses):
    """Calculate Word Error Rate"""
    if not references or not hypotheses:
        return 1.0
        
    total_edits = 0
    total_words = 0
    
    for ref, hyp in zip(references, hypotheses):
        ref_words = ref.split()
        hyp_words = hyp.split()
        
        total_edits += editdistance.eval(ref_words, hyp_words)
        total_words += len(ref_words)
    
    return total_edits / total_words if total_words > 0 else 1.0

# =====================================================
# 5. UPDATED Evaluation Loop
# =====================================================
print("Starting evaluation...")

references = []  # Ground truth translations
hypotheses = []  # Model predictions

# ✅ Use your actual tokenized validation files
try:
    # Read from your tokenized files and decode them
    with open("/kaggle/working/tokenized/val.src", "r") as f_src, \
         open("/kaggle/working/tokenized/val.tgt", "r") as f_trg:

        src_lines = f_src.readlines()
        trg_lines = f_trg.readlines()
        
        print(f"Evaluating {len(src_lines)} sentences...")
        
        for i, (src_line, trg_line) in enumerate(zip(src_lines, trg_lines)):
            if i % 100 == 0:
                print(f"Processed {i} sentences...")
            
            # Parse tokenized lines (space-separated integers)
            try:
                src_tokens = [int(t) for t in src_line.strip().split() if t]
                trg_tokens = [int(t) for t in trg_line.strip().split() if t]
                
                if not src_tokens or not trg_tokens:
                    continue
                
                # Decode tokens to text (removing BOS/EOS)
                src_text = sp_ur.decode([t for t in src_tokens if t not in [sp_ur.bos_id(), sp_ur.eos_id()]])
                trg_text = sp_ro.decode([t for t in trg_tokens if t not in [sp_ro.bos_id(), sp_ro.eos_id()]])
                
                # Translate with BEAM SEARCH (improved method)
                hypothesis = translate_sentence_improved(src_text, model, sp_ur, sp_ro, 
                                                       use_beam_search=True, beam_size=5, device=device)
                
                references.append(trg_text)
                hypotheses.append(hypothesis)
                
                # Print first few examples with comparison
                if i < 5:
                    # Also get greedy result for comparison
                    greedy_result = translate_sentence_improved(src_text, model, sp_ur, sp_ro, 
                                                              use_beam_search=False, device=device)
                    
                    print(f"\nExample {i+1}:")
                    print(f"Source:     {src_text}")
                    print(f"Reference:  {trg_text}")
                    print(f"Greedy:     {greedy_result}")
                    print(f"Beam(5):    {hypothesis}")
                    print(f"Improvement: {'✅' if len(hypothesis.split()) > len(greedy_result.split()) else '🔄'}")
                    
            except Exception as e:
                print(f"Error processing sentence {i}: {e}")
                continue

            # Limit evaluation to reasonable size for testing
            if i >= 1000:  # Evaluate first 1000 sentences
                print(f"Stopping evaluation at {i+1} sentences for efficiency...")
                break

except FileNotFoundError as e:
    print(f"❌ Error: {e}")
    print("Make sure tokenized validation files exist")
    exit(1)

# =====================================================
# 6. Calculate Metrics
# =====================================================
print(f"\nCalculating metrics for {len(references)} sentences...")

if references and hypotheses:
    # BLEU Score
    try:
        bleu = calculate_bleu_correct(references, hypotheses)
        print(f"BLEU Score: {bleu*100:.2f}")
    except Exception as e:
        print(f"Error calculating BLEU: {e}")

    # Character Error Rate
    try:
        cer = calculate_cer_correct(references, hypotheses)
        print(f"CER: {cer:.4f} ({cer*100:.2f}%)")
    except Exception as e:
        print(f"Error calculating CER: {e}")

    # Word Error Rate
    try:
        wer = calculate_wer(references, hypotheses)
        print(f"WER: {wer:.4f} ({wer*100:.2f}%)")
    except Exception as e:
        print(f"Error calculating WER: {e}")
else:
    print("❌ No valid sentence pairs found for evaluation!")

print("\n✅ Evaluation completed!")

Loaded vocabularies: Urdu=8000, Roman=8000
✅ Loaded model from checkpoint
Starting evaluation...
Evaluating 5219 sentences...
Processed 0 sentences...

Example 1:
Source:     استان یار سے اٹھ جائیں کیا
Reference:  āstān-e-yār se uth jaa.eñ kyā
Greedy:     pūchhiye yaar se uth jaa.e kyā kyā
Beam(5):    ḳhūb-e se yaar jaa.e kyā kyā
Improvement: 🔄

Example 2:
Source:     اتا ہے ہوش مجھ کو اب تو پہر پہر میں
Reference:  aatā hai hosh mujh ko ab to pahr pahr meñ
Greedy:     aatā hai jin mujh ko ab to dikhāyā pesh meñ
Beam(5):    aatā hai jin mujh ko ab to dikhāyāgī meñ
Improvement: 🔄

Example 3:
Source:     صبح کے درد کو راتوں کی جلن کو بھولیں
Reference:  sub.h ke dard ko rātoñ kī jalan ko bhūleñ
Greedy:     sub. dard ke dard ko pahchān kī jal. ko ko ko
Beam(5):    sub. dard ke dard ko pahchān kī jal.an ko bhūleñ
Improvement: 🔄

Example 4:
Source:     یہ ہمیں تھے جن کے لباس پر سر رہ سیاہی لکھی گئی
Reference:  ye hamīñ the jin ke libās par sar-e-rah siyāhī likhī ga.ī
Greedy:     ye hameñ the 