<a href="https://colab.research.google.com/github/Sameensanobarsubiya/African-Vulture-Optimization/blob/main/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2018 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm # For progress bars

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Assuming common defaults, adjust if needed
pad_idx = tokenizer.pad_id() if hasattr(tokenizer, 'pad_id') else 0 # Often 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return x

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, mask):
        src = self.embedding(src)
        src = self.pos_encoding(src)
        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        tgt = self.pos_encoding(tgt)
        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0):
        super(Transformer, self).__init__()
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout)
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        encoder_output = self.encoder(src, src_mask)
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx).to(device)

# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences
        encoded_english = encoded_english[:self.max_length] + [pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [pad_idx] * (self.max_length - len(encoded_urdu))

        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
max_length = 128 # Define max length again for dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

criterion = nn.CrossEntropyLoss(ignore_index=pad_idx) # Ignore padding in loss calculation
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval()
    total_val_loss = 0

    print("Evaluating...")
    with torch.no_grad():
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval()
    with torch.no_grad():
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding
        for _ in range(max_length):
            output = model(src_tensor, tgt_tensor)
            last_token_logits = output[:, -1, :]
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token if it was added)
        # Adjust the slicing if your tokenizer's decode handles the start token differently
        translated_text = tokenizer.decode_ids(target_sequence[1:]) # Exclude the initial start token

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

Using device: cuda
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN         NaN   
3  اگر آپ نے اپنے آپ سے ایسے سوالات کیے ہیں، تو آ...         NaN         NaN   
4   اس بات پر منحصر ہے کہ آپ رہنمائی کے لیے کہاں ...         NaN         NaN   

   Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  \
0         NaN         NaN         NaN         NaN        

Training:   0%|          | 0/1509 [00:00<?, ?it/s]

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## **Introduction to Colab and Python**

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

try:
    df = pd.read_excel('parallel-corpus.xlsx')
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    exit()

df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

pad_idx = tokenizer.pad_id() if hasattr(tokenizer, 'pad_id') else 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

max_length = max_actual_length + 10
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting max_length for padding, truncation, and model to: {max_length}")


def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        if x.size(0) > self.pe.size(0):
            raise RuntimeError(f"Input sequence length ({x.size(0)}) exceeds positional encoding max_len ({self.pe.size(0)}). Increase max_length.")
        x = x + self.pe[:x.size(0), :]
        return x

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, mask):
        src = self.embedding(src)
        src = self.pos_encoding(src)
        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        tgt = self.pos_encoding(tgt)
        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000):
        super(Transformer, self).__init__()
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len)
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        encoder_output = self.encoder(src, src_mask)
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

src_vocab_size = tokenizer.get_piece_size()
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512
num_layers = 6
num_heads = 8
d_ff = 2048
dropout = 0.1

model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length).to(device)


class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device)

        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    model.eval()
    total_val_loss = 0

    print("Evaluating...")
    with torch.no_grad():
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval()
    with torch.no_grad():
        encoded_input = tokenizer.encode_as_ids(text)
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        target_sequence = [start_token_id]
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        for _ in range(max_length):
            output = model(src_tensor, tgt_tensor)
            last_token_logits = output[:, -1, :]
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            target_sequence.append(predicted_token_id.item())
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        translated_text = tokenizer.decode_ids(target_sequence[1:])

        return translated_text

sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

Using device: cuda
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN         NaN   
3  اگر آپ نے اپنے آپ سے ایسے سوالات کیے ہیں، تو آ...         NaN         NaN   
4   اس بات پر منحصر ہے کہ آپ رہنمائی کے لیے کہاں ...         NaN         NaN   

   Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  \
0         NaN         NaN         NaN         NaN        

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [5]:
# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Assuming common defaults, adjust if needed
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset to determine appropriate max_length
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    # Add 2 for start and end tokens in target sequence
    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set max_length to be slightly larger than the maximum actual length found
# This provides some buffer for variations in sentence length during generation
max_length = max_actual_length + 10
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super(PositionalEncoding, self).__init__()
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        # Calculate positions
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        # Calculate division term for sinusoidal functions
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        # Apply sin to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cos to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        # Reshape for broadcasting and register as buffer
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # Ensure input sequence length does not exceed positional encoding max_len
        if x.size(0) > self.pe.size(0):
            raise RuntimeError(f"Input sequence length ({x.size(0)}) exceeds positional encoding max_len ({self.pe.size(0)}). Increase max_length.")
        # Add positional encoding to the input embeddings
        x = x + self.pe[:x.size(0), :]
        return x

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        # Calculate dimension of each head
        self.d_k = d_model // num_heads

        # Linear layers for queries, keys, and values
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        # Output linear layer
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Calculate attention scores (Q * K^T / sqrt(d_k))
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        # Apply mask if provided (fill masked positions with a large negative value)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        # Apply softmax to get attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        # Calculate weighted sum of values (Attention * V)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # Apply linear transformations and split into heads
        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Perform scaled dot-product attention
        output = self.scaled_dot_product_attention(q, k, v, mask)
        # Concatenate heads and apply output linear layer
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        # Two linear layers with a ReLU activation in between
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # Apply the feed-forward network
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        # Self-attention sub-layer
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        # Feed-forward sub-layer
        self.ffn = FeedForward(d_model, d_ff)
        # Layer normalization and dropout
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        # Apply self-attention with residual connection and normalization
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        # Apply feed-forward network with residual connection and normalization
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Encoder, self).__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        # Stack of encoder layers
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, mask):
        # Apply embedding and positional encoding
        src = self.embedding(src)
        src = self.pos_encoding(src)
        # Apply dropout
        src = self.dropout(src)
        # Pass through encoder layers
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        # Masked self-attention sub-layer
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        # Cross-attention sub-layer (attention over encoder output)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        # Feed-forward sub-layer
        self.ffn = FeedForward(d_model, d_ff)
        # Layer normalization and dropout
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Apply masked self-attention with residual connection and normalization
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        # Apply cross-attention with residual connection and normalization
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        # Apply feed-forward network with residual connection and normalization
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Decoder, self).__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        # Stack of decoder layers
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        # Dropout layer
        self.dropout = nn.Dropout(dropout)
        # Output linear layer to predict vocabulary
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Apply embedding and positional encoding
        tgt = self.embedding(tgt)
        tgt = self.pos_encoding(tgt)
        # Apply dropout
        tgt = self.dropout(tgt)
        # Pass through decoder layers
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        # Apply output linear layer
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# Initialize and move the model to the selected device
model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length).to(device)


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        # Zero out gradients
        optimizer.zero_grad()
        # Forward pass through the model
        output = model(src, tgt_input)
        # Reshape output for loss calculation
        output = output.view(-1, tgt_vocab_size)

        # Calculate loss
        loss = criterion(output, labels)
        total_train_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        # Iterate over the validation data
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            # Forward pass through the model
            output = model(src, tgt_input)
            # Reshape output for loss calculation
            output = output.view(-1, tgt_vocab_size)

            # Calculate loss
            loss = criterion(output, labels)
            total_val_loss += loss.item()

    # Calculate average validation loss for the epoch
    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        translated_text = tokenizer.decode_ids(target_sequence[1:])

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")


Special Token IDs: Padding: 0, Start: 1, End: 2, Unknown: 0

Maximum tokenized sequence length in the dataset (including special tokens): 1040
Setting max_length for padding, truncation, and model to: 1050


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm

# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset to determine appropriate max_length
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    # Add 2 for start and end tokens in target sequence
    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set max_length to be slightly larger than the maximum actual length found
# This provides some buffer for variations in sentence length during generation
max_length = max_actual_length + 10
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super(PositionalEncoding, self).__init__()
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        # Calculate positions
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        # Calculate division term for sinusoidal functions
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        # Apply sin to even indices
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cos to odd indices
        pe[:, 1::2] = torch.cos(position * div_term)
        # Reshape for broadcasting and register as buffer
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # Ensure input sequence length does not exceed positional encoding max_len
        # Note: This check is for runtime, the buffer is created at init
        if x.size(1) > self.pe.size(0): # Check sequence length dim (dim 1)
            # While this check is good for runtime, the error is happening at .to(device)
            # so the issue might be with pe itself.
            raise RuntimeError(f"Input sequence length ({x.size(1)}) exceeds positional encoding max_len ({self.pe.size(0)}). Increase max_length.")
        # Add positional encoding to the input embeddings
        # Adjust slicing to match batch_size, seq_len, d_model
        x = x + self.pe[:x.size(1), :].unsqueeze(0) # Add unsqueeze(0) for broadcasting with batch
        return x


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        # Calculate dimension of each head
        self.d_k = d_model // num_heads

        # Linear layers for queries, keys, and values
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        # Output linear layer
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Calculate attention scores (Q * K^T / sqrt(d_k))
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        # Apply mask if provided (fill masked positions with a large negative value)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        # Apply softmax to get attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        # Calculate weighted sum of values (Attention * V)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # Apply linear transformations and split into heads
        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Perform scaled dot-product attention
        output = self.scaled_dot_product_attention(q, k, v, mask)
        # Concatenate heads and apply output linear layer
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        # Two linear layers with a ReLU activation in between
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # Apply the feed-forward network
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        # Self-attention sub-layer
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        # Feed-forward sub-layer
        self.ffn = FeedForward(d_model, d_ff)
        # Layer normalization and dropout
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        # Apply self-attention with residual connection and normalization
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        # Apply feed-forward network with residual connection and normalization
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Encoder, self).__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        # Stack of encoder layers
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, mask):
        # Apply embedding
        src = self.embedding(src)
        # Apply positional encoding
        src = self.pos_encoding(src)
        # Apply dropout
        src = self.dropout(src)
        # Pass through encoder layers
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        # Masked self-attention sub-layer
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        # Cross-attention sub-layer (attention over encoder output)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        # Feed-forward sub-layer
        self.ffn = FeedForward(d_model, d_ff)
        # Layer normalization and dropout
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Apply masked self-attention with residual connection and normalization
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        # Apply cross-attention with residual connection and normalization
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        # Apply feed-forward network with residual connection and normalization
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Decoder, self).__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        # Stack of decoder layers
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        # Dropout layer
        self.dropout = nn.Dropout(dropout)
        # Output linear layer to predict vocabulary
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Apply embedding
        tgt = self.embedding(tgt)
        # Apply positional encoding
        tgt = self.pos_encoding(tgt)
        # Apply dropout
        tgt = self.dropout(tgt)
        # Pass through decoder layers
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        # Apply output linear layer
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# Initialize the model
model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length)

# --- Diagnostic: Print min/max of positional encoding buffer ---
print("\nChecking positional encoding buffer:")
try:
    pe_buffer = model.encoder.pos_encoding.pe
    print(f"Positional Encoding buffer shape: {pe_buffer.shape}")
    print(f"Positional Encoding buffer min value: {pe_buffer.min().item()}")
    print(f"Positional Encoding buffer max value: {pe_buffer.max().item()}")
except AttributeError:
    print("Could not access positional encoding buffer.")
print("-" * 20)
# --- End Diagnostic ---


# Move the model to the selected device
model.to(device)
print("Model moved to device.")


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        # Zero out gradients
        optimizer.zero_grad()
        # Forward pass through the model
        output = model(src, tgt_input)
        # Reshape output for loss calculation
        output = output.view(-1, tgt_vocab_size)

        # Calculate loss
        loss = criterion(output, labels)
        total_train_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        # Iterate over the validation data
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        translated_text = tokenizer.decode_ids(target_sequence[1:])

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN         NaN   
3  اگر آپ نے اپنے آپ سے ایسے سوالات کیے ہیں، تو آ...         NaN         NaN   
4   اس بات پر منحصر ہے کہ آپ رہنمائی کے لیے کہاں ...         NaN         NaN   

   Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  \
0         NaN         Na

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm

# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset to determine appropriate max_length
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    # Add 2 for start and end tokens in target sequence
    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set max_length to be slightly larger than the maximum actual length found
# This provides some buffer for variations in sentence length during generation
max_length = max_actual_length + 10
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# Modified PositionalEncoding - now just a helper to calculate PE, not a Module buffer
def get_positional_encoding(max_len, d_model, device):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0) # Add batch dimension
    return pe.to(device)


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        # Calculate dimension of each head
        self.d_k = d_model // num_heads

        # Linear layers for queries, keys, and values
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        # Output linear layer
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Calculate attention scores (Q * K^T / sqrt(d_k))
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        # Apply mask if provided (fill masked positions with a large negative value)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        # Apply softmax to get attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        # Calculate weighted sum of values (Attention * V)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # Apply linear transformations and split into heads
        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Perform scaled dot-product attention
        output = self.scaled_dot_product_attention(q, k, v, mask)
        # Concatenate heads and apply output linear layer
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        # Two linear layers with a ReLU activation in between
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # Apply the feed-forward network
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        # Self-attention sub-layer
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        # Feed-forward sub-layer
        self.ffn = FeedForward(d_model, d_ff)
        # Layer normalization and dropout
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        # Apply self-attention with residual connection and normalization
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        # Apply feed-forward network with residual connection and normalization
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Encoder, self).__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Positional encoding is now calculated in forward
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        # Apply embedding
        src = self.embedding(src)
        # Add positional encoding
        # Ensure sequence length does not exceed pre-calculated PE length
        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :] # Slice PE to match current sequence length


        # Apply dropout
        src = self.dropout(src)
        # Pass through encoder layers
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        # Masked self-attention sub-layer
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        # Cross-attention sub-layer (attention over encoder output)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        # Layer normalization and dropout
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        # Feed-forward sub-layer
        self.ffn = FeedForward(d_model, d_ff)
        # Layer normalization and dropout
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Apply masked self-attention with residual connection and normalization
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        # Apply cross-attention with residual connection and normalization
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        # Apply feed-forward network with residual connection and normalization
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Decoder, self).__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Positional encoding is now calculated in forward
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Apply embedding
        tgt = self.embedding(tgt)
        # Add positional encoding
        # Ensure sequence length does not exceed pre-calculated PE length
        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :] # Slice PE to match current sequence length

        # Apply dropout
        tgt = self.dropout(tgt)
        # Pass through decoder layers
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        # Apply output linear layer
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder
        # Pass device to Encoder/Decoder so PE is calculated on the correct device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# Initialize the model
# Device is handled inside Encoder/Decoder for PE
model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length)

# Move the entire model to the selected device
# This should now work more smoothly as PE is already on the correct device
model.to(device)
print("Model moved to device.")


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        # Zero out gradients
        optimizer.zero_grad()
        # Forward pass through the model
        output = model(src, tgt_input)
        # Reshape output for loss calculation
        output = output.view(-1, tgt_vocab_size)

        # Calculate loss
        loss = criterion(output, labels)
        total_train_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        # Iterate over the validation data
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        translated_text = tokenizer.decode_ids(target_sequence[1:])

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN         NaN   
3  اگر آپ نے اپنے آپ سے ایسے سوالات کیے ہیں، تو آ...         NaN         NaN   
4   اس بات پر منحصر ہے کہ آپ رہنمائی کے لیے کہاں ...         NaN         NaN   

   Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  \
0         NaN         Na

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [8]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm

# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset to determine appropriate max_length
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    # Add 2 for start and end tokens in target sequence
    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set max_length to be slightly larger than the maximum actual length found
# This provides some buffer for variations in sentence length during generation
max_length = max_actual_length + 10
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
def get_positional_encoding(max_len, d_model, device):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0) # Add batch dimension
    return pe.to(device)


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        src = self.embedding(src)
        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :]

        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# Initialize the model components
encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_length)
decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_length)

# Move encoder and decoder individually to the device
print("Attempting to move Encoder to device...")
encoder.to(device)
print("Encoder moved to device.")

print("Attempting to move Decoder to device...")
decoder.to(device)
print("Decoder moved to device.")

# Initialize the Transformer model with the moved components
# Note: This might still fail if the issue is with the Transformer container itself
model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length)
# Replace the initially created encoder/decoder with the ones already on the device
model.encoder = encoder
model.decoder = decoder

print("Transformer model assembled with components on device.")

# The model is now on the device, so no further .to(device) call here.


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        # Zero out gradients
        optimizer.zero_grad()
        # Forward pass through the model
        output = model(src, tgt_input)
        # Reshape output for loss calculation
        output = output.view(-1, tgt_vocab_size)

        # Calculate loss
        loss = criterion(output, labels)
        total_train_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        # Iterate over the validation data
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        translated_text = tokenizer.decode_ids(target_sequence[1:])

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN         NaN   
3  اگر آپ نے اپنے آپ سے ایسے سوالات کیے ہیں، تو آ...         NaN         NaN   
4   اس بات پر منحصر ہے کہ آپ رہنمائی کے لیے کہاں ...         NaN         NaN   

   Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  \
0         NaN         Na

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm

# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(0)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")
    # Check if the detected CUDA version matches PyTorch's expected version
    if torch.backends.cuda.is_built() and torch.version.cuda:
        print(f"PyTorch expected CUDA version: {torch.version.cuda}")
        # This is just an informational print, the actual compatibility depends on drivers etc.
else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset to determine appropriate max_length
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    # Add 2 for start and end tokens in target sequence
    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set max_length to be slightly larger than the maximum actual length found
# This provides some buffer for variations in sentence length during generation
max_length = max_actual_length + 10
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
def get_positional_encoding(max_len, d_model, device):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0) # Add batch dimension
    return pe.to(device)


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        src = self.embedding(src)
        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :]

        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder
        # Pass device to Encoder/Decoder so PE is calculated on the correct device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# Initialize the Transformer model (components will be moved individually below)
model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length)

# --- Attempt to move components individually for diagnosis ---
print("Attempting to move Encoder to device...")
try:
    model.encoder.to(device)
    print("Encoder moved to device successfully.")
except Exception as e:
    print(f"Error moving Encoder to device: {e}")
    # You can add more specific checks here if needed


print("Attempting to move Decoder to device...")
try:
    model.decoder.to(device)
    print("Decoder moved to device successfully.")
except Exception as e:
    print(f"Error moving Decoder to device: {e}")
    # You can add more specific checks here if needed

# At this point, if the error occurred, the traceback should indicate whether it was
# during the encoder or decoder transfer.
# If both were successful, the issue might be elsewhere, but the original traceback
# points specifically to the .to(device) on the whole model.
# If both succeeded, the model is now effectively on the device, so we can proceed.


print("Transformer model initialized and components moved to device.")

# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        # Zero out gradients
        optimizer.zero_grad()
        # Forward pass through the model
        output = model(src, tgt_input)
        # Reshape output for loss calculation
        output = output.view(-1, tgt_vocab_size)

        # Calculate loss
        loss = criterion(output, labels)
        total_train_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        # Iterate over the validation data
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        translated_text = tokenizer.decode_ids(target_sequence[1:])

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
CUDA version: 12.4
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True
PyTorch expected CUDA version: 12.4
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN       

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm

# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(0)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")
    # Check if the detected CUDA version matches PyTorch's expected version
    if torch.backends.cuda.is_built() and torch.version.cuda:
        print(f"PyTorch expected CUDA version: {torch.version.cuda}")
        # This is just an informational print, the actual compatibility depends on drivers etc.
else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
CUDA version: 12.4
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True
PyTorch expected CUDA version: 12.4
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN       

NameError: name 'start_' is not defined

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm

# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(0)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")
    # Check if the detected CUDA version matches PyTorch's expected version
    if torch.backends.cuda.is_built() and torch.version.cuda:
        print(f"PyTorch expected CUDA version: {torch.version.cuda}")
        # This is just an informational print, the actual compatibility depends on drivers etc.
else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset (informational now, will use fixed max_length)
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set a fixed, smaller max_length for testing
max_length = 128
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting fixed max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0).to(device)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
def get_positional_encoding(max_len, d_model, device):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0) # Add batch dimension

    # --- Diagnostic: Print PE tensor properties before moving to device ---
    print("\n--- Positional Encoding Tensor Properties (before moving to device) ---")
    print(f"Shape: {pe.shape}")
    print(f"dtype: {pe.dtype}")
    print(f"device: {pe.device}")
    print(f"Min value: {pe.min().item()}")
    print(f"Max value: {pe.max().item()}")
    # Estimate memory usage (assuming float32 = 4 bytes)
    memory_bytes = pe.numel() * pe.element_size()
    print(f"Estimated CPU memory usage: {memory_bytes / (1024*1024):.2f} MB")
    print("--------------------------------------------------------------------")
    # --- End Diagnostic ---

    return pe.to(device)


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        src = self.embedding(src)
        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :]

        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder
        # Pass device to Encoder/Decoder so PE is calculated on the correct device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# Initialize the Transformer model
# The positional encoding tensors will be created and moved to the device during
# the initialization of the Encoder and Decoder modules.
model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length)

# The model is now on the device because PE was moved during init, and other
# parameters are implicitly moved with the module structure.
# Explicitly calling model.to(device) here is redundant for moving the PE,
# but it's good practice to ensure all other parameters/buffers are on device.
# However, since the error is at PE creation/move, this line will not be reached.
# model.to(device) # This line is not needed for the error diagnosis


print("Transformer model initialized.")
# If the code reaches this point, the PE tensors were moved successfully.
# The error might then occur later during training.


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        # Zero out gradients
        optimizer.zero_grad()
        # Forward pass through the model
        output = model(src, tgt_input)
        # Reshape output for loss calculation
        output = output.view(-1, tgt_vocab_size)

        # Calculate loss
        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        # Iterate over the validation data
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * (max_length - len(encoded_input))
        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        translated_text = tokenizer.decode_ids(target_sequence[1:])

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=max_length, start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
CUDA version: 12.4
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True
PyTorch expected CUDA version: 12.4
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0                 میں اپنے والدین سے کیسے بات کروں ؟         NaN         NaN   
1                             میں دوست کیسے بنائوں ؟         NaN         NaN   
2                           میں اتنا اداس کیوں ہوں؟.         NaN       

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [12]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os

# --- Step 1: Set CUDA_LAUNCH_BLOCKING=1 for better error messages ---
# This needs to be set before any CUDA operations are performed.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# --- Step 2: Set device to GPU if available, else CPU ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Step 3: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    # Get details for the current device (device 0)
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")

    # --- Simple CUDA Tensor Test ---
    print("\n--- Simple CUDA Tensor Test ---")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
        print(test_tensor)
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
    print("-----------------------------")

else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
    print("If CUDA is not available, the rest of the code requiring GPU will not work.")

print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---

# Exit if CUDA is not available, as the model is designed for GPU
if not torch.cuda.is_available():
    print("Exiting because CUDA is not available.")
    exit()


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset (informational now, will use fixed max_length)
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set a fixed, smaller max_length for testing
max_length = 128
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting fixed max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    # Masks should be created on the same device as the input tensors
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    # Ensure nopeak_mask is created on the same device as tgt
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
# This function now takes device as an argument and creates the tensor directly on that device
def get_positional_encoding(max_len, d_model, device):
    # Create positional encoding tensor directly on the target device
    pe = torch.zeros(max_len, d_model, device=device)
    position = torch.arange(0, max_len, dtype=torch.float, device=device).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model), device=device)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0) # Add batch dimension

    # --- Diagnostic: Print PE tensor properties ---
    print("\n--- Positional Encoding Tensor Properties ---")
    print(f"Shape: {pe.shape}")
    print(f"dtype: {pe.dtype}")
    print(f"device: {pe.device}")
    print(f"Min value: {pe.min().item()}")
    print(f"Max value: {pe.max().item()}")
    # Estimate memory usage (assuming float32 = 4 bytes)
    memory_bytes = pe.numel() * pe.element_size()
    print(f"Estimated Device memory usage: {memory_bytes / (1024*1024):.2f} MB")
    print("-------------------------------------------")
    # --- End Diagnostic ---

    return pe


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Ensure calculations are on the same device
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding directly on the specified device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        # Ensure embedding output is on the same device as PE
        src = self.embedding(src) # Embedding output is on the device of the input tensor (src)
        if src.device != self.pe.device:
            # This should ideally not happen if inputs are moved to device, but adding a safety check
            print(f"Warning: Encoder input device ({src.device}) does not match PE device ({self.pe.device}). Moving PE.")
            self.pe = self.pe.to(src.device) # Move PE to match input device

        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :]

        src = self.dropout(src)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding directly on the specified device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Ensure embedding output is on the same device as PE
        tgt = self.embedding(tgt) # Embedding output is on the device of the input tensor (tgt)
        if tgt.device != self.pe.device:
             # This should ideally not happen if inputs are moved to device, but adding a safety check
            print(f"Warning: Decoder input device ({tgt.device}) does not match PE device ({self.pe.device}). Moving PE.")
            self.pe = self.pe.to(tgt.device) # Move PE to match input device


        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder, passing the target device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences (should be on the same device as src/tgt)
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# --- Initialize the Transformer model, passing the device ---
# Positional encoding tensors will be created directly on the device during
# the initialization of the Encoder and Decoder modules.
print("\nInitializing Transformer model...")
try:
    model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device)
    print("Transformer model initialized successfully.")
    # The model should largely be on the device now because its components were.
    # A final .to(device) here ensures any remaining parameters/buffers are moved,
    # but might not be strictly necessary if PE move was the only issue.
    # If the error happens below, it's *not* the PE move causing the direct error anymore.
    model.to(device)
    print("Model confirmed to be on device.")

except Exception as e:
    print(f"Error during Transformer model initialization or final .to(device): {e}")
    print("Please check the traceback above for the specific line that caused the error.")
    # If the error was at get_positional_encoding(..., device=device),
    # the issue is likely with the CUDA environment itself.
    exit() # Exit if model initialization fails


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * (self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

print("\nStarting training loop...")
for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    for batch in tqdm(custom_train_dataloader, desc="Training"):
        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        # Zero out gradients
        optimizer.zero_grad()
        # Forward pass through the model
        output = model(src, tgt_input)
        # Reshape output for loss calculation
        output = output.view(-1, tgt_vocab_size)

        # Calculate loss
        loss = criterion(output, labels)
        total_train_loss += loss.item()

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        # Iterate over the validation data
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        # Use the model's max_length for inference consistency
        encoded_input = encoded_input[:model.encoder.max_len] + [pad_idx] * (model.encoder.max_len - len(encoded_input))
        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the model's max_len for generation limit
        for _ in range(model.decoder.max_len):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
# Use the model's max_length for the translation function as well
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=model.encoder.max_len, # Pass the model's max_length
    start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")



CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True

--- Simple CUDA Tensor Test ---
Error creating/moving test tensor to cuda: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

-----------------------------
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yo

Training:   0%|          | 0/1509 [00:00<?, ?it/s]

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os

# --- Step 1: Set CUDA_LAUNCH_BLOCKING=1 for better error messages ---
# This needs to be set before any CUDA operations are performed.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# --- Step 2: Set device to GPU if available, else CPU ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Step 3: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    # Get details for the current device (device 0)
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")

    # --- Simple CUDA Tensor Test ---
    print("\n--- Simple CUDA Tensor Test ---")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
        print(test_tensor)
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
    print("-----------------------------")

else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
    print("If CUDA is not available, the rest of the code requiring GPU will not work.")

print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---

# Exit if CUDA is not available, as the model is designed for GPU
if not torch.cuda.is_available():
    print("Exiting because CUDA is not available.")
    exit()


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset (informational now, will use fixed max_length)
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set a fixed, smaller max_length for testing
max_length = 128
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting fixed max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    # Masks should be created on the same device as the input tensors
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    # Ensure nopeak_mask is created on the same device as tgt
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
# This function now takes device as an argument and creates the tensor directly on that device
def get_positional_encoding(max_len, d_model, device):
    # Create positional encoding tensor directly on the target device
    pe = torch.zeros(max_len, d_model, device=device)
    position = torch.arange(0, max_len, dtype=torch.float, device=device).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model), device=device)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0) # Add batch dimension

    # --- Diagnostic: Print PE tensor properties ---
    print("\n--- Positional Encoding Tensor Properties ---")
    print(f"Shape: {pe.shape}")
    print(f"dtype: {pe.dtype}")
    print(f"device: {pe.device}")
    print(f"Min value: {pe.min().item()}")
    print(f"Max value: {pe.max().item()}")
    # Estimate memory usage (assuming float32 = 4 bytes)
    memory_bytes = pe.numel() * pe.element_size()
    print(f"Estimated Device memory usage: {memory_bytes / (1024*1024):.2f} MB")
    print("-------------------------------------------")
    # --- End Diagnostic ---

    return pe


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Ensure calculations are on the same device
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding directly on the specified device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        # Ensure embedding output is on the same device as PE
        src = self.embedding(src) # Embedding output is on the device of the input tensor (src)
        # No need to check/move PE device here if PE is already on the correct device from init

        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        # Ensure PE slicing is correct for the current batch's sequence length
        src = src + self.pe[:, :src.size(1), :]


        src = self.dropout(src)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding directly on the specified device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Ensure embedding output is on the same device as PE
        tgt = self.embedding(tgt) # Embedding output is on the device of the input tensor (tgt)
        # No need to check/move PE device here if PE is already on the correct device from init

        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
         # Ensure PE slicing is correct for the current batch's sequence length
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder, passing the target device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences (should be on the same device as src/tgt)
        # Ensure masks are created on the correct device.
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# --- Initialize the Transformer model, passing the device ---
# Positional encoding tensors will be created directly on the device during
# the initialization of the Encoder and Decoder modules.
print("\nInitializing Transformer model...")
try:
    model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device)
    print("Transformer model initialized successfully.")
    # The model should largely be on the device now because its components were.
    # A final .to(device) here ensures any remaining parameters/buffers are moved,
    # but might not be strictly necessary if PE move was the only issue.
    # If the error happens below, it's *not* the PE move causing the direct error anymore.
    # model.to(device) # This was causing an error before, likely due to PE not being on device first.
                      # Now PE is created on device, so this might work, but let's rely on
                      # PE initialization moving the core parts and inputs being moved in the loop.


except Exception as e:
    print(f"Error during Transformer model initialization: {e}")
    print("Please check the traceback above for the specific line that caused the error.")
    # If the error was at get_positional_encoding(..., device=device),
    # the issue is likely with the CUDA environment itself.
    exit() # Exit if model initialization fails


print("Transformer model should be on device after initialization.")


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_english)) # Use max(0, ...) to avoid negative padding
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_urdu)) # Use max(0, ...) to avoid negative padding


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

print("\nStarting training loop...")
for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    # --- Add diagnostic prints for batch tensors before moving to device ---
    print("\n--- Diagnostic: Inspecting batch tensors before moving to device ---")
    for i, batch in enumerate(tqdm(custom_train_dataloader, desc="Training")):
        if i == 0: # Inspect only the first batch
            src_batch_cpu = batch['input_ids']
            tgt_batch_cpu = batch['labels']

            print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
            print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
            print(f"Source vocab size: {src_vocab_size}") # Check max_value vs vocab_size

            print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
            print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
            print(f"Target vocab size: {tgt_vocab_size}") # Check max_value vs vocab_size
            print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---


        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        # Use the model's max_length for inference consistency
        encoded_input = encoded_input[:model.encoder.max_len] + [pad_idx] * max(0, model.encoder.max_len - len(encoded_input)) # Use max(0, ...)

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the model's max_len for generation limit
        for _ in range(model.decoder.max_len):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
# Use the model's max_length for the translation function as well
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=model.encoder.max_len, # Pass the model's max_length
    start_token_id=start_token_id,
    end_token_id=end_token_id, pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True

--- Simple CUDA Tensor Test ---
Successfully created and moved a test tensor to cuda.
tensor([-2.0334, -0.5138, -0.3794,  1.0689,  0.0168,  0.4950, -1.0198, -0.2708,
         0.0856, -1.2105], device='cuda:0')
-----------------------------
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unn

NameError: name 'model' is not defined

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os

# --- Step 1: Set CUDA_LAUNCH_BLOCKING=1 for better error messages ---
# This needs to be set before any CUDA operations are performed.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# --- Step 2: Set device to GPU if available, else CPU ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Step 3: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    # Get details for the current device (device 0)
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")

    # --- Simple CUDA Tensor Test ---
    print("\n--- Simple CUDA Tensor Test ---")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
        print(test_tensor)
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
        # If this fails, there is a fundamental CUDA/PyTorch installation issue.
    print("-----------------------------")

else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
    print("If CUDA is not available, the rest of the code requiring GPU will not work.")

print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---

# Exit if CUDA is not available, as the model is designed for GPU
if not torch.cuda.is_available():
    print("Exiting because CUDA is not available.")
    # In a script, use sys.exit()
    import sys
    sys.exit()


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    import sys
    sys.exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    import sys
    sys.exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset (informational now, will use fixed max_length)
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set a fixed, smaller max_length for testing
max_length = 128
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting fixed max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    # Masks should be created on the same device as the input tensors
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    # Ensure nopeak_mask is created on the same device as tgt
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
# Modified to calculate exp on CPU then move to device
def get_positional_encoding(max_len, d_model, device):
    # Create positional encoding tensor on CPU first
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    # Calculate div_term on CPU
    div_term_cpu = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term_cpu)
    pe[:, 1::2] = torch.cos(position * div_term_cpu)
    pe = pe.unsqueeze(0) # Add batch dimension

    # --- Diagnostic: Print PE tensor properties ---
    print("\n--- Positional Encoding Tensor Properties (before move) ---")
    print(f"Shape: {pe.shape}")
    print(f"dtype: {pe.dtype}")
    print(f"device: {pe.device}")
    print(f"Min value: {pe.min().item()}")
    print(f"Max value: {pe.max().item()}")
    # Estimate memory usage (assuming float32 = 4 bytes)
    memory_bytes = pe.numel() * pe.element_size()
    print(f"Estimated CPU memory usage: {memory_bytes / (1024*1024):.2f} MB")
    print("-------------------------------------------------------")
    # --- End Diagnostic ---

    # Now move the completed PE tensor to the target device
    return pe.to(device)


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Ensure calculations are on the same device
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        # This call is where the error was happening previously.
        # get_positional_encoding now creates PE on CPU then moves it.
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        # Ensure embedding output is on the same device as PE
        src = self.embedding(src) # Embedding output is on the device of the input tensor (src)
        # No need to check/move PE device here if PE is already on the correct device from init

        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        # Ensure PE slicing is correct for the current batch's sequence length
        src = src + self.pe[:, :src.size(1), :]


        src = self.dropout(src)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        # This call is where the error was happening previously.
        # get_positional_encoding now creates PE on CPU then moves it.
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Ensure embedding output is on the same device as PE
        tgt = self.embedding(tgt) # Embedding output is on the device of the input tensor (tgt)
        # No need to check/move PE device here if PE is already on the correct device from init

        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
         # Ensure PE slicing is correct for the current batch's sequence length
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder, passing the target device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences (should be on the same device as src/tgt)
        # Ensure masks are created on the correct device.
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# --- Initialize the Transformer model, passing the device ---
# Positional encoding tensors will be created directly on the device during
# the initialization of the Encoder and Decoder modules.
print("\nInitializing Transformer model...")
try:
    model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device)
    print("Transformer model initialized successfully.")
    # The model should largely be on the device now because its components were.
    # A final .to(device) here ensures any remaining parameters/buffers are moved,
    # but might not be strictly necessary if PE move was the only issue.
    # If the error happens below, it's *not* the PE move causing the direct error anymore.
    # model.to(device) # Removed this redundant/problematic call
                      # Now PE is created on device during init, so this might work, but let's rely on
                      # PE initialization moving the core parts and inputs being moved in the loop.


except Exception as e:
    print(f"Error during Transformer model initialization: {e}")
    print("Please check the traceback above for the specific line that caused the error.")
    # If the error was at get_positional_encoding(..., device=device),
    # the issue is likely with the CUDA environment itself.
    import sys
    sys.exit() # Exit if model initialization fails


print("Transformer model should be on device after initialization.")


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        # Use max(0, ...) to avoid negative padding length calculation if somehow max_length < len
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

print("\nStarting training loop...")
for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data
    # --- Add diagnostic prints for batch tensors before moving to device ---
    # Moved these checks to just before the .to(device) call inside the loop
    # to ensure we are inspecting the exact tensors being moved when the error occurs.
    # print("\n--- Diagnostic: Inspecting batch tensors before moving to device ---")

    for i, batch in enumerate(tqdm(custom_train_dataloader, desc="Training")):
        # if i == 0: # Inspect only the first batch
        #     src_batch_cpu = batch['input_ids']
        #     tgt_batch_cpu = batch['labels']

        #     print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
        #     print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
        #     print(f"Source vocab size: {src_vocab_size}") # Check max_value vs vocab_size

        #     print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
        #     print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
        #     print(f"Target vocab size: {tgt_vocab_size}") # Check max_value vs vocab_size
        #     print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---


        # Move batch tensors to the selected device
        # --- Diagnostic: Inspecting batch tensors before moving to device (within loop) ---
        # This will print for the batch that actually triggers the error
        if i < 5: # Print for the first few batches to see if values are consistent
            src_batch_cpu = batch['input_ids']
            tgt_batch_cpu = batch['labels']

            print(f"\nBatch {i}: Inspecting tensors before moving to {device}...")
            print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
            print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
            print(f"Source vocab size: {src_vocab_size}")

            print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
            print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
            print(f"Target vocab size: {tgt_vocab_size}")
            print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---

        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        # Use the model's max_length for inference consistency
        encoded_input = encoded_input[:model.encoder.max_len] + [pad_idx] * max(0, model.encoder.max_len - len(encoded_input)) # Use max(0, ...)

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the model's max_len for generation limit
        for _ in range(model.decoder.max_len):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
# Use the model's max_length for the translation function as well
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=model.encoder.max_len, # Pass the model's max_length
    start_token_id=start_token_id,
    end_token_id

SyntaxError: incomplete input (<ipython-input-1-1209113918>, line 590)

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os
import sys # Import sys for sys.exit()

# --- Step 1: Set CUDA_LAUNCH_BLOCKING=1 for better error messages ---
# This needs to be set before any CUDA operations are performed.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# --- Step 2: Set device to GPU if available, else CPU ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Step 3: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    # Get details for the current device (device 0)
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")

    # --- Simple CUDA Tensor Test ---
    print("\n--- Simple CUDA Tensor Test ---")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
        print(test_tensor)
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
        # If this fails, there is a fundamental CUDA/PyTorch installation issue.
    print("-----------------------------")

else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
    print("If CUDA is not available, the rest of the code requiring GPU will not work.")

print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---

# Exit if CUDA is not available, as the model is designed for GPU
if not torch.cuda.is_available():
    print("Exiting because CUDA is not available.")
    sys.exit()


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    sys.exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    sys.exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset (informational now, will use fixed max_length)
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set a fixed, smaller max_length for testing
max_length = 128
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting fixed max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    # Masks should be created on the same device as the input tensors
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    # Ensure nopeak_mask is created on the same device as tgt
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
# Modified to calculate exp on CPU then move to device
def get_positional_encoding(max_len, d_model, device):
    # Create positional encoding tensor on CPU first
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    # Calculate div_term on CPU
    div_term_cpu = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term_cpu)
    pe[:, 1::2] = torch.cos(position * div_term_cpu)
    pe = pe.unsqueeze(0) # Add batch dimension

    # --- Diagnostic: Print PE tensor properties ---
    print("\n--- Positional Encoding Tensor Properties (before move) ---")
    print(f"Shape: {pe.shape}")
    print(f"dtype: {pe.dtype}")
    print(f"device: {pe.device}")
    print(f"Min value: {pe.min().item()}")
    print(f"Max value: {pe.max().item()}")
    # Estimate memory usage (assuming float32 = 4 bytes)
    memory_bytes = pe.numel() * pe.element_size()
    print(f"Estimated CPU memory usage: {memory_bytes / (1024*1024):.2f} MB")
    print("-------------------------------------------------------")
    # --- End Diagnostic ---

    # Now move the completed PE tensor to the target device
    return pe.to(device)


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Ensure calculations are on the same device
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        # Ensure embedding output is on the same device as PE
        src = self.embedding(src) # Embedding output is on the device of the input tensor (src)
        # No need to check/move PE device here if PE is already on the correct device from init

        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        # Ensure PE slicing is correct for the current batch's sequence length
        src = src + self.pe[:, :src.size(1), :]


        src = self.dropout(src)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Ensure embedding output is on the same device as PE
        tgt = self.embedding(tgt) # Embedding output is on the device of the input tensor (tgt)
        # No need to check/move PE device here if PE is already on the correct device from init

        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
         # Ensure PE slicing is correct for the current batch's sequence length
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder, passing the target device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences (should be on the same device as src/tgt)
        # Ensure masks are created on the correct device.
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# --- Initialize the Transformer model, passing the device ---
# Positional encoding tensors will be created directly on the device during
# the initialization of the Encoder and Decoder modules.
print("\nInitializing Transformer model...")
try:
    model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device)
    print("Transformer model initialized successfully.")
    # The model should largely be on the device now because its components were.
    # A final .to(device) here ensures any remaining parameters/buffers are moved,
    # but might not be strictly necessary if PE move was the only issue.
    # If the error happens below, it's *not* the PE move causing the direct error anymore.
    # model.to(device) # Removed this redundant/problematic call


except Exception as e:
    print(f"Error during Transformer model initialization: {e}")
    print("Please check the traceback above for the specific line that caused the error.")
    # If the error was at get_positional_encoding(..., device=device),
    # the issue is likely with the CUDA environment itself.
    sys.exit() # Exit if model initialization fails


print("Transformer model should be on device after initialization.")


# 4. Data Loading and Preprocessing with Custom Tokenizer

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        # Use max(0, ...) to avoid negative padding length calculation if somehow max_length < len
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

print("\nStarting training loop...")
for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data

    for i, batch in enumerate(tqdm(custom_train_dataloader, desc="Training")):
        # Move batch tensors to the selected device
        # --- Diagnostic: Inspecting batch tensors before moving to device ---
        # This will print for the batch that actually triggers the error
        if i < 5: # Print for the first few batches to see if values are consistent
            src_batch_cpu = batch['input_ids']
            tgt_batch_cpu = batch['labels']

            print(f"\nBatch {i}: Inspecting tensors before moving to {device}...")
            print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
            print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
            print(f"Source vocab size: {src_vocab_size}")

            print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
            print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
            print(f"Target vocab size: {tgt_vocab_size}")
            print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---

        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        # Use the model's max_length for inference consistency
        encoded_input = encoded_input[:model.encoder.max_len] + [pad_idx] * max(0, model.encoder.max_len - len(encoded_input)) # Use max(0, ...)

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the model's max_len for generation limit
        for _ in range(model.decoder.max_len):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
# Use the model's max_length for the translation function as well
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=model.encoder.max_len, # Pass the model's max_length
    start_token_id=start_token_id,
    end_token_id=end_token_id, # Corrected this line
    pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True

--- Simple CUDA Tensor Test ---
Successfully created and moved a test tensor to cuda.
tensor([-0.2243, -1.1342,  0.5697, -1.1392,  0.5348,  0.8518, -0.9881,  0.1923,
        -0.5311, -1.0137], device='cuda:0')
-----------------------------
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unn

Training:   0%|          | 0/1509 [00:00<?, ?it/s]


Batch 0: Inspecting tensors before moving to cuda...
Source batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Source batch min value: 0, max value: 31971
Source vocab size: 32000
Target batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Target batch min value: 0, max value: 31968
Target vocab size: 32000
-----------------------------------------------------------------


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os
import sys # Import sys for sys.exit()

# --- Step 1: Set CUDA_LAUNCH_BLOCKING=1 for better error messages ---
# This needs to be set before any CUDA operations are performed.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# --- Step 2: Set device to GPU if available, else CPU ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Step 3: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    # Get details for the current device (device 0)
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")

    # --- Simple CUDA Tensor Test ---
    print("\n--- Simple CUDA Tensor Test ---")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
        print(test_tensor)
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
        # If this fails, there is a fundamental CUDA/PyTorch installation issue.
        # In this case, the PE error was likely a symptom, not the root cause.
        print("Fundamental CUDA test failed. Your CUDA/PyTorch setup may be incompatible or corrupted.")
        import sys
        sys.exit() # Exit if fundamental CUDA test fails

else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
    print("If CUDA is not available, the rest of the code requiring GPU will not work.")
    import sys
    sys.exit() # Exit if CUDA is not available


print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    sys.exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    sys.exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset (informational now, will use fixed max_length)
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set a fixed, smaller max_length for testing
max_length = 128
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting fixed max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    # Masks should be created on the same device as the input tensors
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    # Ensure nopeak_mask is created on the same device as tgt
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
# Modified to calculate exp on CPU then move to device
def get_positional_encoding(max_len, d_model, device):
    # Create positional encoding tensor on CPU first
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    # Calculate div_term on CPU
    div_term_cpu = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term_cpu)
    pe[:, 1::2] = torch.cos(position * div_term_cpu)
    pe = pe.unsqueeze(0) # Add batch dimension

    # --- Diagnostic: Print PE tensor properties ---
    # Removed CPU properties print to avoid clutter, focus on device properties if successful
    # print("\n--- Positional Encoding Tensor Properties (before move) ---")
    # print(f"Shape: {pe.shape}")
    # print(f"dtype: {pe.dtype}")
    # print(f"device: {pe.device}")
    # print(f"Min value: {pe.min().item()}")
    # print(f"Max value: {pe.max().item()}")
    # # Estimate memory usage (assuming float32 = 4 bytes)
    # memory_bytes = pe.numel() * pe.element_size()
    # print(f"Estimated CPU memory usage: {memory_bytes / (1024*1024):.2f} MB")
    # print("-------------------------------------------------------")
    # --- End Diagnostic ---

    # Now move the completed PE tensor to the target device
    pe_on_device = pe.to(device)

    # --- Diagnostic: Print PE tensor properties (after move) ---
    print("\n--- Positional Encoding Tensor Properties (on device) ---")
    print(f"Shape: {pe_on_device.shape}")
    print(f"dtype: {pe_on_device.dtype}")
    print(f"device: {pe_on_device.device}")
    print(f"Min value: {pe_on_device.min().item()}")
    print(f"Max value: {pe_on_device.max().item()}")
    # Estimate memory usage on device
    memory_bytes_device = pe_on_device.numel() * pe_on_device.element_size()
    print(f"Estimated Device memory usage: {memory_bytes_device / (1024*1024):.2f} MB")
    print("-------------------------------------------------------")
    # --- End Diagnostic ---


    return pe_on_device


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Ensure calculations are on the same device
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        # Ensure embedding output is on the same device as PE
        src = self.embedding(src) # Embedding output is on the device of the input tensor (src)
        # No need to check/move PE device here if PE is already on the correct device from init

        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        # Ensure PE slicing is correct for the current batch's sequence length
        src = src + self.pe[:, :src.size(1), :]


        src = self.dropout(src)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Ensure embedding output is on the same device as PE
        tgt = self.embedding(tgt) # Embedding output is on the device of the input tensor (tgt)
        # No need to check/move PE device here if PE is already on the correct device from init

        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
         # Ensure PE slicing is correct for the current batch's sequence length
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder, passing the target device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences (should be on the same device as src/tgt)
        # Ensure masks are created on the correct device.
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# --- Initialize the Transformer model, passing the device ---
# Positional encoding tensors will be created directly on the device during
# the initialization of the Encoder and Decoder modules.
print("\nInitializing Transformer model...")
try:
    model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device)
    print("Transformer model initialized successfully.")
    # The model should largely be on the device now because its components were.
    # A final .to(device) here ensures any remaining parameters/buffers are moved,
    # but might not be strictly necessary if PE move was the only issue.
    # If the error happens below, it's *not* the PE move causing the direct error anymore.
    # model.to(device) # Removed this redundant/problematic call


except Exception as e:
    print(f"Error during Transformer model initialization: {e}")
    print("Please check the traceback above for the specific line that caused the error.")
    sys.exit() # Exit if model initialization fails


print("Transformer model should be on device after initialization.")


# 4. Data Loading and Preprocessing with Custom Dataset

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        # Use max(0, ...) to avoid negative padding length calculation if somehow max_length < len
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

print("\nStarting training loop...")
for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data

    for i, batch in enumerate(tqdm(custom_train_dataloader, desc="Training")):
        # --- Diagnostic: Inspecting batch tensors before moving to device ---
        # This will print for the batch that actually triggers the error
        if i < 5: # Print for the first few batches to see if values are consistent
            src_batch_cpu = batch['input_ids']
            tgt_batch_cpu = batch['labels']

            print(f"\nBatch {i}: Inspecting tensors before moving to {device}...")
            print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
            print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
            print(f"Source vocab size: {src_vocab_size}")

            print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
            print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
            print(f"Target vocab size: {tgt_vocab_size}")
            print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---

        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        # Use the model's max_length for inference consistency
        encoded_input = encoded_input[:model.encoder.max_len] + [pad_idx] * max(0, model.encoder.max_len - len(encoded_input)) # Use max(0, ...)

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the model's max_len for generation limit
        for _ in range(model.decoder.max_len):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text

# Translate a sample sentence after training
sample_english_sentence = "This is a test sentence."
# Use the model's max_length for the translation function as well
translated_urdu = custom_translate_english_to_urdu(
    sample_english_sentence, model, tokenizer, device,
    max_length=model.encoder.max_len, # Pass the model's max_length
    start_token_id=start_token_id,
    end_token_id=end_token_id,
    pad_idx=pad_idx
)
print(f"\nOriginal English: {sample_english_sentence}")
print(f"Translated Urdu: {translated_urdu}")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True

--- Simple CUDA Tensor Test ---
Successfully created and moved a test tensor to cuda.
tensor([ 1.0400,  0.0233, -1.2081,  1.3635,  0.3674, -0.7474,  0.1935,  0.2716,
        -1.0764, -0.2894], device='cuda:0')
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0      

Training:   0%|          | 0/1509 [00:00<?, ?it/s]


Batch 0: Inspecting tensors before moving to cuda...
Source batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Source batch min value: 0, max value: 31990
Source vocab size: 32000
Target batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Target batch min value: 0, max value: 31966
Target vocab size: 32000
-----------------------------------------------------------------


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os
import sys # Import sys for sys.exit()

# --- Step 1: Set CUDA_LAUNCH_BLOCKING=1 for better error messages ---
# This needs to be set before any CUDA operations are performed.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# --- Step 2: Set device to GPU if available, else CPU ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Step 3: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    # Get details for the current device (device 0)
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    # Check if PyTorch is built with CUDA
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")

    # --- Simple CUDA Tensor Test ---
    print("\n--- Simple CUDA Tensor Test ---")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
        print(test_tensor)
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
        # If this fails, there is a fundamental CUDA/PyTorch installation issue.
        # In this case, the PE error was likely a symptom, not the root cause.
        print("Fundamental CUDA test failed. Your CUDA/PyTorch setup may be incompatible or corrupted.")
        sys.exit() # Exit if fundamental CUDA test fails

else:
    print("CUDA is not available. Please ensure you have a CUDA-enabled GPU and necessary drivers installed.")
    print("If CUDA is not available, the rest of the code requiring GPU will not work.")
    sys.exit() # Exit if CUDA is not available


print("--------------------------------------------")
# --- End Environment and Compatibility Checks ---


# 1. Data Loading
# Assuming your dataset file is named 'parallel-corpus.xlsx'
# And it has columns "SENTENCES " (English) and "MEANING" (Urdu)
try:
    df = pd.read_excel('parallel-corpus.xlsx') # Use read_excel for .xlsx files
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'parallel-corpus.xlsx' not found. Please check the file name and path.")
    # Exit or handle the error appropriately
    sys.exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    # Exit or handle the error appropriately
    sys.exit()

# Rename columns for easier access
df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)

# 2. Prepare Data for Tokenizer Training and Train SentencePiece + BPE Tokenizer

# Combine English and Urdu sentences into a single file
with open("corpus.txt", "w", encoding="utf-8") as f:
    for index, row in df.iterrows():
        f.write(str(row['english']) + "\n")
        f.write(str(row['urdu']) + "\n")

print("\nCorpus file created for tokenizer training.")

# Define tokenizer training parameters
# vocab_size: The desired size of your vocabulary
# model_prefix: Prefix for the output model files
# model_type: 'bpe' for BPE-based SentencePiece
spm.SentencePieceTrainer.train(
    '--input=corpus.txt --model_prefix=m_bpe --vocab_size=32000 --model_type=bpe'
)

print("SentencePiece + BPE tokenizer trained.")
print("Model files 'm_bpe.model' and 'm_bpe.vocab' created.")

# Load the trained tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("m_bpe.model")

# Get special token IDs (these might vary based on SentencePiece default)
# Common special tokens are <unk> (unknown), <s> (start), </s> (end), <pad> (padding)
# You can check the m_bpe.vocab file to confirm
# Set pad_idx to 0 explicitly as it's the standard for padding
pad_idx = 0
start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1 # Often 1
end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2 # Often 2
unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3 # Often 3

print(f"\nSpecial Token IDs: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")

# Find the maximum actual length in the dataset (informational now, will use fixed max_length)
max_actual_length = 0
for index, row in df.iterrows():
    english_text = str(row['english'])
    urdu_text = str(row['urdu'])

    encoded_english = tokenizer.encode_as_ids(english_text)
    encoded_urdu = tokenizer.encode_as_ids(urdu_text)

    urdu_sequence_length = len(encoded_urdu) + 2

    max_actual_length = max(max_actual_length, len(encoded_english), urdu_sequence_length)

# Set a fixed, smaller max_length for testing
max_length = 128
print(f"\nMaximum tokenized sequence length in the dataset (including special tokens): {max_actual_length}")
print(f"Setting fixed max_length for padding, truncation, and model to: {max_length}")


# 3. Implement Transformer Encoder-Decoder from Scratch

# Helper function for creating masks
def create_masks(src, tgt, pad_idx):
    # Masks should be created on the same device as the input tensors
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    # Create a triangular mask to prevent attention to future tokens in the target sequence
    # Ensure nopeak_mask is created on the same device as tgt
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper
# Modified to calculate exp on CPU then move to device
def get_positional_encoding(max_len, d_model, device):
    # Create positional encoding tensor on CPU first
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    # Calculate div_term on CPU
    div_term_cpu = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term_cpu)
    pe[:, 1::2] = torch.cos(position * div_term_cpu)
    pe = pe.unsqueeze(0) # Add batch dimension

    # --- Diagnostic: Print PE tensor properties ---
    # Removed CPU properties print to avoid clutter, focus on device properties if successful
    # print("\n--- Positional Encoding Tensor Properties (before move) ---")
    # print(f"Shape: {pe.shape}")
    # print(f"dtype: {pe.dtype}")
    # print(f"device: {pe.device}")
    # print(f"Min value: {pe.min().item()}")
    # print(f"Max value: {pe.max().item()}")
    # # Estimate memory usage (assuming float32 = 4 bytes)
    # memory_bytes = pe.numel() * pe.element_size()
    # print(f"Estimated CPU memory usage: {memory_bytes / (1024*1024):.2f} MB")
    # print("-------------------------------------------------------")
    # --- End Diagnostic ---

    # Now move the completed PE tensor to the target device
    pe_on_device = pe.to(device)

    # --- Diagnostic: Print PE tensor properties (on device) ---
    print("\n--- Positional Encoding Tensor Properties (on device) ---")
    print(f"Shape: {pe_on_device.shape}")
    print(f"dtype: {pe_on_device.dtype}")
    print(f"device: {pe_on_device.device}")
    print(f"Min value: {pe_on_device.min().item()}")
    print(f"Max value: {pe_on_device.max().item()}")
    # Estimate memory usage on device
    memory_bytes_device = pe_on_device.numel() * pe_on_device.element_size()
    print(f"Estimated Device memory usage: {memory_bytes_device / (1024*1024):.2f} MB")
    print("-------------------------------------------------------")
    # --- End Diagnostic ---


    return pe_on_device


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # Ensure calculations are on the same device
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, src, mask):
        # Ensure embedding output is on the same device as PE
        src = self.embedding(src) # Embedding output is on the device of the input tensor (src)
        # No need to check/move PE device here if PE is already on the correct device from init

        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        # Ensure PE slicing is correct for the current batch's sequence length
        src = src + self.pe[:, :src.size(1), :]


        src = self.dropout(src)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            src = layer(src, mask)
        return src

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        # Pre-calculate positional encoding and keep it on the device
        self.pe = get_positional_encoding(max_len, d_model, device)


    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        # Ensure embedding output is on the same device as PE
        tgt = self.embedding(tgt) # Embedding output is on the device of the input tensor (tgt)
        # No need to check/move PE device here if PE is already on the correct device from init

        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
         # Ensure PE slicing is correct for the current batch's sequence length
        tgt = tgt + self.pe[:, :tgt.size(1), :]

        tgt = self.dropout(tgt)
        for layer in self.layers:
            # Layers should handle tensors on their own device
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Initialize encoder and decoder, passing the target device
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Store padding index
        self.pad_idx = pad_idx

    def forward(self, src, tgt):
        # Create masks for source and target sequences (should be on the same device as src/tgt)
        # Ensure masks are created on the correct device.
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        # Pass source through encoder
        encoder_output = self.encoder(src, src_mask)
        # Pass target and encoder output through decoder
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output

# Define model parameters
src_vocab_size = tokenizer.get_piece_size() # Size of the vocabulary
tgt_vocab_size = tokenizer.get_piece_size()
d_model = 512 # Embedding dimension
num_layers = 6 # Number of encoder/decoder layers
num_heads = 8 # Number of attention heads
d_ff = 2048 # Dimension of feed-forward network
dropout = 0.1

# --- Initialize the Transformer model, passing the device ---
# Positional encoding tensors will be created directly on the device during
# the initialization of the Encoder and Decoder modules.
print("\nInitializing Transformer model...")
try:
    model = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device)
    print("Transformer model initialized successfully.")
    # --- Explicitly move the entire model to the device ---
    # This ensures all parameters and buffers (including embedding weights) are on the GPU.
    print("Moving model to device...")
    model.to(device)
    print("Model moved to device successfully.")


except Exception as e:
    print(f"Error during Transformer model initialization or device transfer: {e}")
    print("Please check the traceback above for the specific line that caused the error.")
    sys.exit() # Exit if model initialization fails


# print("Transformer model should be on device after initialization.") # This print is now redundant


# 4. Data Loading and Preprocessing with Custom Dataset

class CustomTranslationDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # Add start and end tokens to the target sequence
        encoded_urdu = [self.start_token_id] + encoded_urdu + [self.end_token_id]

        # Pad or truncate sequences to the defined max_length
        # Use max(0, ...) to avoid negative padding length calculation if somehow max_length < len
        encoded_english = encoded_english[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_english))
        encoded_urdu = encoded_urdu[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_urdu))


        return {
            'input_ids': torch.tensor(encoded_english, dtype=torch.long),
            'labels': torch.tensor(encoded_urdu, dtype=torch.long)
        }

# Create custom dataset
custom_full_dataset = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
custom_train_size = int(0.8 * len(custom_full_dataset))
custom_val_size = len(custom_full_dataset) - custom_train_size
custom_train_dataset, custom_val_dataset = random_split(custom_full_dataset, [custom_train_size, custom_val_size])

# Define batch size and create data loaders
custom_batch_size = 16
custom_train_dataloader = DataLoader(custom_train_dataset, batch_size=custom_batch_size, shuffle=True)
custom_val_dataloader = DataLoader(custom_val_dataset, batch_size=custom_batch_size)


# 5. Training Loop (with custom model and data)

# Define loss function (CrossEntropyLoss) and ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
# Define optimizer (Adam) with specified learning rate and parameters
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

num_epochs = 5 # Adjust as needed

print("\nStarting training loop...")
for epoch in range(num_epochs):
    model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    # Iterate over the training data

    for i, batch in enumerate(tqdm(custom_train_dataloader, desc="Training")):
        # --- Diagnostic: Inspecting batch tensors before moving to device ---
        # This will print for the batch that actually triggers the error
        # Only print for the first batch to avoid flooding output
        if i == 0:
            src_batch_cpu = batch['input_ids']
            tgt_batch_cpu = batch['labels']

            print(f"\nBatch {i}: Inspecting tensors before moving to {device}...")
            print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
            print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
            print(f"Source vocab size: {src_vocab_size}")

            print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
            print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
            print(f"Target vocab size: {tgt_vocab_size}")
            print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---

        # Move batch tensors to the selected device
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device) # Target input for decoder

        # The target for loss calculation is the target sequence shifted by one position
        # because the decoder predicts the next token based on the previous ones.
        # We also remove the last token from the target input for the decoder.
        # The labels for the loss calculation are the actual target sequence excluding the first token.
        # This is standard for sequence-to-sequence training.
        tgt_input = tgt[:, :-1]
        labels = tgt[:, 1:].contiguous().view(-1)

        optimizer.zero_grad()
        output = model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(custom_train_dataloader)
    print(f"Training Loss: {avg_train_loss:.4f}")

    # 6. Evaluation (with custom model and data)

    model.eval() # Set model to evaluation mode
    total_val_loss = 0

    print("Evaluating...")
    # Disable gradient calculation during evaluation
    with torch.no_grad():
        for batch in tqdm(custom_val_dataloader, desc="Validation"):
            # Move batch tensors to the selected device
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            # Prepare target input and labels similarly to training
            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(custom_val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

# 7. Inference (Translation with custom model)

def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length=128, start_token_id=1, end_token_id=2, pad_idx=0):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        # Use the model's max_length for inference consistency
        encoded_input = encoded_input[:model.encoder.max_len] + [pad_idx] * max(0, model.encoder.max_len - len(encoded_input)) # Use max(0, ...)

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the model's max_len for generation limit
        for _ in range(model.decoder.max_len):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text

# Translate a sample sentence

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True

--- Simple CUDA Tensor Test ---
Successfully created and moved a test tensor to cuda.
tensor([-1.4512, -0.4681,  1.1984, -1.2393, -0.3966,  0.7614, -0.3013,  0.6769,
        -0.3374,  0.8470], device='cuda:0')
--------------------------------------------
Dataset loaded successfully.
                                          SENTENCES   \
0             How can I communicate with my parents?   
1                           How can I make friends?’   
2                              Why do I get so sad?’   
3  If you’ve asked yourself such questions, you’r...   
4  Depending on where you’ve turned for guidance,...   

                                             MEANING  Unnamed: 2  Unnamed: 3  \
0      

Training:   0%|          | 0/1509 [00:00<?, ?it/s]


Batch 0: Inspecting tensors before moving to cuda...
Source batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Source batch min value: 0, max value: 31992
Source vocab size: 32000
Target batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Target batch min value: 0, max value: 31968
Target vocab size: 32000
-----------------------------------------------------------------
Training Loss: 5.2224
Evaluating...


Validation:   0%|          | 0/378 [00:00<?, ?it/s]

Validation Loss: 4.4681

Epoch 2/5


Training:   0%|          | 0/1509 [00:00<?, ?it/s]


Batch 0: Inspecting tensors before moving to cuda...
Source batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Source batch min value: 0, max value: 31979
Source vocab size: 32000
Target batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Target batch min value: 0, max value: 31979
Target vocab size: 32000
-----------------------------------------------------------------
Training Loss: 4.0322
Evaluating...


Validation:   0%|          | 0/378 [00:00<?, ?it/s]

Validation Loss: 3.9376

Epoch 3/5


Training:   0%|          | 0/1509 [00:00<?, ?it/s]


Batch 0: Inspecting tensors before moving to cuda...
Source batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Source batch min value: 0, max value: 31937
Source vocab size: 32000
Target batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Target batch min value: 0, max value: 31924
Target vocab size: 32000
-----------------------------------------------------------------
Training Loss: 3.5234
Evaluating...


Validation:   0%|          | 0/378 [00:00<?, ?it/s]

Validation Loss: 3.6286

Epoch 4/5


Training:   0%|          | 0/1509 [00:00<?, ?it/s]


Batch 0: Inspecting tensors before moving to cuda...
Source batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Source batch min value: 0, max value: 31953
Source vocab size: 32000
Target batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Target batch min value: 0, max value: 31924
Target vocab size: 32000
-----------------------------------------------------------------
Training Loss: 3.1235
Evaluating...


Validation:   0%|          | 0/378 [00:00<?, ?it/s]

Validation Loss: 3.4340

Epoch 5/5


Training:   0%|          | 0/1509 [00:00<?, ?it/s]


Batch 0: Inspecting tensors before moving to cuda...
Source batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Source batch min value: 0, max value: 31974
Source vocab size: 32000
Target batch shape: torch.Size([16, 128]), dtype: torch.int64, device: cpu
Target batch min value: 0, max value: 31965
Target vocab size: 32000
-----------------------------------------------------------------
Training Loss: 2.8033
Evaluating...


Validation:   0%|          | 0/378 [00:00<?, ?it/s]

Validation Loss: 3.3058

Training complete!


In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os
import sys
from nltk.translate.bleu_score import corpus_bleu
# Assuming you have already defined your Transformer model classes (Encoder, Decoder, Transformer, etc.)
# and the get_positional_encoding function, and have trained your 'model' object.
# Assuming device, tokenizer, pad_idx, start_token_id, end_token_id, max_length,
# src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout,
# and the original dataframe `df` are already defined from your training code.

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Assuming df, tokenizer, pad_idx, start_token_id, end_token_id, max_length
# and model (trained) are already defined and available in the environment
# from your previous training steps.
# If running in a new session, you would need to load the model here.
# See the "Example of Loading the trained model state_dict" section below.


# Re-define the translate function needed for evaluation and inference
def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length, start_token_id, end_token_id, pad_idx):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        # Use the model's max_length for inference consistency
        encoded_input = encoded_input[:max_length] + [pad_idx] * max(0, max_length - len(encoded_input)) # Use max(0, ...)

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the max_length for generation limit
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text


# Assuming your validation dataset `custom_val_dataset` or a separate test set is available.
# If you used random_split on the full dataframe, you might need access to the original dataframe `df`
# and the indices of the test/validation split to get the original text.
# For simplicity, let's assume `df` and the indices used for the *last* split (your validation set
# from the training code) are available. We'll use the original validation set as our test set for demonstration.
# In a real project, always use a test set that was *not* used for training or validation.

# --- Recreate the validation dataset and splits if necessary ---
# This block is for demonstrating in a new session. If you run this right after training,
# the variables custom_full_dataset, custom_train_dataset, custom_val_dataset, df, tokenizer,
# max_length, pad_idx, start_token_id, end_token_id should be available.
# If you are running in a new session, make sure df, tokenizer, max_length, pad_idx,
# start_token_id, end_token_id are defined first (e.g., by running the initial data loading and tokenizer parts).

try:
    # Attempt to access the existing custom_full_dataset and splits
    print("\nAttempting to use existing dataset splits...")
    # This block will only work if you run this code in the same session after training
    _ = custom_full_dataset # Check if it exists
    _ = custom_train_dataset # Check if train split exists
    _ = custom_val_dataset # Check if val split exists
    print("Existing dataset splits found.")

    # Use the validation dataset as the test set for BLEU
    test_dataset_for_bleu = custom_val_dataset
    print(f"Using validation set (size {len(test_dataset_for_bleu)}) as test set for BLEU evaluation.")

except NameError:
    # If dataset objects are not found (e.g., new session), recreate them and the splits
    print("\nExisting dataset splits not found. Recreating datasets and splits...")
    # Assuming df, tokenizer, max_length, start_token_id, end_token_id, pad_idx are defined
    try:
        custom_full_dataset_recreated = CustomTranslationDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)
        custom_train_size_recreated = int(0.8 * len(custom_full_dataset_recreated))
        custom_val_size_recreated = len(custom_full_dataset_recreated) - custom_train_size_recreated
        custom_train_dataset_recreated, custom_val_dataset_recreated = random_split(custom_full_dataset_recreated, [custom_train_size_recreated, custom_val_size_recreated])

        # Use the recreated validation dataset as the test set for BLEU
        test_dataset_for_bleu = custom_val_dataset_recreated
        print(f"Recreated datasets. Using validation set (size {len(test_dataset_for_bleu)}) as test set for BLEU evaluation.")
    except Exception as e:
         print(f"An unexpected error occurred while recreating dataset: {e}")
         test_dataset_for_bleu = None # Ensure test_dataset_for_bleu is defined


# --- Model Saving ---

# 8. Save the trained model
model_save_path = "transformer_translation_model.pth"
print(f"\nSaving model state_dict to {model_save_path}...")
try:
    # Save only the state dictionary (recommended)
    torch.save(model.state_dict(), model_save_path)
    print("Model state_dict saved successfully.")
except Exception as e:
    print(f"Error saving model state_dict: {e}")

# --- Example of Loading the trained model state_dict (if needed in a new session) ---
# Note: You would typically load this in a separate script or cell for inference/evaluation
# if you were not continuing directly after training.
# If you are running this immediately after training, the 'model' object is already available.
# This section is commented out but shows how to load.

# print(f"\nExample: Loading model state_dict from {model_save_path

Using device: cuda

Attempting to use existing dataset splits...
Existing dataset splits found.
Using validation set (size 6033) as test set for BLEU evaluation.

Saving model state_dict to transformer_translation_model.pth...
Model state_dict saved successfully.


In [6]:
import torch
import torch.nn as nn
# No need for optim, math, pandas, spm, Dataset, DataLoader, random_split, tqdm, os, sys for just inference

# Assuming device, tokenizer, pad_idx, start_token_id, end_token_id, max_length
# and model (trained and on the correct device) are already defined and available
# in the environment from your previous training or loading steps.

# Set device to GPU if available, else CPU (redundant if already set, but safe)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


# Re-define the translate function if not already in the current cell
def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length, start_token_id, end_token_id, pad_idx):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * max(0, max_length - len(encoded_input))

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the max_length for generation limit
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text

# --- Example Translation ---

# Define a sample English sentence to translate
sample_english_sentence = "This is a test sentence for translation."
# Make sure your model is on the correct device and in evaluation mode
model.to(device) # Ensure model is on device
model.eval() # Set to evaluation mode

# Perform the translation
print(f"\nOriginal English: {sample_english_sentence}")

try:
    translated_urdu = custom_translate_english_to_urdu(
        sample_english_sentence,
        model,
        tokenizer,
        device,
        max_length=max_length, # Use the max_length consistent with your training
        start_token_id=start_token_id,
        end_token_id=end_token_id,
        pad_idx=pad_idx
    )
    print(f"Translated Urdu: {translated_urdu}")

except Exception as e:
    print(f"Error during translation: {e}")

print("\nTranslation example complete.")

# To translate other sentences, just change the `sample_english_sentence` variable
# and re-run the translation block.

Using device: cuda

Original English: This is a test sentence for translation.
Translated Urdu: یہ امتحان کے لیے ایک ٹیسٹ ہے۔

Translation example complete.


In [7]:
import torch
# No need for nn, optim, math, os, sys for just evaluation if model is already loaded
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split # Needed for splitting test set
from tqdm.notebook import tqdm
from nltk.translate.bleu_score import corpus_bleu

# Assuming device, tokenizer, pad_idx, start_token_id, end_token_id, max_length
# and model (trained and on the correct device) are already defined and available
# in the environment from your previous training or loading steps.
# Assuming the original dataframe `df` is also available.

# Set device to GPU if available, else CPU (redundant if already set, but safe)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device for evaluation: {device}")


# Re-define the translate function if not already in the current cell
def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length, start_token_id, end_token_id, pad_idx):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * max(0, max_length - len(encoded_input))

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the max_length for generation limit
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text


# --- Prepare a Test Set for BLEU Evaluation ---
# For demonstration, we'll split the original dataframe into train/test indices.
# In a real scenario, you MUST use a completely separate test set that was NOT
# used for training or validation splits.

# Create a dummy Dataset to use random_split on the dataframe indices
class IndexDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe
    def __len__(self):
        return len(self.dataframe)
    def __getitem__(self, idx):
         # Just return the index
        return idx

full_index_dataset = IndexDataset(df)

# Split indices for train and test. This is just for demonstration.
# Using 80/20 split for train/test indices
train_size = int(0.8 * len(full_index_dataset))
test_size_for_bleu = len(full_index_dataset) - train_size

if test_size_for_bleu > 0:
    # Split indices, not the actual data. Use a fixed seed for reproducibility if desired.
    # torch.manual_seed(42) # Optional: uncomment for reproducible split
    train_indices, test_indices_for_bleu = random_split(full_index_dataset, [train_size, test_size_for_bleu])
    print(f"\nUsing {len(test_indices_for_bleu)} samples from the original dataframe as test set for BLEU evaluation.")
else:
     print("\nNot enough data to create a test set for BLEU evaluation.")
     test_indices_for_bleu = []


# --- Calculate BLEU Score ---
if len(test_indices_for_bleu) > 0:
    print("\nCalculating Corpus BLEU score on the test set...")
    # Ensure the model is on the correct device and in evaluation mode
    model.to(device)
    model.eval()

    hypotheses = [] # Model's translated token lists
    references = [] # Actual target token lists (list of lists)

    print("Generating translations for BLEU evaluation...")
    # Iterate through the test set indices to get original text and generate translations
    for original_idx_in_df in tqdm(test_indices_for_bleu, desc="Translating test set"):
        english_text = str(df.iloc[original_idx_in_df]['english']) # Get original English text from dataframe
        original_urdu_text = str(df.iloc[original_idx_in_df]['urdu']) # Get original Urdu text (reference)

        # Generate translation using the custom_translate function
        translated_urdu = custom_translate_english_to_urdu(
            english_text, model, tokenizer, device,
            max_length=max_length, # Use the max_length consistent with your training
            start_token_id=start_token_id,
            end_token_id=end_token_id,
            pad_idx=pad_idx
        )

        # Tokenize both the hypothesis and reference using the SentencePiece tokenizer
        # NLTK's corpus_bleu expects lists of tokens (strings)
        hypothesis_tokens = tokenizer.encode_as_pieces(translated_urdu)
        # References need to be a list of lists, even if you only have one reference per sentence
        reference_tokens = [tokenizer.encode_as_pieces(original_urdu_text)] # List containing one reference token list

        hypotheses.append(hypothesis_tokens)
        references.append(reference_tokens)

    # Calculate corpus BLEU score
    # Ensure hypotheses and references are not empty lists and have the same length
    if hypotheses and references and len(hypotheses) == len(references):
        try:
            # NLTK's corpus_bleu takes a list of reference sentences (each a list of token strings)
            # and a list of hypothesis sentences (each a list of token strings).
            # Our 'references' list structure matches what corpus_bleu expects directly.
            bleu_score = corpus_bleu(references, hypotheses)
            print(f"\nCorpus BLEU Score: {bleu_score:.4f}")

        except Exception as e:
             print(f"Error calculating BLEU score: {e}")
             print("This might happen if there are empty sequences or other issues with tokenization.")
    else:
        print("\nCould not calculate BLEU score: Hypotheses or references list is empty or their lengths mismatch.")
        print(f"Number of hypotheses: {len(hypotheses)}")
        print(f"Number of references: {len(references)}")

else:
    print("\nNo test dataset samples available for BLEU evaluation.")


print("\nBLEU evaluation complete.")

Using device for evaluation: cuda

Using 6033 samples from the original dataframe as test set for BLEU evaluation.

Calculating Corpus BLEU score on the test set...
Generating translations for BLEU evaluation...


Translating test set:   0%|          | 0/6033 [00:00<?, ?it/s]


Corpus BLEU Score: 0.2362

BLEU evaluation complete.


In [11]:
import torch
# No need for nn, optim, math, os, sys for just inference

# Assuming device, tokenizer, pad_idx, start_token_id, end_token_id, max_length
# and model (trained and on the correct device) are already defined and available
# in the environment from your previous training or loading steps.
# Assuming the original dataframe `df` is also available (though not strictly
# needed for just a single example translation).

# Set device to GPU if available, else CPU (redundant if already set, but safe)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device for translation: {device}")


# Re-define the translate function if not already in the current cell
def custom_translate_english_to_urdu(text, model, tokenizer, device, max_length, start_token_id, end_token_id, pad_idx):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence
        encoded_input = tokenizer.encode_as_ids(text)
        # Pad or truncate the encoded input
        encoded_input = encoded_input[:max_length] + [pad_idx] * max(0, max_length - len(encoded_input))

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token
        target_sequence = [start_token_id]
        # Convert to tensor, add batch dimension, and move to device
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop - use the max_length for generation limit
        for _ in range(max_length):
            # Get model output for the current target sequence
            output = model(src_tensor, tgt_tensor)
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (excluding the start token)
        # Ensure we handle the case where only the start token was generated
        translated_text = tokenizer.decode_ids(target_sequence[1:]) if len(target_sequence) > 1 else ""

        return translated_text


# --- Single Example Translation ---

print("\n--- Single Example Translation ---")
# Ensure the model is on the correct device and in evaluation mode
model.to(device) # Ensure model is on device
model.eval() # Set to evaluation mode

# Define a sample English sentence to translate
# You can change this sentence to test different inputs
sample_english_sentence = "Who is your father?"
print(f"Original English: {sample_english_sentence}")

try:
    translated_urdu = custom_translate_english_to_urdu(
        sample_english_sentence,
        model,
        tokenizer,
        device,
        max_length=max_length, # Use the max_length consistent with your training
        start_token_id=start_token_id,
        end_token_id=end_token_id,
        pad_idx=pad_idx
    )
    print(f"Translated Urdu: {translated_urdu}")

except Exception as e:
    print(f"Error during translation: {e}")

print("----------------------------------")

# To translate other sentences, just change the `sample_english_sentence` variable
# and re-run this block. You can add more print statements and function calls
# to translate multiple sentences if needed.

Using device for translation: cuda

--- Single Example Translation ---
Original English: Who is your father?
Translated Urdu: تمہارا پروردگار ہے؟
----------------------------------


In [12]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os
import sys
from nltk.translate.bleu_score import corpus_bleu

# --- Step 0: Configuration ---
# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define paths and names
tokenizer_model_path = "m_bpe.model" # Path to your SentencePiece model file (same for both directions)
data_file = 'parallel-corpus.xlsx' # Path to your Excel data file
urdu_to_english_model_save_path = "transformer_urdu_to_english_model.pth" # New path for U->E model

# Model Hyperparameters (MUST match the English->Urdu model architecture if using the same Transformer class)
# These are needed to load the model correctly.
# vocab_size will be the same for both source and target as we use one joint tokenizer
d_model = 512
num_layers = 6
num_heads = 8
d_ff = 2048
dropout = 0.1
max_length = 128 # Use the same max_length as English->Urdu

# Training Parameters
num_epochs = 5 # Adjust as needed
custom_batch_size = 16 # Adjust as needed

# --- Step 1: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
        print("Fundamental CUDA test failed. Your CUDA/PyTorch setup may be incompatible or corrupted.")
        sys.exit()
else:
    print("CUDA is not available. Exiting.")
    sys.exit()
print("--------------------------------------------")


# --- Step 2: Load Data ---
try:
    df = pd.read_excel(data_file)
    print("Dataset loaded successfully.")
    df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)
except FileNotFoundError:
    print(f"Error: {data_file} not found.")
    sys.exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    sys.exit()


# --- Step 3: Load Tokenizer ---
# Assuming the tokenizer model 'm_bpe.model' and 'm_bpe.vocab' were already trained
# using the combined English and Urdu corpus in your previous run.
try:
    tokenizer = spm.SentencePieceProcessor()
    tokenizer.load(tokenizer_model_path)
    print(f"Tokenizer loaded successfully from {tokenizer_model_path}.")
    pad_idx = tokenizer.pad_id() if hasattr(tokenizer, 'pad_id') else 0
    start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1
    end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2
    unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3
    print(f"Special Token IDs from tokenizer: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")
    vocab_size = tokenizer.get_piece_size() # Joint vocabulary size
    print(f"Joint Vocabulary Size: {vocab_size}")

except FileNotFoundError:
    print(f"Error: Tokenizer model file {tokenizer_model_path} not found.")
    print("Please train the tokenizer first by running the initial part of your English->Urdu script.")
    sys.exit()
except Exception as e:
    print(f"An error occurred while loading tokenizer: {e}")
    sys.exit()

# Set source and target vocab sizes - they are the same for a joint tokenizer
src_vocab_size = vocab_size
tgt_vocab_size = vocab_size


# --- Step 4: Implement Transformer Model and Data Handling (Adapt for U->E) ---

# Helper function for creating masks (remains the same)
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper (remains the same)
def get_positional_encoding(max_len, d_model, device):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term_cpu = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term_cpu)
    pe[:, 1::2] = torch.cos(position * div_term_cpu)
    pe = pe.unsqueeze(0)
    pe_on_device = pe.to(device)
    # print("\n--- Positional Encoding Tensor Properties (on device) ---")
    # print(f"Shape: {pe_on_device.shape}, dtype: {pe_on_device.dtype}, device: {pe_on_device.device}")
    # print(f"Min value: {pe_on_device.min().item()}, Max value: {pe_on_device.max().item()}")
    # print(f"Estimated Device memory usage: {pe_on_device.numel() * pe_on_device.element_size() / (1024*1024):.2f} MB")
    # print("-------------------------------------------------------")
    return pe_on_device

# MultiHeadAttention (remains the same)
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)
    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output
    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

# FeedForward (remains the same)
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

# EncoderLayer (remains the same)
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

# Encoder (remains the same, but processes Urdu tokens)
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.pe = get_positional_encoding(max_len, d_model, device)
    def forward(self, src, mask):
        src = self.embedding(src)
        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :]
        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

# DecoderLayer (remains the same)
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)
    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

# Decoder (remains the same, but generates English tokens)
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        self.pe = get_positional_encoding(max_len, d_model, device)
    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :]
        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

# Transformer (remains the same architecture, handles U->E data)
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Encoder processes source language (Urdu)
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Decoder generates target language (English)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.pad_idx = pad_idx
    def forward(self, src, tgt):
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        encoder_output = self.encoder(src, src_mask)
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output


# Custom Dataset for Urdu -> English
class UrduToEnglishDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # For Urdu->English, Urdu is input, English is labels
        encoded_input = encoded_urdu
        encoded_labels = [self.start_token_id] + encoded_english + [self.end_token_id] # Add start/end to target English

        # Pad or truncate sequences to the defined max_length
        encoded_input = encoded_input[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_input))
        encoded_labels = encoded_labels[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_labels))


        return {
            'input_ids': torch.tensor(encoded_input, dtype=torch.long), # Urdu tokens
            'labels': torch.tensor(encoded_labels, dtype=torch.long)  # English tokens with start/end
        }


# Inference function for Urdu -> English (Greedy)
def custom_translate_urdu_to_english(text, model, tokenizer, device, max_length, start_token_id, end_token_id, pad_idx):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence (Urdu)
        encoded_input = tokenizer.encode_as_ids(text) # Input is Urdu text
        encoded_input = encoded_input[:max_length] + [pad_idx] * max(0, max_length - len(encoded_input))

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token (for English)
        target_sequence = [start_token_id] # Start token for English generation
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            output = model(src_tensor, tgt_tensor) # src is Urdu, tgt_tensor is generated English
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (English)
        translated_text = tokenizer.decode_ids(target_sequence[1:]) # Decode generated English tokens

        return translated_text

# --- Step 5: Create and Split Dataset (Urdu -> English) ---
print("\nCreating Urdu->English Dataset...")
urdu_to_english_full_dataset = UrduToEnglishDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
urdu_to_english_train_size = int(0.8 * len(urdu_to_english_full_dataset))
urdu_to_english_val_size = len(urdu_to_english_full_dataset) - urdu_to_english_train_size
urdu_to_english_train_dataset, urdu_to_english_val_dataset = random_split(urdu_to_english_full_dataset, [urdu_to_english_train_size, urdu_to_english_val_size])

# Define batch size and create data loaders
urdu_to_english_train_dataloader = DataLoader(urdu_to_english_train_dataset, batch_size=custom_batch_size, shuffle=True)
urdu_to_english_val_dataloader = DataLoader(urdu_to_english_val_dataset, batch_size=custom_batch_size)

print(f"Urdu->English Train dataset size: {len(urdu_to_english_train_dataset)}")
print(f"Urdu->English Validation dataset size: {len(urdu_to_english_val_dataset)}")


# --- Step 6: Initialize and Train Urdu->English Model ---
print("\nInitializing Urdu->English Transformer model...")
# Create a NEW model instance for Urdu->English
urdu_to_english_model = Transformer(
    src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device
)
print("Urdu->English Transformer model initialized successfully.")

# Move the new model to the device
print("Moving Urdu->English model to device...")
urdu_to_english_model.to(device)
print("Urdu->English model moved to device successfully.")


# Define loss function and optimizer for U->E model
urdu_to_english_criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
urdu_to_english_optimizer = optim.Adam(urdu_to_english_model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

print("\nStarting Urdu->English training loop...")
for epoch in range(num_epochs):
    urdu_to_english_model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nUrdu->English Epoch {epoch + 1}/{num_epochs}")
    for i, batch in enumerate(tqdm(urdu_to_english_train_dataloader, desc="Urdu->English Training")):
        src = batch['input_ids'].to(device) # Urdu input
        tgt = batch['labels'].to(device)    # English labels

        tgt_input = tgt[:, :-1] # English input for decoder
        labels = tgt[:, 1:].contiguous().view(-1) # English labels for loss

        urdu_to_english_optimizer.zero_grad()
        output = urdu_to_english_model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = urdu_to_english_criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        urdu_to_english_optimizer.step()

    avg_train_loss = total_train_loss / len(urdu_to_english_train_dataloader)
    print(f"Urdu->English Training Loss: {avg_train_loss:.4f}")

    # Evaluation
    urdu_to_english_model.eval()
    total_val_loss = 0

    print("Urdu->English Evaluating...")
    with torch.no_grad():
        for batch in tqdm(urdu_to_english_val_dataloader, desc="Urdu->English Validation"):
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = urdu_to_english_model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = urdu_to_english_criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(urdu_to_english_val_dataloader)
    print(f"Urdu->English Validation Loss: {avg_val_loss:.4f}")

print("\nUrdu->English Training complete!")


# --- Step 7: Save the Trained Urdu->English Model ---
print(f"\nSaving Urdu->English model state_dict to {urdu_to_english_model_save_path}...")
try:
    torch.save(urdu_to_english_model.state_dict(), urdu_to_english_model_save_path)
    print("Urdu->English model state_dict saved successfully.")
except Exception as e:
    print(f"Error saving Urdu->English model state_dict: {e}")


# --- Step 8: Prepare Test Set for Urdu->English BLEU Evaluation ---
# Use the validation set indices from the U->E split as the test set for BLEU
# In a real scenario, use a completely separate test set indices.
urdu_to_english_test_indices_for_bleu = urdu_to_english_val_dataset.indices if hasattr(urdu_to_english_val_dataset, 'indices') else range(len(urdu_to_english_val_dataset))

if len(urdu_to_english_test_indices_for_bleu) > 0:
    print(f"\nUsing {len(urdu_to_english_test_indices_for_bleu)} samples from the original dataframe indices (from U->E validation split) as test set for BLEU evaluation.")
else:
     print("\nNot enough data to create a test set for Urdu->English BLEU evaluation.")


# --- Step 9: Calculate Urdu->English BLEU Score ---
if len(urdu_to_english_test_indices_for_bleu) > 0:
    print("\nCalculating Corpus BLEU score on the Urdu->English test set...")
    urdu_to_english_model.eval() # Ensure the trained model is in evaluation mode

    hypotheses = [] # Model's translated token lists (English)
    references = [] # Actual target token lists (list of lists) (English)

    print("Generating translations for Urdu->English BLEU evaluation...")
    for original_idx_in_df in tqdm(urdu_to_english_test_indices_for_bleu, desc="Translating U->E test set"):
        original_urdu_text = str(df.iloc[original_idx_in_df]['urdu']) # Get original Urdu text from dataframe
        original_english_text = str(df.iloc[original_idx_in_df]['english']) # Get original English text (reference)

        # Generate translation using the custom_translate_urdu_to_english function
        translated_english = custom_translate_urdu_to_english(
            original_urdu_text, urdu_to_english_model, tokenizer, device,
            max_length=max_length,
            start_token_id=start_token_id, # Start token for English
            end_token_id=end_token_id,     # End token for English
            pad_idx=pad_idx
        )

        # Tokenize both the hypothesis (translated English) and reference (original English)
        hypothesis_tokens = tokenizer.encode_as_pieces(translated_english)
        reference_tokens = [tokenizer.encode_as_pieces(original_english_text)] # Reference is original English

        hypotheses.append(hypothesis_tokens)
        references.append(reference_tokens)

    # Calculate corpus BLEU score
    if hypotheses and references and len(hypotheses) == len(references):
        try:
            bleu_score_urdu_to_english = corpus_bleu(references, hypotheses)
            print(f"\nCorpus BLEU Score (Urdu->English): {bleu_score_urdu_to_english:.4f}")

        except Exception as e:
             print(f"Error calculating Urdu->English BLEU score: {e}")
             print("This might happen if there are empty sequences or other issues with tokenization.")
    else:
        print("\nCould not calculate Urdu->English BLEU score: Hypotheses or references list is empty or their lengths mismatch.")
        print(f"Number of hypotheses: {len(hypotheses)}")
        print(f"Number of references: {len(references)}")

else:
    print("\nNo test dataset samples available for Urdu->English BLEU evaluation.")

print("\nUrdu->English BLEU evaluation complete.")

CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True
Successfully created and moved a test tensor to cuda.
--------------------------------------------
Dataset loaded successfully.
Tokenizer loaded successfully from m_bpe.model.
Special Token IDs from tokenizer: Padding: -1, Start: 1, End: 2, Unknown: 0
Joint Vocabulary Size: 32000

Creating Urdu->English Dataset...
Urdu->English Train dataset size: 24131
Urdu->English Validation dataset size: 6033

Initializing Urdu->English Transformer model...
Urdu->English Transformer model initialized successfully.
Moving Urdu->English model to device...
Urdu->English model moved to device successfully.

Starting Urdu->English training loop...

Urdu->English Epoch 1/5


Urdu->English Training:   0%|          | 0/1509 [00:00<?, ?it/s]

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os
import sys
from nltk.translate.bleu_score import corpus_bleu

# --- Step 0: Configuration ---
# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define paths and names
tokenizer_model_path = "m_bpe.model" # Path to your SentencePiece model file (same for both directions)
data_file = 'parallel-corpus.xlsx' # Path to your Excel data file
urdu_to_english_model_save_path = "transformer_urdu_to_english_model.pth" # New path for U->E model

# Model Hyperparameters (MUST match the English->Urdu model architecture if using the same Transformer class)
# These are needed to load the model correctly.
# vocab_size will be the same for both source and target as we use one joint tokenizer
d_model = 512
num_layers = 6
num_heads = 8
d_ff = 2048
dropout = 0.1
max_length = 128 # Use the same max_length as English->Urdu

# Training Parameters
num_epochs = 5 # Adjust as needed
custom_batch_size = 16 # Adjust as needed

# --- Step 1: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")
    try:
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
    except Exception as e:
        print(f"Error creating/moving test tensor to {device}: {e}")
        print("Fundamental CUDA test failed. Your CUDA/PyTorch setup may be incompatible or corrupted.")
        sys.exit()
else:
    print("CUDA is not available. Exiting.")
    sys.exit()
print("--------------------------------------------")


# --- Step 2: Load Data ---
try:
    df = pd.read_excel(data_file)
    print("Dataset loaded successfully.")
    df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)
except FileNotFoundError:
    print(f"Error: {data_file} not found.")
    sys.exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    sys.exit()


# --- Step 3: Load Tokenizer ---
# Assuming the tokenizer model 'm_bpe.model' and 'm_bpe.vocab' were already trained
# using the combined English and Urdu corpus in your previous run.
try:
    tokenizer = spm.SentencePieceProcessor()
    tokenizer.load(tokenizer_model_path)
    print(f"Tokenizer loaded successfully from {tokenizer_model_path}.")
    pad_idx = tokenizer.pad_id() if hasattr(tokenizer, 'pad_id') else 0
    start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1
    end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2
    unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3
    print(f"Special Token IDs from tokenizer: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")
    vocab_size = tokenizer.get_piece_size() # Joint vocabulary size
    print(f"Joint Vocabulary Size: {vocab_size}")

except FileNotFoundError:
    print(f"Error: Tokenizer model file {tokenizer_model_path} not found.")
    print("Please train the tokenizer first by running the initial part of your English->Urdu script.")
    sys.exit()
except Exception as e:
    print(f"An error occurred while loading tokenizer: {e}")
    sys.exit()

# Set source and target vocab sizes - they are the same for a joint tokenizer
src_vocab_size = vocab_size
tgt_vocab_size = vocab_size


# --- Step 4: Implement Transformer Model and Data Handling (Adapt for U->E) ---

# Helper function for creating masks (remains the same)
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper (remains the same)
def get_positional_encoding(max_len, d_model, device):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term_cpu = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term_cpu)
    pe[:, 1::2] = torch.cos(position * div_term_cpu)
    pe = pe.unsqueeze(0)
    pe_on_device = pe.to(device)
    # print("\n--- Positional Encoding Tensor Properties (on device) ---")
    # print(f"Shape: {pe_on_device.shape}, dtype: {pe_on_device.dtype}, device: {pe_on_device.device}")
    # print(f"Min value: {pe_on_device.min().item()}, Max value: {pe_on_device.max().item()}")
    # print(f"Estimated Device memory usage: {pe_on_device.numel() * pe_on_device.element_size() / (1024*1024):.2f} MB")
    # print("-------------------------------------------------------")
    return pe_on_device

# MultiHeadAttention (remains the same)
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)
    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output
    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

# FeedForward (remains the same)
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

# EncoderLayer (remains the same)
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

# Encoder (remains the same, but processes Urdu tokens)
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.pe = get_positional_encoding(max_len, d_model, device)
    def forward(self, src, mask):
        src = self.embedding(src)
        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :]
        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

# DecoderLayer (remains the same)
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)
    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

# Decoder (remains the same, but generates English tokens)
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        self.pe = get_positional_encoding(max_len, d_model, device)
    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :]
        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

# Transformer (remains the same architecture, handles U->E data)
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Encoder processes source language (Urdu)
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Decoder generates target language (English)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.pad_idx = pad_idx
    def forward(self, src, tgt):
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        encoder_output = self.encoder(src, src_mask)
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output


# Custom Dataset for Urdu -> English
class UrduToEnglishDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # For Urdu->English, Urdu is input, English is labels
        encoded_input = encoded_urdu
        encoded_labels = [self.start_token_id] + encoded_english + [self.end_token_id] # Add start/end to target English

        # Pad or truncate sequences to the defined max_length
        encoded_input = encoded_input[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_input))
        encoded_labels = encoded_labels[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_labels))


        return {
            'input_ids': torch.tensor(encoded_input, dtype=torch.long), # Urdu tokens
            'labels': torch.tensor(encoded_labels, dtype=torch.long)  # English tokens with start/end
        }


# Inference function for Urdu -> English (Greedy)
def custom_translate_urdu_to_english(text, model, tokenizer, device, max_length, start_token_id, end_token_id, pad_idx):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence (Urdu)
        encoded_input = tokenizer.encode_as_ids(text) # Input is Urdu text
        encoded_input = encoded_input[:max_length] + [pad_idx] * max(0, max_length - len(encoded_input))

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token (for English)
        target_sequence = [start_token_id] # Start token for English generation
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            output = model(src_tensor, tgt_tensor) # src is Urdu, tgt_tensor is generated English
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (English)
        translated_text = tokenizer.decode_ids(target_sequence[1:]) # Decode generated English tokens

        return translated_text

# --- Step 5: Create and Split Dataset (Urdu -> English) ---
print("\nCreating Urdu->English Dataset...")
urdu_to_english_full_dataset = UrduToEnglishDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
urdu_to_english_train_size = int(0.8 * len(urdu_to_english_full_dataset))
urdu_to_english_val_size = len(urdu_to_english_full_dataset) - urdu_to_english_train_size
urdu_to_english_train_dataset, urdu_to_english_val_dataset = random_split(urdu_to_english_full_dataset, [urdu_to_english_train_size, urdu_to_english_val_size])

# Define batch size and create data loaders
urdu_to_english_train_dataloader = DataLoader(urdu_to_english_train_dataset, batch_size=custom_batch_size, shuffle=True)
urdu_to_english_val_dataloader = DataLoader(urdu_to_english_val_dataset, batch_size=custom_batch_size)

print(f"Urdu->English Train dataset size: {len(urdu_to_english_train_dataset)}")
print(f"Urdu->English Validation dataset size: {len(urdu_to_english_val_dataset)}")


# --- Step 6: Initialize and Train Urdu->English Model ---
print("\nInitializing Urdu->English Transformer model...")
# Create a NEW model instance for Urdu->English
urdu_to_english_model = Transformer(
    src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device
)
print("Urdu->English Transformer model initialized successfully.")

# Move the new model to the device
print("Moving Urdu->English model to device...")
urdu_to_english_model.to(device)
print("Urdu->English model moved to device successfully.")


# Define loss function and optimizer for U->E model
urdu_to_english_criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
urdu_to_english_optimizer = optim.Adam(urdu_to_english_model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

print("\nStarting Urdu->English training loop...")
for epoch in range(num_epochs):
    urdu_to_english_model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nUrdu->English Epoch {epoch + 1}/{num_epochs}")
    for i, batch in enumerate(tqdm(urdu_to_english_train_dataloader, desc="Urdu->English Training")):

        # --- Diagnostic: Inspecting batch tensors before moving to device (within loop) ---
        # Print for the first batch of each epoch
        if i == 0:
            src_batch_cpu = batch['input_ids']
            tgt_batch_cpu = batch['labels']

            print(f"\nBatch {i}: Inspecting U->E tensors before moving to {device}...")
            print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
            print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
            print(f"Source vocab size (Urdu input): {src_vocab_size}") # src is Urdu for this model

            print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
            print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
            print(f"Target vocab size (English labels): {tgt_vocab_size}") # tgt is English for this model
            print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---


        src = batch['input_ids'].to(device) # Urdu input
        tgt = batch['labels'].to(device)    # English labels

        tgt_input = tgt[:, :-1] # English input for decoder
        labels = tgt[:, 1:].contiguous().view(-1) # English labels for loss

        urdu_to_english_optimizer.zero_grad()
        output = urdu_to_english_model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = urdu_to_english_criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        urdu_to_english_optimizer.step()

    avg_train_loss = total_train_loss / len(urdu_to_english_train_dataloader)
    print(f"Urdu->English Training Loss: {avg_train_loss:.4f}")

    # Evaluation
    urdu_to_english_model.eval()
    total_val_loss = 0

    print("Urdu->English Evaluating...")
    with torch.no_grad():
        for i, batch in enumerate(tqdm(urdu_to_english_val_dataloader, desc="Urdu->English Validation")):
             # --- Diagnostic: Inspecting batch tensors before moving to device (within loop) ---
            # Print for the first batch of evaluation
            if i == 0 and epoch == 0: # Only print once per training run
                src_batch_cpu = batch['input_ids']
                tgt_batch_cpu = batch['labels']

                print(f"\nBatch {i}: Inspecting U->E Val tensors before moving to {device}...")
                print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
                print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
                print(f"Source vocab size (Urdu input): {src_vocab_size}")

                print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
                print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
                print(f"Target vocab size (English labels): {tgt_vocab_size}")
                print("-----------------------------------------------------------------")
            # --- End diagnostic prints ---

            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = urdu_to_english_model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = urdu_to_english_criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(urdu_to_english_val_dataloader)
    print(f"Urdu->English Validation Loss: {avg_val_loss:.4f}")

print("\nUrdu->English Training complete!")


# --- Step 7: Save the Trained Urdu->English Model ---
print(f"\nSaving Urdu->English model state_dict to {urdu_to_english_model_save_path}...")
try:
    torch.save(urdu_to_english_model.state_dict(), urdu_to_english_model_save_path)
    print("Urdu->English model state_dict saved successfully.")
except Exception as e:
    print(f"Error saving Urdu->English model state_dict: {e}")


# --- Step 8: Prepare Test Set for Urdu->English BLEU Evaluation ---
# Use the validation set indices from the U->E split as the test set for BLEU
# In a real scenario, use a completely separate test set indices.
urdu_to_english_test_indices_for_bleu = urdu_to_english_val_dataset.indices if hasattr(urdu_to_english_val_dataset, 'indices') else range(len(urdu_to_english_val_dataset))

if len(urdu_to_english_test_indices_for_bleu) > 0:
    print(f"\nUsing {len(urdu_to_english_test_indices_for_bleu)} samples from the original dataframe indices (from U->E validation split) as test set for BLEU evaluation.")
else:
     print("\nNot enough data to create a test set for Urdu->English BLEU evaluation.")


# --- Step 9: Calculate Urdu->English BLEU Score ---
if len(urdu_to_english_test_indices_for_bleu) > 0:
    print("\nCalculating Corpus BLEU score on the Urdu->English test set...")
    urdu_to_english_model.eval() # Ensure the trained model is in evaluation mode

    hypotheses = [] # Model's translated token lists (English)
    references = [] # Actual target token lists (list of lists) (English)

    print("Generating translations for Urdu->English BLEU evaluation...")
    for original_idx_in_df in tqdm(urdu_to_english_test_indices_for_bleu, desc="Translating U->E test set"):
        original_urdu_text = str(df.iloc[original_idx_in_df]['urdu']) # Get original Urdu text from dataframe
        original_english_text = str(df.iloc[original_idx_in_df]['english']) # Get original English text (reference)

        # Generate translation using the custom_translate_urdu_to_english function
        translated_english = custom_translate_urdu_to_english(
            original_urdu_text, urdu_to_english_model, tokenizer, device,
            max_length=max_length,
            start_token_id=start_token_id, # Start token for English
            end_token_id=end_token_id,     # End token for English
            pad_idx=pad_idx
        )

        # Tokenize both the hypothesis (translated English) and reference (original English)
        hypothesis_tokens = tokenizer.encode_as_pieces(translated_english)
        reference_tokens = [tokenizer.encode_as_pieces(original_english_text)] # Reference is original English

        hypotheses.append(hypothesis_tokens)
        references.append(reference_tokens)

    # Calculate corpus BLEU score
    if hypotheses and references and len(hypotheses) == len(references):
        try:
            bleu_score_urdu_to_english = corpus_bleu(references, hypotheses)
            print(f"\nCorpus BLEU Score (Urdu->English): {bleu_score_urdu_to_english:.4f}")

        except Exception as e:
             print(f"Error calculating Urdu->English BLEU score: {e}")
             print("This might happen if there are empty sequences or other issues with tokenization.")
    else:
        print("\nCould not calculate Urdu->English BLEU score: Hypotheses or references list is empty or their lengths mismatch.")
        print(f"Number of hypotheses: {len(hypotheses)}")
        print(f"Number of references: {len(references)}")

else:
    print("\nNo test dataset samples available for Urdu->English BLEU evaluation.")

print("\nUrdu->English BLEU evaluation complete.")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True
Error creating/moving test tensor to cuda: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Fundamental CUDA test failed. Your CUDA/PyTorch setup may be incompatible or corrupted.
Traceback (most recent call last):
  File "<ipython-input-13-4141178837>", line 55, in <cell line: 0>
    test_tensor = torch.randn(10).to(device)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages

TypeError: object of type 'NoneType' has no len()

In [14]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import pandas as pd
import sentencepiece as spm
from torch.utils.data import Dataset, DataLoader, random_split
from tqdm.notebook import tqdm
import os
import sys
from nltk.translate.bleu_score import corpus_bleu

# --- Step 0: Configuration ---
# Set CUDA_LAUNCH_BLOCKING=1 for better error messages
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("CUDA_LAUNCH_BLOCKING is set to 1")

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define paths and names
tokenizer_model_path = "m_bpe.model" # Path to your SentencePiece model file (same for both directions)
data_file = 'parallel-corpus.xlsx' # Path to your Excel data file
urdu_to_english_model_save_path = "transformer_urdu_to_english_model.pth" # New path for U->E model

# Model Hyperparameters (MUST match the English->Urdu model architecture if using the same Transformer class)
# These are needed to load the model correctly.
# vocab_size will be the same for both source and target as we use one joint tokenizer
d_model = 512
num_layers = 6
num_heads = 8
d_ff = 2048
dropout = 0.1
max_length = 128 # Use the same max_length as English->Urdu

# Training Parameters
num_epochs = 5 # Adjust as needed
custom_batch_size = 16 # Adjust as needed

# --- Step 1: Environment and Compatibility Checks ---
print("\n--- Environment and Compatibility Checks ---")
print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA version: {torch.version.cuda}")
    current_device = torch.cuda.current_device()
    print(f"Current CUDA device: {current_device}")
    print(f"GPU Name: {torch.cuda.get_device_name(current_device)}")
    print(f"GPU Capability: {torch.cuda.get_device_capability(current_device)}")
    print(f"PyTorch built with CUDA: {torch.backends.cuda.is_built()}")

    # --- Simple CUDA Tensor Test ---
    print("\n--- Simple CUDA Tensor Test ---")
    try:
        # Attempt to create and move a small tensor to the GPU
        test_tensor = torch.randn(10).to(device)
        print(f"Successfully created and moved a test tensor to {device}.")
        print(test_tensor)
    except Exception as e:
        # If this fails, there is a fundamental CUDA/PyTorch installation issue.
        print(f"Error creating/moving test tensor to {device}: {e}")
        print("Fundamental CUDA test failed. Your CUDA/PyTorch setup may be incompatible or corrupted.")
        # We exit here because training requiring CUDA will fail anyway
        sys.exit()
else:
    print("CUDA is not available. Exiting as GPU is required for this model.")
    sys.exit()
print("--------------------------------------------")


# --- Step 2: Load Data ---
try:
    df = pd.read_excel(data_file)
    print("Dataset loaded successfully.")
    df.rename(columns={"SENTENCES ": "english", "MEANING": "urdu"}, inplace=True)
except FileNotFoundError:
    print(f"Error: {data_file} not found.")
    sys.exit()
except Exception as e:
    print(f"An error occurred while reading the Excel file: {e}")
    sys.exit()


# --- Step 3: Load Tokenizer ---
# Assuming the tokenizer model 'm_bpe.model' and 'm_bpe.vocab' were already trained
# using the combined English and Urdu corpus in your previous run.
try:
    tokenizer = spm.SentencePieceProcessor()
    tokenizer.load(tokenizer_model_path)
    print(f"Tokenizer loaded successfully from {tokenizer_model_path}.")
    pad_idx = tokenizer.pad_id() if hasattr(tokenizer, 'pad_id') else 0
    start_token_id = tokenizer.bos_id() if hasattr(tokenizer, 'bos_id') else 1
    end_token_id = tokenizer.eos_id() if hasattr(tokenizer, 'eos_id') else 2
    unk_token_id = tokenizer.unk_id() if hasattr(tokenizer, 'unk_id') else 3
    print(f"Special Token IDs from tokenizer: Padding: {pad_idx}, Start: {start_token_id}, End: {end_token_id}, Unknown: {unk_token_id}")
    vocab_size = tokenizer.get_piece_size() # Joint vocabulary size
    print(f"Joint Vocabulary Size: {vocab_size}")

except FileNotFoundError:
    print(f"Error: Tokenizer model file {tokenizer_model_path} not found.")
    print("Please train the tokenizer first by running the initial part of your English->Urdu script.")
    sys.exit()
except Exception as e:
    print(f"An error occurred while loading tokenizer: {e}")
    sys.exit()

# Set source and target vocab sizes - they are the same for a joint tokenizer
src_vocab_size = vocab_size
tgt_vocab_size = vocab_size


# --- Step 4: Implement Transformer Model and Data Handling (Adapt for U->E) ---

# Helper function for creating masks (remains the same)
def create_masks(src, tgt, pad_idx):
    src_mask = (src != pad_idx).unsqueeze(1).unsqueeze(2)
    tgt_mask = (tgt != pad_idx).unsqueeze(1).unsqueeze(3)
    seq_length = tgt.shape[1]
    nopeak_mask = (torch.triu(torch.ones(seq_length, seq_length, device=tgt.device), diagonal=1).type(torch.bool) == False).unsqueeze(0).unsqueeze(0)
    tgt_mask = tgt_mask & nopeak_mask
    return src_mask, tgt_mask

# PositionalEncoding helper (remains the same)
def get_positional_encoding(max_len, d_model, device):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term_cpu = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term_cpu)
    pe[:, 1::2] = torch.cos(position * div_term_cpu)
    pe = pe.unsqueeze(0)
    pe_on_device = pe.to(device)
    # print("\n--- Positional Encoding Tensor Properties (on device) ---")
    # print(f"Shape: {pe_on_device.shape}, dtype: {pe_on_device.dtype}, device: {pe_on_device.device}")
    # print(f"Min value: {pe_on_device.min().item()}, Max value: {pe_on_device.max().item()}")
    # print(f"Estimated Device memory usage: {pe_on_device.numel() * pe_on_device.element_size() / (1024*1024):.2f} MB")
    # print("-------------------------------------------------------")
    return pe_on_device

# MultiHeadAttention (remains the same)
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.linear = nn.Linear(d_model, d_model)
    def scaled_dot_product_attention(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)
        return output
    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        q = self.wq(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.wk(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.wv(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        output = self.scaled_dot_product_attention(q, k, v, mask)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.linear(output)
        return output

# FeedForward (remains the same)
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

# EncoderLayer (remains the same)
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))
        return x

# Encoder (remains the same, but processes Urdu tokens)
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.pe = get_positional_encoding(max_len, d_model, device)
    def forward(self, src, mask):
        src = self.embedding(src)
        if src.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({src.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        src = src + self.pe[:, :src.size(1), :]
        src = self.dropout(src)
        for layer in self.layers:
            src = layer(src, mask)
        return src

# DecoderLayer (remains the same)
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout3 = nn.Dropout(dropout)
    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        attn_output = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt = self.norm1(tgt + self.dropout1(attn_output))
        cross_attn_output = self.cross_attn(tgt, encoder_output, encoder_output, src_mask)
        tgt = self.norm2(tgt + self.dropout2(cross_attn_output))
        ffn_output = self.ffn(tgt)
        tgt = self.norm3(tgt + self.dropout3(ffn_output))
        return tgt

# Decoder (remains the same, but generates English tokens)
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, max_len=5000, device=None):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.max_len = max_len
        self.d_model = d_model
        self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout)
        self.fc_out = nn.Linear(d_model, vocab_size)
        self.pe = get_positional_encoding(max_len, d_model, device)
    def forward(self, tgt, encoder_output, tgt_mask, src_mask):
        tgt = self.embedding(tgt)
        if tgt.size(1) > self.max_len:
             raise RuntimeError(f"Input sequence length ({tgt.size(1)}) exceeds positional encoding max_len ({self.max_len}). Increase max_length.")
        tgt = tgt + self.pe[:, :tgt.size(1), :]
        tgt = self.dropout(tgt)
        for layer in self.layers:
            tgt = layer(tgt, encoder_output, tgt_mask, src_mask)
        output = self.fc_out(tgt)
        return output

# Transformer (remains the same architecture, handles U->E data)
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout=0.1, pad_idx=0, max_len=5000, device=None):
        super(Transformer, self).__init__()
        # Encoder processes source language (Urdu)
        self.encoder = Encoder(src_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        # Decoder generates target language (English)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_len=max_len, device=device)
        self.pad_idx = pad_idx
    def forward(self, src, tgt):
        src_mask, tgt_mask = create_masks(src, tgt, self.pad_idx)
        encoder_output = self.encoder(src, src_mask)
        decoder_output = self.decoder(tgt, encoder_output, tgt_mask, src_mask)
        return decoder_output


# Custom Dataset for Urdu -> English
class UrduToEnglishDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, start_token_id, end_token_id, pad_idx):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id
        self.pad_idx = pad_idx

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        english_text = str(self.dataframe.iloc[idx]['english'])
        urdu_text = str(self.dataframe.iloc[idx]['urdu'])

        # Encode sentences using the custom tokenizer
        encoded_english = self.tokenizer.encode_as_ids(english_text)
        encoded_urdu = self.tokenizer.encode_as_ids(urdu_text)

        # For Urdu->English, Urdu is input, English is labels
        encoded_input = encoded_urdu
        encoded_labels = [self.start_token_id] + encoded_english + [self.end_token_id] # Add start/end to target English

        # Pad or truncate sequences to the defined max_length
        encoded_input = encoded_input[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_input))
        encoded_labels = encoded_labels[:self.max_length] + [self.pad_idx] * max(0, self.max_length - len(encoded_labels))


        return {
            'input_ids': torch.tensor(encoded_input, dtype=torch.long), # Urdu tokens
            'labels': torch.tensor(encoded_labels, dtype=torch.long)  # English tokens with start/end
        }


# Inference function for Urdu -> English (Greedy)
def custom_translate_urdu_to_english(text, model, tokenizer, device, max_length, start_token_id, end_token_id, pad_idx):
    model.eval() # Set model to evaluation mode
    with torch.no_grad(): # Disable gradient calculation
        # Encode the source sentence (Urdu)
        encoded_input = tokenizer.encode_as_ids(text) # Input is Urdu text
        encoded_input = encoded_input[:max_length] + [pad_idx] * max(0, max_length - len(encoded_input))

        # Convert to tensor, add batch dimension, and move to device
        src_tensor = torch.tensor(encoded_input, dtype=torch.long).unsqueeze(0).to(device)

        # Initialize the target sequence with the start token (for English)
        target_sequence = [start_token_id] # Start token for English generation
        tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Greedy decoding loop
        for _ in range(max_length):
            output = model(src_tensor, tgt_tensor) # src is Urdu, tgt_tensor is generated English
            # Get logits for the last token in the target sequence
            last_token_logits = output[:, -1, :]
            # Get the predicted token ID with the highest probability
            _, predicted_token_id = torch.max(last_token_logits, dim=-1)

            # If the predicted token is the end token or padding, stop generation
            if predicted_token_id.item() == end_token_id or predicted_token_id.item() == pad_idx:
                break

            # Append the predicted token to the target sequence
            target_sequence.append(predicted_token_id.item())
            # Update the target tensor with the new sequence
            tgt_tensor = torch.tensor(target_sequence, dtype=torch.long).unsqueeze(0).to(device)

        # Decode the generated sequence (English)
        translated_text = tokenizer.decode_ids(target_sequence[1:]) # Decode generated English tokens

        return translated_text

# --- Step 5: Create and Split Dataset (Urdu -> English) ---
print("\nCreating Urdu->English Dataset...")
urdu_to_english_full_dataset = UrduToEnglishDataset(df, tokenizer, max_length, start_token_id, end_token_id, pad_idx)

# Split dataset into training and validation sets
urdu_to_english_train_size = int(0.8 * len(urdu_to_english_full_dataset))
urdu_to_english_val_size = len(urdu_to_english_full_dataset) - urdu_to_english_train_size
urdu_to_english_train_dataset, urdu_to_english_val_dataset = random_split(urdu_to_english_full_dataset, [urdu_to_english_train_size, urdu_to_english_val_size])

# Define batch size and create data loaders
urdu_to_english_train_dataloader = DataLoader(urdu_to_english_train_dataset, batch_size=custom_batch_size, shuffle=True)
urdu_to_english_val_dataloader = DataLoader(urdu_to_english_val_dataset, batch_size=custom_batch_size)

print(f"Urdu->English Train dataset size: {len(urdu_to_english_train_dataset)}")
print(f"Urdu->English Validation dataset size: {len(urdu_to_english_val_dataset)}")


# --- Step 6: Initialize and Train Urdu->English Model ---
print("\nInitializing Urdu->English Transformer model...")
# Create a NEW model instance for Urdu->English
urdu_to_english_model = Transformer(
    src_vocab_size, tgt_vocab_size, d_model, num_layers, num_heads, d_ff, dropout, pad_idx, max_length, device=device
)
print("Urdu->English Transformer model initialized successfully.")

# Move the new model to the device
print("Moving Urdu->English model to device...")
urdu_to_english_model.to(device)
print("Urdu->English model moved to device successfully.")


# Define loss function and optimizer for U->E model
urdu_to_english_criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
urdu_to_english_optimizer = optim.Adam(urdu_to_english_model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

print("\nStarting Urdu->English training loop...")
for epoch in range(num_epochs):
    urdu_to_english_model.train() # Set model to training mode
    total_train_loss = 0

    print(f"\nUrdu->English Epoch {epoch + 1}/{num_epochs}")
    for i, batch in enumerate(tqdm(urdu_to_english_train_dataloader, desc="Urdu->English Training")):

        # --- Diagnostic: Inspecting batch tensors before moving to device (within loop) ---
        # Print for the first batch of each epoch
        if i == 0: # Only print for the very first batch overall (i==0 and epoch==0)
             if epoch == 0:
                src_batch_cpu = batch['input_ids']
                tgt_batch_cpu = batch['labels']

                print(f"\nBatch {i}: Inspecting U->E tensors before moving to {device}...")
                print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
                print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
                print(f"Source vocab size (Urdu input): {src_vocab_size}") # src is Urdu for this model

                print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
                print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
                print(f"Target vocab size (English labels): {tgt_vocab_size}") # tgt is English for this model
                print("-----------------------------------------------------------------")
        # --- End diagnostic prints ---


        src = batch['input_ids'].to(device) # Urdu input
        tgt = batch['labels'].to(device)    # English labels

        tgt_input = tgt[:, :-1] # English input for decoder
        labels = tgt[:, 1:].contiguous().view(-1) # English labels for loss

        urdu_to_english_optimizer.zero_grad()
        output = urdu_to_english_model(src, tgt_input)
        output = output.view(-1, tgt_vocab_size)

        loss = urdu_to_english_criterion(output, labels)
        total_train_loss += loss.item()

        loss.backward()
        urdu_to_english_optimizer.step()

    avg_train_loss = total_train_loss / len(urdu_to_english_train_dataloader)
    print(f"Urdu->English Training Loss: {avg_train_loss:.4f}")

    # Evaluation
    urdu_to_english_model.eval()
    total_val_loss = 0

    print("Urdu->English Evaluating...")
    with torch.no_grad():
        for i, batch in enumerate(tqdm(urdu_to_english_val_dataloader, desc="Urdu->English Validation")):
             # --- Diagnostic: Inspecting batch tensors before moving to device (within loop) ---
            # Print for the first batch of evaluation
            if i == 0 and epoch == 0: # Only print once per training run
                src_batch_cpu = batch['input_ids']
                tgt_batch_cpu = batch['labels']

                print(f"\nBatch {i}: Inspecting U->E Val tensors before moving to {device}...")
                print(f"Source batch shape: {src_batch_cpu.shape}, dtype: {src_batch_cpu.dtype}, device: {src_batch_cpu.device}")
                print(f"Source batch min value: {src_batch_cpu.min().item()}, max value: {src_batch_cpu.max().item()}")
                print(f"Source vocab size (Urdu input): {src_vocab_size}")

                print(f"Target batch shape: {tgt_batch_cpu.shape}, dtype: {tgt_batch_cpu.dtype}, device: {tgt_batch_cpu.device}")
                print(f"Target batch min value: {tgt_batch_cpu.min().item()}, max value: {tgt_batch_cpu.max().item()}")
                print(f"Target vocab size (English labels): {tgt_vocab_size}")
                print("-----------------------------------------------------------------")
            # --- End diagnostic prints ---

            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            tgt_input = tgt[:, :-1]
            labels = tgt[:, 1:].contiguous().view(-1)

            output = urdu_to_english_model(src, tgt_input)
            output = output.view(-1, tgt_vocab_size)

            loss = urdu_to_english_criterion(output, labels)
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(urdu_to_english_val_dataloader)
    print(f"Urdu->English Validation Loss: {avg_val_loss:.4f}")

print("\nUrdu->English Training complete!")


# --- Step 7: Save the Trained Urdu->English Model ---
print(f"\nSaving Urdu->English model state_dict to {urdu_to_english_model_save_path}...")
try:
    torch.save(urdu_to_english_model.state_dict(), urdu_to_english_model_save_path)
    print("Urdu->English model state_dict saved successfully.")
except Exception as e:
    print(f"Error saving Urdu->English model state_dict: {e}")


# --- Step 8: Prepare Test Set for Urdu->English BLEU Evaluation ---
# Use the validation set indices from the U->E split as the test set for BLEU
# In a real scenario, use a completely separate test set indices.
urdu_to_english_test_indices_for_bleu = urdu_to_english_val_dataset.indices if hasattr(urdu_to_english_val_dataset, 'indices') else range(len(urdu_to_english_val_dataset))

if len(urdu_to_english_test_indices_for_bleu) > 0:
    print(f"\nUsing {len(urdu_to_english_test_indices_for_bleu)} samples from the original dataframe indices (from U->E validation split) as test set for BLEU evaluation.")
else:
     print("\nNot enough data to create a test set for Urdu->English BLEU evaluation.")


# --- Step 9: Calculate Urdu->English BLEU Score ---
if len(urdu_to_english_test_indices_for_bleu) > 0:
    print("\nCalculating Corpus BLEU score on the Urdu->English test set...")
    urdu_to_english_model.eval() # Ensure the trained model is in evaluation mode

    hypotheses = [] # Model's translated token lists (English)
    references = [] # Actual target token lists (list of lists) (English)

    print("Generating translations for Urdu->English BLEU evaluation...")
    for original_idx_in_df in tqdm(urdu_to_english_test_indices_for_bleu, desc="Translating U->E test set"):
        original_urdu_text = str(df.iloc[original_idx_in_df]['urdu']) # Get original Urdu text from dataframe
        original_english_text = str(df.iloc[original_idx_in_df]['english']) # Get original English text (reference)

        # Generate translation using the custom_translate_urdu_to_english function
        translated_english = custom_translate_urdu_to_english(
            original_urdu_text, urdu_to_english_model, tokenizer, device,
            max_length=max_length,
            start_token_id=start_token_id, # Start token for English
            end_token_id=end_token_id,     # End token for English
            pad_idx=pad_idx
        )

        # Tokenize both the hypothesis (translated English) and reference (original English)
        hypothesis_tokens = tokenizer.encode_as_pieces(translated_english)
        reference_tokens = [tokenizer.encode_as_pieces(original_english_text)] # Reference is original English

        hypotheses.append(hypothesis_tokens)
        references.append(reference_tokens)

    # Calculate corpus BLEU score
    if hypotheses and references and len(hypotheses) == len(references):
        try:
            bleu_score_urdu_to_english = corpus_bleu(references, hypotheses)
            print(f"\nCorpus BLEU Score (Urdu->English): {bleu_score_urdu_to_english:.4f}")

        except Exception as e:
             print(f"Error calculating Urdu->English BLEU score: {e}")
             print("This might happen if there are empty sequences or other issues with tokenization.")
    else:
        print("\nCould not calculate Urdu->English BLEU score: Hypotheses or references list is empty or their lengths mismatch.")
        print(f"Number of hypotheses: {len(hypotheses)}")
        print(f"Number of references: {len(references)}")

else:
    print("\nNo test dataset samples available for Urdu->English BLEU evaluation.")

print("\nUrdu->English BLEU evaluation complete.")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



CUDA_LAUNCH_BLOCKING is set to 1
Using device: cuda

--- Environment and Compatibility Checks ---
PyTorch version: 2.6.0+cu124
Is CUDA available: True
Number of CUDA devices: 1
CUDA version: 12.4
Current CUDA device: 0
GPU Name: Tesla T4
GPU Capability: (7, 5)
PyTorch built with CUDA: True

--- Simple CUDA Tensor Test ---
Error creating/moving test tensor to cuda: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Fundamental CUDA test failed. Your CUDA/PyTorch setup may be incompatible or corrupted.
Traceback (most recent call last):
  File "<ipython-input-14-1246981847>", line 59, in <cell line: 0>
    test_tensor = torch.randn(10).to(device)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/l

TypeError: object of type 'NoneType' has no len()

In [15]:
# WARNING: Running complex environment setup in a single cell might not work reliably.
# This is only a last resort if dedicated terminal access is unavailable.

# Try installing PyTorch and related libraries in the current Colab environment
# Replace the command below with the exact one from the PyTorch website for your OS/Pip/CUDA 12.1 or 12.4
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other libraries
!pip install pandas openpyxl sentencepiece tqdm nltk

# --- Minimal CUDA Test (within the same cell) ---
import torch
import sys

print(f"PyTorch version: {torch.__version__}")
print(f"Is CUDA available: {torch.cuda.is_available()}")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if device.type == 'cuda':
    print(f"CUDA version reported by PyTorch: {torch.version.cuda}")
    try:
        print(f"Number of CUDA devices: {torch.cuda.device_count()}")
        print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    except Exception as e:
         print(f"Error getting GPU details: {e}")
         print("This might indicate a deeper driver/CUDA toolkit issue.")

    print("\n--- Attempting simple CUDA tensor test ---")
    try:
        test_tensor = torch.randn(10).to(device)
        print("Successfully created and moved a test tensor to CUDA.")
        print(test_tensor)
        print("\nBasic CUDA test PASSED.")
    except Exception as e:
        print(f"Error with CUDA tensor test: {e}")
        print("\nBasic CUDA functionality FAILED.")
        print("This confirms an issue with your PyTorch/CUDA/Driver setup.")
        # sys.exit(1) # Avoid sys.exit in notebooks if you want to continue other cells

else:
    print("\nCUDA is not available (PyTorch reports).")
    print("Basic CUDA test SKIPPED (no CUDA device found by PyTorch).")
    # sys.exit(1) # Avoid sys.exit in notebooks

print("Script finished.") # This line will only print if the CUDA test passes (or is skipped)


Looking in indexes: https://download.pytorch.org/whl/cu121
INFO: pip is looking at multiple versions of torch to determine which version is compatible with other requirements. This could take a while.
Collecting torch
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp311-cp311-linux_x86_64.whl (780.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m780.5/780.5 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m90.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PyTorch version: 2.6.0+cu124
Is CUDA available: True
Using device: cuda
CUDA version reported by PyTorch: 12.4
Number of CUDA devices: 1
GPU Name: Tesla T4

--- Attempting simple CUDA tensor test ---
Error with CUDA tensor test: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Basic CUDA functionality FAILED.
This confirms an issue with your PyTorch/CUDA/Driver setup.
Script finished.


Welcome to this Colab where you will get a quick introduction to the Python programming language and the environment used for the course's exercises: Colab.

Colab is a Python development environment that runs in the browser using Google Cloud.

For example, to print "Hello World", just hover the mouse over [ ] and press the play button to the upper left. Or press shift-enter to execute.

In [3]:
print("Hello World")

Hello World


## Functions, Conditionals, and Iteration
Let's create a Python function, and call it from a loop.

In [None]:
def HelloWorldXY(x, y):
  if (x < 10):
    print("Hello World, x was < 10")
  elif (x < 20):
    print("Hello World, x was >= 10 but < 20")
  else:
    print("Hello World, x was >= 20")
  return x + y

for i in range(8, 25, 5):  # i=8, 13, 18, 23 (start, stop, step)
  print("--- Now running with i: {}".format(i))
  r = HelloWorldXY(i,i)
  print("Result from HelloWorld: {}".format(r))

In [None]:
print(HelloWorldXY(1,2))

Easy, right?

If you want a loop starting at 0 to 2 (exclusive) you could do any of the following

In [None]:
print("Iterate over the items. `range(2)` is like a list [0,1].")
for i in range(2):
  print(i)

print("Iterate over an actual list.")
for i in [0,1]:
  print(i)

print("While works")
i = 0
while i < 2:
  print(i)
  i += 1

In [None]:
print("Python supports standard key words like continue and break")
while True:
  print("Entered while")
  break

## Numpy and lists
Python has lists built into the language.
However, we will use a library called numpy for this.
Numpy gives you lots of support functions that are useful when doing Machine Learning.

Here, you will also see an import statement. This statement makes the entire numpy package available and we can access those symbols using the abbreviated 'np' syntax.

In [None]:
import numpy as np  # Make numpy available using np.

# Create a numpy array, and append an element
a = np.array(["Hello", "World"])
a = np.append(a, "!")
print("Current array: {}".format(a))
print("Printing each element")
for i in a:
  print(i)

print("\nPrinting each element and their index")
for i,e in enumerate(a):
  print("Index: {}, was: {}".format(i, e))

In [None]:
print("\nShowing some basic math on arrays")
b = np.array([0,1,4,3,2])
print("Max: {}".format(np.max(b)))
print("Average: {}".format(np.average(b)))
print("Max index: {}".format(np.argmax(b)))

In [None]:
print("\nYou can print the type of anything")
print("Type of b: {}, type of b[0]: {}".format(type(b), type(b[0])))

In [None]:
print("\nUse numpy to create a [3,3] dimension array with random number")
c = np.random.rand(3, 3)
print(c)

In [None]:
print("\nYou can print the dimensions of arrays")
print("Shape of a: {}".format(a.shape))
print("Shape of b: {}".format(b.shape))
print("Shape of c: {}".format(c.shape))
print("...Observe, Python uses both [0,1,2] and (0,1,2) to specify lists")

## Colab Specifics

Colab is a virtual machine you can access directly. To run commands at the VM's terminal, prefix the line with an exclamation point (!).


In [None]:
print("\nDoing $ls on filesystem")
!ls -l
!pwd

In [None]:
print("Install numpy")  # Just for test, numpy is actually preinstalled in all Colab instances
!pip install numpy

**Exercise**

Create a code cell underneath this text cell and add code to:


*   List the path of the current directory (pwd)
* Go to / (cd) and list the content (ls -l)

In [None]:
!pwd
!cd /
!ls -l
print("Hello")

All usage of Colab in this course is completely free or charge. Even GPU usage is provided free of charge for some hours of usage every day.

**Using GPUs**
* Many of the exercises in the course executes more quickly by using GPU runtime: Runtime | Change runtime type | Hardware accelerator | GPU

**Some final words on Colab**
*   You execute each cell in order, you can edit & re-execute cells if you want
*   Sometimes, this could have unintended consequences. For example, if you add a dimension to an array and execute the cell multiple times, then the cells after may not work. If you encounter problem reset your environment:
  *   Runtime -> Restart runtime... Resets your Python shell
  *   Runtime -> Restart all runtimes... Will reset the Colab image, and get you back to a 100% clean environment
* You can also clear the output in the Colab by doing: Edit -> Clear all outputs
* Colabs in this course are loaded from GitHub. Save to your Google Drive if you want a copy with your code/output: File -> Save a copy in Drive...

**Learn More**
*   Check out [this](https://www.youtube.com/watch?v=inN8seMm7UI&list=PLQY2H8rRoyvwLbzbnKJ59NkZvQAW9wLbx&index=3) episode of #CodingTensorFlow, and don't forget to subscribe to the YouTube channel ;)
