# Sentence Tokenization for Transformer Neural Network (English → Kannada Translation)

This notebook prepares the **data processing pipeline** for a Transformer model that translates English to Kannada (a South Indian language). The pipeline covers:

1. **Vocabulary Creation** — Building character-level vocabularies for both languages
2. **Dataset Loading & Filtering** — Loading sentence pairs, removing invalid/long sentences
3. **Tokenization** — Converting characters to integer indices with special tokens (START, END, PADDING)
4. **Batching** — Grouping sentence pairs using PyTorch DataLoader for parallel training
5. **Masking** — Creating padding masks and look-ahead masks to control attention
6. **Sentence Embedding** — Combining token embeddings with positional encoding

**Key design choice:** We use **character-level** tokenization (not word-level) to reduce model parameters and speed up inference.

## 1. Import Libraries

- **torch**: PyTorch deep learning framework, used for tensor operations, building the model, and training.
- **numpy**: Numerical computing library, used here for percentile calculations and array operations.

In [None]:
import torch
import numpy as np

## 2. Define File Paths, Special Tokens & Language Vocabularies

### Dataset
The dataset comes from a research paper covering English paired with 11 Indian languages. We use the English-Kannada (`.en` / `.kn`) pair. The full dataset is ~20 GB with ~4 million sentence pairs.

### Special Tokens
Three special tokens are essential for sequence modeling:
- **`<START>`**: Marks the beginning of a sentence (used for Kannada/target during generation)
- **`<END>`**: Marks the end of a sentence
- **`<PADDING>`**: Fills shorter sentences to a fixed length so all sentences in a batch have equal size

### Vocabularies
We build **character-level** vocabularies for each language:
- **Kannada** is an **alpha-syllabary** — each character represents a syllable (e.g., consonant 'ಕ' + vowel marker 'ಾ' = 'ಕಾ' "ka"). The vocabulary includes vowels, consonants, vowel markers (matras), digits, and punctuation.
- **English** uses a standard **phonetic alphabet** — uppercase, lowercase, digits, and punctuation.

Each vocabulary maps every possible character to a unique integer index for embedding.

In [None]:
# --- File paths to the parallel corpus (English-Kannada sentence pairs) ---
# These files are stored on Google Drive; each line contains one sentence.
english_file = 'drive/MyDrive/translation_en_kn/train.en'
kannada_file = 'drive/MyDrive/translation_en_kn/train.kn'

# --- Special Tokens ---
# START_TOKEN: Prepended to target (Kannada) sentences so the decoder has an initial input to begin generation.
# PADDING_TOKEN: Appended to short sentences to make all sequences the same fixed length (required for batching).
# END_TOKEN: Appended to mark where a sentence ends, so the model learns when to stop generating.
START_TOKEN = '<START>'
PADDING_TOKEN = '<PADDING>'
END_TOKEN = '<END>'

# --- Kannada Vocabulary (character-level) ---
# Kannada is an alpha-syllabary script. Characters include:
#   - Independent vowels (ಅ, ಆ, ಇ, ... ಔ)
#   - Consonants (ಕ, ಖ, ಗ, ... ಹ) organized in groups (velar, palatal, retroflex, dental, labial, etc.)
#   - Dependent vowel signs / matras (ಾ, ಿ, ೀ, ು, ೂ, ... ್) that attach to consonants
#   - Kannada digits (೦-೯), punctuation, and the special tokens
# Total vocab size ≈ 120+ characters
kannada_vocabulary = [START_TOKEN, ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', 
                      '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '<', '=', '>', '?', 'ˌ', 
                      'ँ', 'ఆ', 'ఇ', 'ా', 'ి', 'ీ', 'ు', 'ూ', 
                      'ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ', 'ಊ', 'ಋ', 'ೠ', 'ಌ', 'ಎ', 'ಏ', 'ಐ', 'ಒ', 'ಓ', 'ಔ',  # Vowels
                      'ಕ', 'ಖ', 'ಗ', 'ಘ', 'ಙ',           # Velar consonants
                      'ಚ', 'ಛ', 'ಜ', 'ಝ', 'ಞ',           # Palatal consonants
                      'ಟ', 'ಠ', 'ಡ', 'ಢ', 'ಣ',           # Retroflex consonants
                      'ತ', 'ಥ', 'ದ', 'ಧ', 'ನ',           # Dental consonants
                      'ಪ', 'ಫ', 'ಬ', 'ಭ', 'ಮ',           # Labial consonants
                      'ಯ', 'ರ', 'ಱ', 'ಲ', 'ಳ', 'ವ', 'ಶ', 'ಷ', 'ಸ', 'ಹ',  # Other consonants
                      '಼', 'ಽ', 'ಾ', 'ಿ', 'ೀ', 'ು', 'ೂ', 'ೃ', 'ೄ', 'ೆ', 'ೇ', 'ೈ', 'ೊ', 'ೋ', 'ೌ', '್', 'ೕ', 'ೖ', 'ೞ', 'ೣ', 'ಂ', 'ಃ',  # Vowel signs (matras) & modifiers
                      '೦', '೧', '೨', '೩', '೪', '೫', '೬', '೭', '೮', '೯',  # Kannada digits 0-9
                      PADDING_TOKEN, END_TOKEN]

# --- English Vocabulary (character-level) ---
# Standard ASCII characters: uppercase A-Z, lowercase a-z, digits 0-9, punctuation, and special tokens.
# Total vocab size ≈ 100 characters
english_vocabulary = [START_TOKEN, ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', 
                        '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
                        ':', '<', '=', '>', '?', '@', 
                        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 
                        'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 
                        'Y', 'Z',
                        '[', '\\', ']', '^', '_', '`', 
                        'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
                        'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 
                        'y', 'z', 
                        '{', '|', '}', '~', PADDING_TOKEN, END_TOKEN]

## 3. Understanding Kannada Script Structure

Kannada is an **alpha-syllabary** (abugida), meaning each character represents a full syllable rather than a single phoneme like in English.

- A consonant like **ಕ** ('ka') inherently carries the vowel 'a'.
- Adding a **vowel sign (matra)** changes the vowel sound: **ಕ** + **ಾ** = **ಕಾ** ('kaa').
- The word **ಕನ್ನಡ** ("Kannada") is split into individual Unicode characters by Python's `list()`.

This structural difference means the character set for Kannada is much larger than English, and the way characters combine is fundamentally different.

In [None]:
# Demo: Break the Kannada word "ಕನ್ನಡ" (Kannada) into individual Unicode characters.
# Python's list() splits the string into its constituent Unicode code points.
# This shows how the model will see each character as a separate token.
text = 'ಕನ್ನಡ'
list(text)

['ಕ', 'ನ', '್', 'ನ', 'ಡ']

In [None]:
# Demo: Combining a Kannada consonant with a vowel sign (matra).
# 'ಕ' (ka) + 'ಾ' (aa matra) = 'ಕಾ' (kaa)
# This illustrates how Kannada syllables are formed by combining base consonants with vowel markers.
'ಕ' + 'ಾ'

'ಕಾ'

## 4. Create Vocabulary Index Mappings

We create **bidirectional dictionaries** for each language:
- **`index_to_language`**: Maps integer index → character (used when decoding model output back to text)
- **`language_to_index`**: Maps character → integer index (used when converting input text to numerical tokens)

These mappings are the foundation of **character-level tokenization** — every character in a sentence will be replaced by its integer index before being fed to the model's embedding layer.

In [None]:
# Build index-to-character and character-to-index dictionaries for both languages.
# enumerate() gives each character a unique integer starting from 0.
# Example: index_to_kannada = {0: '<START>', 1: ' ', 2: '!', ...}
# Example: kannada_to_index = {'<START>': 0, ' ': 1, '!': 2, ...}
index_to_kannada = {k:v for k,v in enumerate(kannada_vocabulary)}
kannada_to_index = {v:k for k,v in enumerate(kannada_vocabulary)}
index_to_english = {k:v for k,v in enumerate(english_vocabulary)}
english_to_index = {v:k for k,v in enumerate(english_vocabulary)}

## 5. Load the Dataset

The dataset contains ~4 million English-Kannada sentence pairs. For faster training, we only load the **first 100,000 pairs**.

Each file has one sentence per line, so `readlines()` loads all sentences into a list. We strip trailing newline characters (`\n`) for clean processing.

In [None]:
# Load English and Kannada sentence files (one sentence per line)
with open(english_file, 'r') as file:
    english_sentences = file.readlines()
with open(kannada_file, 'r') as file:
    kannada_sentences = file.readlines()

# Limit to the first 100,000 sentence pairs for faster training
# The full dataset has ~4 million pairs but would take much longer to train on
TOTAL_SENTENCES = 100000
english_sentences = english_sentences[:TOTAL_SENTENCES]
kannada_sentences = kannada_sentences[:TOTAL_SENTENCES]

# Strip trailing newline characters from each sentence for clean processing
english_sentences = [sentence.rstrip('\n') for sentence in english_sentences]
kannada_sentences = [sentence.rstrip('\n') for sentence in kannada_sentences]

### Preview: Sample English and Kannada Sentences

Let's look at the first 10 sentences from each language to verify the data loaded correctly.

In [None]:
# Preview first 10 English sentences (source language)
english_sentences[:10]

['Hes a scientist.',
 "'But we speak the truth aur ye sach hai ke Gujarat mein vikas pagal hogaya hai,'' Rahul Gandhi further said in Banaskantha",
 '8 lakh crore have been looted.',
 'I read a lot into this as well.',
 "She was found dead with the phone's battery exploded close to her head the following morning.",
 'How did mankind come under Satans rival sovereignty?',
 'And then I became Prime Minister.',
 'What about corruption?',
 'No differences',
 '"""The shooting of the film is 90 percent done."']

In [None]:
# Preview first 10 Kannada sentences (target language)
kannada_sentences[:10]

['ಇವರು ಸಂಶೋಧಕ ಸ್ವಭಾವದವರು.',
 '"ಆದರೆ ಸತ್ಯ ಹೊರ ಬಂದೇ ಬರುತ್ತದೆ ಎಂದು ಹೇಳಿದ ರಾಹುಲ್ ಗಾಂಧಿ, ""ಸೂರತ್ ಜನರು ಚೀನಾದ ಜತೆ ಸ್ಪರ್ಧೆ ನಡೆಸುತ್ತಿದ್ದಾರೆ"',
 'ಕಳ್ಳತನವಾಗಿದ್ದ 8 ಲಕ್ಷ ರೂ.',
 'ಇದರ ಬಗ್ಗೆ ನಾನೂ ಸಾಕಷ್ಟು ಓದಿದ್ದೇನೆ.',
 'ಆಕೆಯ ತಲೆಯ ಹತ್ತಿರ ಇರಿಸಿಕೊಂಡಿದ್ದ ಫೋನ್\u200cನ ಬ್ಯಾಟರಿ ಸ್ಫೋಟಗೊಂಡು ಆಕೆ ಮೃತಪಟ್ಟಿದ್ದಾಳೆ ಎನ್ನಲಾಗಿದೆ.',
 'ಮಾನವಕುಲವು ಸೈತಾನನ ಆಳಿಕೆಯ ಕೆಳಗೆ ಬಂದದ್ದು ಹೇಗೆ?',
 'ನಂತರ ಪ್ರಧಾನಿ ಕೂಡ ಆಗುತ್ತೇನೆ.',
 'ಭ್ರಷ್ಟಾಚಾರ ಏಕಿದೆ?',
 '‘ಅನುಪಾತದಲ್ಲಿ ವ್ಯತ್ಯಾಸವಿಲ್ಲ’',
 'ಆ ಚಿತ್ರದ ಶೇ 90ರಷ್ಟು ಚಿತ್ರೀಕರಣವೂ ಈಗಾಗಲೇ ಮುಗಿದು ಹೋಗಿದೆ.']

## 6. Analyze Sentence Length Distribution

Since we use **character-level** tokenization, each character becomes one token. We need to understand the distribution of sentence lengths to choose an appropriate `max_sequence_length`.

- **Maximum lengths** show the longest sentences (outliers) — these are rare and can be discarded.
- **Percentile analysis** (e.g., 97th percentile) shows the length below which 97% of sentences fall, helping us choose a practical cutoff that covers most data without wasting computation on very long sentences.

In [None]:
# Find the maximum character-level sentence length in each language.
# Kannada max ≈ 639 chars, English max ≈ 722 chars — these are outlier sentences.
max(len(x) for x in kannada_sentences), max(len(x) for x in english_sentences),

(639, 722)

In [None]:
# Calculate the 97th percentile of sentence lengths.
# This tells us: 97% of sentences are shorter than this length.
# Helps us pick a max_sequence_length that covers most sentences without being too large.
PERCENTILE = 97
print( f"{PERCENTILE}th percentile length Kannada: {np.percentile([len(x) for x in kannada_sentences], PERCENTILE)}" )
print( f"{PERCENTILE}th percentile length English: {np.percentile([len(x) for x in english_sentences], PERCENTILE)}" )

97th percentile length Kannada: 172.0
97th percentile length English: 179.0


## 7. Filter Sentences by Length and Vocabulary Validity

We set **`max_sequence_length = 200`** characters and discard sentences that:
1. **Exceed the max length** — Reduces dimensionality, simplifies the model, avoids overfitting to rare long sentences.
2. **Contain characters not in our vocabulary** — Ensures every character can be tokenized.

Two helper functions are used:
- **`is_valid_tokens()`**: Checks that every unique character in the sentence exists in the vocabulary.
- **`is_valid_length()`**: Checks that the sentence length is less than `max_sequence_length - 1` (leaving room for the END token).

After filtering, the dataset reduces from 100,000 to ~81,900 valid sentence pairs.

In [None]:
# Maximum number of characters (tokens) per sentence, including special tokens.
# Sentences longer than this will be discarded to reduce model complexity.
max_sequence_length = 200

def is_valid_tokens(sentence, vocab):
    """Check if every unique character in the sentence exists in the vocabulary.
    Returns False immediately if any character is not found (out-of-vocabulary)."""
    for token in list(set(sentence)):
        if token not in vocab:
            return False
    return True

def is_valid_length(sentence, max_sequence_length):
    """Check if the sentence fits within max_sequence_length.
    We use (max_sequence_length - 1) because we need to reserve 1 position for the END token."""
    return len(list(sentence)) < (max_sequence_length - 1)

# Iterate over all sentence pairs and collect indices of valid ones.
# A sentence pair is valid if:
#   1. Both Kannada and English sentences are within the max length
#   2. All characters in the Kannada sentence are in the Kannada vocabulary
# Note: English vocabulary validation is skipped here (handled by the broader ASCII charset)
valid_sentence_indicies = []
for index in range(len(kannada_sentences)):
    kannada_sentence, english_sentence = kannada_sentences[index], english_sentences[index]
    if is_valid_length(kannada_sentence, max_sequence_length) \
      and is_valid_length(english_sentence, max_sequence_length) \
      and is_valid_tokens(kannada_sentence, kannada_vocabulary):
        valid_sentence_indicies.append(index)

print(f"Number of sentences: {len(kannada_sentences)}")
print(f"Number of valid sentences: {len(valid_sentence_indicies)}")

Number of sentences: 100000
Number of valid sentences: 81916


### Apply the Filter

Keep only the valid sentence pairs by selecting sentences at the valid indices. Both English and Kannada lists are filtered in parallel to maintain alignment.

In [None]:
# Rebuild sentence lists using only the valid indices.
# This ensures English and Kannada sentences remain aligned (same index = same pair).
kannada_sentences = [kannada_sentences[i] for i in valid_sentence_indicies]
english_sentences = [english_sentences[i] for i in valid_sentence_indicies]

In [None]:
# Verify: preview first 3 filtered Kannada sentences
kannada_sentences[:3]

['ಇವರು ಸಂಶೋಧಕ ಸ್ವಭಾವದವರು.',
 '"ಆದರೆ ಸತ್ಯ ಹೊರ ಬಂದೇ ಬರುತ್ತದೆ ಎಂದು ಹೇಳಿದ ರಾಹುಲ್ ಗಾಂಧಿ, ""ಸೂರತ್ ಜನರು ಚೀನಾದ ಜತೆ ಸ್ಪರ್ಧೆ ನಡೆಸುತ್ತಿದ್ದಾರೆ"',
 'ಕಳ್ಳತನವಾಗಿದ್ದ 8 ಲಕ್ಷ ರೂ.']

## 8. Create a Custom PyTorch Dataset

PyTorch's `Dataset` class provides a standard interface for loading data. We create a custom `TextDataset` that:
- Stores parallel English and Kannada sentence lists
- Returns a **(English, Kannada) tuple** for a given index via `__getitem__`
- Reports dataset size via `__len__`

This is needed because default PyTorch datasets don't handle paired multilingual text data out of the box.

In [None]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    """Custom PyTorch Dataset for paired English-Kannada sentences.
    
    Each sample is a tuple: (english_sentence_string, kannada_sentence_string).
    The DataLoader will batch these tuples together for training.
    """

    def __init__(self, english_sentences, kannada_sentences):
        """Store the parallel sentence lists."""
        self.english_sentences = english_sentences
        self.kannada_sentences = kannada_sentences

    def __len__(self):
        """Return the total number of sentence pairs."""
        return len(self.english_sentences)

    def __getitem__(self, idx):
        """Return the (English, Kannada) sentence pair at the given index."""
        return self.english_sentences[idx], self.kannada_sentences[idx]

In [None]:
# Instantiate the dataset with our filtered sentence pairs
dataset = TextDataset(english_sentences, kannada_sentences)

In [None]:
# Check the total number of valid sentence pairs in the dataset (~81,900)
len(dataset)

81916

In [None]:
# Access a single sample: returns (english_sentence, kannada_sentence) tuple
dataset[1]

("'But we speak the truth aur ye sach hai ke Gujarat mein vikas pagal hogaya hai,'' Rahul Gandhi further said in Banaskantha",
 '"ಆದರೆ ಸತ್ಯ ಹೊರ ಬಂದೇ ಬರುತ್ತದೆ ಎಂದು ಹೇಳಿದ ರಾಹುಲ್ ಗಾಂಧಿ, ""ಸೂರತ್ ಜನರು ಚೀನಾದ ಜತೆ ಸ್ಪರ್ಧೆ ನಡೆಸುತ್ತಿದ್ದಾರೆ"')

## 9. Batching with DataLoader

**Batching** groups multiple sentence pairs to be processed simultaneously. Benefits:
- **Reduces parameter updates**: One batch → one gradient update (instead of one per sentence).
- **Speeds up training**: Parallelizes computation across samples in the batch.
- **Smoother gradients**: Averaging loss over multiple samples produces more stable gradient signals.

We use `batch_size = 3` here for demonstration. In practice, larger batch sizes (e.g., 32, 64) are used.

The `DataLoader` automatically handles splitting the dataset into batches.

In [None]:
# Create a DataLoader that groups sentence pairs into batches of 3.
# Each batch will contain (list_of_3_english_sentences, list_of_3_kannada_sentences).
batch_size = 3 
train_loader = DataLoader(dataset, batch_size)
iterator = iter(train_loader)  # Create an iterator to manually step through batches

In [None]:
# Print the first few batches to see the structure.
# Each batch is a list of 2 tuples: batch[0] = English sentences, batch[1] = Kannada sentences.
# Each contains `batch_size` (3) sentences.
for batch_num, batch in enumerate(iterator):
    print(batch)
    if batch_num > 3:
        break

[('Hes a scientist.', "'But we speak the truth aur ye sach hai ke Gujarat mein vikas pagal hogaya hai,'' Rahul Gandhi further said in Banaskantha", '8 lakh crore have been looted.'), ('ಇವರು ಸಂಶೋಧಕ ಸ್ವಭಾವದವರು.', '"ಆದರೆ ಸತ್ಯ ಹೊರ ಬಂದೇ ಬರುತ್ತದೆ ಎಂದು ಹೇಳಿದ ರಾಹುಲ್ ಗಾಂಧಿ, ""ಸೂರತ್ ಜನರು ಚೀನಾದ ಜತೆ ಸ್ಪರ್ಧೆ ನಡೆಸುತ್ತಿದ್ದಾರೆ"', 'ಕಳ್ಳತನವಾಗಿದ್ದ 8 ಲಕ್ಷ ರೂ.')]
[('I read a lot into this as well.', 'How did mankind come under Satans rival sovereignty?', 'And then I became Prime Minister.'), ('ಇದರ ಬಗ್ಗೆ ನಾನೂ ಸಾಕಷ್ಟು ಓದಿದ್ದೇನೆ.', 'ಮಾನವಕುಲವು ಸೈತಾನನ ಆಳಿಕೆಯ ಕೆಳಗೆ ಬಂದದ್ದು ಹೇಗೆ?', 'ನಂತರ ಪ್ರಧಾನಿ ಕೂಡ ಆಗುತ್ತೇನೆ.')]
[('What about corruption?', '"""The shooting of the film is 90 percent done."', 'the Special Statute'), ('ಭ್ರಷ್ಟಾಚಾರ ಏಕಿದೆ?', 'ಆ ಚಿತ್ರದ ಶೇ 90ರಷ್ಟು ಚಿತ್ರೀಕರಣವೂ ಈಗಾಗಲೇ ಮುಗಿದು ಹೋಗಿದೆ.', 'ವಿಶೇಷ ಕಾನೂನು')]
[('"Then the king said to Ittai the Gittite, ""Why do you also go with us? Return, and stay with the king. for you are a foreigner, and also an exile. Return to your own place."', 'What happened at the UN Ge

## 10. Tokenization — Converting Characters to Integer Indices

**Tokenization** translates each character in a sentence into its corresponding integer index from the vocabulary dictionary. The process:

1. **Map each character** → its integer index using `language_to_index` dictionary.
2. **Optionally prepend START token** — Used for Kannada (target) to give the decoder an initial input.
3. **Optionally append END token** — Marks where the sentence ends so the model learns to stop generating.
4. **Pad with PADDING tokens** — Fill remaining positions up to `max_sequence_length` so all sequences have equal length.
5. **Convert to PyTorch tensor** — Ready for input to the model's embedding layer.

### Token usage by language:
| Token | English (Encoder input) | Kannada (Decoder input) |
|-------|------------------------|------------------------|
| START | **No** — entire sentence processed at once | **Yes** — needed as initial input for generation |
| END   | **No** | **Yes** — model learns when to stop |
| PADDING | **Yes** — to fill to fixed length | **Yes** — to fill to fixed length |

In [None]:
def tokenize(sentence, language_to_index, start_token=True, end_token=True):
    """Convert a sentence string into a fixed-length tensor of integer indices.
    
    Args:
        sentence: Raw sentence string (e.g., "Hello" or "ನಮಸ್ಕಾರ")
        language_to_index: Dict mapping each character → integer index
        start_token: If True, prepend the START token index (used for Kannada/target)
        end_token: If True, append the END token index
    
    Returns:
        torch.Tensor of shape (max_sequence_length,) with integer token indices
    """
    # Step 1: Convert each character to its integer index
    sentence_word_indicies = [language_to_index[token] for token in list(sentence)]
    
    # Step 2: Optionally add START token at the beginning (for decoder/target sentences)
    if start_token:
        sentence_word_indicies.insert(0, language_to_index[START_TOKEN])
    
    # Step 3: Optionally add END token at the end (signals sentence completion)
    if end_token:
        sentence_word_indicies.append(language_to_index[END_TOKEN])
    
    # Step 4: Pad with PADDING tokens to reach max_sequence_length
    # This ensures all sequences in a batch have the same length
    for _ in range(len(sentence_word_indicies), max_sequence_length):
        sentence_word_indicies.append(language_to_index[PADDING_TOKEN])
    
    return torch.tensor(sentence_word_indicies)

In [None]:
# Inspect the current batch — it's a tuple: (english_sentences_list, kannada_sentences_list)
batch

[('It has been under discussion for a long time.',
  'Buses cannot get there.',
  'Why then this tradition was not thought of?'),
 ('ಎಂಬುದು ಬಹಳ ದೀರ್ಘ ಕಾಲದಿಂದಲೂ ಚರ್ಚಿತವಾಗುತ್ತಿರುವ ವಿಷಯ.',
  'ಇಲ್ಲಿಗೆ ಬರಲು ಬಸ್ ಸೌಕರ್ಯವೂ ಇಲ್ಲ.',
  'ಆ ಪರಂಪರೆ ಯಾಕೆ ಮುನ್ನೆಲೆಗೆ ಬರಲಿಲ್ಲ?')]

In [None]:
# Access one sentence from the batch (batch[0] = English, batch[1] = Kannada)
batch[sentence_num]

('It has been under discussion for a long time.',
 'Buses cannot get there.',
 'Why then this tradition was not thought of?')

### Tokenize a Batch of Sentences

For each sentence pair in the batch:
- **English** (encoder input): Tokenized **without** START or END tokens — the encoder processes the entire source sentence at once.
- **Kannada** (decoder input): Tokenized **with** both START and END tokens — the decoder needs START as initial input and END to learn when to stop generating.

The tokenized sequences are stacked into tensors of shape `(batch_size, max_sequence_length)`.

In [None]:
# Tokenize all sentences in the current batch
eng_tokenized, kn_tokenized = [], []
for sentence_num in range(batch_size):
    eng_sentence, kn_sentence = batch[0][sentence_num], batch[1][sentence_num]
    
    # English (encoder): NO start/end tokens — encoder sees the full source sentence at once
    eng_tokenized.append( tokenize(eng_sentence, english_to_index, start_token=False, end_token=False) )
    
    # Kannada (decoder): WITH start/end tokens — decoder generates one token at a time
    kn_tokenized.append( tokenize(kn_sentence, kannada_to_index, start_token=True, end_token=True) )

# Stack individual sentence tensors into batch tensors: shape (batch_size, max_sequence_length)
eng_tokenized = torch.stack(eng_tokenized)
kn_tokenized = torch.stack(kn_tokenized)

In [None]:
# Inspect the tokenized English batch — shape: (3, 200)
# Each row is a sentence: [char_indices..., PADDING_TOKEN_indices...]
# No START or END tokens present for English
eng_tokenized

tensor([[41, 84,  1, 72, 65, 83,  1, 66, 69, 69, 78,  1, 85, 78, 68, 69, 82,  1,
         68, 73, 83, 67, 85, 83, 83, 73, 79, 78,  1, 70, 79, 82,  1, 65,  1, 76,
         79, 78, 71,  1, 84, 73, 77, 69, 15, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95,
         95, 95],
        [34, 85, 83, 69, 83,  1, 67, 65, 78, 78, 79, 84,  1, 71, 69, 84,  1, 84,
         7

## 11. Attention Masking — Controlling What the Model Can "See"

Masking is critical in Transformers to ensure the model only attends to valid, appropriate tokens. Three types of masks are created:

### 1. Encoder Self-Attention Mask (Padding Mask)
- Prevents the encoder from attending to **padding tokens**.
- Shape: `(batch_size, max_seq_len, max_seq_len)` — one 2D mask per sentence.
- Padded positions are set to a large negative number; valid positions are 0.

### 2. Decoder Self-Attention Mask (Look-Ahead + Padding Mask)
- **Look-ahead mask**: An upper-triangular matrix that prevents the decoder from "seeing" **future tokens** during training. This is essential for autoregressive generation — the model should only use current and previous tokens to predict the next one.
- Combined with padding mask to also block attention on padded positions.

### 3. Decoder Cross-Attention Mask (Padding Mask)
- Applied when the decoder attends to the encoder's output.
- Prevents the decoder from attending to **padded positions** in the source (English) sentence.

### Why use large negative numbers instead of literal `-inf`?
- During softmax: `softmax(0) → passes value`, `softmax(-∞) → 0 (blocks value)`.
- Using literal `-inf` can cause **NaN** values if an entire row is masked (softmax of all `-inf`).
- Instead, we use `-1e9` (negative one billion) — large enough to effectively zero out attention, but numerically stable.

In [None]:
# Large negative value used as mask value instead of -inf to avoid NaN in softmax.
# softmax(-1e9) ≈ 0 (effectively blocks attention), but won't produce NaN like -inf would.
NEG_INFTY = -1e9

def create_masks(eng_batch, kn_batch):
    """Create attention masks for the encoder and decoder.
    
    Args:
        eng_batch: List of raw English sentence strings (source)
        kn_batch: List of raw Kannada sentence strings (target)
    
    Returns:
        encoder_self_attention_mask: Blocks padding in encoder self-attention
        decoder_self_attention_mask: Blocks future tokens + padding in decoder self-attention
        decoder_cross_attention_mask: Blocks padding when decoder attends to encoder output
    """
    num_sentences = len(eng_batch)
    
    # --- Look-Ahead Mask (for decoder self-attention) ---
    # Upper triangular matrix of True values: prevents attending to future positions.
    # Example for seq_len=4:  [[F, T, T, T],   (position 0 can only see position 0)
    #                          [F, F, T, T],   (position 1 can see 0 and 1)
    #                          [F, F, F, T],   (position 2 can see 0, 1, and 2)
    #                          [F, F, F, F]]   (position 3 can see all previous)
    look_ahead_mask = torch.full([max_sequence_length, max_sequence_length] , True)
    look_ahead_mask = torch.triu(look_ahead_mask, diagonal=1)  # Upper triangle (above main diagonal)
    
    # --- Initialize padding masks as all False (no masking) ---
    # Shape: (batch_size, max_seq_len, max_seq_len) — one 2D attention matrix per sentence
    encoder_padding_mask = torch.full([num_sentences, max_sequence_length, max_sequence_length] , False)
    decoder_padding_mask_self_attention = torch.full([num_sentences, max_sequence_length, max_sequence_length] , False)
    decoder_padding_mask_cross_attention = torch.full([num_sentences, max_sequence_length, max_sequence_length] , False)

    for idx in range(num_sentences):
      eng_sentence_length, kn_sentence_length = len(eng_batch[idx]), len(kn_batch[idx])
      
      # Indices of positions that should be masked (padding positions).
      # +1 accounts for the END token that will be added during tokenization.
      eng_chars_to_padding_mask = np.arange(eng_sentence_length + 1, max_sequence_length)
      kn_chars_to_padding_mask = np.arange(kn_sentence_length + 1, max_sequence_length)
      
      # Encoder padding mask: block attention TO and FROM padding positions
      # Setting both row and column ensures padded tokens neither attend nor are attended to
      encoder_padding_mask[idx, :, eng_chars_to_padding_mask] = True   # No token attends TO padding
      encoder_padding_mask[idx, eng_chars_to_padding_mask, :] = True   # Padding doesn't attend TO any token
      
      # Decoder self-attention padding mask: same logic for Kannada padding positions
      decoder_padding_mask_self_attention[idx, :, kn_chars_to_padding_mask] = True
      decoder_padding_mask_self_attention[idx, kn_chars_to_padding_mask, :] = True
      
      # Decoder cross-attention padding mask: decoder should not attend to English padding
      decoder_padding_mask_cross_attention[idx, :, eng_chars_to_padding_mask] = True
      decoder_padding_mask_cross_attention[idx, kn_chars_to_padding_mask, :] = True

    # Convert boolean masks to float masks: True → NEG_INFTY (blocked), False → 0 (allowed)
    # After softmax: 0 → weight of 1 (passes), NEG_INFTY → weight of ~0 (blocked)
    encoder_self_attention_mask = torch.where(encoder_padding_mask, NEG_INFTY, 0)
    
    # Decoder self-attention: combine look-ahead mask AND padding mask.
    # A position is masked if EITHER the look-ahead OR padding mask is True.
    # The '+' acts as logical OR for boolean tensors.
    decoder_self_attention_mask =  torch.where(look_ahead_mask + decoder_padding_mask_self_attention, NEG_INFTY, 0)
    
    decoder_cross_attention_mask = torch.where(decoder_padding_mask_cross_attention, NEG_INFTY, 0)
    
    # Debug: print mask shapes and a 10x10 corner to visualize the pattern
    print(f"encoder_self_attention_mask {encoder_self_attention_mask.size()}: {encoder_self_attention_mask[0, :10, :10]}")
    print(f"decoder_self_attention_mask {decoder_self_attention_mask.size()}: {decoder_self_attention_mask[0, :10, :10]}")
    print(f"decoder_cross_attention_mask {decoder_cross_attention_mask.size()}: {decoder_cross_attention_mask[0, :10, :10]}")
    return encoder_self_attention_mask, decoder_self_attention_mask, decoder_cross_attention_mask

In [None]:
# Generate and inspect the three attention masks for the current batch.
# The 10x10 slices show:
#   - 0.0 = "allowed to attend" (will pass through softmax normally)
#   - -1e9 = "blocked" (will be zeroed out by softmax)
create_masks(batch[0], batch[1])

encoder_self_attention_mask torch.Size([3, 200, 200]): tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
decoder_self_attention_mask torch.Size([3, 200, 200]): tensor([[ 0.0000e+00, -1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09,
         -1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09],
        [ 0.0000e+00,  0.0000e+00, -1.0000e+09, -1.0000e+09, -1.0000e+09,
         -1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00, -1.0000e+09, -1.0000e+09,
         -1.0000e

(tensor([[[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          ...,
          [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09]],
 
         [[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          ...,
    

## 12. Sentence Embedding Class — Putting It All Together

The `SentenceEmbedding` class integrates all preprocessing steps into a single PyTorch module that will be used inside the Transformer:

1. **Batch Tokenization** — Converts a batch of raw sentence strings into integer index tensors (with optional START/END tokens and padding).
2. **Embedding Lookup** — Maps each integer index to a dense vector of size `d_model` using `nn.Embedding`.
3. **Positional Encoding** — Adds sinusoidal position information so the model knows the order of tokens (since self-attention is position-agnostic).
4. **Dropout** — Randomly zeroes some values during training to prevent overfitting.

This class will be instantiated separately for the **encoder** (English) and **decoder** (Kannada), each with its own vocabulary and embedding weights.

In [None]:
class SentenceEmbedding(nn.Module):
    """Converts raw sentence strings into dense vector representations for the Transformer.
    
    Pipeline: Raw text → Tokenize → Embedding lookup → Add positional encoding → Dropout
    
    This module is used for BOTH encoder (English) and decoder (Kannada) inputs,
    each instantiated with its own vocabulary and index mappings.
    """
    def __init__(self, max_sequence_length, d_model, language_to_index, START_TOKEN, END_TOKEN, PADDING_TOKEN):
        """
        Args:
            max_sequence_length: Fixed length all sequences are padded/truncated to (200)
            d_model: Dimensionality of embeddings (e.g., 512 in original Transformer)
            language_to_index: Dict mapping characters → integer indices for this language
            START_TOKEN, END_TOKEN, PADDING_TOKEN: Special token strings
        """
        super().__init__()
        self.vocab_size = len(language_to_index)          # Total number of unique tokens in this language
        self.max_sequence_length = max_sequence_length
        self.embedding = nn.Embedding(self.vocab_size, d_model)  # Learnable lookup table: index → d_model vector
        self.language_to_index = language_to_index
        self.position_encoder = PositionalEncoding(d_model, max_sequence_length)  # Sinusoidal position info
        self.dropout = nn.Dropout(p=0.1)  # 10% dropout for regularization
        self.START_TOKEN = START_TOKEN
        self.END_TOKEN = END_TOKEN
        self.PADDING_TOKEN = PADDING_TOKEN
    
    def batch_tokenize(self, batch, start_token=True, end_token=True):
        """Convert a batch of raw sentence strings into a padded integer tensor.
        
        Args:
            batch: List of sentence strings
            start_token: Whether to prepend START token (True for Kannada decoder input)
            end_token: Whether to append END token
        
        Returns:
            Tensor of shape (batch_size, max_sequence_length) on the appropriate device
        """

        def tokenize(sentence, start_token=True, end_token=True):
            """Tokenize a single sentence into a fixed-length list of integer indices."""
            # Map each character to its vocabulary index
            sentence_word_indicies = [self.language_to_index[token] for token in list(sentence)]
            
            # Optionally add START token at the beginning
            if start_token:
                sentence_word_indicies.insert(0, self.language_to_index[self.START_TOKEN])
            
            # Optionally add END token at the end
            if end_token:
                sentence_word_indicies.append(self.language_to_index[self.END_TOKEN])
            
            # Pad remaining positions with PADDING token to reach fixed length
            for _ in range(len(sentence_word_indicies), self.max_sequence_length):
                sentence_word_indicies.append(self.language_to_index[self.PADDING_TOKEN])
            
            return torch.tensor(sentence_word_indicies)

        # Tokenize each sentence in the batch and stack into a 2D tensor
        tokenized = []
        for sentence_num in range(len(batch)):
           tokenized.append( tokenize(batch[sentence_num], start_token, end_token) )
        tokenized = torch.stack(tokenized)  # Shape: (batch_size, max_sequence_length)
        return tokenized.to(get_device())   # Move to GPU if available
    
    def forward(self, x, end_token=True):
        """Forward pass: raw sentences → embedded + positionally-encoded vectors.
        
        Args:
            x: List of raw sentence strings (one batch)
            end_token: Whether to add END token during tokenization
        
        Returns:
            Tensor of shape (batch_size, max_sequence_length, d_model)
        """
        x = self.batch_tokenize(x, end_token)        # (batch, seq_len) — integer indices
        x = self.embedding(x)                          # (batch, seq_len, d_model) — dense vectors
        pos = self.position_encoder().to(get_device()) # (seq_len, d_model) — position information
        x = self.dropout(x + pos)                      # Add position info + apply dropout
        return x