# full pipeline for English → Hindi translation using a Transformer built from scratch

🔧 Overview:

We will build a Transformer model using:

>Custom Embeddings

>Positional Encoding

>Encoder & Decoder with Multi-Head Attention

>Masking

>Teacher Forcing

>Training on a custom parallel English-Hindi dataset

#✅ Step-by-Step Practical Structure:
🔹 Step 1: Install Required Libraries

In [26]:
!pip install tensorflow
!pip install tensorflow-text



#🔹 Step 2: Load English-Hindi Translation Dataset
We will use a small sample for quick testing:

In [27]:
# Sample data for demo purposes
english_sentences = [
    "hello", "how are you", "i am fine", "what is your name", "good morning"
]
hindi_sentences = [
    "नमस्ते", "आप कैसे हैं", "मैं ठीक हूँ", "तुम्हारा नाम क्या है", "सुप्रभात"
]


#🔹 Step 3: Tokenization


In [28]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# -------------------------------
# Step 1: Tokenize English Sentences
# -------------------------------

# Create a tokenizer for English with:
# - No filtering of punctuation (`filters=''`)
# - An out-of-vocabulary token to handle unseen words
eng_tokenizer = Tokenizer(filters='', oov_token='<OOV>')

# Fit the tokenizer on the list of English sentences
eng_tokenizer.fit_on_texts(english_sentences)

# Convert each English sentence into a sequence of integers (word indices)
eng_tensor = eng_tokenizer.texts_to_sequences(english_sentences)

# Pad sequences so that all input sentences have equal length
# Padding is added to the end ('post') to preserve input ordering
eng_tensor = pad_sequences(eng_tensor, padding='post')

# -------------------------------
# Step 2: Tokenize Hindi Sentences
# -------------------------------

# Create a tokenizer for Hindi with the same configuration
hin_tokenizer = Tokenizer(filters='', oov_token='<OOV>')
hin_tokenizer.fit_on_texts(hindi_sentences)
hin_tensor = hin_tokenizer.texts_to_sequences(hindi_sentences)
hin_tensor = pad_sequences(hin_tensor, padding='post')

# -------------------------------
# Step 3: Vocabulary Sizes
# -------------------------------

# Vocabulary size is the total number of unique tokens + 1 for padding (0 index)
eng_vocab_size = len(eng_tokenizer.word_index) + 1
hin_vocab_size = len(hin_tokenizer.word_index) + 1

# Now `eng_tensor` and `hin_tensor` contain the integer-encoded and padded input/output data
# These can now be used to train the sequence-to-sequence transformer model


#🔹 Step 4: Positional Encoding

In [29]:
def positional_encoding(position, d_model):
    """
    Compute the positional encoding matrix.

    Args:
        position: Maximum length of the sequence (e.g., 100).
        d_model: Dimensionality of the embedding/hidden layer (e.g., 512).

    Returns:
        A tensor of shape (1, position, d_model) containing the positional encodings.
    """

    # Step 1: Create a matrix of shape (position, 1)
    # Each row corresponds to a position index (0 to position-1)
    angle_rads = np.arange(position)[:, np.newaxis] / np.power(
        10000,
        (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model)
    )
    # np.arange(d_model)[np.newaxis, :] creates a shape (1, d_model) row for dimension indices
    # (// 2) groups even and odd dimensions for sin/cos separately

    # Step 2: Apply sin to even indices in the array; 0, 2, 4, ...
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # Step 3: Apply cos to odd indices in the array; 1, 3, 5, ...
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    # Step 4: Add a batch dimension (1, position, d_model) and convert to Tensor
    return tf.cast(angle_rads[np.newaxis, ...], dtype=tf.float32)


#🔹 Step 5: Scaled Dot-Product Attention

In [30]:
def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Calculate the scaled dot-product attention.

    Args:
        q: query tensor of shape (..., seq_len_q, depth)
        k: key tensor of shape (..., seq_len_k, depth)
        v: value tensor of shape (..., seq_len_v, depth_v)
        mask: Float tensor with shape broadcastable to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
        output: Attention output after applying weights to V
        attention_weights: Softmax weights used for attention
    """

    # Step 1: Calculate the dot product between Q and K^T
    print("Shape of q in scaled_dot_product_attention:", tf.shape(q))
    print("Shape of k in scaled_dot_product_attention:", tf.shape(k))
    print("Shape of v in scaled_dot_product_attention:", tf.shape(v))
    print("Shape of mask in scaled_dot_product_attention:", tf.shape(mask) if mask is not None else "None")

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # Step 2: Scale the dot products by the square root of the dimension of the key vectors
    dk = tf.cast(tf.shape(k)[-1], tf.float32)       # depth of key vectors
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)  # Stabilize gradients

    # Step 3: Add the mask (if any) to ignore padding positions or future tokens (in decoder)
    if mask is not None:
        # Use tf.where to apply the mask
        scaled_attention_logits = tf.where(mask == 0, scaled_attention_logits, -1e9)


    # Step 4: Apply softmax to get attention weights (probabilities)
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    # Step 5: Multiply attention weights with values (V)
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

#🔹 Step 6: Multi-Head Attention

In [31]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        """
        Initializes the MultiHeadAttention layer.

        Args:
            d_model: Dimension of the model (output of attention layer)
            num_heads: Number of parallel attention heads
        """
        super(MultiHeadAttention, self).__init__()

        # Ensure d_model is divisible by number of heads (for equal splitting)
        assert d_model % num_heads == 0

        self.num_heads = num_heads
        self.depth = d_model // num_heads  # Depth of each head

        # Dense layers to transform input into Q, K, V matrices
        self.wq = tf.keras.layers.Dense(d_model)  # Linear layer for query
        self.wk = tf.keras.layers.Dense(d_model)  # Linear layer for key
        self.wv = tf.keras.layers.Dense(d_model)  # Linear layer for value

        # Final dense layer to combine all heads
        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """
        Splits the last dimension into (num_heads, depth) and rearranges to shape:
        (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))  # reshape
        return tf.transpose(x, perm=[0, 2, 1, 3])  # move num_heads before seq_len

    def call(self, v, k, q, mask):
        """
        Forward pass of the MultiHeadAttention layer.

        Args:
            v: Value tensor
            k: Key tensor
            q: Query tensor
            mask: Mask to apply (optional)

        Returns:
            output: Final attention output of shape (batch_size, seq_len_q, d_model)
        """
        batch_size = tf.shape(q)[0]  # Get dynamic batch size

        # Linear projections of Q, K, V
        q = self.wq(q)  # (batch_size, seq_len_q, d_model)
        k = self.wk(k)  # (batch_size, seq_len_k, d_model)
        v = self.wv(v)  # (batch_size, seq_len_v, d_model)

        print("Shape of q after dense:", tf.shape(q))
        print("Shape of k after dense:", tf.shape(k))
        print("Shape of v after dense:", tf.shape(v))

        # Split Q, K, V into multiple heads
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        print("Shape of q after split_heads:", tf.shape(q))
        print("Shape of k after split_heads:", tf.shape(k))
        print("Shape of v after split_heads:", tf.shape(v))


        # Apply scaled dot-product attention to each head
        scaled_attention, _ = scaled_dot_product_attention(q, k, v, mask)
        # Transpose back: (batch_size, seq_len_q, num_heads, depth)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

        # Concatenate all heads together
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.num_heads * self.depth))
        # Final linear projection
        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output

#🔹 Step 7: Feed Forward Network

In [32]:
def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),
        tf.keras.layers.Dense(d_model)
    ])


#🔹 Step 8: Encoder and Decoder Layers

In [33]:
# Encoder Layer definition
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff):
        super().__init__()

        # Multi-head self-attention mechanism
        self.mha = MultiHeadAttention(d_model, num_heads)

        # Feed Forward Network (two dense layers)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        # Layer normalization layers (helps with training stability)
        self.layernorm1 = tf.keras.layers.LayerNormalization()
        self.layernorm2 = tf.keras.layers.LayerNormalization()

    def call(self, x, mask):
        # Self-attention over input (query, key, value all = x)
        attn_output = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)

        # Add & Norm: Residual connection + normalization
        out1 = self.layernorm1(x + attn_output)

        # Feed Forward Network
        ffn_output = self.ffn(out1)

        # Add & Norm again
        return self.layernorm2(out1 + ffn_output)


# Decoder Layer definition
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff):
        super().__init__()

        # First multi-head attention (masked self-attention)
        self.mha1 = MultiHeadAttention(d_model, num_heads)

        # Second multi-head attention (cross-attention: query = decoder, key/value = encoder)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        # Feed Forward Network
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        # Layer normalizations for each sub-layer
        self.layernorm1 = tf.keras.layers.LayerNormalization()
        self.layernorm2 = tf.keras.layers.LayerNormalization()
        self.layernorm3 = tf.keras.layers.LayerNormalization()

    def call(self, x, enc_output, look_ahead_mask, padding_mask):
        # First attention block: masked self-attention
        attn1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        out1 = self.layernorm1(attn1 + x)  # Add & Norm

        # Second attention block: cross-attention with encoder output
        attn2 = self.mha2(enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        out2 = self.layernorm2(attn2 + out1)  # Add & Norm

        # Feed Forward Network
        ffn_output = self.ffn(out2)

        # Final Add & Norm
        return self.layernorm3(ffn_output + out2)


#🔹 Step 9: Build Transformer Model

In [34]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding):
        super().__init__()

        # Embedding layer to convert token indices to dense vectors
        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)

        # Positional encoding helps model understand position of words in sequence
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        # Stack of encoder layers
        self.enc_layers = [EncoderLayer(d_model, num_heads, dff) for _ in range(num_layers)]

        # Dropout layer for regularization
        self.dropout = tf.keras.layers.Dropout(0.1)

    def call(self, x, mask):
        seq_len = tf.shape(x)[1]  # Length of input sequence

        # Token embeddings + positional encoding
        x = self.embedding(x)  # (batch_size, seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.embedding.output_dim, tf.float32)) # Scale embeddings
        x += self.pos_encoding[:, :seq_len, :]  # Add positional information

        x = self.dropout(x)  # Apply dropout

        # Pass through all encoder layers
        for enc_layer in self.enc_layers:
            x = enc_layer(x, mask)  # Apply self-attention and feed-forward

        return x  # Output of the encoder (batch_size, input_seq_len, d_model)


class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size, maximum_position_encoding):
        super().__init__()

        # Embedding for target tokens
        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)

        # Positional encoding for decoder input
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        # Stack of decoder layers
        self.dec_layers = [DecoderLayer(d_model, num_heads, dff) for _ in range(num_layers)]

        # Dropout for regularization
        self.dropout = tf.keras.layers.Dropout(0.1)

    def call(self, x, enc_output, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]  # Length of target sequence

        # Token embeddings + positional encoding
        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.embedding.output_dim, tf.float32)) # Scale embeddings
        x += self.pos_encoding[:, :seq_len, :]  # Add positional information

        x = self.dropout(x)  # Apply dropout

        # Pass through each decoder layer
        for dec_layer in self.dec_layers:
            x = dec_layer(x, enc_output, look_ahead_mask, padding_mask)

        return x  # Output of the decoder (batch_size, target_seq_len, d_model)

#🔹 Step 10: Final Transformer

In [35]:
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, target_vocab_size, pe_input, pe_target):
        super().__init__()

        # Encoder to process the input sequence
        self.encoder = Encoder(
            num_layers, d_model, num_heads, dff,
            input_vocab_size, pe_input
        )

        # Decoder to process the target sequence using encoder output
        self.decoder = Decoder(
            num_layers, d_model, num_heads, dff,
            target_vocab_size, pe_target
        )

        # Final dense layer to project decoder output to vocabulary logits
        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inp, tar, look_ahead_mask, dec_padding_mask):
        # Create padding mask for the encoder input
        enc_padding_mask = create_padding_mask(inp)

        # Pass input through the encoder
        enc_output = self.encoder(inp, enc_padding_mask)  # (batch_size, input_seq_len, d_model)

        # Pass target tokens, encoder output, and masks through decoder
        dec_output = self.decoder(
            tar, enc_output,
            look_ahead_mask, dec_padding_mask
        )  # (batch_size, target_seq_len, d_model)

        # Output layer converts decoder output into logits over target vocabulary
        return self.final_layer(dec_output)  # (batch_size, target_seq_len, target_vocab_size)

def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    # add extra dimensions to add the padding to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

# Task
Add code to the notebook to translate an English sentence to Hindi using the trained Transformer model.

## Create a function to preprocess the input english sentence

### Subtask:
Create a function that preprocesses a raw English sentence for the Transformer model. This involves tokenizing the sentence and padding it to the maximum input sequence length.


**Reasoning**:
Define a function to preprocess the English sentence by tokenizing and padding it.



In [36]:
def preprocess_sentence(sentence):
    """
    Tokenizes and pads a single English sentence for the Transformer model.

    Args:
        sentence: The raw English sentence string.

    Returns:
        A TensorFlow tensor containing the padded tokenized sentence.
    """
    # Tokenize the sentence using the pre-fitted English tokenizer
    sentence_sequence = eng_tokenizer.texts_to_sequences([sentence])

    # Pad the sequence to the maximum input length
    padded_sequence = pad_sequences(sentence_sequence, maxlen=pe_input, padding='post')

    # Convert the padded sequence to a TensorFlow tensor
    return tf.cast(padded_sequence, dtype=tf.int64)


## Create a function to generate the translation

### Subtask:
Create a function that takes a preprocessed English input tensor and the Transformer model and generates the Hindi translation using greedy decoding.


**Reasoning**:
Define the `translate` function which implements greedy decoding to generate the Hindi translation from the English input tensor using the Transformer model.



In [37]:
def translate(input_sentence_tensor, transformer):
    """
    Translates an English input tensor to Hindi using greedy decoding.

    Args:
        input_sentence_tensor: TensorFlow tensor containing the padded
                               tokenized English sentence.
        transformer: The trained Transformer model.

    Returns:
        The translated Hindi sentence string.
    """
    # Get the integer ID for the start token in the Hindi tokenizer
    # We assume '<start>' and '<end>' tokens were added during tokenization.
    # If not, the first token added by the tokenizer (usually '<OOV>')
    # or a specific index (like 1 if 0 is padding) might be used as start.
    # Based on the previous tokenization, the start token isn't explicitly handled,
    # so we'll add a dummy start token logic here. In a real scenario,
    # you'd need to modify the tokenization step to include <start> and <end> tokens.
    # For this example, let's assume the first token in the Hindi vocabulary
    # after padding (index 1) could represent a 'start' or the first actual word.
    # However, a proper implementation requires explicit start/end tokens in vocabulary.
    # Let's refine this: The standard Transformer expects <start> and <end> tokens.
    # Let's assume we add '<start>' and '<end>' to the vocabularies manually for demonstration.
    # For the current data, let's simulate start with the first predicted token and
    # end with padding (0) or a designated end token if we had one.
    # Given the current tokenization doesn't include <start>/<end>, we'll need to adapt.
    # A common approach for inference without explicit start/end in input vocab
    # is to start with a special start token in the target language.
    # Since we don't have explicit <start>/<end> in the current `hin_tokenizer`,
    # we'll use a placeholder index (like 0, although 0 is usually padding) or
    # assume the first predicted non-padding token is the start. This is not ideal.

    # Let's make an assumption for demonstration: Add '<start>' and '<end>' manually.
    # In a production system, modify the tokenizer setup.
    # Assuming '<start>' has index `hin_vocab_size` and '<end>' has `hin_vocab_size + 1`
    # This requires increasing `hin_vocab_size` by 2 and refitting/updating tokenizer.
    # Let's use a simplified approach based on the *current* tokenizer state.
    # We need a starting sequence for the decoder. This sequence usually begins
    # with a designated start token.
    # Without dedicated start/end tokens, let's use a workaround for demonstration.
    # We need an index that *isn't* padding (0). The next available index is 1.
    # Let's assume index 1 acts as a "start-like" token for this limited example.
    # **Note:** This is a simplification for this exercise due to the limited dataset
    # and tokenizer setup. Proper <start> and <end> tokens are standard.

    # Initialize the decoder input with a start token (using index 1 as a placeholder)
    # In a real scenario, this would be `hin_tokenizer.word_index['<start>']`
    decoder_input = tf.expand_dims([1], axis=0) # Shape: (1, 1)

    # Set a maximum translation length
    max_length = pe_target # Use the target positional encoding length

    # Store the generated token IDs
    output_sequence = []

    # Start the greedy decoding loop
    for i in range(max_length):
        # Create look-ahead mask for the decoder input
        look_ahead_mask = create_look_ahead_mask(tf.shape(decoder_input)[1])
        # Create padding mask for the decoder input
        dec_padding_mask = create_padding_mask(decoder_input)

        # Combine masks (look-ahead and padding on the decoder side)
        combined_mask = tf.maximum(look_ahead_mask, dec_padding_mask)


        # Get predictions from the transformer
        # The transformer's call method expects: inp, tar, look_ahead_mask, dec_padding_mask
        # Here, inp is the encoder input, tar is the current decoder input
        predictions = transformer(
            inp=input_sentence_tensor,
            tar=decoder_input,
            look_ahead_mask=combined_mask, # Use the combined mask for decoder self-attention
            dec_padding_mask=create_padding_mask(input_sentence_tensor) # Padding mask for encoder-decoder attention
        )

        # Get the prediction for the next token (from the last time step)
        predictions = predictions[:, -1, :]  # Shape: (batch_size, target_vocab_size)
        predicted_id = tf.argmax(predictions, axis=-1).numpy()[0] # Get the token ID (scalar)

        # Check if the predicted token is the end token
        # We don't have an explicit end token, let's use padding (0) as a proxy
        # or a specific high index if we had added one.
        # Let's assume for this limited example that index 0 (padding) acts like an end token.
        # **Note:** This is another simplification. Proper <end> tokens are standard.
        if predicted_id == 0: # Assuming 0 is the padding/end token
             break

        # Append the predicted token ID to the output sequence
        output_sequence.append(predicted_id)

        # Concatenate the predicted ID to the decoder input for the next step
        decoder_input = tf.concat([decoder_input, tf.expand_dims([predicted_id], axis=0)], axis=-1)

    # Convert the sequence of predicted IDs back to a Hindi sentence
    # The tokenizer's `sequences_to_texts` method expects a list of sequences, so wrap the output list.
    # We need to handle potential OOV tokens and the placeholder start token if it was added.
    # The `sequences_to_texts` method handles converting indices back to words.
    # It will ignore padding (0) by default.
    translated_sentence = hin_tokenizer.sequences_to_texts([output_sequence])

    # `sequences_to_texts` returns a list of strings (one string per sequence).
    # We only have one sequence, so take the first element.
    return translated_sentence[0]

# Helper function for creating the look-ahead mask
def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)

## Use the translation function with a sample english sentence

### Subtask:
Translate a sample English sentence using the `preprocess_sentence` and `translate` functions.


**Reasoning**:
Define a sample English sentence, preprocess it using the `preprocess_sentence` function, and then use the `translate` function with the transformer model to get the Hindi translation. Finally, print both the original and translated sentences.



In [38]:
# 1. Define a sample English sentence
sample_english_sentence = "how are you"

# 2. Call the preprocess_sentence function
input_tensor = preprocess_sentence(sample_english_sentence)

# 3. Call the translate function with the input tensor and the transformer model
translated_hindi_sentence = translate(input_tensor, transformer)

# 4. Print the original and translated sentences
print("Original English:", sample_english_sentence)
print("Translated Hindi:", translated_hindi_sentence)

Shape of q in scaled_dot_product_attention: tf.Tensor([ 1  8  4 16], shape=(4,), dtype=int32)
Shape of k in scaled_dot_product_attention: tf.Tensor([ 1  8  4 16], shape=(4,), dtype=int32)
Shape of v in scaled_dot_product_attention: tf.Tensor([ 1  8  4 16], shape=(4,), dtype=int32)
Shape of mask in scaled_dot_product_attention: tf.Tensor([1 1 1 4], shape=(4,), dtype=int32)
Shape of q in scaled_dot_product_attention: tf.Tensor([ 1  8  4 16], shape=(4,), dtype=int32)
Shape of k in scaled_dot_product_attention: tf.Tensor([ 1  8  4 16], shape=(4,), dtype=int32)
Shape of v in scaled_dot_product_attention: tf.Tensor([ 1  8  4 16], shape=(4,), dtype=int32)
Shape of mask in scaled_dot_product_attention: tf.Tensor([1 1 1 4], shape=(4,), dtype=int32)
Shape of q in scaled_dot_product_attention: tf.Tensor([ 1  8  1 16], shape=(4,), dtype=int32)
Shape of k in scaled_dot_product_attention: tf.Tensor([ 1  8  1 16], shape=(4,), dtype=int32)
Shape of v in scaled_dot_product_attention: tf.Tensor([ 1  8  

## Summary:

### Data Analysis Key Findings

*   A Python function `preprocess_sentence` was successfully created to tokenize and pad an English sentence to a fixed length using a pre-fitted English tokenizer.
*   A Python function `translate` was successfully created to perform greedy decoding using the trained Transformer model. This function takes the preprocessed English input, iteratively predicts the next Hindi token, and stops based on a maximum length or a designated end token (simulated by padding in this example).
*   The `translate` function utilizes look-ahead and padding masks during the decoding process.
*   A sample English sentence "how are you" was preprocessed and translated using the created functions and the Transformer model.
*   The translation output for "how are you" was "कैसे कैसे कैसे कैसे".

### Insights or Next Steps

*   The current implementation of the `translate` function uses padding (index 0) as a proxy for the end token. A more robust approach would involve explicitly adding `<start>` and `<end>` tokens to the Hindi vocabulary and modifying the tokenizer accordingly.
*   The model's translation output "कैसे कैसे कैसे कैसे" for "how are you" suggests potential issues with the training data, model capacity, or training process, as it is a repetitive translation rather than a meaningful one. Further investigation into the training data and model evaluation metrics is recommended.
