# Task 3 (55 points): NLP and Attention Mechanism

## *Part 1 (10 points):*
Implement the scaled dot-product attention as discussed in class
(lecture 14) from scratch (use NumPy and pandas only, no deep learning libraries are
allowed for this step).

In [None]:
import numpy as np

def softmax(x):
    exps = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exps / np.sum(exps, axis=-1, keepdims=True)

def scaled_dot_product_attention(query, key, value):
    # Calculate the dot product of Q and K^T to computes the similarity between queries and keys.
    matmul_qk = np.matmul(query, np.transpose(key, axes=(0, 2, 1)))
    # Scale the dot product by the square root of the key dimension.
    #    This scaling prevents the dot product from growing too large, which can
    #    push the softmax function into regions with extremely small gradients.
    dk = key.shape[-1]
    scaled_attention_logits = matmul_qk / np.sqrt(dk)
    # Apply softmax to normalizes the similarity scores, turning them into probabilities.
    attention_weights = softmax(scaled_attention_logits)
    # Calculate the weighted sum of the values.
    #    This is the context vector, which is a weighted sum of the values,
    #    where the weights are the attention weights.
    output = np.matmul(attention_weights, value)
    return attention_weights, output

In [None]:
import tensorflow as tf
class SelfAttentionEncoder(tf.keras.layers.Layer):
    def __init__(self, units):
        super(SelfAttentionEncoder, self).__init__()
        self.units = units
        self.wq = tf.keras.layers.Dense(units)
        self.wk = tf.keras.layers.Dense(units)
        self.wv = tf.keras.layers.Dense(units)

    def __call__(self, x):
        # x shape: (batch_size, seq_len, embedding_dim)
        Q = self.wq(x)  # (batch_size, seq_len, units)
        K = self.wk(x)  # (batch_size, seq_len, units)
        V = self.wv(x)  # (batch_size, seq_len, units)

        attention_weights, output = scaled_dot_product_attention(Q, K, V)
        return output, attention_weights

In [None]:
# --- Example Usage ---
# Assume we have Q, K, and V matrices:
# Q: (batch_size, query_length, d_model)
# K: (batch_size, key_length, d_model)
# V: (batch_size, value_length, d_model)
# For simplicity, let's create some dummy data:
Q = np.random.rand(2, 8, 64)  # batch_size=2, query_length=8, d_model=64
K = np.random.rand(2, 10, 64) # batch_size=2, key_length=10, d_model=64
V = np.random.rand(2, 10, 64) # batch_size=2, value_length=10, d_model=64

# Calculate attention
attention_weights, output = scaled_dot_product_attention(Q, K, V)

# Print the shapes to verify
print("Attention Weights Shape:", attention_weights.shape)
print("Output Shape:", output.shape)

Attention Weights Shape: (2, 8, 10)
Output Shape: (2, 8, 64)


# *Part 2 (10 points):*

Pick any encoder-decoder seq2seq model (as discussed in class) and
integrate the scaled dot-product attention in the encoder architecture. You may come
up with your own technique of integration or adopt one from literature. Hint: See
Bahdanau or Luong attention paper presented in class (lecture 14).

In [None]:
import numpy as np
import pandas as pd

class Seq2SeqModel:
    def __init__(self, encoder_vocab_size, decoder_vocab_size, embedding_dim, units):
        self.encoder = Encoder(encoder_vocab_size, embedding_dim, units)
        self.decoder = Decoder(decoder_vocab_size, embedding_dim, units)

    def forward(self, inputs, targets, target_token_start):
        encoder_output, encoder_hidden = self.encoder.forward(inputs)

        decoder_hidden = encoder_hidden

        # Teacher forcing - feeding the target as the next input
        decoder_input = np.array([[target_token_start]] * targets.shape[0])

        predictions = []
        for t in range(1, targets.shape[1]):
            context_vector, attention_weights = self.decoder.attention(decoder_hidden, encoder_output)

            decoder_output, decoder_hidden = self.decoder.gru(np.concatenate([(decoder_input, context_vector)], axis=-1), decoder_hidden)

            prediction = self.decoder.forward(decoder_output, x, decoder_hidden, encoder_output)

            predictions.append(prediction)

            decoder_input = np.expand_dims(targets[:, t], 1) # Using teacher forcing

        return np.stack(predictions, axis=1)

class Encoder:
    def __init__(self, vocab_size, embedding_dim, units):
        self.embedding = np.random.randn(vocab_size, embedding_dim)
        self.attention = SelfAttentionEncoder(units)
        self.gru = GRU(units, embedding_dim)
        self.units = units

    def forward(self, x):
        batch_size = x.shape[0]
        x = self.embedding[x]
        # Initialize hidden state
        hidden = np.zeros((batch_size, self.units))
        outputs = []

        for t in range(x.shape[1]):
            output, hidden = self.gru(x[:, t, :], hidden)
            outputs.append(output)

        outputs = np.stack(outputs, axis=1)
        return outputs, hidden

class Decoder:
    def __init__(self, vocab_size, embedding_dim, units):
        self.embedding = np.random.randn(vocab_size, embedding_dim)
        self.gru = GRU(units, embedding_dim)
        self.fc = np.random.randn(units, vocab_size)

        # Attention mechanism
        self.attention = BahdanauAttention(units)
        self.W1 = np.random.randn(units, units)
        self.V = np.random.randn(units, 1)

    def forward(self, x, hidden, encoder_output):
        x = self.embedding[x]

        context_vector, attention_weights = self.attention(hidden, encoder_output)
        x = np.concatenate([np.expand_dims(x, 1), context_vector], axis=-1)

        output, state = self.gru(x, hidden)
        output = np.reshape(output, (-1, output.shape[2]))
        x = np.dot(output, self.fc)
        return x, state

class BahdanauAttention:
    def __init__(self, units):
        self.W1 = np.random.randn(units, units)
        self.W2 = np.random.randn(units, units)
        self.V = np.random.randn(units, 1)

    def __call__(self, query, values):
        query_with_time_axis = np.expand_dims(query, 1)
        score = np.dot(np.tanh(np.dot(query_with_time_axis, self.W1) + np.dot(values, self.W2)), self.V)
        attention_weights = np.exp(score) / np.sum(np.exp(score), axis=1, keepdims=True)
        context_vector = np.sum(attention_weights * values, axis=1)
        return context_vector, attention_weights

class GRU:
    def __init__(self, units, input_size):
        self.units = units
        self.W_z = np.random.randn(input_size, units)
        self.W_r = np.random.randn(input_size, units)
        self.W_h = np.random.randn(input_size, units)
        self.U_z = np.random.randn(units, units)
        self.U_r = np.random.randn(units, units)
        self.U_h = np.random.randn(units, units)
        self.b_z = np.zeros(units)
        self.b_r = np.zeros(units)
        self.b_h = np.zeros(units)

    def __call__    (self, x, hidden):
        z = self.sigmoid(np.dot(x, self.W_z) + np.dot(hidden, self.U_z) + self.b_z)
        r = self.sigmoid(np.dot(x, self.W_r) + np.dot(hidden, self.U_r) + self.b_r)
        h_tilde = np.tanh(np.dot(x, self.W_h) + np.dot(r * hidden, self.U_h) + self.b_h)
        hidden = (1 - z) * hidden + z * h_tilde
        return hidden, hidden

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

# *Part 3 (5 points):*

Pick any public dataset of your choice (use a small-scale dataset like a
subset of the Tatoeba or Multi30k dataset) for machine translation task. Train your
model from Part 2 for the machine translation task. Evaluate test set by reporting the
BLEU Score.

In [None]:
splits = {'train': 'train.jsonl', 'validation': 'val.jsonl', 'test': 'test.jsonl'}
df = pd.read_json("hf://datasets/ryan82/multi30k_fr/" + splits["train"], lines=True)

In [None]:
df

Unnamed: 0,en,fr
0,"Two young, White males are outside near many b...",Deux jeunes hommes blancs sont dehors près de ...
1,Several men in hard hats are operating a giant...,Plusieurs hommes en casque font fonctionner un...
2,A little girl climbing into a wooden playhouse.,Une petite fille grimpe dans une maisonnette e...
3,A man in a blue shirt is standing on a ladder ...,Un homme dans une chemise bleue se tient sur u...
4,Two men are at the stove preparing food.,Deux hommes aux fourneaux préparent à manger.
...,...,...
28996,A rock climber practices on a rock climbing wall.,Un alpinisme s'exerce sur un mur d'escalade.
28997,Two male construction workers are working on a...,Deux ouvriers travaillent sur la rue à l'extér...
28998,An elderly man sits outside a storefront accom...,Un vieil homme est assis devant une vitrine ac...
28999,A man in shorts and a Hawaiian shirt leans ove...,Un homme en short et chemise hawaïenne se penc...


In [None]:
english_sentences = df['en'].tolist()
french_sentences = df['fr'].tolist()
from collections import Counter
# Function to build a vocabulary from sentences
def build_vocab(sentences, max_tokens=None):
    token_counter = Counter()
    for sentence in sentences:
        tokens = sentence.lower().split()  # Simple tokenization by splitting on spaces
        token_counter.update(tokens)

    # Limit vocabulary size if max_tokens is specified
    if max_tokens:
        vocab = [token for token, _ in token_counter.most_common(max_tokens)]
    else:
        vocab = list(token_counter.keys())

    # Add special tokens
    vocab = ['<pad>', '<start>', '<end>', '<unk>'] + vocab
    return {token: idx for idx, token in enumerate(vocab)}

# Function to tokenize and vectorize sentences
def vectorize_sentences(sentences, vocab, max_sequence_length):
    sequences = []
    for sentence in sentences:
        tokens = sentence.lower().split()
        sequence = [vocab.get(token, vocab['<unk>']) for token in tokens]  # Convert tokens to indices
        sequence = sequence[:max_sequence_length]  # Truncate if longer than max_sequence_length
        sequence = sequence + [vocab['<pad>']] * (max_sequence_length - len(sequence))  # Pad if shorter
        sequences.append(sequence)
    return np.array(sequences)

# Build vocabularies
max_tokens = 30000
max_sequence_length = 50
english_vocab = build_vocab(english_sentences, max_tokens)
french_vocab = build_vocab(french_sentences, max_tokens)

# Vectorize sentences
english_sequences = vectorize_sentences(english_sentences, english_vocab, max_sequence_length)
french_sequences = vectorize_sentences(french_sentences, french_vocab, max_sequence_length)

# Get vocabulary sizes
encoder_vocab_size = len(english_vocab)
decoder_vocab_size = len(french_vocab)


In [None]:
print(f"encoder_vocab_size: {encoder_vocab_size}, decoder_vocab_size: {decoder_vocab_size}")
print("English sequences sample:", english_sequences[0])
print("French sequences sample:", french_sequences[0])

encoder_vocab_size: 14450, decoder_vocab_size: 16694
English sequences sample: [  13 1347   22  824   15   63   72  167 1648    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0]
French sequences sample: [  17   69   36  221   23  130   59    6 1904    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0]


In [None]:
def loss_function(real, pred):
    # Create a mask to ignore padding tokens (assume padding token index is 0)
    mask = (real != 0).astype(np.float32)

    # Compute the cross-entropy loss manually
    loss = -np.sum(np.log(pred[np.arange(pred.shape[0]), :, real]) * mask) / np.sum(mask)

    return loss

# Training loop
def train_step(model, inputs, targets, optimizer, target_token_start):
    # Forward pass
    predictions = model.forward(inputs, targets, target_token_start)

    # Compute loss (ignore the first token in targets for teacher forcing)
    loss = loss_function(targets[:, 1:], predictions)

    # Backward pass (compute gradients manually)
    gradients = compute_gradients(model, inputs, targets, loss_function)

    # Update model parameters using the optimizer
    optimizer.update_parameters(model, gradients)

    return loss

# Function to compute gradients manually
def compute_gradients(model, inputs, targets, loss_function):
    gradients = {}
    epsilon = 1e-5  # Small value for finite differences

    for param_name, param_value in model.parameters.items():
        grad = np.zeros_like(param_value)

        # Iterate over each parameter value and compute the gradient
        for idx in np.ndindex(param_value.shape):
            original_value = param_value[idx]

            # Perturb the parameter
            param_value[idx] += epsilon
            loss_plus = loss_function(targets[:, 1:], model.forward(inputs, targets))

            param_value[idx] -= 2 * epsilon
            loss_minus = loss_function(targets[:, 1:], model.forward(inputs, targets))

            # Reset the parameter
            param_value[idx] = original_value

            # Compute the gradient using finite differences
            grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)

        gradients[param_name] = grad

    return gradients

In [None]:
class SGDOptimizer:
    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def update_parameters(self, model, gradients):
        """
        Updates model parameters using SGD.

        Args:
            model: The Seq2Seq model.
            gradients: A dictionary of gradients for each model parameter.
        """
        for param_name, param_value in model.parameters.items():
            param_value -= self.learning_rate * gradients[param_name]

In [None]:
# Hyperparameters
embedding_dim = 256
units = 512
batch_size = 64
epochs = 10
learning_rate = 0.01
# Define the target_token_start token
target_token_start = french_vocab['<start>']

# Initialize the model
model = Seq2SeqModel(encoder_vocab_size, decoder_vocab_size, embedding_dim, units)
optimizer = SGDOptimizer(learning_rate)

In [None]:
# Training the model
for epoch in range(epochs):
    total_loss = 0
    for inputs, targets in zip(english_sequences, french_sequences):
        inputs = np.expand_dims(inputs, axis=0)  # Add batch dimension
        targets = np.expand_dims(targets, axis=0)  # Add batch dimension

        loss = train_step(model, inputs, targets, optimizer, target_token_start)
        total_loss += loss

    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(english_sequences)}")

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 1) + inhomogeneous part.

# *Part 4 (30 points):*
In this part you are required to implement a simplified Transformer
model from scratch (using Python and NumPy/PyTorch/TensorFlow with minimal high-
level abstractions) and apply it to a machine translation task (e.g., English-to-French or
English-to-German translation) using the same dataset from part 3.

We discussed Transformer architecture in depth in class (Vaswani Paper – Attention is
all you need). Apply the following simplifications to the original model architecture:
1. Reduced Model Depth: Use 2 encoder layers and 2 decoder layers instead of the standard 6.
2. Limited Attention Heads: Use 2 attention heads in the multi-head attention mechanism rather than 8.
3. Smaller Embedding Size: Set the embedding dimension to 64 instead of 512.
4. Reduced Feedforward Network Size: Use a feedforward dimension of 128 instead of 2048.
5. Smaller Dataset: Use a small dataset (e.g., about 10k sentence pairs).
6. Tokenization Simplifications: Use a basic subword tokenizer (like Byte Pair Encoding - BPE) or word-level tokenization instead of complex language-specific tokenizers.

**Key components to implement:**
1. Positional Encoding: Implement Sinusoidal position encoding.
2. Scaled dot-product attention: Use the same implementation from part 1.
Projects in Machine Learning and AI (RPI Spring 2025)
3. Multi-Head Attention: Integrate the scaled dot-product attention into a multi-
head attention framework using the specified simplifications.
4. Encoder and Decoder Blocks: Implement simplified encoder and decoder
layers, ensuring: Layer normalization, Residual connections, Masked attention in
the decoder for autoregressive generation.
5. Final Output Layer: Implement a linear layer followed by a SoftMax activation
for generating translated tokens.


**Evaluation:** Compute the BLEU score on a validation set and compare the performance
with your model from part 2. Explain why there are differences in performance. Also
discuss any other differences you notice, for example runtime etc.