

Recurrent Neural Networks (RNNs) : specialized neural networks for sequential data (which is data that has a temporal or sequential order).

RNN is a type of NN htat iterates over a sequence (of vector) while keeping an internal state (memory) that depends on the previous elements of the sequence.

Vanishing Gradient Problem : when the gradients become too small during backpropagation, making it hard for the model to learn long-term dependencies.

Transformers : Architecture :

ENCODER : understand the input sequence and extract its features


-input embedding : converts input tokens into vectors
for text : one-hot encoding (the number of "classes" is the size of the vocabulary)
embedding (projection): projects the one-hot encoded vectors into a lower-dimensional space that the model takes as input


-positional encoding : adds positional information to the input embeddings



-self-attention mechanism : calculates the importance of each token in the sequence

the sequence elements are not aware of one another
the self-attention mechanism allows each element to consider the other elements in the sequence when making predictions.
similarity scores are calculated between each pair of tokens in the sequence to determine their importance.
example : x1 x2 x3 x4 x5
x1 is compared to x2, x3, x4, x5
x2 is compared to x1, x3, x4, x5.... 
it's a scalar product 

then we apply a softmax function to the similarity scores to get the attention weights. (it's the score of the similarity)
the attention weights are multiplied by the input tokens to get the weighted sum, which is the output of the self-attention mechanism.
now x1 = sum of (x1, x2, x3, x4, x5) * attention weights

we take each vector and now we put everything in matrix M 
we multiply M transposed by M -> we get the similarity matrix
we apply softmax to the similarity matrix -> we get the attention matrix
we multiply the attention matrix by M -> we get the output matrix

add learnable weights to learn how to perform the self-attention mechanism :
the learnable weights are called query, key, and value matrices.
with x1 example -> query
x1(k1) x2(k2)  x3(k3)  x4(k4) ... xN(kN)
v1 v2  v3  v4 ... vN
x1 = softmax(query * key) * value

multi-head attention : the self-attention mechanism is applied multiple times in parallel, each with different learnable weights.



-feed forward neural network : processes the self-attention output

takes a matrix as input 

-residual connection : adds the input to the feed-forward neural network output to prevent the vanishing gradient problem.
-layer normalization : normalizes the output of the residual connection to stabilize training.

the dog is sleeping and the cat is playing
the model is aware that the cat is playing and the dog is sleeping because we add positional encoding to the input embeddings.
p(i,j) = sin (pos/10000^(2i/d)) if i is even
p(i,j) = cos (pos/10000^(2i/d)) if i is odd
pos : position of the token in the sequence
i : dimension of the embedding
d : dimension of the embedding

the output of the encoder is a sequence of vectors that represent the input sequence's features.

1 layer contains : input embedding -> positional encoding -> self-attention mechanism -> feed-forward neural network

the input embedding and positinal are not done each time we pass through the layer, they are done only once because they are not learnable parameters.


how to perform a classification task on a sequence ? 
At the end of the encoder, we can add a classification head that takes the encoder's output which is a sequence of vectors
We add a global average pooling layer to reduce the sequence of vectors to a single vector that is passed to the classification head.
Then, we apply an MLP to the pooled vector to make predictions.



DECODER : generate the output sequence based on the encoder's features



Lab Session

1. Code a transformer encoder model using:
    - `tfm.nlp.layer.transformerEncoderBlock`
    - `tf.keras.layers.Embedding`
    - Positional encoding
    - `tf.keras.layers.GlobalAveragePooling1D`

2. Train the model on the Reuters newswire dataset.
    - Remember to pad the sequences for batching.

3. Experiment with different hyperparameters (vector dimension, number of heads, number of layers, etc.),add comment to explain every choice and every hyperparameters and compare the results based on a metric of choice (justify the metric used)


In [None]:
import tensorflow as tf
import numpy as np

# Hyperparameters & Data Setup
vocab_size = 5000      # Reuters articles typically use common words; we limit to the top 5k words.
max_len = 300           # Maximum sequence length; chosen to capture most article content while keeping compute reasonable.
embedding_dim = 64      # Embedding size: a moderate dimension that balances capacity and speed.
num_heads = 4           # Multi-head attention: 5 heads allow the model to attend to different subspaces.
ff_dim = 512            # Feed-forward network dimension: larger than embedding_dim to increase capacity.
num_layers = 5         # Number of transformer encoder layers; can be increased in experiments.
num_classes = 46        # Reuters dataset has 46 different topics.

# Load Reuters dataset (already tokenized into integer sequences)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.reuters.load_data(num_words=vocab_size)

# Pad sequences for uniform length (important for batching)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_len, padding='post')
x_test  = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_len, padding='post')


# Model Construction
# Input layer for sequences of token IDs
inputs = tf.keras.Input(shape=(max_len,))

# Embedding layer converts integer tokens to dense vectors
x = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)(inputs)

# Positional Encoding
# Create a fixed positional encoding matrix (max_len x embedding_dim)
# Formula: p(pos, 2i) = sin(pos / 10000^(2i/embedding_dim))
#          p(pos, 2i+1) = cos(pos / 10000^(2i/embedding_dim))


pos_encoding = np.zeros((max_len, embedding_dim))
for pos in range(max_len):
    for i in range(embedding_dim):
        angle = pos / np.power(10000, (2 * (i // 2)) / embedding_dim)
        if i % 2 == 0:
            pos_encoding[pos, i] = np.sin(angle)
        else:
            pos_encoding[pos, i] = np.cos(angle)
# Convert to a TensorFlow constant so it can be added to the embeddings.
pos_encoding = tf.constant(pos_encoding, dtype=tf.float32)

# Add positional encoding to the embedding (broadcasts across the batch)
x = x + pos_encoding

# Transformer Encoder Block
for i in range(num_layers):
    # Multi-head self-attention: each token attends to all others.
    # Using tf.keras.layers.MultiHeadAttention which automatically creates query, key, and value matrices.
    attn_output = tf.keras.layers.MultiHeadAttention(num_heads=num_heads,
                                                     key_dim=embedding_dim)(x, x)
    # Add & Normalize: Residual connection to help with gradient flow.
    x = tf.keras.layers.Add()([x, attn_output])
    x = tf.keras.layers.LayerNormalization(epsilon=1e-6)(x)
    
    # Feed-Forward Network: Two dense layers with a ReLU activation in between.
    ff_output = tf.keras.layers.Dense(ff_dim, activation='relu')(x)
    ff_output = tf.keras.layers.Dense(embedding_dim)(ff_output)
    
    # Another residual connection and normalization.
    x = tf.keras.layers.Add()([x, ff_output])
    x = tf.keras.layers.LayerNormalization(epsilon=1e-6)(x)


# Classification Head 
# Global average pooling aggregates the sequence dimension into a single vector.
x = tf.keras.layers.GlobalAveragePooling1D()(x)
# A Dense layer to further process the pooled features.
x = tf.keras.layers.Dense(64, activation='relu')(x)
# Output layer with softmax activation for multi-class classification.
outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)

# Define the complete model.
model = tf.keras.Model(inputs=inputs, outputs=outputs)


# Compile & Summarize the Model
# We use 'adam' optimizer and sparse categorical crossentropy (labels are integers).
# Accuracy is chosen as the metric because it directly measures the percentage of correct classifications.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

# Training the Model
# Batch size of 32 is standard for balanced training speed and stability.
# Using 10 epochs as an initial experiment; further tuning may increase epochs if needed.
history = model.fit(x_train, y_train, validation_split=0.2, epochs=50, batch_size=32)


# Evaluate on Test Data

test_loss, test_accuracy = model.evaluate(x_test, y_test)
print("Test Accuracy:", test_accuracy)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz
[1m2110848/2110848[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


Epoch 1/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 141ms/step - accuracy: 0.3799 - loss: 2.4674 - val_accuracy: 0.5214 - val_loss: 1.7755
Epoch 2/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 147ms/step - accuracy: 0.5770 - loss: 1.6588 - val_accuracy: 0.6205 - val_loss: 1.5158
Epoch 3/10
[1m214/225[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m1s[0m 154ms/step - accuracy: 0.6459 - loss: 1.4227

KeyboardInterrupt: 

In [None]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

# Plot training & validation loss values
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')

plt.show()