<a href="https://colab.research.google.com/github/Datalincy/EDA/blob/main/Transformers_Model_with_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing libraries

In [31]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load the dataset

In [32]:
# Load and preprocess the English-French translation dataset
data_path = keras.utils.get_file(
    "fra-eng.zip", origin="http://storage.googleapis.com/download.tensorflow.org/data/fra-eng.zip", extract=True,)
data_path = data_path.replace(".zip", "")

keras.utils.get_file: This function is used to download a file from a URL and cache it locally. It takes in several arguments:

"fra-eng.zip": This is the name you want to assign to the downloaded file.

origin: The URL from which to download the dataset. In this case, it's pointing to a dataset of English-French translations hosted on Google's cloud storage.

extract=True: This argument specifies that the zip file should be extracted automatically after being downloaded.

In [33]:
with open(data_path + "/fra.txt", "r", encoding="utf-8") as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    english, french = line.split("\t")
    french = "[start]" + french + "[end]"
    text_pairs.append((english, french))
english_tokenizer = keras.preprocessing.text.Tokenizer(filters="")
french_tokenizer = keras.preprocessing.text.Tokenizer(filters="")
english_tokenizer.fit_on_texts(pair[0] for pair in text_pairs)
french_tokenizer.fit_on_texts(pair[0] for pair in text_pairs)

english_vocab_size = len(english_tokenizer.word_index) + 1
french_vocab_size = len(french_tokenizer.word_index) + 1

sequence_length = 20
batch_size = 64

It loads the English-French translation dataset from a file (fra.txt).

It creates pairs of English and French sentences.

It tokenizes both the English and French sentences.

It calculates the vocabulary sizes for both languages.

It sets the sequence length (maximum number of tokens per sentence) and batch size for training the model.

In [39]:
def transformer_decoder(inputs, enc_outputs, head_size, num_heads, ff_dim, dropout=0):
    # Attention and Normalization
    x = layers.MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout)(inputs, inputs)
    x = layers.Dropout(dropout)(x)
    x = layers.LayerNormalization(epsilon=1e-6)(x) # Changed le to 1e
    res = x + inputs
    # Feed Forward Part (Add this part from transformer_encoder)
    x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
    x = layers.Dropout(dropout)(x)
    x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
    x = layers.LayerNormalization(epsilon=1e-6)(x)
    return x + res

The transformer_decoder function implements the following sequence of operations in a Transformer architecture:

Multi-Head Attention: Allows the decoder to focus on different parts of the sequence for each position.

Dropout: Applied for regularization.

Layer Normalization: Normalizes the output to prevent unstable gradients.

Residual Connection: Adds the input back to the output for better gradient flow.

Feed-Forward Network: A fully connected layer with ReLU activation to learn higher-level representations.

Final Residual Connection: The output of the feed-forward network is added back to the input.

This decoder can be used as part of a larger Transformer-based architecture for tasks like machine translation, text generation, etc.

# Define the transformer

In [40]:
# Define the transformer modek architecture
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
  # Attention and Normalization
  x = layers.MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout)(inputs, inputs)
  x = layers.Dropout(dropout)(x)
  x = layers.LayerNormalization(epsilon=1e-6)(x)
  res = x + inputs

  # Feed forward part
  x = layers.conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
  x = layers.Dropout(dropout)(x)
  x = layers.conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
  x = layers.LayerNormalization(epsilon=1e-6)(x)
  return x + res

The transformer_encoder function implements the following sequence of operations:

Multi-Head Attention: The encoder attends to the input sequence using self-attention, allowing it to capture dependencies at different positions in the sequence.

Dropout: Regularization to prevent overfitting.

Layer Normalization: Normalizes the output of the attention layer to improve stability during training.

Residual Connection: Adds the original input back to the output of the attention block, helping with gradient flow.

Feed-Forward Network: A fully connected layer (applied position-wise) followed by another convolution, transforming the intermediate representation.

Final Layer Normalization: Stabilizes the output of the feed-forward network.

Final Residual Connection: Adds the original input back to the output after the feed-forward network, which helps preserve information.

This encoder can be stacked multiple times to form a deep Transformer model, which is the core architecture used in models like BERT and GPT.

In [41]:
# Define the transformer modek architecture
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
  # Attention and Normalization
  x = layers.MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout)(inputs, inputs)
  x = layers.Dropout(dropout)(x)
  x = layers.LayerNormalization(epsilon=1e-6)(x)
  res = x + inputs

  # Feed forward part
  x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
  x = layers.Dropout(dropout)(x)
  x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
  x = layers.LayerNormalization(epsilon=1e-6)(x)
  return x + res


def build_model(input_vocab_size, target_size, max_length):
  inputs = keras.Input(shape=(None,), dtype="int64", name="inputs")
  dec_inputs = keras.Input(shape=(None,), dtype="int64", name="dec_inputs")


  # Encoder
  enc_padding_mask = keras.layers.Lambda(
      lambda x: keras.backend.cast(keras.backend.equal(x,0), keras.backend.floatx()))(inputs)
  enc_outputs = keras.layers.Embedding(input_vocab_size, 128)(inputs)
  # Use the custom PositionalEncoding class you defined
  enc_outputs = PositionalEncoding(max_length, 128)(enc_outputs)
  enc_outputs = transformer_encoder(enc_outputs, head_size=128, num_heads=8, ff_dim=512, dropout=0.1) # Changed num_head to num_heads
  enc_outputs = layers.Dropout(0.1)(enc_outputs)

  # Decoder
  look_ahead_mask = keras.layers.Lambda(
      lambda x: keras.backend.cast(keras.backend.equal(x,0), keras.backend.floatx()))(dec_inputs)
  dec_padding_mask = keras.layers.Lambda(
      lambda x: keras.backend.cast(keras.backend.equal(x,0), keras.backend.floatx()))(inputs)

  dec_outputs = keras.layers.Embedding(target_vocab_size, 128)(dec_inputs)
  # Use the custom PositionalEncoding class you defined
  dec_outputs = PositionalEncoding(max_length, 128)(dec_outputs)
  dec_outputs = transformer_decoder(dec_outputs, enc_outputs, head_size=128, num_heads=8, ff_dim=512, dropout=0.1) # Changed num_head to num_heads
  dec_outputs = layers.Dropout(0.1)(dec_outputs)
  outputs = layers.Dense(target_vocab_size, activation="softmax")(dec_outputs)

  model = keras.Model(inputs=[inputs, dec_inputs], outputs=outputs)
  return model

Summary of the Transformer Architecture
Encoder:

The encoder takes the input sequence and passes it through a series of self-attention layers (with multi-head attention), followed by a feed-forward network.

It uses a positional encoding to add information about the order of tokens in the sequence.

Decoder:

The decoder takes the target sequence and generates predictions by attending to both the target sequence itself (self-attention) and the encoder's output (cross-attention).

Like the encoder, the decoder uses positional encoding to preserve the order of tokens.

Final Output:

The decoder output is passed through a dense layer with softmax activation to generate a probability distribution over the target vocabulary, from which the next token can be sampled or predicted.

This architecture is suitable for sequence-to-sequence tasks, such as machine translation, where the model converts an input sequence (e.g., English) into an output sequence (e.g., French).

In [42]:
def transformer_decoder(inputs, enc_outputs, head_size,num_heads, ff_dim, dropout=0):
  # Attention and Normalization
  x = layers.MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout)(inputs, inputs)
  x = layers.Dropout(dropout)(x)
  x = layers.LayerNormalization(epsilon=le-6)(x)
  res = x + inputs

In [43]:
def build_model(input_vocab_size, target_size, max_length): # The parameter target_size is being passed
  inputs = keras.Input(shape=(None,), dtype="int64", name="inputs")
  dec_inputs = keras.Input(shape=(None,), dtype="int64", name="dec_inputs")


  # Encoder
  enc_padding_mask = keras.layers.Lambda(
      lambda x: keras.backend.cast(keras.backend.equal(x,0), keras.backend.floatx()))(inputs)
  enc_outputs = keras.layers.Embedding(input_vocab_size, 128)(inputs)
  # Use the custom PositionalEncoding class you defined
  enc_outputs = PositionalEncoding(max_length, 128)(enc_outputs)
  enc_outputs = transformer_encoder(enc_outputs, head_size=128, num_heads=8, ff_dim=512, dropout=0.1) # Changed num_head to num_heads
  enc_outputs = layers.Dropout(0.1)(enc_outputs)

  # Decoder
  look_ahead_mask = keras.layers.Lambda(
      lambda x: keras.backend.cast(keras.backend.equal(x,0), keras.backend.floatx()))(dec_inputs)
  dec_padding_mask = keras.layers.Lambda(
      lambda x: keras.backend.cast(keras.backend.equal(x,0), keras.backend.floatx()))(inputs)

  dec_outputs = keras.layers.Embedding(target_size, 128)(dec_inputs) # Changed target_vocab_size to target_size
  # Use the custom PositionalEncoding class you defined
  dec_outputs = PositionalEncoding(max_length, 128)(dec_outputs)
  dec_outputs = transformer_decoder(dec_outputs, enc_outputs, head_size=128, num_heads=8, ff_dim=512, dropout=0.1) # Changed num_head to num_heads
  dec_outputs = layers.Dropout(0.1)(dec_outputs)
  outputs = layers.Dense(target_size, activation="softmax")(dec_outputs) # Changed target_vocab_size to target_size

  model = keras.Model(inputs=[inputs, dec_inputs], outputs=outputs)
  return model

In [45]:
def transformer_decoder(inputs, enc_outputs, head_size,num_heads, ff_dim, dropout=0):
  # Attention and Normalization
  x = layers.MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout)(inputs, inputs)
  x = layers.Dropout(dropout)(x)
  x = layers.LayerNormalization(epsilon=1e-6)(x) # Changed le to 1e-6
  res = x + inputs
  # Feed Forward Part (Add this part from transformer_encoder)
  x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
  x = layers.Dropout(dropout)(x)
  x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
  x = layers.LayerNormalization(epsilon=1e-6)(x)
  return x + res

# Train the model

In [47]:
# Build and train the model
transformer_model = build_model(english_vocab_size, french_vocab_size, sequence_length)
transformer_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

english_sequence = english_tokenizer.texts_to_sequences(pair[0] for pair in text_pairs)
french_sequence = french_tokenizer.texts_to_sequences(pair[1] for pair in text_pairs)

english_sequence = keras.preprocessing.sequence.pad_sequences(english_sequence, maxlen=sequence_length, padding="post")
french_sequence = keras.preprocessing.sequence.pad_sequences(french_sequence, maxlen=sequence_length, padding="post")

# Remove the sample_weight parameter and keep only batch_size
transformer_model.fit([english_sequence, french_sequence[:, :-1]],
                    french_sequence[:, 1:],
                    batch_size=batch_size,
                    epochs=10)

Epoch 1/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 29ms/step - accuracy: 0.9900 - loss: 0.2315
Epoch 2/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m129s[0m 27ms/step - accuracy: 0.9982 - loss: 0.0104
Epoch 3/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 27ms/step - accuracy: 0.9987 - loss: 0.0069
Epoch 4/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 27ms/step - accuracy: 0.9990 - loss: 0.0050
Epoch 5/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 27ms/step - accuracy: 0.9992 - loss: 0.0039
Epoch 6/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 27ms/step - accuracy: 0.9993 - loss: 0.0032
Epoch 7/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 27ms/step - accuracy: 0.9995 - loss: 0.0024
Epoch 8/10
[1m2612/2612[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 27ms/step - accuracy: 0.9995 - loss: 0.0021
Epoch 9

<keras.src.callbacks.history.History at 0x7f6e00c0b490>

Data Preparation:

Convert English and French sentences into sequences of integers.

Pad the sequences to ensure consistent length.

Model Building:

Build the Transformer model with encoder and decoder components.

Model Compilation:

Compile the model using Adam optimizer and sparse categorical cross-entropy loss.

Model Training:

Train the model using the input English sequences and target French sequences for 10 epochs.