#Environment Setup and Hyperparameter Initialization

This part lays the groundwork for the project by preparing the environment, defining dataset paths, and configuring key parameters for the image captioning model.

###Mount Google Drive

We mounted Google Drive to access the necessary datasets and files directly from the cloud storage. Change this according to the location of dataset.

###Library Imports

* Core libraries like **os** (file system operations), **re** (regular expressions), **numpy** (numerical computations), and **tensorflow/keras** (deep learning framework) are imported.

* Additional tools like **sklearn** are imported for splitting datasets into training and testing subsets.

###Image and Sequence Configuration

* **IMAGE_SIZE:** Images are resized to dimensions of (299, 299) to standardize inputs for the neural network.

* **SEQ_LENGTH:** Sets a fixed maximum length of 25 tokens for textual sequences (captions). This ensures consistency across data batches.

###Vocabulary and Embedding Dimensions

* **VOCAB_SIZE:** Limited the vocabulary size to 10,000 words to manage the language model complexity while maintaining a diverse enough vocabulary for meaningful captions.

* **EMBED_DIM:** Set the embedding dimensions for both image and text features to 512, ensuring the model can learn compact yet expressive representations.

###Transformer Model Parameters

* **FF_DIM:** Defines the size of the feed-forward network within the transformer. Configured the feed-forward network size to 512 units, balancing complexity and computational efficiency.

* **NUM_HEADS:** Configures the number of attention heads in the transformer architecture for multi-head attention. Set the number of attention heads to 2 for the multi-head attention mechanism in the transformer.

###Training Parameters:

* **BATCH_SIZE:** Specifies the number of samples processed simultaneously during training. We set the batch size to 256, which determines the number of samples processed simultaneously during training.

* **EPOCHS:** Sets the number of complete passes through the dataset for model training, here defined as 30.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras import layers
from keras.applications import efficientnet
from keras.layers import TextVectorization
from keras.utils import register_keras_serializable
from sklearn.model_selection import train_test_split

# Define constants
IMAGES_PATH = "/content/drive/MyDrive/Colab/flickr30k_images"
CAPTIONS_PATH = "/content/drive/MyDrive/Colab/results.csv"

# Desired image dimensions
IMAGE_SIZE = (299, 299)

# Fixed length allowed for any sequence
SEQ_LENGTH = 25

# Vocabulary size
VOCAB_SIZE = 10000

# Dimension for the image embeddings and token embeddings
EMBED_DIM = 512

# Number of units in the feed-forward network
FF_DIM = 512

# Number of attention heads
NUM_HEADS = 2

# Batch size
BATCH_SIZE = 256

# Number of epochs
EPOCHS = 30


#Data Loading, Preprocessing, and Splitting

In this section, we have focused on:

* Loading and preprocessing the captions dataset to prepare it for model training.
* Filtering and validating the data to ensure captions meet the specified length constraints.
* Splitting the dataset into training, validation, and test subsets to enable robust model evaluation.

###(load_captions_data) Function:

* **Purpose:** To load and preprocess the captions dataset from a file.

* **Process:** Opens the captions file specified by filename and reads its lines, skipping the header.

 *Initializes:*

 * **(caption_mapping):** A dictionary mapping image file paths to their corresponding captions.
 * **(text_data):** A list of all processed captions.
 * **(images_to_skip):** A set to track images that should be excluded based on caption length criteria.

 *Iterates over each line in the file:*

 * Splits the line into the image name and its associated caption. Handles exceptions for malformed entries.
 * Filters out captions that are too short (< 4 tokens) or too long **(> SEQ_LENGTH)** and adds such images to the **images_to_skip** set.
 * Appends captions to **text_data** after adding <**start**> and <**end**> tokens to mark the beginning and end of the captions.
 * Updates **caption_mapping** by appending captions to the corresponding image path.

 *Removes skipped images from caption_mapping to ensure only valid entries remain.*

* **Returns:**

 * **caption_mapping:** Dictionary mapping image paths to captions.

 * **text_data:** List of processed captions.

###(train_val_split) Function:

* **Purpose:** To split the dataset into training, validation, and test subsets.

* **Process:**

 * Extracts all image keys (all_images) from caption_data.
 * Optionally shuffles the image keys for randomization.
 * Splits the dataset: First into training and validation sets based on **validation_size**. Then splits the validation set further into validation and test subsets based on **test_size**.
 * Creates new dictionaries (**training_data, validation_data, and test_data**) mapping image paths to captions for each subset.

* **Returns:**

 * **training_data:** Training subset.
 * **validation_data:** Validation subset.
 * **test_data:** Test subset.

###Loading and Splitting the Dataset:

* Calls **load_captions_data** to load the captions and map them to images.

* Calls **train_val_split** to partition the dataset into training, validation, and test sets.

* Prints the total number of samples and the count for each subset.

In [None]:
def load_captions_data(filename):
    with open(filename, encoding='utf-8') as caption_file:
        caption_data = caption_file.readlines()[1:]
        caption_mapping = {}
        text_data = []
        images_to_skip = set()

        for line in caption_data:
            line = line.rstrip("\n")
            try:
                img_name, _, caption = line.split("| ")
            except ValueError:
                img_name, caption = line.split("| ")
                caption = caption[4:]

            img_name = os.path.join(IMAGES_PATH, img_name.strip())
            tokens = caption.strip().split()
            if len(tokens) < 4 or len(tokens) > SEQ_LENGTH:
                images_to_skip.add(img_name)
                continue

            if img_name.endswith("jpg") and img_name not in images_to_skip:
                caption = "<start> " + caption.strip() + " <end>"
                text_data.append(caption)

                if img_name in caption_mapping:
                    caption_mapping[img_name].append(caption)
                else:
                    caption_mapping[img_name] = [caption]

        for img_name in images_to_skip:
            if img_name in caption_mapping:
                del caption_mapping[img_name]

        return caption_mapping, text_data

def train_val_split(caption_data, validation_size=0.2, test_size=0.02, shuffle=True):
    all_images = list(caption_data.keys())
    if shuffle:
        np.random.shuffle(all_images)

    train_keys, validation_keys = train_test_split(
        all_images, test_size=validation_size, random_state=42
    )
    validation_keys, test_keys = train_test_split(
        validation_keys, test_size=test_size, random_state=42
    )

    training_data = {img_name: caption_data[img_name] for img_name in train_keys}
    validation_data = {img_name: caption_data[img_name] for img_name in validation_keys}
    test_data = {img_name: caption_data[img_name] for img_name in test_keys}

    return training_data, validation_data, test_data

# Load the dataset
captions_mapping, text_data = load_captions_data(CAPTIONS_PATH)

# Split the dataset
train_data, validation_data, test_data = train_val_split(captions_mapping)

print(f"Total samples: {len(captions_mapping)}")
print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(validation_data)}")
print(f"Test samples: {len(test_data)}")

#Text Vectorization and Custom Preprocessing

In this section, we have focused on:

* Defining a custom preprocessing pipeline to clean and standardize the captions, ensuring they are consistent and free from unnecessary characters.

* Configuring a **TextVectorization** layer to tokenize the captions and convert them into fixed-length integer sequences.

* Adapting the vectorizer to the processed text data to generate a vocabulary and prepare the captions for input into the model.

###(custom_standardization) Function:

* **Purpose:** To preprocess and clean text input by standardizing it before tokenization.

* **Process:**

 * Converts the input string to lowercase using **tf.strings.lower()** to ensure uniformity.
 * Defines a set of characters to strip **(strip_chars)**, which includes punctuation, special characters, and digits.
 * Uses **tf.strings.regex_replace()** to remove all occurrences of the specified **strip_chars** from the input text.

* **Returns:** A cleaned, lowercase version of the input text with specified characters removed.

###Text Vectorizer (TextVectorization Layer):

* **Purpose:** Converts raw text into sequences of integers for model training.

* **Configuration:**

 * **max_tokens:** Limits the vocabulary size to **VOCAB_SIZE** (10,000 words) to control complexity.
 * **output_mode:** Set to **"int"**, so text is tokenized and mapped to integer sequences.
 * **output_sequence_length:** Pads or truncates sequences to a fixed length of **SEQ_LENGTH** (25 tokens) for uniformity.

 * **standardize:** Applies the custom preprocessing function **(custom_standardization)** to clean the text before tokenization.

###Adapting the Vectorizer:

* **Purpose:** To build the vocabulary for the text data.

* **Process:**

 * **vectorization.adapt(text_data)** is called, where **text_data** is the list of all cleaned captions from the previous step.
 * This step analyzes the text data and prepares a vocabulary of the most frequent **VOCAB_SIZE** words, assigning an integer index to each word.

In [None]:
def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    strip_chars = "!\"#$%&'()*+,-./:;=?@[\]^_`{|}~1234567890"
    return tf.strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")

# Define the vectorizer
vectorization = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=SEQ_LENGTH,
    standardize=custom_standardization
)

# Adapt the vectorizer to the text data
vectorization.adapt(text_data)

# Image Augmentation Pipeline

In this section, we have created an image augmentation pipeline to:

* Enhance training data diversity by simulating real-world variations in images.

* Improve the robustness and generalization of the image captioning model by exposing it to transformed versions of the same images.



###(image_augmentation) Definition:

* **Purpose:** To enhance the training dataset by applying random transformations to images, improving the model's ability to generalize to new data.

* **Process:** A sequential model is created using **keras.Sequential** to apply a series of image augmentations.

 *Augmentation Layers:*

  * **layers.RandomFlip("horizontal"):** Randomly flips images horizontally, simulating different orientations.
  * **layers.RandomRotation(0.2):** Randomly rotates images within a range of ±20% (of 360 degrees), simulating various angles of view.
  * **layers.RandomContrast(0.3):** Adjusts image contrast randomly by up to ±30%, simulating varying lighting conditions.

* **Integration:** The **image_augmentation** pipeline can be applied to images during training to increase diversity in the dataset and make the model more robust to variations in image orientation, rotation, and contrast.

In [None]:
image_augmentation = keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.2),
    layers.RandomContrast(0.3)
])

#Custom Transformer Encoder Block

In this section, we have implemented a custom transformer encoder block to:

* Extract high-level contextual features from sequential input data using attention mechanisms.

* Enhance model performance by enabling it to focus on relevant parts of the sequence dynamically.

* Integrate normalization and feed-forward layers for improved training stability and expressive power.

###Defining TransformerEncoderBlock:

* **Purpose:** Implements a custom transformer encoder block for processing sequential data, such as image embeddings or text token embeddings.

* **@register_keras_serializable:** Registers the layer as a custom Keras serializable object, allowing the model to save and reload this layer seamlessly.

###init Method:

* Initializes the transformer encoder block by defining the following sub-layers:

 **MultiHeadAttention (self.attention_1):** A multi-head attention mechanism with:
 * **num_heads:** Number of attention heads (defined as **self.num_heads**).
 * **key_dim:** Dimensionality of the keys (set as **self.embed_dim**).
 * Allows the model to focus on different parts of the input sequence simultaneously.

 **LayerNormalization (self.layernorm_1 and self.layernorm_2):** Normalizes inputs to stabilize training and prevent gradient explosion/vanishing.

 **Dense Layer (self.dense_1):** A feed-forward dense layer with ReLU activation applied to the normalized input. The dimensionality is set to **self.embed_dim**.

###call Method:

* **Purpose:** Defines the forward pass of the encoder block.

* **Process:**
 1. Normalizes the input embeddings using **self.layernorm_1**.
 2. Passes the normalized inputs through the dense layer **(self.dense_1)**.
 3. Applies multi-head attention using **self.attention_1**, where the queries, keys, and values are all derived from the dense layer outputs.
 4. Combines the dense layer output with the attention output (via skip connection) and applies a second normalization **(self.layernorm_2)**.

* **Returns:** The final output of the encoder block.

###get_config Method:

* **Purpose:** Ensures the custom layer can be serialized and deserialized.

* **Process:** Updates the configuration dictionary with the block's parameters: **embed_dim, dense_dim,** and **num_heads**.

In [None]:
@register_keras_serializable()
class TransformerEncoderBlock(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super(TransformerEncoderBlock, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads

        # Initialize sub-layers in __init__
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=self.num_heads, key_dim=self.embed_dim
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.dense_1 = layers.Dense(self.embed_dim, activation="relu")

    def call(self, inputs, training, mask=None):
        inputs_norm = self.layernorm_1(inputs)
        inputs_dense = self.dense_1(inputs_norm)
        attention_output = self.attention_1(
            query=inputs_dense, value=inputs_dense, key=inputs_dense, training=training
        )
        out = self.layernorm_2(inputs_dense + attention_output)
        return out

    def get_config(self):
        config = super(TransformerEncoderBlock, self).get_config()
        config.update({
            'embed_dim': self.embed_dim,
            'dense_dim': self.dense_dim,
            'num_heads': self.num_heads,
        })
        return config

#Positional Embedding Layer for Token and Sequence Representation

In this section, we have implemented a **PositionalEmbedding** layer to:

* Embed input tokens (words) as dense vectors.

* Incorporate positional information to preserve sequence order in the transformer model.

* Scale embeddings for numerical stability and compute masks to handle padding tokens.

###Defining PositionalEmbedding:

* **Purpose:** Implements a custom layer to combine token embeddings and positional embeddings, which are critical for maintaining sequence order in transformer models.

* **@register_keras_serializable:** Registers this layer as a custom Keras serializable component, ensuring compatibility for saving and loading models.


###init Method:

* **Components:**

 * **token_embeddings:** An embedding layer that converts input tokens (words) into dense vectors of size **embed_dim**.
 * **position_embeddings:** An embedding layer that assigns a positional vector to each token position in the sequence, with the same size **(embed_dim)**.
 * **Parameters:**  
  
   1. **sequence_length:** Maximum number of tokens in a sequence.
   2. **vocab_size:** Total number of unique tokens in the vocabulary.
   3. **embed_dim:** Dimensionality of the embedding vectors.

###call Method:

* **Purpose:** Combines token embeddings with positional embeddings to incorporate sequence order information.

* **Process:**
 1. Calculates the sequence length dynamically from the input using **tf.shape(inputs)[-1]**.
 2. Generates a range of positions **(positions)** for the tokens in the sequence (e.g., [0, 1, 2, ...]).
 3. Applies the token embedding layer **(self.token_embeddings)** to convert input tokens into dense vectors and scales the embeddings by the square root of **embed_dim (embed_scale = sqrt(embed_dim))** for stability.
 4. Generates positional embeddings **(self.position_embeddings)** for the positions.
 5. Combines the scaled token embeddings and positional embeddings via element-wise addition.

* **Returns:** A tensor of shape **(batch_size, sequence_length, embed_dim)** containing the combined embeddings.

###compute_mask Method:

* **Purpose:** Computes a mask to ignore padding tokens (tokens with value 0) during training.

* **Process:** Uses **tf.math.not_equal(inputs, 0)** to generate a boolean mask indicating which tokens are non-padding tokens.

###get_config Method:

* **Purpose:** Ensures the layer's parameters are included for serialization.

* **Process:** Updates the configuration dictionary with **sequence_length, vocab_size,** and **embed_dim**.

In [None]:
@register_keras_serializable()
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super(PositionalEmbedding, self).__init__(**kwargs)
        self.token_embeddings = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=embed_dim)
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        # Remove 'embed_scale' from __init__

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embed_dim = tf.cast(self.embed_dim, tf.float32)
        embed_scale = tf.math.sqrt(embed_dim)
        embedded_tokens = self.token_embeddings(inputs) * embed_scale
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            'sequence_length': self.sequence_length,
            'vocab_size': self.vocab_size,
            'embed_dim': self.embed_dim,
        })
        return config

#Custom Transformer Decoder Block

In this section, we have built a custom transformer decoder block to:

* Process input sequences with attention mechanisms that integrate encoded inputs and prior outputs.

* Maintain autoregressive behavior using causal masking.

* Generate token probabilities for caption generation in the image captioning model.

###Defining TransformerDecoderBlock:

* **Purpose:** Implements a custom transformer decoder block for generating sequential outputs, such as text captions, by leveraging encoded input and prior context.

* **@register_keras_serializable:** Registers this layer as a custom Keras serializable component, ensuring compatibility for model saving and loading.

###init Method:

* Initializes the decoder block with the following components:

 **Attention Layers:**

 * **attention_1:** A self-attention mechanism allowing the model to focus on previously generated tokens.
 * **cross_attention_2:** A cross-attention mechanism that incorporates information from encoder outputs.

 **Feed-Forward Network:**

 * **ffn_layer_1** and **ffn_layer_2:** Dense layers for transforming and refining intermediate representations.

 **Normalization Layers:**

 * **layernorm_1, layernorm_2, layernorm_3:** Layer normalization to stabilize training and improve convergence.

 **Positional Embedding:**

 * **embedding:** Adds token and positional embeddings to input tokens for preserving sequence order.

 **Output Layer:**

 * **out:** A dense layer with a **softmax** activation that maps the output to probabilities over the vocabulary.

 **Dropout:**

 * **dropout_1** and **dropout_2:** Dropout layers to prevent overfitting.

 **Support for Masking:**

 * Enables handling of padding and causal masks to ensure proper attention behavior during training.

###call Method:

* **Purpose:** Defines the forward pass for the decoder block.

* **Process:**

 1. **Embedding:** Converts input tokens into dense embeddings using **self.embedding**.
 2. **Causal Mask:** Generates a causal mask **(get_causal_attention_mask)** to ensure attention only attends to prior tokens, preserving autoregressive behavior.
 3. **Self-Attention:** Applies self-attention **(self.attention_1)** to allow the model to focus on previously generated tokens. Then combines the result with the input embeddings using a residual connection and normalizes with **layernorm_1**.
 4. **Cross-Attention:** Applies cross-attention **(self.cross_attention_2)** using the encoder outputs, allowing the decoder to leverage encoded input features. After that, adds the result to the output of the self-attention layer and normalizes with **layernorm_2**.
 5. **Feed-Forward Network:** Passes the output through a feed-forward network **(ffn_layer_1 and ffn_layer_2)** with dropout and combines the result using residual connections and **layernorm_3**.
 6. **Output:** Passes the refined output through the dense output layer **(self.out)** with a **softmax** activation to generate token probabilities.

* **Returns:** Probabilities over the vocabulary for the next token.

###get_causal_attention_mask Method:

* **Purpose:** Creates a causal mask to prevent attention to future tokens.

* **Process:**

 * Constructs a lower triangular matrix where each token can only attend to itself and previous tokens.
 * Expands the mask to match the batch size and sequence dimensions.

###get_config Method:

* **Purpose:** Ensures all parameters are serializable for model saving and loading.

* **Process:** Updates the configuration dictionary with **embed_dim, ff_dim,** and **num_heads**.

In [None]:
@register_keras_serializable()
class TransformerDecoderBlock(layers.Layer):
    def __init__(self, embed_dim, ff_dim, num_heads, **kwargs):
        super(TransformerDecoderBlock, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.ff_dim = ff_dim
        self.num_heads = num_heads

        # Initialize sub-layers in __init__
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=self.num_heads, key_dim=self.embed_dim
        )
        self.cross_attention_2 = layers.MultiHeadAttention(
            num_heads=self.num_heads, key_dim=self.embed_dim
        )
        self.ffn_layer_1 = layers.Dense(self.ff_dim, activation="relu")
        self.ffn_layer_2 = layers.Dense(self.embed_dim)
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.embedding = PositionalEmbedding(
            sequence_length=SEQ_LENGTH,
            vocab_size=VOCAB_SIZE,
            embed_dim=EMBED_DIM
        )
        self.out = layers.Dense(VOCAB_SIZE, activation="softmax")
        self.dropout_1 = layers.Dropout(0.3)
        self.dropout_2 = layers.Dropout(0.5)
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, training, mask=None):
        inputs_embedded = self.embedding(inputs)
        causal_mask = self.get_causal_attention_mask(inputs_embedded)

        if mask is not None:
            padding_mask = tf.cast(mask[:, :, tf.newaxis], dtype=tf.int32)
            combined_mask = tf.cast(mask[:, tf.newaxis, :], dtype=tf.int32)
            combined_mask = tf.minimum(combined_mask, causal_mask)
        else:
            combined_mask = causal_mask

        attention_output = self.attention_1(
            query=inputs_embedded,
            value=inputs_embedded,
            key=inputs_embedded,
            attention_mask=combined_mask,
            training=training
        )
        out1 = self.layernorm_1(inputs_embedded + attention_output)

        cross_attention_output = self.cross_attention_2(
            query=out1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask if mask is not None else None,
            training=training
        )
        out2 = self.layernorm_2(out1 + cross_attention_output)

        ffn_out = self.ffn_layer_1(out2)
        ffn_out = self.dropout_1(ffn_out, training=training)
        ffn_out = self.ffn_layer_2(ffn_out)

        ffn_out = self.layernorm_3(ffn_out + out2)
        ffn_out = self.dropout_2(ffn_out, training=training)

        preds = self.out(ffn_out)
        return preds

    def get_causal_attention_mask(self, inputs):
        batch_size, sequence_length = tf.shape(inputs)[0], tf.shape(inputs)[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, sequence_length, sequence_length))
        return tf.tile(mask, [batch_size, 1, 1])

    def get_config(self):
        config = super(TransformerDecoderBlock, self).get_config()
        config.update({
            'embed_dim': self.embed_dim,
            'ff_dim': self.ff_dim,
            'num_heads': self.num_heads,
        })
        return config

#Feature Extraction with Pre-trained CNN Model

In this section, we have defined and instantiated a pre-trained CNN model for:

* Extracting meaningful, high-level features from input images using EfficientNetB0.

* Preparing image embeddings that will be processed by the transformer-based captioning model.


###get_cnn_model Function:

* **Purpose:** Creates a Convolutional Neural Network (CNN) model for extracting high-level image features to be used as inputs for the image captioning transformer model.

* **Process:**

 *Base Model:*

  * Uses a pre-trained **EfficientNetB0** model from **keras.applications.efficientnet.**
  * Configures the base model:

   1. **input_shape:** Sets the input shape to **(*IMAGE_SIZE, 3)** (299x299 images with 3 color channels).
   2. **include_top=False:** Excludes the fully connected classification head to focus only on feature extraction.
   3. **weights="imagenet":** Initializes the model with weights pre-trained on the ImageNet dataset for leveraging learned features.

 *Freezing the Base Model:*

  * Sets **base_model.trainable = False** to freeze all layers of the EfficientNetB0 model, preventing weight updates during training and reducing computational load.

 *Reshaping the Output:*

  * Reshapes the output of the CNN to **(-1, base_model_out.shape[-1])**. **-1** flattens the spatial dimensions (height and width). **base_model_out.shape[-1]** retains the feature depth (channel dimension).

 *Model Definition:*

  * Creates the CNN model with **keras.models.Model**, specifying the base model's input and the reshaped output.

* Instantiation:

 * **cnn_model:** Calls **get_cnn_model()** to instantiate the CNN model, which will be used to extract image features for subsequent processing in the transformer model.

In [None]:
def get_cnn_model():
    base_model = efficientnet.EfficientNetB0(
        input_shape=(*IMAGE_SIZE, 3),
        include_top=False,
        weights="imagenet"
    )
    base_model.trainable = False
    base_model_out = base_model.output
    base_model_out = layers.Reshape((-1, base_model_out.shape[-1]))(base_model_out)
    cnn_model = keras.models.Model(base_model.input, base_model_out)
    return cnn_model

cnn_model = get_cnn_model()

#Instantiating Transformer Encoder and Decoder Blocks

In this section, we have instantiated the core components of the transformer-based architecture:

* **Encoder:** Processes and encodes image embeddings into meaningful feature representations.

* **Decoder:** Decodes the sequence embeddings, integrating the encoder's outputs to predict captions token by token.

###Encoder Initialization:

* **TransformerEncoderBlock:** Instantiates the custom encoder block with the following parameters:

 * **embed_dim:** Embedding dimensionality set to **EMBED_DIM** (512).
 * **dense_dim:** Dimensionality of the feed-forward network set to **FF_DIM** (512).
 * **num_heads:** Number of attention heads in the multi-head attention mechanism set to **NUM_HEADS** (2).

The encoder processes image embeddings, enabling the model to extract high-level contextual representations through multi-head attention and feed-forward networks.

###Decoder Initialization:

* **TransformerDecoderBlock:** Instantiates the custom decoder block with the same parameters:

 * **embed_dim:** Embedding dimensionality set to **EMBED_DIM** (512).
 * **ff_dim:** Dimensionality of the feed-forward network set to **FF_DIM** (512).
 * **num_heads:** Number of attention heads in the self-attention and cross-attention mechanisms set to **NUM_HEADS** (2).

The decoder processes sequence embeddings (token embeddings) and incorporates encoder outputs to generate meaningful captions.

In [None]:
encoder = TransformerEncoderBlock(embed_dim=EMBED_DIM, dense_dim=FF_DIM, num_heads=NUM_HEADS)
decoder = TransformerDecoderBlock(embed_dim=EMBED_DIM, ff_dim=FF_DIM, num_heads=NUM_HEADS)

#Image Captioning Model Implementation

In this section, we have implemented the complete image captioning model, integrating:

* **Image Processing:** Feature extraction with the CNN.

* **Sequence Modeling:** Image embedding processing with the encoder and caption generation with the decoder.

* **Training/Evaluation Workflow:** Loss, accuracy calculation, and training/testing steps.

* **Serialization:** Mechanisms for saving and loading the model configuration and weights.

###Defining ImageCaptioningModel:

* **Purpose:** Implements a custom Keras model for end-to-end image captioning by combining the CNN, encoder, and decoder components with loss and accuracy tracking.

###Initialization (init Method):

* **cnn_model:** Pre-trained CNN for feature extraction.

* **encoder:** Transformer encoder to process image embeddings.

* **decoder:** Transformer decoder to generate captions.

* **image_aug:** Optional image augmentation pipeline.

* **Metrics:**

 * **loss_tracker:** Tracks the mean loss during training/testing.
 * **acc_tracker:** Tracks the mean accuracy during training/testing.

###Forward Pass (call Method):

* **Inputs:** A batch of images **(batch_img)** and their corresponding token sequences **(batch_seq)**.

* **Process:**

 1. Applies image augmentation **(image_aug)** during training, if provided.
 2. Extracts image features using **cnn_model**.
 3. Passes image embeddings through the encoder to generate **encoder_out**.
 4. Splits input sequences into:
   * **batch_seq_inp:** Input tokens for the decoder (all but the last token).
   * **batch_seq_true:** Ground truth tokens (all but the first token).
 5. Computes a mask for padding tokens in **batch_seq_inp**.
 6. Passes input tokens and **encoder_out** through the decoder to generate predictions **(batch_seq_pred)**.

###Loss Calculation (calculate_loss Method):

* **Purpose:** Computes loss while ignoring padding tokens.

* **Process:**

 * Uses the model's compiled loss function.
 * Multiplies the loss by the mask to ignore padding tokens.
 * Averages the loss over valid tokens.

###Accuracy Calculation (calculate_accuracy Method):

* **Purpose:** Computes sequence accuracy while ignoring padding tokens.

* **Process:**

 * Compares predicted tokens **(y_pred)** with ground truth tokens **(y_true)**.
 * Applies the mask to ignore padding tokens.
 * Averages accuracy over valid tokens.

###Training Step (train_step Method):

* **Purpose:** Defines a single training step for the model.

* **Process:**

 1. Computes predictions **(batch_seq_pred)** and loss.
 2. Backpropagates the loss using **GradientTape**.
 3. Updates the encoder and decoder weights using the optimizer.
 4. Tracks loss and accuracy using **loss_tracker** and **acc_tracker**.

###Testing Step (test_step Method):

* **Purpose:** Defines a single testing step for the model.

* **Process:**

 1. Computes predictions **(batch_seq_pred)** without updating weights.
 2. Computes and tracks loss and accuracy using the same methods as **train_step**.

###Metrics Property:

* Ensures the model returns **loss_tracker** and **acc_tracker** during evaluation.

###Serialization Methods:

* **get_config:** Ensures model components **(cnn_model, encoder, decoder,** and **image_aug)** are serializable.

* **from_config:** Deserializes components and reinstantiates the model for loading saved configurations.

In [None]:
@register_keras_serializable()
class ImageCaptioningModel(keras.Model):
    def __init__(
        self,
        cnn_model,
        encoder,
        decoder,
        image_aug=None,
        **kwargs
    ):
        super(ImageCaptioningModel, self).__init__(**kwargs)
        self.cnn_model = cnn_model
        self.encoder = encoder
        self.decoder = decoder
        self.image_aug = image_aug

        self.loss_tracker = keras.metrics.Mean(name="loss")
        self.acc_tracker = keras.metrics.Mean(name="accuracy")

    def call(self, inputs, training=False, mask=None):
        batch_img, batch_seq = inputs

        if self.image_aug and training:
            batch_img = self.image_aug(batch_img)

        img_embed = self.cnn_model(batch_img, training=False)
        encoder_out = self.encoder(img_embed, training=training)

        batch_seq_inp = batch_seq[:, :-1]
        batch_seq_true = batch_seq[:, 1:]

        mask = tf.math.not_equal(batch_seq_inp, 0)

        batch_seq_pred = self.decoder(
            batch_seq_inp, encoder_out, training=training, mask=mask
        )

        return batch_seq_pred

    def calculate_loss(self, y_true, y_pred, mask):
        loss = self.compiled_loss(y_true, y_pred)
        mask = tf.cast(mask, dtype=loss.dtype)
        loss *= mask
        return tf.reduce_sum(loss) / tf.reduce_sum(mask)

    def calculate_accuracy(self, y_true, y_pred, mask):
        accuracy = tf.equal(y_true, tf.argmax(y_pred, axis=2))
        accuracy = tf.math.logical_and(mask, accuracy)
        accuracy = tf.cast(accuracy, dtype=tf.float32)
        mask = tf.cast(mask, dtype=tf.float32)
        return tf.reduce_sum(accuracy) / tf.reduce_sum(mask)

    def train_step(self, batch_data):
        batch_img, batch_seq = batch_data

        with tf.GradientTape() as tape:
            batch_seq_pred = self((batch_img, batch_seq), training=True)
            batch_seq_true = batch_seq[:, 1:]
            mask = tf.math.not_equal(batch_seq_true, 0)
            loss = self.calculate_loss(batch_seq_true, batch_seq_pred, mask)

        train_vars = (
            self.encoder.trainable_variables +
            self.decoder.trainable_variables
        )
        gradients = tape.gradient(loss, train_vars)
        self.optimizer.apply_gradients(zip(gradients, train_vars))

        acc = self.calculate_accuracy(batch_seq_true, batch_seq_pred, mask)

        self.loss_tracker.update_state(loss)
        self.acc_tracker.update_state(acc)

        return {"loss": self.loss_tracker.result(), "accuracy": self.acc_tracker.result()}

    def test_step(self, batch_data):
        batch_img, batch_seq = batch_data

        batch_seq_pred = self((batch_img, batch_seq), training=False)
        batch_seq_true = batch_seq[:, 1:]
        mask = tf.math.not_equal(batch_seq_true, 0)
        loss = self.calculate_loss(batch_seq_true, batch_seq_pred, mask)
        acc = self.calculate_accuracy(batch_seq_true, batch_seq_pred, mask)

        self.loss_tracker.update_state(loss)
        self.acc_tracker.update_state(acc)

        return {"loss": self.loss_tracker.result(), "accuracy": self.acc_tracker.result()}

    @property
    def metrics(self):
        return [self.loss_tracker, self.acc_tracker]

    def get_config(self):
        config = super(ImageCaptioningModel, self).get_config()
        config.update({
            'cnn_model': keras.layers.serialize(self.cnn_model),
            'encoder': keras.layers.serialize(self.encoder),
            'decoder': keras.layers.serialize(self.decoder),
            'image_aug': keras.layers.serialize(self.image_aug) if self.image_aug else None,
        })
        return config

    @classmethod
    def from_config(cls, config):
        cnn_model_config = config.pop('cnn_model')
        encoder_config = config.pop('encoder')
        decoder_config = config.pop('decoder')
        image_aug_config = config.pop('image_aug', None)

        cnn_model = keras.layers.deserialize(cnn_model_config, custom_objects={'EfficientNetB0': efficientnet.EfficientNetB0})
        encoder = keras.layers.deserialize(encoder_config, custom_objects={'TransformerEncoderBlock': TransformerEncoderBlock})
        decoder = keras.layers.deserialize(decoder_config, custom_objects={
            'TransformerDecoderBlock': TransformerDecoderBlock,
            'PositionalEmbedding': PositionalEmbedding
        })
        image_aug = keras.layers.deserialize(image_aug_config) if image_aug_config else None

        return cls(
            cnn_model=cnn_model,
            encoder=encoder,
            decoder=decoder,
            image_aug=image_aug,
            **config
        )

#Data Preparation and Dataset Creation

In this section, we have created a robust pipeline for data preparation and dataset creation to:

* Load and preprocess images and captions.

* Organize and align data for efficient input into the model.

* Optimize data loading with parallel processing, shuffling, batching, and prefetching.

###decode_and_resize Function:

* **Purpose:** Loads and preprocesses an image for input into the model.

* **Process:**

 * Reads the image file from the provided path using **tf.io.read_file**.
 * Decodes the image into a tensor with 3 color channels **(tf.image.decode_jpeg)**.
 * Resizes the image to the target dimensions **(IMAGE_SIZE)**.
 * Converts the image values to floating-point numbers normalized between 0 and 1 **(tf.image.convert_image_dtype)**.

* **Returns:** A processed image tensor ready for input into the CNN.

###process_input Function:

* **Purpose:** Prepares an image and its corresponding caption for training.

* **Process:**

 * Calls **decode_and_resize** to preprocess the image.
 * Tokenizes the caption using the text vectorizer **(vectorization)**.

* **Returns:** A tuple of the processed image tensor and the tokenized caption.

###flatten_data Function:

* **Purpose:** Flattens nested image-caption pairs into separate lists for processing.

* **Process:**

 * Iterates through the list of images and their corresponding captions.
 * For each caption of an image, duplicates the image path, ensuring alignment for training.

* **Returns:** Two flattened lists:

 * **flattened_images:** A list of image paths, one for each caption.
 * **flattened_captions:** A list of captions, aligned with the image paths.

###make_dataset Function:

* **Purpose:** Creates a tf.data.Dataset pipeline for efficient data loading and preprocessing.

* **Process:**

 * Calls **flatten_data** to prepare lists of image paths and captions.
 * Creates a TensorFlow dataset using **tf.data.Dataset.from_tensor_slices**.
 * Maps **process_input** to preprocess each image-caption pair in parallel **(num_parallel_calls=tf.data.AUTOTUNE)**.
 * Shuffles the dataset to ensure randomness during training **(dataset.shuffle)**.
 * Batches the data into chunks of size **BATCH_SIZE** and prefetches batches for efficiency **(prefetch=tf.data.AUTOTUNE)**.

* **Returns:** A batched and preprocessed dataset.

###Dataset Preparation:

* **train_dataset:**

 * Calls **make_dataset** with training data **(train_data)**.
 * Produces a dataset of preprocessed images and tokenized captions for training.

* **validation_dataset:**

 * Calls **make_dataset** with validation data **(validation_data)**.
 * Produces a dataset for evaluating the model during training.

In [None]:
def decode_and_resize(img_path):
    img = tf.io.read_file(img_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, IMAGE_SIZE)
    img = tf.image.convert_image_dtype(img, tf.float32)
    return img

def process_input(img_path, caption):
    img = decode_and_resize(img_path)
    caption = vectorization(caption)
    return img, caption

def flatten_data(images, captions):
    flattened_images = []
    flattened_captions = []
    for img, caps in zip(images, captions):
        for cap in caps:
            flattened_images.append(img)
            flattened_captions.append(cap)
    return flattened_images, flattened_captions

def make_dataset(images, captions):
    images, captions = flatten_data(images, captions)
    dataset = tf.data.Dataset.from_tensor_slices((images, captions))
    dataset = dataset.map(process_input, num_parallel_calls=tf.data.AUTOTUNE)
    dataset = dataset.shuffle(BATCH_SIZE * 8).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
    return dataset

# Prepare the datasets
train_dataset = make_dataset(list(train_data.keys()), list(train_data.values()))
validation_dataset = make_dataset(list(validation_data.keys()), list(validation_data.values()))

#Model Compilation, Training, and Early Stopping

In this section, we have focused on:

* Defining a loss function suitable for sequence-to-sequence tasks.

* Compiling the image captioning model with appropriate optimization and loss strategies.

* Implementing early stopping to prevent overfitting and reduce training time.

* Training the model while tracking performance on the validation dataset.

###Defining the Loss Function:

* **cross_entropy:** Computes the loss for sequence prediction by comparing the predicted token probabilities with the ground truth tokens.

 *Configuration:*

 * **from_logits=False:** Indicates that the output probabilities are already normalized (softmax applied).
 * **reduction='none':** Computes the loss for each token individually, enabling custom weighting or masking (e.g., ignoring padding tokens).

###Instantiating the Image Captioning Model:

* **caption_model:**

 * Combines the previously defined components **(cnn_model, encoder, decoder,** and **image_augmentation)** into the custom **ImageCaptioningModel**.
 * Ensures the entire image captioning pipeline is encapsulated in a single model for training and inference.

###Compiling the Model:

* **Purpose:** Specifies the optimizer and loss function for training.

* **Configuration:**

 * **optimizer:** Adam optimizer with a learning rate of **1e-4**
 * **loss:** Uses the **cross_entropy** function for token-by-token comparison.

###Defining Early Stopping:

* **early_stopping:** A Keras callback to prevent overfitting by stopping training early if validation performance does not improve.

 *Configuration:*

 * **patience=3:** Waits for three epochs of no improvement before stopping.
 * **restore_best_weights=True:** Restores the model weights to the best-performing epoch for validation.

###Training the Model:

* **caption_model.fit:** Trains the model on the **train_dataset** and evaluates on the **validation_dataset** at each epoch.

 *Configuration:*

 * **epochs=EPOCHS:** Specifies the total number of training epochs (30 in this case).
 * **validation_data:** Provides the validation dataset for monitoring model performance.
 * **callbacks=[early_stopping]:** Uses the early stopping mechanism to improve training efficiency.

###Tracking Training History:

* **history:**

 * Captures the training and validation loss and accuracy metrics over all epochs.
 * Useful for visualizing the model's performance trends during training.


In [None]:
# Define the loss function
cross_entropy = keras.losses.SparseCategoricalCrossentropy(from_logits=False, reduction='none')

In [None]:
# Instantiate the Image Captioning Model
caption_model = ImageCaptioningModel(
    cnn_model=cnn_model,
    encoder=encoder,
    decoder=decoder,
    image_aug=image_augmentation
)

# Compile the model
caption_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    loss=cross_entropy
)

# Define early stopping
early_stopping = keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

# Train the model
history = caption_model.fit(
    train_dataset,
    epochs=EPOCHS,
    validation_data=validation_dataset,
    callbacks=[early_stopping]
)

#Model Saving, Vocabulary Exporting, and Model Reloading

In this section, we have:

* Saved the trained model to disk for future use.

* Exported the vocabulary, enabling interpretation of model outputs.

* Defined custom objects to ensure compatibility when reloading the model.

* Reloaded the model to demonstrate successful serialization and deserialization.

###Saving the Model:

* **caption_model.save:** Saves the trained ImageCaptioningModel to a file named caption_model_test.keras. The saved file contains:

 * Model architecture.
 * Trained weights.
 * Optimizer state (if any).

###Extracting and Saving Vocabulary:

* **Extracting Vocabulary:** Calls **vectorization.get_vocabulary()** to retrieve the vocabulary used during training, which maps integer indices to tokens.

* **Saving Vocabulary:**

 * Uses the **pickle** library to save the vocabulary as a binary file **(vocab.pkl).**
 * This file can be loaded later to interpret model outputs (e.g., mapping token indices back to words).

###Defining Custom Objects:

* **Purpose:** Ensures that custom model components (e.g., **ImageCaptioningModel, TransformerEncoderBlock)** are recognized during model loading.

* **Configuration:** A dictionary **(custom_objects)** maps the names of custom classes to their definitions.

###Loading the Model:

* **keras.models.load_model:**

 * Reloads the saved model **(caption_model_test.keras)** into memory for inference or further training.
 * Uses the **custom_objects** dictionary to resolve custom layers and components during deserialization.

In [None]:
# Save the model
caption_model.save('caption_model_test.keras')

In [None]:
# Get the vocabulary from the TextVectorization layer
vocab = vectorization.get_vocabulary()

# Save the vocabulary to a file
import pickle
with open('vocab.pkl', 'wb') as f:
    pickle.dump(vocab, f)

In [None]:
# Define custom objects
custom_objects = {
    'ImageCaptioningModel': ImageCaptioningModel,
    'TransformerEncoderBlock': TransformerEncoderBlock,
    'TransformerDecoderBlock': TransformerDecoderBlock,
    'PositionalEmbedding': PositionalEmbedding
}

# Load the model
loaded_model = keras.models.load_model('caption_model_test.keras', custom_objects=custom_objects)

#Caption Generation Using Greedy Algorithm

In this section, we have implemented a function to:

* Generate captions for test images using the trained image captioning model.

* Apply a greedy decoding strategy for simplicity, selecting the most probable token at each step.

* Decode the predicted token indices into human-readable text using the trained vocabulary.

###Vocabulary Mapping:

* **vocab:** Retrieves the trained vocabulary from the **vectorization** layer.

* **INDEX_TO_WORD:** Creates a dictionary that maps token indices to their corresponding words for decoding predictions into human-readable captions.

###Defining greedy_algorithm Function:

* **Purpose:** Generates captions for an image using a greedy decoding strategy, where the most likely token is chosen at each step.

###Greedy Decoding Steps:

* **Image Preprocessing:** The input image isvdecoded and resized using **decode_and_resize** and then expanded to a batch dimension **(tf.expand_dims)** for processing by the CNN.

* **Feature Extraction:** The preprocessed image is passed through the model's CNN **(caption_model.cnn_model)** to extract image features.

* **Encoding Features:** The extracted features are passed to the transformer encoder **(caption_model.encoder)** to generate encoded representations of the image.

* **Caption Generation:**

 * Starts with the token <**start**> to initialize the decoding process.
 * Iteratively:
   1. Converts the partially generated caption into tokenized format using **vectorization**.
   2. Computes the attention mask to handle padding tokens.
   3. Passes the tokenized caption and encoded image features to the transformer decoder **(caption_model.decoder)**.
   4. Selects the token with the highest probability **(np.argmax)** from the predictions.
   5. Appends the selected token to the **decoded_caption**.
   6. Breaks the loop if <**end**> token is predicted or the maximum length **(MAX_DECODED_SENTENCE_LENGTH)** is reached.

* **Postprocessing:** Removes <**start**> and <**end**> tokens from the generated caption for readability. Returns the final decoded caption as a string.

* **Parameters and Constants:**

 * **MAX_DECODED_SENTENCE_LENGTH:** Limits the maximum length of the generated caption to avoid overly long predictions.
 * **test_images:** A list of image paths from the test dataset to be used for caption generation.

In [None]:
vocab = vectorization.get_vocabulary()
INDEX_TO_WORD = {idx: word for idx, word in enumerate(vocab)}
MAX_DECODED_SENTENCE_LENGTH = SEQ_LENGTH - 1
test_images = list(test_data.keys())

def greedy_algorithm(image):
    # Read the image from the disk
    image = decode_and_resize(image)

    # Pass the image to the CNN
    image = tf.expand_dims(image, 0)
    image = caption_model.cnn_model(image)

    # Pass the image features to the Transformer encoder
    encoded_img = caption_model.encoder(image, training=False)

    # Generate the caption using the Transformer decoder
    decoded_caption = "<start> "
    for i in range(MAX_DECODED_SENTENCE_LENGTH):
        tokenized_caption = vectorization([decoded_caption])[:, :-1]
        mask = tf.math.not_equal(tokenized_caption, 0)
        predictions = caption_model.decoder(tokenized_caption, encoded_img, training=False, mask=mask)
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = INDEX_TO_WORD[sampled_token_index]
        if sampled_token == "<end>":
            break
        decoded_caption += " " + sampled_token

    decoded_caption = decoded_caption.replace("<start> ", "")
    decoded_caption = decoded_caption.replace(" <end>", "").strip()

    return decoded_caption

#Testing the Caption Generation with a Sample Image

In this section, we have tested the captioning model by:

* Applying the greedy_algorithm function to a real image.

* Verifying the generated caption text against the visual content of the image.

* Displaying the image for manual evaluation of the model's performance.

In [None]:
from PIL import Image
img = "/content/drive/MyDrive/Colab/000000000785.jpg"
caption = greedy_algorithm(img)
print(f'Generated Caption: {caption}\n')
Image.open(img)