**Loads Captions**: It reads the captions.txt file, where each line contains an image ID and a caption. It splits the line into the image ID and caption, and maps them.

**Cleans Captions** It converts all the captions to lowercase to ensure uniformity.

In [1]:
import os

# Define paths for the dataset
CAPTION_PATH = "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/captions.txt"
DATASET_PATH = "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images"

# Function to load captions from the captions.txt file
def load_captions(filepath):
    captions_dict = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split(',', 1)
            if len(parts) < 2:
                continue
            image_id, caption = parts
            image_id = image_id.strip().split('.')[0]  # Remove file extension
            if image_id not in captions_dict:
                captions_dict[image_id] = []
            captions_dict[image_id].append("startseq " + caption.strip() + " endseq")  # Add startseq and endseq
    return captions_dict

# Function to clean the captions (convert to lowercase)
def clean_captions(captions):
    for img_id in captions:
        captions[img_id] = [cap.lower() for cap in captions[img_id]]
    return captions

# Load and clean captions
print("Loading and cleaning captions...")
captions = load_captions(CAPTION_PATH)
captions = clean_captions(captions)
print(f"Loaded {len(captions)} image captions.")


Loading and cleaning captions...
Loaded 8092 image captions.


**Shuffling Image IDs:** It shuffles the list of image IDs to ensure randomness.

**Splits the Dataset:** It divides the shuffled data into three parts:

70% for training

15% for validation

15% for testing

Outputs the Split Sizes: After splitting, it prints how many images are in each set.

In [2]:
import numpy as np

# Function to create random splits for train, validation, and test sets
def create_splits(captions):
    all_img_ids = list(captions.keys())  # Get all image IDs
    np.random.seed(42)  # For reproducibility
    np.random.shuffle(all_img_ids)  # Shuffle image IDs randomly
    train_split = int(0.7 * len(all_img_ids))  # 70% for training
    val_split = int(0.85 * len(all_img_ids))  # 15% for validation (total 85% for train + validation)
    return all_img_ids[:train_split], all_img_ids[train_split:val_split], all_img_ids[val_split:]

# Create the splits
print("Creating train, validation, and test splits...")
train_ids, val_ids, test_ids = create_splits(captions)

print(f"Train: {len(train_ids)} images, Validation: {len(val_ids)} images, Test: {len(test_ids)} images")


Creating train, validation, and test splits...
Train: 5664 images, Validation: 1214 images, Test: 1214 images


**Creates Vocabulary:** It flattens all the captions into a single list and then creates a tokenizer, which learns the word-to-integer mapping.

**Finds Max Length:** It calculates the maximum length of the captions (in terms of the number of words) to determine the padding length.

Outputs Vocabulary Size and Max Length: After creating the tokenizer, it prints the size of the vocabulary and the maximum caption length.

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Function to create the vocabulary and tokenizer
def create_vocabulary(captions):
    all_captions = [caption for cap_list in captions.values() for caption in cap_list]  # Flatten all captions into a list
    tokenizer = Tokenizer()  # Initialize the tokenizer
    tokenizer.fit_on_texts(all_captions)  # Fit tokenizer on the captions
    vocab_size = len(tokenizer.word_index) + 1  # +1 for the padding token
    max_length = max(len(c.split()) for c in all_captions)  # Maximum length of captions
    return tokenizer, vocab_size, max_length

# Create vocabulary and tokenizer
print("Creating vocabulary and tokenizer...")
tokenizer, vocab_size, max_length = create_vocabulary(captions)

print(f"Vocabulary size: {vocab_size}, Max caption length: {max_length}")


Creating vocabulary and tokenizer...
Vocabulary size: 8497, Max caption length: 40


**InceptionV3 Model:** It loads the InceptionV3 model pre-trained on ImageNet without the top classification layer, as we just need the feature extraction part.

**Feature Extraction:** It processes each image, resizes it, converts it to an array, and applies the necessary preprocessing. Then, it uses the model to extract image features.

**Saving Features:** The extracted features are saved to a .pkl file to avoid re-processing the images in the future.

**Loading Features:** If the features are already saved in the .pkl file, it loads them directly.

In [4]:
import pickle
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.applications.inception_v3 import preprocess_input
from tqdm import tqdm

# Define the path to save features
feature_file = "D:/Desktop/DL project/Mini project - dataset/outputs/custom_image_features.pkl"

# Check if features are already extracted
if os.path.exists(feature_file):
    print("Loading pre-extracted features...")
    with open(feature_file, 'rb') as f:
        features = pickle.load(f)
else:
    print("Extracting image features using InceptionV3...")
    base_model = InceptionV3(weights='imagenet', include_top=False, pooling='avg')  # Load InceptionV3 without the top layer

    # Function to extract features from images
    def extract_features(image_dir, image_ids):
        features = {}
        for img_id in tqdm(image_ids):
            file_path = os.path.join(image_dir, img_id + ".jpg")
            if not os.path.exists(file_path):
                continue  # Skip if image file doesn't exist
            img = load_img(file_path, target_size=(299, 299))  # Load image
            img = img_to_array(img)  # Convert image to array
            img = np.expand_dims(img, axis=0)  # Add batch dimension
            img = preprocess_input(img)  # Preprocess image for InceptionV3
            feature = base_model.predict(img, verbose=0)  # Get image features
            features[img_id] = feature.flatten()  # Flatten the feature vector
        return features

    # Extract features from training, validation, and test images
    features = extract_features(DATASET_PATH, list(set(train_ids + val_ids + test_ids)))

    # Save features to a pickle file for future use
    with open(feature_file, "wb") as f:
        pickle.dump(features, f)
    print("Image features saved.")


Extracting image features using InceptionV3...


100%|██████████| 8092/8092 [11:31<00:00, 11.69it/s]


FileNotFoundError: [Errno 2] No such file or directory: 'D:/Desktop/DL project/Mini project - dataset/outputs/custom_image_features.pkl'

**Directory Creation:** It ensures that the outputs directory is created using os.makedirs(feature_dir, exist_ok=True) before trying to save the feature file.

**File Path:** The feature_file path is updated to save inside the outputs folder correctly.

**Creates Sequences:** It converts the captions into integer sequences using the tokenizer. For each caption, the image feature is paired with the caption sequence, where each word is converted to an integer.

**Pads Sequences:** It ensures that the sequences have a uniform length by padding them with zeros where necessary, up to the max_length (the maximum length of captions calculated earlier).

**Returns Arrays:** It returns three arrays:

**input_images:** The features of the images.

**input_seqs:** The padded sequences of words (captions).

**output_words:** The next word in the sequence, which will be predicted by the model.

In [6]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Function to create sequences from captions and image features
def create_sequences(tokenizer, max_length, captions, features, image_ids):
    input_images, input_seqs, output_words = [], [], []

    for img_id in image_ids:
        if img_id not in captions or img_id not in features:
            continue
        for caption in captions[img_id]:
            seq = tokenizer.texts_to_sequences([caption])[0]
            for i in range(1, len(seq)):
                input_images.append(features[img_id])
                input_seqs.append(seq[:i])
                output_words.append(seq[i])

    # Pad the sequences to ensure uniform length
    padded_seqs = pad_sequences(input_seqs, maxlen=max_length, padding='post')
    return np.array(input_images), np.array(padded_seqs), np.array(output_words)

# Create the sequences for training and validation sets
print("Creating training sequences...")
train_img, train_seq, train_out = create_sequences(tokenizer, max_length, captions, features, train_ids)
val_img, val_seq, val_out = create_sequences(tokenizer, max_length, captions, features, val_ids)

print(f"Train samples: {len(train_img)}")
print(f"Validation samples: {len(val_img)}")


Creating training sequences...
Train samples: 335100
Validation samples: 71484


Image Feature Input: The image features are passed through a Dense layer after applying dropout for regularization.

Caption Input: The captions are processed through an embedding layer, followed by another dropout layer and an LSTM layer to handle the sequential nature of the captions.

Merging the Inputs: The image features and caption features are merged using the add function and passed through a final dense layer.

Output Layer: The final output layer is a dense layer with a softmax activation, which predicts the next word in the caption.

Model Compilation: The model is compiled with a sparse categorical cross-entropy loss function (since we're predicting words) and the Adam optimizer

In [7]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Input, add
from tensorflow.keras.optimizers import Adam

# Function to build the CNN-LSTM model
def build_model(vocab_size, max_length, feature_size):
    # Image feature input
    inputs1 = Input(shape=(feature_size,))
    fe1 = Dropout(0.5)(inputs1)  # Apply dropout for regularization
    fe2 = Dense(256, activation='relu')(fe1)  # Dense layer for image features

    # Caption input
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  # Embedding layer for captions
    se2 = Dropout(0.5)(se1)  # Dropout layer for regularization
    se3 = LSTM(256)(se2)  # LSTM layer for caption generation

    # Decoder (merging image features and captions)
    decoder1 = add([fe2, se3])  # Merge image features and caption features
    decoder2 = Dense(256, activation='relu')(decoder1)  # Another Dense layer
    outputs = Dense(vocab_size, activation='softmax')(decoder2)  # Final softmax layer for output

    # Build and compile the model
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
    
    return model

# Build the model
feature_size = train_img.shape[1]  # Image feature size (length of flattened feature vector)
print("Building the CNN-LSTM model...")
model = build_model(vocab_size, max_length, feature_size)

# Summarize the model architecture
model.summary()


Building the CNN-LSTM model...


Model Checkpoint: Saves the model with the lowest validation loss during training.

Early Stopping: Stops training if the validation loss does not improve for a set number of epochs (patience = 5), and restores the best weights.

Model Training: Trains the model using the training data (train_img, train_seq, and train_out) and validates it on the validation data (val_img, val_seq, and val_out).



In [8]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Define the callback for saving the best model during training
checkpoint_path = "D:/Desktop/DL project/Mini project - dataset/outputs/best_model.h5"
checkpoint = ModelCheckpoint(
    checkpoint_path,
    monitor='val_loss',
    verbose=1,
    save_best_only=True,
    mode='min'
)

# Define early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model
print("Training the CNN-LSTM model...")
history = model.fit(
    [train_img, train_seq], train_out,  # Inputs and outputs for training
    validation_data=([val_img, val_seq], val_out),  # Validation data
    epochs=20,  # Number of epochs (you can adjust this)
    batch_size=64,  # Batch size
    callbacks=[early_stopping, checkpoint]  # Callbacks for early stopping and saving the best model
)


Training the CNN-LSTM model...
Epoch 1/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step - accuracy: 0.2724 - loss: 4.3620
Epoch 1: val_loss improved from inf to 3.47563, saving model to D:/Desktop/DL project/Mini project - dataset/outputs/best_model.h5




[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m429s[0m 82ms/step - accuracy: 0.2724 - loss: 4.3619 - val_accuracy: 0.3584 - val_loss: 3.4756
Epoch 2/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step - accuracy: 0.3646 - loss: 3.2347
Epoch 2: val_loss improved from 3.47563 to 3.36156, saving model to D:/Desktop/DL project/Mini project - dataset/outputs/best_model.h5




[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m415s[0m 79ms/step - accuracy: 0.3646 - loss: 3.2347 - val_accuracy: 0.3763 - val_loss: 3.3616
Epoch 3/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step - accuracy: 0.3847 - loss: 2.9645
Epoch 3: val_loss improved from 3.36156 to 3.34273, saving model to D:/Desktop/DL project/Mini project - dataset/outputs/best_model.h5




[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m389s[0m 74ms/step - accuracy: 0.3847 - loss: 2.9645 - val_accuracy: 0.3853 - val_loss: 3.3427
Epoch 4/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step - accuracy: 0.3958 - loss: 2.8112
Epoch 4: val_loss did not improve from 3.34273
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m384s[0m 73ms/step - accuracy: 0.3958 - loss: 2.8112 - val_accuracy: 0.3901 - val_loss: 3.3669
Epoch 5/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step - accuracy: 0.4052 - loss: 2.7065
Epoch 5: val_loss did not improve from 3.34273
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m389s[0m 74ms/step - accuracy: 0.4052 - loss: 2.7065 - val_accuracy: 0.3905 - val_loss: 3.4017
Epoch 6/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step - accuracy: 0.4128 - loss: 2.6340
Epoch 6: val_loss did not improve from 3.34273
[1m5236/5236[0m [32

In [4]:
import os

# Define path to your captions file
CAPTION_PATH = "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/captions.txt"

# Reload captions from file
def load_captions(filepath):
    captions_dict = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split(',', 1)
            if len(parts) < 2:
                continue
            image_id, caption = parts
            image_id = image_id.strip().split('.')[0]
            if image_id not in captions_dict:
                captions_dict[image_id] = []
            captions_dict[image_id].append("startseq " + caption.strip() + " endseq")
    return captions_dict

# Clean captions (lowercase)
def clean_captions(captions):
    for img_id in captions:
        captions[img_id] = [cap.lower() for cap in captions[img_id]]
    return captions

# Execute loading
print(" Reloading captions...")
captions = load_captions(CAPTION_PATH)
captions = clean_captions(captions)
print(f" Loaded {len(captions)} image captions.")


 Reloading captions...
 Loaded 8092 image captions.


In [6]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Recreate the image sequences from existing variables
def create_sequences(tokenizer, max_length, captions, features, image_ids):
    input_images, input_seqs, output_words = [], [], []

    for img_id in image_ids:
        if img_id not in captions or img_id not in features:
            continue
        for caption in captions[img_id]:
            seq = tokenizer.texts_to_sequences([caption])[0]
            for i in range(1, len(seq)):
                input_images.append(features[img_id])
                input_seqs.append(seq[:i])
                output_words.append(seq[i])

    padded_seqs = pad_sequences(input_seqs, maxlen=max_length, padding='post')
    return np.array(input_images), np.array(padded_seqs), np.array(output_words)

# Paths
feature_file = "D:/Desktop/DL project/Mini project - dataset/outputs/custom_image_features.pkl"

# Load pre-extracted image features
with open(feature_file, 'rb') as f:
    features = pickle.load(f)

# Recreate train/val/test splits if not available
def create_splits(captions):
    all_img_ids = list(captions.keys())
    np.random.seed(42)
    np.random.shuffle(all_img_ids)
    train_split = int(0.7 * len(all_img_ids))
    val_split = int(0.85 * len(all_img_ids))
    return all_img_ids[:train_split], all_img_ids[train_split:val_split], all_img_ids[val_split:]

train_ids, val_ids, test_ids = create_splits(captions)

# Create sequences
print(" Recreating sequences...")
train_img, train_seq, train_out = create_sequences(tokenizer, max_length, captions, features, train_ids)
val_img, val_seq, val_out = create_sequences(tokenizer, max_length, captions, features, val_ids)

print(f" Train samples: {len(train_img)}, Validation samples: {len(val_img)}")


 Recreating sequences...
 Train samples: 335100, Validation samples: 71484


In [7]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Input, add
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np

# === 1. Recreate tokenizer and calculate vocab_size, max_length ===
all_captions = [cap for cap_list in captions.values() for cap in cap_list]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)

vocab_size = len(tokenizer.word_index) + 1
max_length = max(len(cap.split()) for cap in all_captions)
feature_size = train_img.shape[1]

print(f" Vocab size: {vocab_size}, Max length: {max_length}")

# === 2. Rebuild the exact model architecture ===
def build_model(vocab_size, max_length, feature_size):
    inputs1 = Input(shape=(feature_size,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)

    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)

    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)

    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
    return model

model = build_model(vocab_size, max_length, feature_size)

# === 3. Load weights from epoch 8 ===
model.load_weights("D:/Desktop/DL project/Mini project - dataset/outputs/best_model.h5")

# === 4. Callbacks ===
checkpoint = ModelCheckpoint(
    "D:/Desktop/DL project/Mini project - dataset/outputs/best_model.h5",
    monitor='val_loss',
    verbose=1,
    save_best_only=True,
    mode='min'
)
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# === 5. Continue training from epoch 9 ===
print("🚀 Resuming training from Epoch 9...")

history = model.fit(
    [train_img, train_seq], train_out,
    validation_data=([val_img, val_seq], val_out),
    epochs=20,
    initial_epoch=8,
    batch_size=64,
    callbacks=[early_stopping, checkpoint]
)


 Vocab size: 8497, Max length: 40
🚀 Resuming training from Epoch 9...
Epoch 9/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.3915 - loss: 2.9442
Epoch 9: val_loss improved from inf to 3.33909, saving model to D:/Desktop/DL project/Mini project - dataset/outputs/best_model.h5




[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m406s[0m 77ms/step - accuracy: 0.3915 - loss: 2.9442 - val_accuracy: 0.3863 - val_loss: 3.3391
Epoch 10/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.4016 - loss: 2.8406
Epoch 10: val_loss did not improve from 3.33909
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m402s[0m 77ms/step - accuracy: 0.4016 - loss: 2.8406 - val_accuracy: 0.3898 - val_loss: 3.3542
Epoch 11/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step - accuracy: 0.4076 - loss: 2.7486
Epoch 11: val_loss did not improve from 3.33909
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m407s[0m 78ms/step - accuracy: 0.4076 - loss: 2.7486 - val_accuracy: 0.3904 - val_loss: 3.3677
Epoch 12/20
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step - accuracy: 0.4137 - loss: 2.6685
Epoch 12: val_loss did not improve from 3.33909
[1m5236/5236[0

In [8]:
history = model.fit(
    [train_img, train_seq], train_out,
    validation_data=([val_img, val_seq], val_out),
    epochs=20,  # Total number of epochs to train (this continues from Epoch 14)
    initial_epoch=14,  # Starts training from Epoch 14 (Epochs are 0-indexed)
    batch_size=64,
    verbose=2,  # Show the full logs for each epoch
    callbacks=[early_stopping, checkpoint]
)


Epoch 15/20

Epoch 15: val_loss did not improve from 3.33909
5236/5236 - 422s - 81ms/step - accuracy: 0.4006 - loss: 2.8361 - val_accuracy: 0.3877 - val_loss: 3.3780
Epoch 16/20

Epoch 16: val_loss did not improve from 3.33909
5236/5236 - 409s - 78ms/step - accuracy: 0.4073 - loss: 2.7463 - val_accuracy: 0.3915 - val_loss: 3.3930
Epoch 17/20

Epoch 17: val_loss did not improve from 3.33909
5236/5236 - 407s - 78ms/step - accuracy: 0.4121 - loss: 2.6811 - val_accuracy: 0.3940 - val_loss: 3.4353
Epoch 18/20

Epoch 18: val_loss did not improve from 3.33909
5236/5236 - 405s - 77ms/step - accuracy: 0.4162 - loss: 2.6285 - val_accuracy: 0.3939 - val_loss: 3.4685
Epoch 19/20

Epoch 19: val_loss did not improve from 3.33909
5236/5236 - 408s - 78ms/step - accuracy: 0.4202 - loss: 2.5875 - val_accuracy: 0.3929 - val_loss: 3.5271
Epoch 20/20

Epoch 20: val_loss did not improve from 3.33909
5236/5236 - 406s - 78ms/step - accuracy: 0.4232 - loss: 2.5516 - val_accuracy: 0.3927 - val_loss: 3.5171


Evaluate the Model on Test Data:

In [9]:
test_img, test_seq, test_out = create_sequences(tokenizer, max_length, captions, features, test_ids)

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate([test_img, test_seq], test_out, verbose=1)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_acc}")


[1m2232/2232[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 16ms/step - accuracy: 0.3885 - loss: 3.3939
Test Loss: 3.4419095516204834, Test Accuracy: 0.384755402803421


Generate Captions for Test Images:

In [15]:
def generate_caption(model, image_path, tokenizer, max_length, feature_file=None):
    # Load and preprocess image
    img = load_img(image_path, target_size=(299, 299))
    img = img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = preprocess_input(img)  # Preprocess for InceptionV3

    # Extract features from the image
    if feature_file:
        with open(feature_file, "rb") as f:
            features = pickle.load(f)
        img_id = os.path.basename(image_path).split('.')[0]
        feature = features.get(img_id)  # Get image features for the specific image
    else:
        feature = base_model.predict(img, verbose=0).flatten()  # Flatten to match model input

    # Initialize the caption sequence with 'startseq'
    in_text = 'startseq'

    for _ in range(max_length):
        # Convert caption to sequence and pad
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        sequence = pad_sequences([sequence], maxlen=max_length, padding='post')

        # Predict the next word
        yhat = model.predict([np.array([feature]), sequence], verbose=0)  # Ensure image feature is a batch of 1
        yhat = np.argmax(yhat)

        # Map predicted word index to word
        word = ''
        for w, idx in tokenizer.word_index.items():
            if idx == yhat:
                word = w
                break

        # Stop if the word is 'endseq'
        if word is None or word == 'endseq':
            break

        # Append predicted word to the input caption
        in_text += ' ' + word

    # Clean up the caption
    caption = in_text.replace('startseq', '').replace('endseq', '').strip()
    return caption


In [23]:
import os
import pickle
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import numpy as np

# Function to extract image features using InceptionV3
def extract_image_features(image_paths, feature_file=None):
    base_model = InceptionV3(weights='imagenet', include_top=False, pooling='avg')
    features = {}

    if feature_file and os.path.exists(feature_file):
        print("Loading pre-extracted features...")
        with open(feature_file, 'rb') as f:
            features = pickle.load(f)
    
    for image_path in image_paths:
        img_id = os.path.basename(image_path).split('.')[0]
        if img_id not in features:  # If feature is not already extracted
            print(f"Extracting features for new image: {image_path}")
            img = load_img(image_path, target_size=(299, 299))  # Resize to InceptionV3 input size
            img = img_to_array(img)  # Convert image to numpy array
            img = np.expand_dims(img, axis=0)  # Add batch dimension
            img = preprocess_input(img)  # Preprocess image for InceptionV3

            feature = base_model.predict(img, verbose=0)  # Get image features from InceptionV3
            features[img_id] = feature.flatten()  # Flatten the feature vector

    # Save the features to the file for future use
    if feature_file:
        with open(feature_file, 'wb') as f:
            pickle.dump(features, f)
        print("Features saved.")

    return features

# Function to generate caption for a given image
def generate_caption(model, image_path, tokenizer, max_length, features, feature_file=None):
    img_id = os.path.basename(image_path).split('.')[0]
    feature = features.get(img_id)

    if feature is None:
        print(f"Feature not found for {image_path}. Extracting features...")
        features = extract_image_features([image_path], feature_file)
        feature = features.get(img_id)

    # Initialize caption sequence with 'startseq'
    in_text = 'startseq'

    for _ in range(max_length):
        sequence = tokenizer.texts_to_sequences([in_text])[0]  # Convert to sequence
        sequence = pad_sequences([sequence], maxlen=max_length, padding='post')  # Pad sequence

        # Predict the next word in the caption
        yhat = model.predict([np.array([feature]), sequence], verbose=0)  # Model prediction
        yhat = np.argmax(yhat)  # Get the index of the highest probability

        # Map the predicted word index to the word
        word = ''
        for w, idx in tokenizer.word_index.items():
            if idx == yhat:
                word = w
                break

        # Break if 'endseq' is predicted
        if word == 'endseq':
            break

        # Append predicted word to input sequence
        in_text += ' ' + word

    caption = in_text.replace('startseq', '').replace('endseq', '').strip()
    return caption

# Function to generate captions for a list of images
def generate_captions_for_images(model, image_paths, tokenizer, max_length, feature_file=None):
    features = extract_image_features(image_paths, feature_file)  # Extract features for all images
    for image_path in image_paths:
        print(f"Generating caption for image: {image_path}")
        try:
            caption = generate_caption(model, image_path, tokenizer, max_length, features, feature_file)
            print(f"Generated Caption: {caption}")
        except Exception as e:
            print(f"Error generating caption for {image_path}: {e}")
        print("-" * 50)

# List of images from the dataset and new test images

image_paths = [
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/335588286_f67ed8c9f9.jpg",
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/507758961_e63ca126cc.jpg",
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/2021602343_03023e1fd1.jpg",
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/2255685792_f70474c6db.jpg",
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/2439031566_2e0c0d3550.jpg",
    "D:/Downloads/test.jpg",  # Existing test image
    "D:/Downloads/test1.jpg",  # Existing test image
    "D:/Downloads/test3.jpg"  # New test image
]

# Generate captions for all images including test3.jpg
generate_captions_for_images(model, image_paths, tokenizer, max_length, feature_file="D:/Desktop/DL project/Mini project - dataset/outputs/custom_image_features.pkl")



Loading pre-extracted features...
Extracting features for new image: D:/Downloads/test3.jpg
Features saved.
Generating caption for image: D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/335588286_f67ed8c9f9.jpg
Generated Caption: a dog is running through a field of water
--------------------------------------------------
Generating caption for image: D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/507758961_e63ca126cc.jpg
Generated Caption: a young boy in a red shirt is jumping over a trampoline
--------------------------------------------------
Generating caption for image: D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/2021602343_03023e1fd1.jpg
Generated Caption: a basketball player in a white shirt is holding a basketball
--------------------------------------------------
Generating caption for image: D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/2255685792_f70474c6db.jpg
G

**Hyperparameter Finalization and Fine-tuning:**

Hyperparameters:

Batch size is set to 32.

Learning rate is 0.001 (you can adjust based on training behavior).

Dropout rate is set to 0.3 for regularization.

LSTM units are set to 128.

Callbacks:

ModelCheckpoint: Saves the model weights whenever there is improvement in the validation loss.

EarlyStopping: Stops the training if validation loss does not improve for 5 epochs, and restores the best weights.

Model Compilation:

Using the Adam optimizer with the defined learning rate and sparse categorical cross-entropy loss function suitable for multi-class classification.

Model Saving:

After fine-tuning, the model is saved as final_finetuned_model.h5 in the given output directory.

In [26]:
# Create the tokenizer from scratch using the captions dataset
from tensorflow.keras.preprocessing.text import Tokenizer

# Recreate the tokenizer using all captions
all_captions = [caption for cap_list in captions.values() for caption in cap_list]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)

# Save the tokenizer as a pickle file
tokenizer_path = "D:/Desktop/DL project/Mini project - dataset/outputs/tokenizer.pkl"
with open(tokenizer_path, 'wb') as f:
    pickle.dump(tokenizer, f)

# Save max_length (which you can also load again)
max_length = max(len(c.split()) for c in all_captions)
maxlen_path = "D:/Desktop/DL project/Mini project - dataset/outputs/max_length.pkl"
with open(maxlen_path, 'wb') as f:
    pickle.dump(max_length, f)

print(f"Tokenizer and max_length saved to {tokenizer_path} and {maxlen_path}")


Tokenizer and max_length saved to D:/Desktop/DL project/Mini project - dataset/outputs/tokenizer.pkl and D:/Desktop/DL project/Mini project - dataset/outputs/max_length.pkl


In [27]:
import pickle
import numpy as np

# Load tokenizer and max_length
tokenizer_path = "D:/Desktop/DL project/Mini project - dataset/outputs/tokenizer.pkl"
maxlen_path = "D:/Desktop/DL project/Mini project - dataset/outputs/max_length.pkl"

# Load tokenizer
with open(tokenizer_path, "rb") as f:
    tokenizer = pickle.load(f)

# Load max_length
with open(maxlen_path, "rb") as f:
    max_length = pickle.load(f)

# Verify the tokenizer and max_length loaded correctly
print(f"Tokenizer loaded with {len(tokenizer.word_index)} words.")
print(f"Max length for captions: {max_length}")


Tokenizer loaded with 8496 words.
Max length for captions: 40


In [29]:
import numpy as np

# Save training data (train_img, train_seq, train_out) to .npy files
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/train_img_data.npy", train_img)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/train_seq_data.npy", train_seq)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/train_out_data.npy", train_out)

# Save validation data (val_img, val_seq, val_out) to .npy files
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/val_img_data.npy", val_img)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/val_seq_data.npy", val_seq)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/val_out_data.npy", val_out)

print("Training and validation data saved successfully to .npy files!")


Training and validation data saved successfully to .npy files!


In [31]:
import numpy as np

# Assuming train_img, train_seq, train_out, val_img, val_seq, val_out are already created and available

# Save training data (train_img, train_seq, train_out) to .npy files
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/train_img_data.npy", train_img)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/train_seq_data.npy", train_seq)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/train_out_data.npy", train_out)

# Save validation data (val_img, val_seq, val_out) to .npy files
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/val_img_data.npy", val_img)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/val_seq_data.npy", val_seq)
np.save("D:/Desktop/DL project/Mini project - dataset/outputs/val_out_data.npy", val_out)

print("Training and validation data saved successfully to .npy files!")


Training and validation data saved successfully to .npy files!


In [32]:
import numpy as np

# Load the saved .npy files for training and validation data
train_img = np.load("D:/Desktop/DL project/Mini project - dataset/outputs/train_img_data.npy")
train_seq = np.load("D:/Desktop/DL project/Mini project - dataset/outputs/train_seq_data.npy")
train_out = np.load("D:/Desktop/DL project/Mini project - dataset/outputs/train_out_data.npy")
val_img = np.load("D:/Desktop/DL project/Mini project - dataset/outputs/val_img_data.npy")
val_seq = np.load("D:/Desktop/DL project/Mini project - dataset/outputs/val_seq_data.npy")
val_out = np.load("D:/Desktop/DL project/Mini project - dataset/outputs/val_out_data.npy")

# Print the shape of the loaded data to ensure everything is correct
print(f"Training images shape: {train_img.shape}")
print(f"Training sequences shape: {train_seq.shape}")
print(f"Training outputs shape: {train_out.shape}")
print(f"Validation images shape: {val_img.shape}")
print(f"Validation sequences shape: {val_seq.shape}")
print(f"Validation outputs shape: {val_out.shape}")

# Now you can proceed to train the model with the data you just loaded


Training images shape: (335100, 2048)
Training sequences shape: (335100, 40)
Training outputs shape: (335100,)
Validation images shape: (71484, 2048)
Validation sequences shape: (71484, 40)
Validation outputs shape: (71484,)


Hyperparameter Tuning with Keras Tuner:
We will use Keras Tuner for hyperparameter tuning. This allows us to search for the best hyperparameters for your model such as learning_rate, dropout_rate, LSTM_units, etc.

In [34]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Input, add
from tensorflow.keras.optimizers import Adam

# Define the model-building function with hyperparameters
def build_model_with_tuning(hp):
    # Image feature input
    inputs1 = Input(shape=(train_img.shape[1],))  # Adjust input shape if necessary
    fe1 = Dropout(hp.Float('dropout_rate', 0.2, 0.5, step=0.1))(inputs1)  # Tuning dropout rate
    fe2 = Dense(256, activation='relu')(fe1)  # Dense layer for image features

    # Caption input
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  # Embedding layer
    se2 = Dropout(hp.Float('dropout_rate', 0.2, 0.5, step=0.1))(se1)
    se3 = LSTM(hp.Choice('lstm_units', [128, 256, 512]))(se2)  # Tuning LSTM units

    # Adjust dimensions to match before merging (using Dense layer)
    se3 = Dense(256)(se3)  # Ensure LSTM output matches the size of the image feature layer

    # Decoder (Merging image features and captions)
    decoder1 = add([fe2, se3])  # Merge image features and caption features
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)

    # Compile the model
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer=Adam(learning_rate=hp.Choice('learning_rate', [1e-3, 1e-4, 1e-5])),
                  metrics=['accuracy'])
    
    return model


Running the Hyperparameter Tuning:
This code will perform hyperparameter tuning by optimizing:

dropout_rate (from 0.2 to 0.5),

lstm_units (with options 128, 256, 512),

learning_rate (with options 1e-3, 1e-4, and 1e-5).

In [35]:
import keras_tuner as kt

# Initialize the Keras Tuner
tuner = kt.RandomSearch(
    build_model_with_tuning,
    objective='val_loss',
    max_trials=5,  # Number of trials for hyperparameter search
    executions_per_trial=1,
    directory='tuner_dir',
    project_name='image_captioning_tuning'
)

# Start the hyperparameter search
tuner.search([train_img, train_seq], train_out, 
             validation_data=([val_img, val_seq], val_out),
             epochs=10,  # Number of epochs to run for each trial
             batch_size=64)

# Get the best hyperparameters
best_hp = tuner.get_best_hyperparameters(1)[0]
print("Best hyperparameters:", best_hp.values)

# Build and train the model with the best hyperparameters
model = build_model_with_tuning(best_hp)
history = model.fit([train_img, train_seq], train_out, 
                    validation_data=([val_img, val_seq], val_out), 
                    epochs=20, batch_size=64)


Trial 1 Complete [01h 09m 05s]
val_loss: 3.836247444152832

Best val_loss So Far: 3.836247444152832
Total elapsed time: 01h 09m 05s

Search: Running Trial #2

Value             |Best Value So Far |Hyperparameter
0.4               |0.2               |dropout_rate
128               |256               |lstm_units
0.001             |1e-05             |learning_rate

Epoch 1/10
[1m5236/5236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m352s[0m 67ms/step - accuracy: 0.2740 - loss: 4.3605 - val_accuracy: 0.3564 - val_loss: 3.5096
Epoch 2/10
[1m1169/5236[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m4:22[0m 65ms/step - accuracy: 0.3582 - loss: 3.2570

KeyboardInterrupt: 

Load the Best Model Saved by ModelCheckpoint

In [56]:
from tensorflow.keras.layers import Layer, Input, Embedding, LSTM, Dense, Dropout, add
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K

# Function to build the CNN-LSTM model
def build_model(vocab_size, max_length, feature_size):
    # Image feature input
    inputs1 = Input(shape=(feature_size,))
    fe1 = Dropout(0.5)(inputs1)  # Apply dropout for regularization
    fe2 = Dense(256, activation='relu')(fe1)  # Dense layer for image features

    # Caption input
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  # Embedding layer for captions
    se2 = Dropout(0.5)(se1)  # Dropout layer for regularization
    se3 = LSTM(256)(se2)  # LSTM layer for caption generation

    # Merge image features and caption features
    merged = add([fe2, se3])  # Merging image features and LSTM output
    dense1 = Dense(256, activation='relu')(merged)
    outputs = Dense(vocab_size, activation='softmax')(dense1)  # Final softmax layer for output

    # Build and compile the model
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

# Example usage:
vocab_size = 8497  # Update with the actual vocab size
max_length = 40    # Update with the actual max length
feature_size = 2048  # Update with the actual feature size from InceptionV3

# Build the model
model = build_model(vocab_size, max_length, feature_size)

# Print model summary
model.summary()


Cross-validation Setup

In [57]:
from sklearn.model_selection import train_test_split
import numpy as np

# Function to quickly perform a simplified 2-fold cross-validation on a smaller subset of data
def quick_cross_validate_model(model, train_img, train_seq, train_out, n_splits=2):
    # Split data into training and validation set (just a quick version)
    X_train_img, X_val_img, X_train_seq, X_val_seq, X_train_out, X_val_out = train_test_split(
        train_img, train_seq, train_out, test_size=0.2, random_state=42
    )

    # Train the model on the training data
    history = model.fit(
        [X_train_img, X_train_seq], X_train_out,
        validation_data=([X_val_img, X_val_seq], X_val_out),
        epochs=5,  # Reduced number of epochs to save time
        batch_size=64,
        verbose=1
    )
    
    # Evaluate the model on the validation set
    val_loss, val_acc = model.evaluate([X_val_img, X_val_seq], X_val_out, verbose=0)
    print(f"Validation Loss: {val_loss}")
    print(f"Validation Accuracy: {val_acc}")

# Assuming 'model' is already built and 'train_img', 'train_seq', 'train_out' are available
quick_cross_validate_model(model, train_img, train_seq, train_out)


Epoch 1/5
[1m4189/4189[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m354s[0m 84ms/step - accuracy: 0.2621 - loss: 4.4691 - val_accuracy: 0.3551 - val_loss: 3.5048
Epoch 2/5
[1m4189/4189[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m334s[0m 80ms/step - accuracy: 0.3579 - loss: 3.2806 - val_accuracy: 0.3755 - val_loss: 3.3240
Epoch 3/5
[1m4189/4189[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m339s[0m 81ms/step - accuracy: 0.3793 - loss: 2.9879 - val_accuracy: 0.3861 - val_loss: 3.2853
Epoch 4/5
[1m4189/4189[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m344s[0m 82ms/step - accuracy: 0.3932 - loss: 2.8094 - val_accuracy: 0.3927 - val_loss: 3.2905
Epoch 5/5
[1m4189/4189[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m351s[0m 84ms/step - accuracy: 0.4039 - loss: 2.6813 - val_accuracy: 0.3960 - val_loss: 3.3302
Validation Loss: 3.330164909362793
Validation Accuracy: 0.395956426858902


Model Evaluation on Test Set

In [58]:
# Evaluate the model on the test set
test_loss, test_acc = model.evaluate([test_img, test_seq], test_out, verbose=1)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_acc}")


[1m2232/2232[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 18ms/step - accuracy: 0.3888 - loss: 3.4877
Test Loss: 3.5352697372436523, Test Accuracy: 0.3839573264122009


Generate Captions for Test Images

In [60]:
# Function to generate captions for a list of test images
def generate_captions_for_test_images(model, image_paths, tokenizer, max_length, feature_file="D:/Desktop/DL project/Mini project - dataset/outputs/custom_image_features.pkl"):
    features = extract_image_features(image_paths, feature_file)  # Extract features for test images
    for image_path in image_paths:
        print(f"Generating caption for image: {image_path}")
        try:
            caption = generate_caption(model, image_path, tokenizer, max_length, features, feature_file)
            print(f"Generated Caption: {caption}")
        except Exception as e:
            print(f"Error generating caption for {image_path}: {e}")
        print("-" * 50)

# List of images to generate captions for
image_paths = [
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/335588286_f67ed8c9f9.jpg",
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/507758961_e63ca126cc.jpg",
    "D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/2021602343_03023e1fd1.jpg"
]

# Generate captions for the test images
generate_captions_for_test_images(model, image_paths, tokenizer, max_length)


Loading pre-extracted features...
Features saved.
Generating caption for image: D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/335588286_f67ed8c9f9.jpg
Generated Caption: a dog is running through the sand
--------------------------------------------------
Generating caption for image: D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/507758961_e63ca126cc.jpg
Generated Caption: a little girl in a pink shirt is jumping on a trampoline
--------------------------------------------------
Generating caption for image: D:/Desktop/DL project/Mini project - dataset/Mini project - dataset/Images/2021602343_03023e1fd1.jpg
Generated Caption: a man in a white shirt is playing a game of basketball players
--------------------------------------------------
