<a href="https://colab.research.google.com/github/DRKAFLE123/ImageCaptioning/blob/main/ImageCaptioningFinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Image Captioning[Computer Vision + NLP]
- What is Image Captioning ?
---
Image Captioning is the process of generating textual description of an image. It uses both Natural Language Processing and Computer Vision to generate the captions.
This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.
### CNNs + RNNs (LSTMs)

To perform Image Captioning we will require two deep learning models combined into one for the training purpose
- CNNs extract the features from the image of some vector size aka the vector embeddings. The size of these embeddings depend on the type of pretrained network being used for the feature extraction

- LSTMs are used for the text generation process. The image embeddings are concatenated with the word embeddings and passed to the LSTM to generate the next word
For a more illustrative explanation of this architecture check the Modelling section for a picture representation

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### I am using Flickr8K [image-caption] dataset from kaggle which consisits of 8000+ images. with 5 captions for each images

### We have taken less image dataset so we will be using Transfer Learning techniques with pretrained model like CNNS(Resnet50) which is trained in 'Imagenet',RNNS(LSTM) for text processing and generating

### Transfer learning is a technique that can be used to improve the performance of a machine learning model when there is a limited amount of training data available. The idea behind transfer learning is to use a pre-trained model that has been trained on a large dataset of images, such as ImageNet, and then fine-tune the model on the smaller dataset.

In [2]:
# !pip install tensorflow

In [3]:
# !pip install tensorflow
# !pip install numpy
# !pip install pandas
# !pip install matplotlib
# !pip install pillow
# !pip install nltk


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint
from nltk.translate.bleu_score import corpus_bleu
import pickle
import os
import urllib.request
import warnings
warnings.filterwarnings("ignore")


In [5]:
# image_dir = '/content/drive/MyDrive/IMAGE-Caption_DATASET/images'
# caption_file = '/content/drive/MyDrive/IMAGE-Caption_DATASET/captions.txt'
# df = pd.read_csv(caption_file)
# df.head()


In [6]:
# Downloading the pre-trained ResNet50 model
# CNN_model = ResNet50(weights='imagenet', include_top=False, input_tensor=Input(shape=(224, 224, 3)))



In [7]:
# CNN_model.summary()

 Output shape:(None, 7, 7, 2048)    

```
# This is formatted as code
```



### Declaring the functions for cleaning captions

In [8]:
import re

#Decelaring function to clean or process the text of caption
def preprocess_caption(caption):
    # print("Before preprocessing:", caption)
    # Handling contractions
    contractions = {
        "I'll": "I will",
        "can't": "cannot",
        "don't": "do not",
        "isn't" : "is not",
        "it's": "it is",
        # Add more contractions as needed
    }
    words = caption.split()
    expanded_words = [contractions.get(word, word) for word in words]
    caption = ' '.join(expanded_words)

    # Text preprocessing using regular expressions
    caption = caption.lower()
    caption = re.sub(r'\d+', '', caption)  # Remove numbers in captions
    caption = re.sub(r'[^\w\s]', '', caption)  # Remove punctuation marks
    caption = re.sub(r'\b(s|a)\b', '', caption)  # Remove hanging "s" and "a"
    caption = "<start> " + caption.strip() + " <end>"  # Add start and end tokens
    # # Write proceesed_captionthe processed captions to the text file
    # with open('/content/drive/MyDrive/IMAGE-Caption_DATASET/proceesed_caption.txt', 'w') as file:
    #     for caption in caption:
    #         file.write(caption + '\n')
    return caption


    # for taking single image id with merged captions
def preprocess_data(image_dir, caption_file):
    print("Iam here")
    df = pd.read_csv(caption_file)
    # print(df.tail())
    print(df.columns) #from here we known that columns has one name ['image,caption']
    # Split the 'image,caption' column into two separate columns
    # df[['image', 'caption']] = df['image,caption'].str.split(',', expand=True)
    # print("Iam here")  #only to debug

    print(df['image'].unique())
    print("Iam here")

    image_paths = []
    captions = []

    unique_image_ids =df['image'].unique()  #stores unique image ids
    for image_id in unique_image_ids:
        image_id_captions = df[df['image'] == image_id]['caption'].values.tolist()  #converts all captions related to this unique keys values to list
        # print(image_id_captions[0])
        image_path = os.path.join(image_dir, image_id)
        merged_caption = preprocess_caption(' '.join(str(caption) for caption in image_id_captions))
 #function call for cleaning

        image_paths.append(image_path)
        captions.append(merged_caption)

    return image_paths,captions



# def preprocess_data(image_dir, caption_file):
#     df = pd.read_csv(caption_file)

#     image_paths = []
#     captions = []


#     for i in range(len(df)):
#         image_id = df["image"][i]  # data from image column
#         image_path = os.path.join(image_dir, image_id)
#         #The os.path.join() function takes two or more paths as input and returns a new path that is the concatenation of the input paths.
#         caption = df["caption"][i]  # It reads all caption from caption columns in Dataset

#         image_paths.append(image_path)  # It will add in empty lsit created above
#         captions.append(preprocess_caption(caption))  # we are calling function  [preprocess_caption]
#         #after separating captions from dataset for preprocessing caption text
#         # print(image_id)

#     return image_paths, captions

In [9]:
image_dir = '/content/drive/MyDrive/IMAGE-Caption_DATASET/images'
caption_file = '/content/drive/MyDrive/IMAGE-Caption_DATASET/captions.txt'

image_paths, captions = preprocess_data(image_dir, caption_file)
# Function call started from here...
# here it will get image_path with image directory + image id from dataset
# seperate captions after processed ,

# for caption in captions:
print(captions[0])

Iam here
Index(['image', 'caption'], dtype='object')
['1000268201_693b08cb0e.jpg' '1001773457_577c3a7d70.jpg'
 '1002674143_1b742ab4b8.jpg' ... '3333675897_0043f992d3.jpg'
 '3333826465_9c84c1b3c6.jpg' '3333921867_6cc7d7c73d.jpg']
Iam here
<start> child in  pink dress is climbing up  set of stairs in an entry way   girl going into  wooden building   little girl climbing into  wooden playhouse   little girl climbing the stairs to her playhouse   little girl in  pink dress going into  wooden cabin <end>


In [10]:
type(captions)

list

In [11]:
print(len(captions))
print(len(image_paths))

5112
5112


### Building Vocabulary

In [12]:
def build_vocabulary(captions):
    vocab = set()
    for caption in captions:
        if caption is not None:
            words = caption.split()
            vocab.update(words)
    return vocab

vocabulary = build_vocabulary(captions)
print("Length of vocabulary =", len(vocabulary))
print(vocabulary)


Length of vocabulary = 7125


### Tokenization

In [13]:
tokenizer = Tokenizer()  #object of class Tokenizer()
tokenizer_vocab=tokenizer.fit_on_texts(captions)  #from the keras.preprocessing.text
#this function uses word tokenization to create a vocabulary of words that are used in the captions.

# # Add special tokens "<start>" and "<end>" to the tokenizer's vocabulary
tokenizer.word_index["<start>"] = len(tokenizer.word_index) + 1
tokenizer.word_index["<end>"] = len(tokenizer.word_index) + 1

vocab_size = len(tokenizer.word_index) + 1   # + 1 operator adds 1 to the vocabulary size to include the <unk> token.
max_length = max(len(caption.split()) for caption in captions)

print(max_length)
# captions = tokenizer.word_index
print(captions)
print(tokenizer_vocab)

89
None


In [14]:
#Evaluating vocabulary:
print("Words in build_vocabulary():", vocabulary)
print("Words in Tokenizer's vocabulary:", tokenizer.word_index)

# Compare the difference in words
diff_words = vocabulary.symmetric_difference(tokenizer.word_index)
print("Different words:", diff_words)



Different words: set()


### Load Image fun and convert to array

In [15]:
def load_image(image_path):
    img = load_img(image_path, target_size=(224,224))     #load_img()
    img = img_to_array(img)  # It will convert to (224,224,3) array
    img = np.expand_dims(img, axis=0)
    # The final step is to expand the dimensions of the array by adding a new axis at the beginning.
    #This is done because the CNN expects the input data to be in a 4D tensor with the shape (batch_size, height, width, channels).
    img = tf.keras.applications.resnet50.preprocess_input(img)
    # The preprocess_input() function from the tf.keras.applications.resnet50 module applies the ResNet50 preprocessing to the image.
    # This includes resizing the image to 224x224x3,and normalizing by subtracting the mean from each channel, and dividing by the standard deviation.
    # img = img/255.0  #another way of normalizing but we have used resnet50.preprocess_input which also normalize
    return img #returns 4D like:[1,224,224,3] here 1 is batch size
    # return Preprocessed numpy.array or a tf.Tensor with type float32



# Feature Extraction

In [16]:
# Downloading the pre-trained ResNet50 model
CNN_model = ResNet50(weights='imagenet', include_top=False, input_tensor=Input(shape=(224, 224, 3)))



Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


# Batch Processing

What is Embedding Layer?

Embedding layer is one of the available layers in Keras. This is mainly used in Natural Language Processing related applications such as language modeling, but it can also be used with other tasks that involve neural networks. While dealing with NLP problems, we can use pre-trained word embeddings such as GloVe. Alternatively we can also train our own embeddings using Keras embedding layer.
- Need of Embeddings
Word embeddings can be thought of as an alternate to one-hot encoding along with dimensionality reduction.

### Defining the captioning model

In [18]:


# Load the pre-trained ResNet50 model
# model = ResNet50(weights='imagenet', include_top=False)

# Define a function to extract features from a batch of images
def extract_features(image_paths, batch_size=32):
    features = []
    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i:i+batch_size]
        batch_images = []

        for path in batch_paths:
            img = load_image(path)  # function call which is described above
            batch_images.append(img)
        batch_images = np.concatenate(batch_images, axis=0)
        batch_features = CNN_model.predict(batch_images)
        batch_features = np.reshape(batch_features, (batch_features.shape[0],-1))
        features.append(np.squeeze(batch_features))
        # Handle the last batch which might have a smaller size
        if i + batch_size >= len(image_paths):
            batch_features = batch_features[:len(image_paths) - i]

        features.extend(batch_features)
    features = np.array(features)
    # with open('/content/drive/MyDrive/IMAGE-Caption_DATASET/features.txt', 'w') as file:
    # file.write(features + '\n')
    return features

# # Example usage
# image_paths = ['/content/drive/MyDrive/IMAGE-Caption_DATASET/images/1000268201_693b08cb0e.jpg']
image_features = extract_features(image_paths)
# print(len(image_features))
# print(captions)




In [22]:
# Specify the file path to save the features
save_path = '/content/drive/MyDrive/IMAGE-Caption_DATASET/imagefeature.npy'

# Save the features to a file
np.save(save_path, image_features)

# Print a message to confirm the save
print("Features saved to:", save_path)

Features saved to: /content/drive/MyDrive/IMAGE-Caption_DATASET/imagefeature.npy


In [26]:
# Specify the file path where the features are saved
load_path = '/content/drive/MyDrive/IMAGE-Caption_DATASET/imagefeature.npy'

# Load the features from the file
image_features = np.load(load_path)

# Print a message to confirm the load
print("Features loaded from:", load_path)

ValueError: ignored

In [27]:
print(image_features.shape)
print(image_features[1])

(5272,)
[0.        0.        0.        ... 0.        1.7696791 0.       ]


In [28]:
# # Flatten the features
# flattened_features = image_features.flatten()

# # Print the shape of the flattened features
# print("Shape of flattened features:", flattened_features.shape

# images['1000268201_693b08cb0e']

In [30]:
# Find the indices of mismatched samples
mismatched_indices = []
for i in range(len(image_features)):
  if i >= len(captions) or i >= len(image_features):
      mismatched_indices.append(i)

# # Remove the mismatched samples
image_features = np.delete(image_features, mismatched_indices, axis=0)
captions = np.delete(captions, mismatched_indices, axis=0)

print("Number of mismatched samples: ", len(mismatched_indices))
print("Number of remaining samples: ", len(image_features))


Number of mismatched samples:  0
Number of remaining samples:  5112


In [31]:
#  Remove the mismatched samples
num_images = len(image_features)
image_features = image_features[:num_images]
captions = captions[:num_images]
print(len(image_features))
print(len(captions))

5112
5112


In [32]:
# Verify the number of samples
if len(image_features) != len(captions):
    raise ValueError("Number of images and captions does not match!")


In [33]:
from keras.models import Model
from keras.layers import Input, LSTM, Embedding, Dense
from tensorflow import keras
from sklearn.model_selection import train_test_split
import tensorflow as tf
# import numpy as np

# Define captioning model
def create_caption_model(vocab_size, max_length, embedding_dim):
    # Image input
    image_input = Input(shape=(2048,))
    # Image feature embedding
    image_model = Dense(embedding_dim, activation='relu')(image_input)
    # Sequence input
    caption_input = Input(shape=(max_length,))
    # Caption embedding
    caption_model = Embedding(vocab_size, embedding_dim, mask_zero=True)(caption_input)
    caption_model = LSTM(embedding_dim)(caption_model)
    # Merge image and caption models
    merged = keras.layers.concatenate([image_model, caption_model])
    # Language model
    language_model = Dense(embedding_dim, activation='relu')(merged)
    output = Dense(vocab_size, activation='softmax')(language_model)
    # Compile the model
    model = Model(inputs=[image_input, caption_input], outputs=output)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

# Define training data
# image_features = image_features
# captions = captions
print(image_features.shape)
# print(captions.shape)
# # Define vocabulary size and maximum caption length
vocab_size =7125
max_length = 89

# Define embedding dimension
embedding_dim = 256

# # Verify the number of samples
# num_samples = len(image_features)  # or captions, assuming they have the same length
# print(num_samples)
# len_caption = len(captions)
# print(len_caption)

# if num_samples != len(captions):
#     raise ValueError("Number of images and captions does not match!")
#  Split the data into training and validation sets
train_image_features, val_image_features, train_captions, val_captions = train_test_split(image_features, captions, test_size=0.2, random_state=42)


# Prepare the data
train_image_input = np.array(train_image_features)
train_caption_input = np.array(train_captions[ :-1])
train_caption_output = np.array(train_captions[ 1:])

val_image_input = np.array(val_image_features)
val_caption_input = np.array(val_captions[:-1])
val_caption_output = np.array(val_captions[1:])

# train_captions= captions

train_captions_tokenized = tokenizer.texts_to_sequences(train_caption_input) #Transforms each text in texts to a sequence of integers(list).
train_captions_padded = pad_sequences(train_captions_tokenized, padding='post') #List is return to Numpy array with shape (len(sequences), maxlen)-2D NumPy array
captions_input = train_captions_padded[ :-1] # upto n-1 captions 0f 2d numphy array
captions_output = train_captions_padded[ 1:]  #upto n of 2D NumPy array
 # Create the captioning model
caption_model = create_caption_model(vocab_size, max_length, embedding_dim)



# train_images = train_image_features
# train_images = np.array(train_images)  #list image to numpy array shape

#  Trim the input arrays to have the same number of samples
num_samples = min(len(train_image_input), len(train_caption_input), len(train_caption_output))
train_image_input = train_image_input[:num_samples]
train_caption_input = train_caption_input[:num_samples]
train_caption_output = train_caption_output[:num_samples]

# Verify the number of samples
if len(train_image_input) != len(train_caption_input) or len(train_image_input) != len(train_caption_output):
    raise ValueError("Number of samples in input arrays do not match!")


# Train the captioning model
caption_model.fit([train_image_input, train_caption_input], train_caption_output,
                  validation_data=([val_image_input, val_caption_input], val_caption_output),
                  epochs=10, batch_size=32)

# Save the trained model
caption_model.save('caption_model.h5')

# Generate captions for new images
def generate_caption(image_features):
    start_token = '<start>'
    end_token = '<end>'
    caption_in = np.zeros((1, max_length))
    caption_in[0, 0] = word_to_index[start_token]
    caption = []
    for i in range(1, max_length):
        caption_out = caption_model.predict([image_features, caption_in])
        next_word_index = np.argmax(caption_out[0, i, :])
        next_word = index_to_word[next_word_index]
        caption.append(next_word)
        caption_in[0, i] = next_word_index
        if next_word == end_token:
            break
    return ' '.join(caption)

# Load the trained captioning model
trained_model = keras.models.load_model('caption_model.h5')

# Generate captions for new images
new_image_features = np.load('new_image_features.npy')
generated_caption = generate_caption(new_image_features)
print(generated_caption)


(5112,)


ValueError: ignored

In [34]:
epochs = 10
batch_size = 64

checkpoint = ModelCheckpoint('model.h5', monitor='loss', save_best_only=True)

history = model.fit(
    [train_images, train_captions_input],
    train_captions_output,
    epochs=epochs,
    batch_size=batch_size,
    callbacks=[checkpoint]
)


NameError: ignored