# Image Captioning using the Flickr8k Dataset
Keras-built neural network trained on the Flickr8k Dataset. When given an image, the model is able to output an image caption describing it.

Since each image has multiple captions corresponding to it, we'll be using an approach called multi-caption training in which we create multiple training examples for each image, where each example pairs the image with one of its corresponding captions.

In [12]:
# Data Handling
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import img_to_array, load_img

# Models and Layers
from keras.models import Model
from keras.applications import MobileNetV2

# Misc
from tqdm import tqdm

Our image data is stored in the `flickr8k/Images` directory in this repository and our captions are stored in `flickr8k/captions.txt`. Let's read the captions and create a map from the filenames to the captions.

In [13]:
# Read the data.
df = pd.read_csv('flickr8k/captions.txt', header=0, delimiter=',', names=['image', 'caption'])

# Create a map of image to captions.
captions = df.groupby('image')['caption'].apply(list).to_dict()

# Display a random image and its captions.
captions["1001773457_577c3a7d70.jpg"]

['A black dog and a spotted dog are fighting',
 'A black dog and a tri-colored dog playing with each other on the road .',
 'A black dog and a white dog with brown spots are staring at each other in the street .',
 'Two dogs of different breeds looking at each other on the road .',
 'Two dogs on pavement moving toward each other .']

## Label Preprocessing
First we need to preprocess the labels, cleaning them out and converting them into vectors of integers. Here, we clear out unnecessary characters and make each word lowercase. We also add each caption to an array called `all_captions` that we can pass into a Keras tokenizer.

In [14]:
# Create a list of all the captions.
all_captions = []

# Iterate over all the filenames.
for filename, img_captions in captions.items():
    # Iterate over each caption for the specific file.
    for i, caption in enumerate(img_captions):
        # Convert the caption to lowercase.
        caption = caption.lower()

        # Remove all the special characters.
        caption = caption.replace("[^a-z]+", "")

        # Remove all additional spaces.
        caption = caption.replace("\s+", " ")

        # Add starting and ending tokens.
        caption = f"<start>{caption}<end>"

        # Update the caption.
        img_captions[i] = caption

        # Append the caption to the list of all captions.
        all_captions.append(caption)

# Display the updated caption.
captions["1001773457_577c3a7d70.jpg"]

['<start>a black dog and a spotted dog are fighting<end>',
 '<start>a black dog and a tri-colored dog playing with each other on the road .<end>',
 '<start>a black dog and a white dog with brown spots are staring at each other in the street .<end>',
 '<start>two dogs of different breeds looking at each other on the road .<end>',
 '<start>two dogs on pavement moving toward each other .<end>']

Now, we need to use the Keras `Tokenizer` to tokenize the text, essentially transforming it into a sequence of numbers. This is a result of the fact that neural networks are only able to process and output information when they are in the form of Tensors.

In [15]:
# Initial the tokenizer. Don't set the num_words parameter.
tokenizer = Tokenizer()

# Fit the tokenizer on the captions.
tokenizer.fit_on_texts(all_captions)

# Iterate over all of the files.
for filename in captions.keys():
    # Replace the raw text captions with the sequences.
    captions[filename] = tokenizer.texts_to_sequences(captions[filename])

# Display the first few captions.
captions["1001773457_577c3a7d70.jpg"]

[[3, 1, 15, 9, 8, 1, 842, 9, 17, 343, 2],
 [3, 1, 15, 9, 8, 1, 1574, 235, 9, 34, 10, 137, 82, 6, 5, 151, 2],
 [3, 1, 15, 9, 8, 1, 14, 9, 10, 27, 1000, 17, 640, 22, 137, 82, 4, 5, 72, 2],
 [3, 13, 31, 12, 740, 2651, 89, 22, 137, 82, 6, 5, 151, 2],
 [3, 13, 31, 6, 726, 804, 321, 137, 82, 2]]

## Image Preprocessing
Now that we've done some basic setting up of the labels, it's time to focus on images. We first will download the `MobileNetV2` pre-trained convolutional neural network.

In [16]:
image_net = MobileNetV2(weights='imagenet', include_top=True)
image_net.summary()

Model: "mobilenetv2_1.00_224"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 Conv1 (Conv2D)                 (None, 112, 112, 32  864         ['input_2[0][0]']                
                                )                                                                 
                                                                                                  
 bn_Conv1 (BatchNormalization)  (None, 112, 112, 32  128         ['Conv1[0][0]']                  
                                )                                              

Now that we've installed the pre-trained model, we need to create our own model that utilizes it. Our goal is to only extract the features of the image from the pre-trained CNN, which will enable us to train our model on the features of the image rather than the raw image data. This is far more efficient.

In [17]:
# Model to extract features from the images.
model = Model(inputs=image_net.input, outputs=image_net.layers[-2].output)

Now that we've defined this model, we're going to create a dictionary that maps from image/filenames to the respective features outputted by the `MobileNetV2` model. We do this because we'll be training the model on the same input features multiple times (but slightly different input labels). This will make a little more sense later.

In [18]:
# Create a dictionary to map the image name to the features.
image_features = {}

# Iterate over all the images.
for filename in tqdm(captions.keys()):
    #Load our images in the target size of 224x224.
    img = load_img(f"./flickr8k/Images/{filename}", target_size=(224, 224))

    # Vectorize the image.
    x = img_to_array(img)

    # Expand the dimensions of the image.
    x = np.expand_dims(x, axis = 0)

    # Calculate the flattened features array.
    features = model.predict(x, verbose=0)
    features = np.reshape(features, (features.shape[1]))
    
    # Store the features array.
    image_features[filename] = features

  0%|          | 0/8091 [00:00<?, ?it/s]2023-03-26 16:16:57.178529: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
100%|██████████| 8091/8091 [04:28<00:00, 30.09it/s]


## Train-test Split
Now that we've preprocessed all of our data, it's time to split it up in preparation for training.

In [19]:
# Seperate our data into training and testing data.
# 7000 training images and 1091 test images.
train_images = list(captions.keys())[:7000]
test_images = list(captions.keys())[7000:]

len(train_images), len(test_images)

(7000, 1091)

## Creating Training Instances
Now that we've set pretty everything up, we need to create "training instances". In order to understand what we're doing, we first need to look into how the model will work.

The model, when given an image, will generate a text sequence, but it is only able to generate a single word at a time. This means that to generate the entire text sequence, 1) we'll need to repeatedly call the `model.predict` method, and 2) the model will need two input layers. One input layer will represent the image data and the second one will be the text sequence.

Due to this, we'll need to create multiple training instances for each caption. Each one will represent the same image data, but the captions for each word will contain the incremented sequence. 

For example, say we have an image with the caption: `<start> a big fat cat <end>`. We would have 6 different instances:
```
<start>
<start> a
<start> a big
<start> a big fat
<start> a big fat cat
<start> a big fat cat <end>
```

This will train the model to generate more accurate captions.

In [22]:
#find the maximum length of a description in a dataset
maxlen = max(len(caption.split()) for caption in all_captions)

# Create three empty arrays.
# One for image inputs, one for caption inputs, and the final one for full labels.
x_img, x_caption, y = [], [], []

# Iterate over each training image.
for filename in train_images:
    # Retrieve the image features.
    image = image_features[filename]

    # Iterate over each caption 
    for sequence in captions[filename]:
        # Iterate over each value (word) in the sequence.
        # We skip the first word in the sequence because that's just the start code.
        for i in range(1, len(sequence)):
            # Our input value will be the sequence up until point `i`.
            input_sequence = sequence[:i]
            # Our output sequence is simply the next word.
            output_sequence = sequence[i]          