# Develop a deep learning photo caption generator

In this project I will be developing a deep learning model which is able to generate textual description for a given photograph. For such a task, our model needs to utilize features learned from various images using computer vision and understanding the natural language as well. The task is to produce a meaningful caption given an image.

I have trained my model using GPU and I will be sharing my environment configuration as well.

In [1]:
from os import listdir
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.models import Model

## Photo and caption dataset

For this project I will be using Flickr8k dataset. It consists of 8001 images and per image it has 5 captions. These captions describe the entities and the events in the image very clearly. The dataset can be downloaded from my Github repository. Download the datasets and then unzip them in your current working directory. You will have two directories:
<ul>
    <li>Flicker8k_Dataset: It consists of 8091 images.</li>
    <li>Flickr8k_text: It contains a number of file describing the dataset.</li>
</ul>

The dataset has been defined and divided into 6000 training images, 1000 development images and 1000 test images.

### Prepare photo data

I will be using a pre-trained model to extract features from different images. For this, I will use VGG16 model which is provided as part of Keras. If you do not have the model already, then the model would be downloaded on running the code. Depending on your internet connectivity and bandwidth this might take time. The download size is roughly 500 megabytes.

For this task, I am only interested in finding the features of the images and so I will remove the output layer (last layer) from the model. These features would serve as one the input features for my caption generator model. Once I have all the features for all the images I will dump them as a pickle file.

In [2]:
from pickle import dump

def extract_features(directory):
    # create an instance of VGG 16 model.
    # if this model is not already downloaded, it would first be downloaded
    model = VGG16()

    # we create a new model with the same input shape as the VGG model,
    # but we are not interested in the output of the model
    # we are rather interested in the features just before the output layer
    model = Model(inputs=model.input, outputs=model.layers[-2].output)
    model.summary()

    # a directory to store the features from all the images
    # key: image id, value: image feature
    features = dict()
    for image in listdir(directory):
        # unique image identifier
        image_id = image.split(".")[0]

        # load the image and preprocess it to VGG16 input format
        # VGG imposes restrictions on the image size and expects image to be of size 224x224
        image = load_img(directory + "/" + image, target_size=(224, 224))
        image = img_to_array(image)
        image = image.reshape(1, image.shape[0], image.shape[1], image.shape[2])

        # VGG16 expects image to be in BRG format rather than RGB format
        image = preprocess_input(image)

        feature = model.predict(image)
        features[image_id] = feature
    return features

features = extract_features("Flicker8k_Dataset")
# store the features as a dictionary in pickle format
dump(features, open("features.pkl", "wb"))

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0     

### Prepare text data

The file Flickr8k.token.txt inside the folder Flickr8k_text contains descriptions about the images. Each line has a fomrat _imageName_#__commentNumber _description_. We will create a dictionary containing the list of all descriptions for a given image Id.

In [3]:
def load_doc(filename):
    file = open(filename, "r")
    content = file.read()
    file.close()
    return content

# load the content of the file
doc = load_doc("Flickr8k_text/Flickr8k.token.txt")

In [4]:
def load_descriptions(doc):
    mapping = dict()
    for line in doc.split("\n"):
        if len(line) < 2:
            continue
        tokens = line.split()
        image_id, image_desc = tokens[0].split(".")[0], tokens[1:]
        image_desc = " ".join(image_desc)
        
        # create a dictionary item, such that the key is the image id
        # and the value is the list of all descriptions for a given image
        if image_id not in mapping:
            mapping[image_id] = list()
        mapping[image_id].append(image_desc)
    return mapping

descriptions = load_descriptions(doc)

Once the image descriptions have been loaded it's time to pre-process and clean our text. This process involves:
<ol>
    <li>Converting all text to lowercase</li>
    <li>Removing all punctuation marks</li>
    <li>Removing one letter words like "a", hanging s ('s), etc.</li>
    <li>Freeing the text of all numbers</li>
</ol>

In [5]:
import string

def clean_descriptions(descriptions):
    table = str.maketrans("", "", string.punctuation)
    for _, desc_list in descriptions.items():
        for i in range(len(desc_list)):
            desc = desc_list[i]
            desc = desc.split()
            desc = [word.lower() for word in desc]
            desc = [word.translate(table) for word in desc]
            desc = [word for word in desc if len(word) > 1]
            desc = [word for word in desc if word.isalpha()]
            desc_list[i] = " ".join(desc)

clean_descriptions(descriptions)

Although the following code is not mandatory for this task, but it gives an approximation, as to how big the vocabulary is.

In [6]:
def to_vocabulary(descriptions):
    all_desc = set()
    for key in descriptions.keys():
        [all_desc.update(desc.split()) for desc in descriptions[key]]
    return all_desc

vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))

Vocabulary Size: 8763


Finally, save the descriptions in a file such that each line consists of an image Id and a description.

In [7]:
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + " " + desc)
    data = "\n".join(lines)
    file = open(filename, "w")
    file.write(data)
    file.close()

save_descriptions(descriptions, "descriptions.txt")  

## Deep Learning Model

Load a pre-defined set of identifiers given a file name. As we have three different files for train, dev and test, we have three sets of image identifiers. The ```load_set``` function will return the list of all unique image identifiers for a given type of dataset.

In [8]:
def load_doc(filename):
    file = open(filename, "r")
    doc = file.read()
    file.close()
    return doc

def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    for line in doc.split("\n"):
        if len(line) < 1:
            continue
        image_id = line.split(".")[0]
        dataset.append(image_id)
    return set(dataset)

To identify when a string starts and ends, we process each description belonging to the training set to contain _startseq_ as the beginning identifier and _endseq_ as the last identifier. This would be very necessary for our caption generation, as we will produce caption word by word and will stop once the _endseq_ identifier is encountered.

In [9]:
def load_clean_descriptions(filename, dataset):
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split("\n"):
        tokens = line.split()
        image_id, image_desc = tokens[0], tokens[1:]
        if image_id in dataset:
            if image_id not in descriptions:
                descriptions[image_id] = list()
            image_desc = "startseq " + " ".join(image_desc) + " endseq"
            descriptions[image_id].append(image_desc)
    return descriptions

The ```load_photo_features``` function is used to load the _features.pkl_ file and return a dictionary containing all pairs of image identifier and descriptions.

In [10]:
from pickle import load

def load_photo_features(filename, dataset):
    all_features = load(open(filename, "rb"))
    features = {k: all_features[k] for k in dataset}  ## why feature returned is a dictionary
    return features    

Load the training dataset before we start to train our model.

In [11]:
filename = "Flickr8k_text/Flickr_8k.trainImages.txt"
train = load_set(filename)
print('Dataset: %d' % len(train))
train_descriptions = load_clean_descriptions("descriptions.txt", train)
print('Descriptions: train=%d' % len(train_descriptions))
train_features = load_photo_features("features.pkl", train)
print('Photos: train=%d' % len(train_features))

Dataset: 6000
Descriptions: train=6000
Photos: train=6000


Create tokenizer and fit it on the training descriptions.

In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer

def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(desc) for desc in descriptions[key]]
    return all_desc

def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# save the tokenizer in pickle format
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open("tokenizer.pkl", "wb"))

vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 7579


In [13]:
# Calculate the length of the longest description
def max_length(descriptions):
	lines = to_lines(descriptions)
	return max(len(d.split()) for d in lines)

The function ```create_sequences``` takes in a tokenizer, the maximum sequence length, a list of all the descriptions, a photo feature and the vocabulary size as input argumenta. It returns an input output pair of training data for our model. Our model has two inputs: photo feature and the encoded text and outputs the next word in the sequence. We will keep predicting the next word in the sequence until we meet the end sequence identifier, i.e. _endseq_.

In [14]:
from numpy import array

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

def create_sequences(tokenizer, max_length, desc_list, photo, vocab_size):
    for desc in desc_list:
        X1, X2, y = list(), list(), list()
        seq = tokenizer.texts_to_sequences([desc])[0]
        for i in range(1, len(seq)):
            in_seq, out_seq = seq[:i], seq[i]
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            out_seq = to_categorical([out_seq], vocab_size)[0]
            X1.append(photo)
            X2.append(in_seq)
            y.append(out_seq)
    return array(X1), array(X2), array(y)

The function ```data_generator``` is used to create training sets with progressive loading. I am training my model using NVIDIA GTX 970 with 4GB VRAM, which is not sufficient for our task. Rather than loading all the data at once, we will create batches of 1 training set and train progressively.

In [15]:
def data_generator(descriptions, photos, tokenizer, max_length, vocab_size):
    while 1:
        for key, desc_list in descriptions.items():
            photo = photos[key][0]
            in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo, vocab_size)
            yield [in_img, in_seq], out_word

Our model is actually a merge model. A merge model means, it merges the output of two models together, feeds them as an input into a new layer and finally gets an output. In the following section you can see we have two _inputs_. 
1. The first input (_input1_) is going to be the features extracted from our VGG model which would then be fed to a dense layer consisting of 256 neurons.
2. The second input (_input2_) is going to be a input sequence derived from the descriptions with the maximum length of the lonngest description. This input is fed to an embedding layer which generates embeddings, which are then fed into an LSTM layer. The output of the LSTM layer is then fed to a dense layer consisting of 256 neurons.
3. Before the outputs from the above two models is fed into the dense layer, the outputs are combined into a single representation using _add_ layer.
4. Finally, the output of the first dense layer is fed into a new dense layer that makes a softmax prediction over the entire vocabulary for the next word in the sequence.


In [16]:
from tensorflow.keras.layers import Input, Dropout, Dense, Embedding, LSTM, add
from tensorflow.keras.models import Model

def define_model(vocab_size, length):
    # feature extracted from VGG16 model
    input1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(input1)
    fe2 = Dense(256, activation="relu")(fe1)
    
    # generate embeddings for the sequences and feed them to LSTM
    input2 = Input(shape=(length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(input2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    
    # combine the output of the above two models before feeding forward to the dense layer
    decoder1 = add([fe2, se3])
    
    decoder2 = Dense(256, activation="relu")(decoder1)
    outputs = Dense(vocab_size, activation="softmax")(decoder2)
    
    model = Model(inputs=[input1, input2], outputs=outputs)
    model.compile(optimizer="adam", loss="categorical_crossentropy")
    
    model.summary()
    return model

In [17]:
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))

# load clean descriptions for the training set
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))

# load training set image features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))

# tokenize the training set descriptions
tokenizer = create_tokenizer(train_descriptions)

vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

# determine the length of the longest description in training set
length = max_length(train_descriptions)
print('Description Length: %d' % length)

Dataset: 6000
Descriptions: train=6000
Photos: train=6000
Vocabulary Size: 7579
Description Length: 34


In [18]:
# create the model
model = define_model(vocab_size, length)

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 34)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 4096)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 34, 256)      1940224     input_3[0][0]                    
__________________________________________________________________________________________________
dropout (Dropout)               (None, 4096)         0           input_2[0][0]                    
____________________________________________________________________________________________

In [19]:
# start model training

epochs = 20
steps = len(train_descriptions)
for i in range(epochs):
    generator = data_generator(train_descriptions, train_features, tokenizer, length, vocab_size)
    model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
    model.save('model_' + str(i) + '.h5')



   1/6000 [..............................] - ETA: 6:32 - loss: 4.3799





### Evaluate model

We will evaluate my model on the held out testing dataset. We will generate descriptions for the testing dataset and evaluate those predictions with a standard cost function. For each description generation, our start token would be _startseq_ and would run until the token _endseq_ is generated.

In [20]:
# return back the original word for a given token id
def detokenize(tokenizer, integer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

In [33]:
from numpy import argmax

def generate_description(model, photo, tokenizer, length):
    # start sequence for our string
    in_text = "startseq"

    # a sequence can have maximum "length" number of words,
    # thus iterate over the max possible length
    for i in range(length):
        # tokenize the input text sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad the input to a maximum length of the longest description
        sequence = pad_sequences([sequence], maxlen=length)
        # predict the next word
        yhat = model.predict([photo, sequence], verbose=0)
        yhat = argmax(yhat)
        # map the integer to the word
        word = detokenize(tokenizer, yhat)
        # if text could not be predicted then stop
        if word is None:
            break
        in_text += " " + word
        # if the end sequence token has been generated then stop
        if word == "endseq":
            break
    return in_text

I will be using BLEU scores to compare the qulaity of text generated using the model with the actual textual descriptions. Wikipedia says BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality.

In [34]:
from nltk.translate.bleu_score import corpus_bleu

def evaluate_model(model, descriptions, photo, tokenizer, length):
    actual, predicted = list(), list()
    for key, desc_list in descriptions.items():
        yhat = generate_description(model, photo[key], tokenizer, length)
        references = [desc.split() for desc in desc_list]
        actual.append(references)
        predicted.append(yhat.split())
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

Then test the model on the heldout testing dataset.

In [35]:
from tensorflow.keras.models import load_model

# load test set
filename = 'Flickr8k_text/Flickr_8k.testImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))

# test set descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))

# test set photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))

# load the model
filename = "model_19.h5"
model = load_model(filename)
# evaluate model
evaluate_model(model, test_descriptions, test_features, tokenizer, length)

Dataset: 1000
Descriptions: test=1000
Photos: test=1000
BLEU-1: 0.417263
BLEU-2: 0.178178
BLEU-3: 0.108978
BLEU-4: 0.043869


### Generate new captions

In [38]:
def extract_feature_for_single_image(image):
    model = VGG16()
    model = Model(inputs=model.input, outputs=model.layers[-2].output)

    image = load_img(image, target_size=(224, 224))
    image = img_to_array(image)
    image = image.reshape(1, image.shape[0], image.shape[1], image.shape[2])

    image = preprocess_input(image)

    feature = model.predict(image)
    return feature

In [45]:
tokenizer = load(open("tokenizer.pkl", "rb"))
features = extract_feature_for_single_image("example.jpg")
description = generate_description(model, features, tokenizer, length)
print(description)

startseq two dogs run through the water endseq


To complete this task I referred to Jason Brownlee's [machine learning blog](machinelearningmastery.com). The blog could be found [here](https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/).