In [None]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""


This code example generates feature vectors corresponding to images in the input dataset. These feature vectors are used by the image captioning network in the code example v9_5_image_captioning. More context for this code example can be found in video 9.5 "Programming Example: Image Captioning with TensorFlow" in the video series "Learning Deep Learning: From Perceptron to Large Language Models" by Magnus Ekman (Video ISBN-13: 9780138177614). This is notebook 1 of 2 for this example.

This programming example assumes that the following resources from the COCO dataset are available:
The file captions_train2014.json should be located in the directory ../data/coco/
All the training images should be located in the directory ../data/coco/train2014/

The resulting feature vectors will be stored in the directory tf_data/feature_vectors/

The import statements are shown in the first code snippet.

In [None]:
import json
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model
from tensorflow.keras.applications import VGG19
from tensorflow.keras.applications.vgg19 import \
    preprocess_input
from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.preprocessing.image import img_to_array
import pickle
import gzip
import logging
tf.get_logger().setLevel(logging.ERROR)

TRAINING_FILE_DIR = '../data/coco/'
OUTPUT_FILE_DIR = 'tf_data/feature_vectors/'


The parts of the dataset that we will use are contained in two resources. The first resource is a json file that contains captions as well as filenames and some other information for the images. We make the assumption that you have placed that file in the directory pointed to by the variable TRAINING_FILE_DIR. The images themselves are stored as individual image files and are assumed to be located in a directory named train2014 in the directory pointed to by TRAINING_FILE_DIR. The COCO dataset contains elaborate tools to parse and read the rich information about the various images, but because we are only interested in the image captions, we choose to directly access the json file and extract the limited data that we need ourselves. The code snippet below opens the json file and creates a dictionary that, for each image, maps a unique key to a list of strings. The first string in each list represents the image filename, and the subsequent strings are alternative captions for the image.


In [None]:
with open(TRAINING_FILE_DIR \
          + 'captions_train2014.json') as json_file:
    data = json.load(json_file)
image_dict = {}
for image in data['images']:
    image_dict[image['id']] = [image['file_name']]
for anno in data['annotations']:
    image_dict[anno['image_id']].append(anno['caption'])


The next step is to create our pretrained VGG19 model, which is done in the next code snippet. We first obtain the full VGG19 model with weights trained from the ImageNet dataset. We then create a new model (model_new) from that model by stating that we want to use the layer named block5_conv4 as output. A fair question is how we figured out that name. As you can see in the code snippet, we first printed out the summary of the full VGG19 model. This summary includes the layer names, and we saw that the last convolutional layer was named block5_conv4.

In [None]:
# Create network without top layers.
model = VGG19(weights='imagenet')
model.summary()
model_new = Model(inputs=model.input,
                  outputs=model.get_layer('block5_conv4').output)
model_new.summary()


We are now ready to run all the images through the network and extract the feature vectors and save to disk. This is done by the code snippet below. We traverse the dictionary to obtain the image file names. Every loop iteration does the processing for a single image and saves the feature vectors for that one image in a single file. Before running the image through the network, we perform some preprocessing. The image sizes in the COCO dataset vary from image to image, so we first read the file to determine its file size. We determine the aspect ratio and then reread the image scaled to a size at which the shortest side ends up being 256 pixels. We then crop the center 224×224 region of the resulting image to end up with the input dimensions that our VGG19 network expects. We finally run the VGG19 preprocessing function, which standardizes the data values in the image before we run the image through the network. The output of the network will be an array with the shape (1, 14, 14, 512) representing the results from a batch of images where the first dimension indicates that the batch size is 1. Therefore, we extract the first (and only) element from this array (y[0]) and save it as a gzipped pickle file with the same name as the image but with the extension .pickle.gz in the directory feature_vectors. When we have looped through all images, we also save the dictionary file as caption_file. pickle.gz so we do not need to parse the json file again later in the code that does the actual training.


In [None]:
# Run all images through the network and save the output.
for i, key in enumerate(image_dict.keys()):
    if i % 1000 == 0:
        print('Progress: ' + str(i) + ' images processed')
    item = image_dict.get(key)
    filename = TRAINING_FILE_DIR + 'train2014/' + item[0]

    # Determine dimensions.
    image = load_img(filename)
    width = image.size[0]
    height = image.size[1]

    # Resize so shortest side is 256 pixels.
    if height > width:
        image = load_img(filename, target_size=(
            int(height/width*256), 256))
    else:
        image = load_img(filename, target_size=(
            256, int(width/height*256)))
    width = image.size[0]
    height = image.size[1]
    image_np = img_to_array(image)

    # Crop to center 224x224 region.
    h_start = int((height-224)/2)
    w_start = int((width-224)/2)
    image_np = image_np[h_start:h_start+224,
                        w_start:w_start+224]

    # Rearrange array to have one more
    # dimension representing batch size = 1.
    image_np = np.expand_dims(image_np, axis=0)

    # Call model and save resulting tensor to disk.
    X = preprocess_input(image_np)
    y = model_new.predict(X)
    save_filename = OUTPUT_FILE_DIR + \
        item[0] + '.pickle.gzip'
    pickle_file = gzip.open(save_filename, 'wb')
    pickle.dump(y[0], pickle_file)
    pickle_file.close()

# Save the dictionary containing captions and filenames.
save_filename = OUTPUT_FILE_DIR + 'caption_file.pickle.gz'
pickle_file = gzip.open(save_filename, 'wb')
pickle.dump(image_dict, pickle_file)
pickle_file.close()
